when-launchagents-attack-100-dollar-api-crash-loop - AIXplore

# When LaunchAgents Attack: A $100 API Crash Loop Story I woke up to a $100 API charge. Not from heavy usage. From a crash loop. 4,590 restart attempts in 12 hours. Every restart made API calls. Port conflicts triggered infinite retries. Three duplicate LaunchAgents fighting for the same port. Here's what went wrong, what I learned, and how to prevent this from happening to you. ## The Discovery Morning coffee. Check email. See the Anthropic usage alert. $100 overnight. That's not supposed to happen. My background services typically cost $2-3 per day. I check the logs: ```bash $ wc -l /tmp/clawdbot-gateway.err 4758 /tmp/clawdbot-gateway.err $ grep -c "Gateway failed to start" /tmp/clawdbot-gateway.err 4590 ``` 4,590 failed starts. In 12 hours. That's **6-7 restart attempts per minute**. The error message was consistent: ``` Gateway failed to start: another gateway instance is already listening on ws://127.0.0.1:18789 ``` Port conflict. Classic. But why was it restarting thousands of times? ## The Root Cause: Three Agents, One Port I found three LaunchAgents all trying to start the same service on the same port: ### Agent 1: `com.bioinfo.clawdbot.plist` ```xml <key>ProgramArguments</key> <array> <string>node</string> <string>dist/entry.js</string> <string>gateway</string> <string>--port</string> <string>18789</string> </array> <key>StartInterval</key> <integer>900</integer>  <key>KeepAlive</key> <dict> <key>SuccessfulExit</key> <false/>  <key>Crashed</key> <true/>  </dict> <key>RunAtLoad</key> <true/>  ``` ### Agent 2: `com.bioinfo.clawdbot-gateway.plist` Same port, same service, different wrapper command. ### Agent 3: `com.clawdbot.gateway.plist` Same port, same service, app bundle variant. All three loaded at boot. All three trying to bind port 18789. All three configured to restart aggressively. ## The Exponential Restart Problem Here's how `StartInterval` + `KeepAlive` + `RunAtLoad` creates a crash loop nightmare: **Boot time (00:00):** - All 3 agents start - First one succeeds, binds port 18789 - Other 2 fail with port conflict - `KeepAlive: true` triggers immediate restart - Both retry and fail - Restart again **Every 15 minutes (StartInterval):** - Even the successful one restarts - Now all 3 are fighting again - First to bind wins - Others crash-loop **Every crash:** - `KeepAlive.Crashed: true` triggers restart - Port still occupied - Immediate failure - Immediate retry Result: **Exponential restart attempts**. <div class="callout" data-callout="danger"> <div class="callout-title">Never Combine These Settings</div> <div class="callout-content"> <strong>StartInterval</strong> (scheduled restarts) + <strong>KeepAlive: true</strong> (restart on exit) + <strong>RunAtLoad: true</strong> (start on boot) creates exponential restart behavior. Each setting is safe alone. Together they're a crash loop bomb. </div> </div> ## The Cost Multiplier Each restart attempt wasn't free. The gateway initialization sequence: 1. Load service configuration 2. Initialize AI provider connections 3. **Make API calls to validate credentials** 4. Attempt to bind port 5. Fail at step 4 6. Crash 7. Repeat Steps 1-3 happened every time. API calls in step 3 cost money. 4,590 restarts × ~$0.02 per restart = **$100** The service never stayed running long enough to do useful work. Just long enough to rack up API charges. ## The Missing Safety Net: Monitoring What should have caught this before it cost $100? ### 1. Restart Count Monitoring I should have been tracking restart attempts: ```bash # Check how many times a LaunchAgent restarted today log show --predicate 'subsystem == "com.apple.launchd"' \ --info --debug --last 24h | \ grep "com.bioinfo.clawdbot" | \ grep -c "Started" ``` If that number exceeds 10 in an hour, something's wrong. ### 2. Port Conflict Detection Before binding a port, check if it's available: ```python import socket def is_port_available(port): try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.bind(('127.0.0.1', port)) sock.close() return True except OSError: return False if not is_port_available(18789): print(f"Port 18789 already in use. Exiting.") exit(1) # Don't retry if port is occupied ``` Exit with status 1 prevents `KeepAlive` from restarting. ### 3. API Cost Alerts I had usage alerts set, but they triggered after the damage was done. Better approach: **Rate-based alerts:** - Alert if API costs exceed $X per hour - Alert if restart rate exceeds N per minute - Alert if error rate spikes **Pre-spend limits:** - Some providers support hard spending caps - Set daily/hourly budgets - Fail gracefully when limit reached <div class="callout" data-callout="tip"> <div class="callout-title">Monitor Cost Per Time, Not Total Cost</div> <div class="callout-content"> Absolute spending alerts ("$100 exceeded") trigger too late. Rate-based alerts ("$10 in last hour") catch runaway costs early. Set hourly thresholds, not daily ones. </div> </div> ## The Fix: Radical Simplification I removed the complexity: ### Step 1: Kill Everything ```bash # Unload all duplicate agents launchctl unload ~/Library/LaunchAgents/com.bioinfo.clawdbot.plist launchctl unload ~/Library/LaunchAgents/com.clawdbot.gateway.plist launchctl unload ~/Library/LaunchAgents/com.bioinfo.clawdbot-gateway.plist # Kill all running processes ps aux | grep clawdbot | awk '{print $2}' | xargs kill -9 ``` ### Step 2: Disable Auto-Start Updated the single remaining agent: ```xml  <key>KeepAlive</key> <false/>  <key>RunAtLoad</key> <false/>  <key>ThrottleInterval</key> <integer>60</integer>  ``` ### Step 3: Manual Start Only ```bash cd ~/apps/clawdbot pnpm clawdbot gateway --verbose ``` No LaunchAgent. No auto-restart. Manual control. For production services, I'd use proper process management (systemd on Linux, supervised with monitoring). For development tools, manual start is safer. ## Lessons Learned ### 1. LaunchAgents Are Dangerous for API Services LaunchAgents are perfect for running scheduled tasks (backups, cleanup scripts). They're terrible for services that make API calls: **Good use cases:** - Scheduled backups - Log rotation - Database cleanup - File synchronization **Bad use cases:** - API-backed services - Development tools - Anything with crash risk - Services with usage costs If it makes API calls, don't auto-restart it. Crashes should require manual intervention. ### 2. Duplicate LaunchAgents Compound Problems How did I end up with three duplicates? Iterative development without cleanup: 1. Created first agent (testing) 2. Created second agent (different approach) 3. App bundle added third agent (automatic) 4. Never cleaned up old ones **Prevention:** ```bash # List all your LaunchAgents launchctl list | grep -i "your-service" # Remove duplicates immediately launchctl unload ~/Library/LaunchAgents/old-service.plist rm ~/Library/LaunchAgents/old-service.plist ``` ### 3. Port Conflicts Need Graceful Handling The service should have checked port availability and exited cleanly: ```python if not is_port_available(PORT): logger.error(f"Port {PORT} already in use") sys.exit(1) # Exit cleanly, don't crash ``` Exit code 1 with `KeepAlive.SuccessfulExit: false` prevents restart. The service dies once instead of looping forever. ### 4. Cost Monitoring Must Be Real-Time My monitoring was reactive (daily summaries, weekly reports). It needed to be proactive: **Reactive monitoring (what I had):** - Daily email: "You spent $X yesterday" - Weekly summary: "Last 7 days: $Y" - Too late to prevent damage **Proactive monitoring (what I needed):** - Alerting: "Spent $10 in last hour (unusual)" - Rate limits: "Max $50/day hard cap" - Anomaly detection: "Restart rate 10x normal" I've since added real-time cost monitoring using CloudWatch alarms (AWS) and custom scripts for other providers. ## How to Prevent This If you're running AI services with LaunchAgents, here's your checklist: ### Pre-Flight Checks ```bash # 1. List all LaunchAgents for your service launchctl list | grep "your-service" # 2. Check for port conflicts before starting lsof -i :YOUR_PORT # 3. Verify restart count is reasonable log show --predicate 'subsystem == "com.apple.launchd"' \ --last 1h | grep "your-service" | grep -c "Started" ``` ### Safe LaunchAgent Configuration ```xml  <key>KeepAlive</key> <false/>  <key>RunAtLoad</key> <false/>  <key>ThrottleInterval</key> <integer>60</integer>   ``` ### Runtime Monitoring Create a monitoring script that runs hourly: ```bash #!/bin/bash # check-service-health.sh RESTART_COUNT=$(launchctl list | grep "your-service" | wc -l) ERROR_COUNT=$(grep -c "ERROR" /tmp/your-service.err) if [ "$RESTART_COUNT" -gt 5 ]; then echo "WARNING: Service restarted $RESTART_COUNT times" # Send alert (email, Slack, etc.) fi if [ "$ERROR_COUNT" -gt 100 ]; then echo "WARNING: $ERROR_COUNT errors in log" # Send alert fi ``` Schedule it with a separate, simple LaunchAgent: ```xml <key>ProgramArguments</key> <array> <string>/path/to/check-service-health.sh</string> </array> <key>StartInterval</key> <integer>3600</integer>  ``` <div class="callout" data-callout="warning"> <div class="callout-title">Test Your Alerts Before You Need Them</div> <div class="callout-content"> Monitoring that's never triggered is monitoring that might not work. Deliberately trigger your alerts (manual restart loop, simulated API spike) to verify they actually notify you. </div> </div> ## The Real Cost $100 in API charges hurt. But the real cost was trust. For 12 hours, I had runaway automation making decisions without oversight. The service was designed to be helpful (auto-restart on failure). Instead, it was harmful (infinite retry on permanent failure). This changed how I think about automation: **Before:** "Make it resilient. Auto-recover from everything." **After:** "Make it safe. Fail loudly on unexpected conditions." Auto-restart is resilience theater when the failure is permanent (port conflict). Better to exit cleanly and require human intervention. ## Your Turn Check your LaunchAgents right now: ```bash # List all your LaunchAgents ls ~/Library/LaunchAgents/ # Check for suspicious restart settings grep -l "StartInterval\|KeepAlive" ~/Library/LaunchAgents/*.plist # Review any that have both ``` If you find duplicates or aggressive restart settings on services that make API calls, fix them before they cost you. The $100 lesson: automation should be safe first, resilient second. --- ## Related Articles <div class="quick-nav"> - [[building-ai-research-night-shift|My AI Research Assistant Works the Night Shift]] - [[AI Systems & Architecture/debugging-distributed-ai-systems|Debugging Distributed AI Systems]] - [[Practical Applications/macos-automation-patterns|macOS Automation Patterns That Work]] </div> --- ### Related Articles - [[dgx-lab-benchmarks-vs-reality-day-4|DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4]] - [[three-days-to-build-ai-research-lab-dgx-claude|My AI Linux Expert: How Claude Code Suggested a 95,000x Faster Solution]] - [[dgx-lab-building-complete-rag-infrastructure-day-3|DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3]] --- <p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p> <p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>