# When LaunchAgents Attack: A $100 API Crash Loop Story
I woke up to a $100 API charge. Not from heavy usage. From a crash loop.
4,590 restart attempts in 12 hours. Every restart made API calls. Port conflicts triggered infinite retries. Three duplicate LaunchAgents fighting for the same port.
Here's what went wrong, what I learned, and how to prevent this from happening to you.
## The Discovery
Morning coffee. Check email. See the Anthropic usage alert.
$100 overnight.
That's not supposed to happen. My background services typically cost $2-3 per day. I check the logs:
```bash
$ wc -l /tmp/clawdbot-gateway.err
4758 /tmp/clawdbot-gateway.err
$ grep -c "Gateway failed to start" /tmp/clawdbot-gateway.err
4590
```
4,590 failed starts. In 12 hours. That's **6-7 restart attempts per minute**.
The error message was consistent:
```
Gateway failed to start: another gateway instance is already listening on ws://127.0.0.1:18789
```
Port conflict. Classic. But why was it restarting thousands of times?
## The Root Cause: Three Agents, One Port
I found three LaunchAgents all trying to start the same service on the same port:
### Agent 1: `com.bioinfo.clawdbot.plist`
```xml
<key>ProgramArguments</key>
<array>
<string>node</string>
<string>dist/entry.js</string>
<string>gateway</string>
<string>--port</string>
<string>18789</string>
</array>
<key>StartInterval</key>
<integer>900</integer> <!-- Restart every 15 minutes -->
<key>KeepAlive</key>
<dict>
<key>SuccessfulExit</key>
<false/> <!-- Restart on any exit -->
<key>Crashed</key>
<true/> <!-- Restart on crash -->
</dict>
<key>RunAtLoad</key>
<true/> <!-- Start on boot -->
```
### Agent 2: `com.bioinfo.clawdbot-gateway.plist`
Same port, same service, different wrapper command.
### Agent 3: `com.clawdbot.gateway.plist`
Same port, same service, app bundle variant.
All three loaded at boot. All three trying to bind port 18789. All three configured to restart aggressively.
## The Exponential Restart Problem
Here's how `StartInterval` + `KeepAlive` + `RunAtLoad` creates a crash loop nightmare:
**Boot time (00:00):**
- All 3 agents start
- First one succeeds, binds port 18789
- Other 2 fail with port conflict
- `KeepAlive: true` triggers immediate restart
- Both retry and fail
- Restart again
**Every 15 minutes (StartInterval):**
- Even the successful one restarts
- Now all 3 are fighting again
- First to bind wins
- Others crash-loop
**Every crash:**
- `KeepAlive.Crashed: true` triggers restart
- Port still occupied
- Immediate failure
- Immediate retry
Result: **Exponential restart attempts**.
<div class="callout" data-callout="danger">
<div class="callout-title">Never Combine These Settings</div>
<div class="callout-content">
<strong>StartInterval</strong> (scheduled restarts) + <strong>KeepAlive: true</strong> (restart on exit) + <strong>RunAtLoad: true</strong> (start on boot) creates exponential restart behavior. Each setting is safe alone. Together they're a crash loop bomb.
</div>
</div>
## The Cost Multiplier
Each restart attempt wasn't free. The gateway initialization sequence:
1. Load service configuration
2. Initialize AI provider connections
3. **Make API calls to validate credentials**
4. Attempt to bind port
5. Fail at step 4
6. Crash
7. Repeat
Steps 1-3 happened every time. API calls in step 3 cost money.
4,590 restarts × ~$0.02 per restart = **$100**
The service never stayed running long enough to do useful work. Just long enough to rack up API charges.
## The Missing Safety Net: Monitoring
What should have caught this before it cost $100?
### 1. Restart Count Monitoring
I should have been tracking restart attempts:
```bash
# Check how many times a LaunchAgent restarted today
log show --predicate 'subsystem == "com.apple.launchd"' \
--info --debug --last 24h | \
grep "com.bioinfo.clawdbot" | \
grep -c "Started"
```
If that number exceeds 10 in an hour, something's wrong.
### 2. Port Conflict Detection
Before binding a port, check if it's available:
```python
import socket
def is_port_available(port):
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(('127.0.0.1', port))
sock.close()
return True
except OSError:
return False
if not is_port_available(18789):
print(f"Port 18789 already in use. Exiting.")
exit(1) # Don't retry if port is occupied
```
Exit with status 1 prevents `KeepAlive` from restarting.
### 3. API Cost Alerts
I had usage alerts set, but they triggered after the damage was done. Better approach:
**Rate-based alerts:**
- Alert if API costs exceed $X per hour
- Alert if restart rate exceeds N per minute
- Alert if error rate spikes
**Pre-spend limits:**
- Some providers support hard spending caps
- Set daily/hourly budgets
- Fail gracefully when limit reached
<div class="callout" data-callout="tip">
<div class="callout-title">Monitor Cost Per Time, Not Total Cost</div>
<div class="callout-content">
Absolute spending alerts ("$100 exceeded") trigger too late. Rate-based alerts ("$10 in last hour") catch runaway costs early. Set hourly thresholds, not daily ones.
</div>
</div>
## The Fix: Radical Simplification
I removed the complexity:
### Step 1: Kill Everything
```bash
# Unload all duplicate agents
launchctl unload ~/Library/LaunchAgents/com.bioinfo.clawdbot.plist
launchctl unload ~/Library/LaunchAgents/com.clawdbot.gateway.plist
launchctl unload ~/Library/LaunchAgents/com.bioinfo.clawdbot-gateway.plist
# Kill all running processes
ps aux | grep clawdbot | awk '{print $2}' | xargs kill -9
```
### Step 2: Disable Auto-Start
Updated the single remaining agent:
```xml
<!-- AFTER: Safe configuration -->
<key>KeepAlive</key>
<false/> <!-- No auto-restart -->
<key>RunAtLoad</key>
<false/> <!-- No start on boot -->
<key>ThrottleInterval</key>
<integer>60</integer> <!-- Rate limit if manually started -->
```
### Step 3: Manual Start Only
```bash
cd ~/apps/clawdbot
pnpm clawdbot gateway --verbose
```
No LaunchAgent. No auto-restart. Manual control.
For production services, I'd use proper process management (systemd on Linux, supervised with monitoring). For development tools, manual start is safer.
## Lessons Learned
### 1. LaunchAgents Are Dangerous for API Services
LaunchAgents are perfect for running scheduled tasks (backups, cleanup scripts). They're terrible for services that make API calls:
**Good use cases:**
- Scheduled backups
- Log rotation
- Database cleanup
- File synchronization
**Bad use cases:**
- API-backed services
- Development tools
- Anything with crash risk
- Services with usage costs
If it makes API calls, don't auto-restart it. Crashes should require manual intervention.
### 2. Duplicate LaunchAgents Compound Problems
How did I end up with three duplicates? Iterative development without cleanup:
1. Created first agent (testing)
2. Created second agent (different approach)
3. App bundle added third agent (automatic)
4. Never cleaned up old ones
**Prevention:**
```bash
# List all your LaunchAgents
launchctl list | grep -i "your-service"
# Remove duplicates immediately
launchctl unload ~/Library/LaunchAgents/old-service.plist
rm ~/Library/LaunchAgents/old-service.plist
```
### 3. Port Conflicts Need Graceful Handling
The service should have checked port availability and exited cleanly:
```python
if not is_port_available(PORT):
logger.error(f"Port {PORT} already in use")
sys.exit(1) # Exit cleanly, don't crash
```
Exit code 1 with `KeepAlive.SuccessfulExit: false` prevents restart. The service dies once instead of looping forever.
### 4. Cost Monitoring Must Be Real-Time
My monitoring was reactive (daily summaries, weekly reports). It needed to be proactive:
**Reactive monitoring (what I had):**
- Daily email: "You spent $X yesterday"
- Weekly summary: "Last 7 days: $Y"
- Too late to prevent damage
**Proactive monitoring (what I needed):**
- Alerting: "Spent $10 in last hour (unusual)"
- Rate limits: "Max $50/day hard cap"
- Anomaly detection: "Restart rate 10x normal"
I've since added real-time cost monitoring using CloudWatch alarms (AWS) and custom scripts for other providers.
## How to Prevent This
If you're running AI services with LaunchAgents, here's your checklist:
### Pre-Flight Checks
```bash
# 1. List all LaunchAgents for your service
launchctl list | grep "your-service"
# 2. Check for port conflicts before starting
lsof -i :YOUR_PORT
# 3. Verify restart count is reasonable
log show --predicate 'subsystem == "com.apple.launchd"' \
--last 1h | grep "your-service" | grep -c "Started"
```
### Safe LaunchAgent Configuration
```xml
<!-- Safe template for services with API costs -->
<key>KeepAlive</key>
<false/> <!-- No auto-restart -->
<key>RunAtLoad</key>
<false/> <!-- Manual start only -->
<key>ThrottleInterval</key>
<integer>60</integer> <!-- Rate limit restarts -->
<!-- NEVER use StartInterval for API services -->
```
### Runtime Monitoring
Create a monitoring script that runs hourly:
```bash
#!/bin/bash
# check-service-health.sh
RESTART_COUNT=$(launchctl list | grep "your-service" | wc -l)
ERROR_COUNT=$(grep -c "ERROR" /tmp/your-service.err)
if [ "$RESTART_COUNT" -gt 5 ]; then
echo "WARNING: Service restarted $RESTART_COUNT times"
# Send alert (email, Slack, etc.)
fi
if [ "$ERROR_COUNT" -gt 100 ]; then
echo "WARNING: $ERROR_COUNT errors in log"
# Send alert
fi
```
Schedule it with a separate, simple LaunchAgent:
```xml
<key>ProgramArguments</key>
<array>
<string>/path/to/check-service-health.sh</string>
</array>
<key>StartInterval</key>
<integer>3600</integer> <!-- Every hour -->
```
<div class="callout" data-callout="warning">
<div class="callout-title">Test Your Alerts Before You Need Them</div>
<div class="callout-content">
Monitoring that's never triggered is monitoring that might not work. Deliberately trigger your alerts (manual restart loop, simulated API spike) to verify they actually notify you.
</div>
</div>
## The Real Cost
$100 in API charges hurt. But the real cost was trust.
For 12 hours, I had runaway automation making decisions without oversight. The service was designed to be helpful (auto-restart on failure). Instead, it was harmful (infinite retry on permanent failure).
This changed how I think about automation:
**Before:** "Make it resilient. Auto-recover from everything."
**After:** "Make it safe. Fail loudly on unexpected conditions."
Auto-restart is resilience theater when the failure is permanent (port conflict). Better to exit cleanly and require human intervention.
## Your Turn
Check your LaunchAgents right now:
```bash
# List all your LaunchAgents
ls ~/Library/LaunchAgents/
# Check for suspicious restart settings
grep -l "StartInterval\|KeepAlive" ~/Library/LaunchAgents/*.plist
# Review any that have both
```
If you find duplicates or aggressive restart settings on services that make API calls, fix them before they cost you.
The $100 lesson: automation should be safe first, resilient second.
---
## Related Articles
<div class="quick-nav">
- [[building-ai-research-night-shift|My AI Research Assistant Works the Night Shift]]
- [[AI Systems & Architecture/debugging-distributed-ai-systems|Debugging Distributed AI Systems]]
- [[Practical Applications/macos-automation-patterns|macOS Automation Patterns That Work]]
</div>
---
### Related Articles
- [[dgx-lab-benchmarks-vs-reality-day-4|DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4]]
- [[three-days-to-build-ai-research-lab-dgx-claude|My AI Linux Expert: How Claude Code Suggested a 95,000x Faster Solution]]
- [[dgx-lab-building-complete-rag-infrastructure-day-3|DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3]]
---
<p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p>
<p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>