Practical Applications8 min readshipped

When LaunchAgents Attack: A $100 API Crash Loop Story

When LaunchAgents Attack: A $100 API Crash Loop Story

I woke up to a $100 API charge. Not from heavy usage. From a crash loop.

4,590 restart attempts in 12 hours. Every restart made API calls. Port conflicts triggered infinite retries. Three duplicate LaunchAgents fighting for the same port.

Here's what went wrong, what I learned, and how to prevent this from happening to you.

The Discovery

Morning coffee. Check email. See the Anthropic usage alert.

$100 overnight.

That's not supposed to happen. My background services typically cost $2-3 per day. I check the logs:

$ wc -l /tmp/clawdbot-gateway.err
4758 /tmp/clawdbot-gateway.err

$ grep -c "Gateway failed to start" /tmp/clawdbot-gateway.err
4590

4,590 failed starts. In 12 hours. That's 6-7 restart attempts per minute.

The error message was consistent:

Gateway failed to start: another gateway instance is already listening on ws://127.0.0.1:18789

Port conflict. Classic. But why was it restarting thousands of times?

The Root Cause: Three Agents, One Port

I found three LaunchAgents all trying to start the same service on the same port:

Agent 1: com.bioinfo.clawdbot.plist

<key>ProgramArguments</key>
<array>
    <string>node</string>
    <string>dist/entry.js</string>
    <string>gateway</string>
    <string>--port</string>
    <string>18789</string>
</array>

<key>StartInterval</key>
<integer>900</integer>  <!-- Restart every 15 minutes -->

<key>KeepAlive</key>
<dict>
    <key>SuccessfulExit</key>
    <false/>  <!-- Restart on any exit -->
    <key>Crashed</key>
    <true/>   <!-- Restart on crash -->
</dict>

<key>RunAtLoad</key>
<true/>  <!-- Start on boot -->

Agent 2: com.bioinfo.clawdbot-gateway.plist

Same port, same service, different wrapper command.

Agent 3: com.clawdbot.gateway.plist

Same port, same service, app bundle variant.

All three loaded at boot. All three trying to bind port 18789. All three configured to restart aggressively.

The Exponential Restart Problem

Here's how StartInterval + KeepAlive + RunAtLoad creates a crash loop nightmare:

Boot time (00:00):

  • All 3 agents start
  • First one succeeds, binds port 18789
  • Other 2 fail with port conflict
  • KeepAlive: true triggers immediate restart
  • Both retry and fail
  • Restart again

Every 15 minutes (StartInterval):

  • Even the successful one restarts
  • Now all 3 are fighting again
  • First to bind wins
  • Others crash-loop

Every crash:

  • KeepAlive.Crashed: true triggers restart
  • Port still occupied
  • Immediate failure
  • Immediate retry

Result: Exponential restart attempts.

Never Combine These Settings
StartInterval (scheduled restarts) + KeepAlive: true (restart on exit) + RunAtLoad: true (start on boot) creates exponential restart behavior. Each setting is safe alone. Together they're a crash loop bomb.

The Cost Multiplier

Each restart attempt wasn't free. The gateway initialization sequence:

  1. Load service configuration
  2. Initialize AI provider connections
  3. Make API calls to validate credentials
  4. Attempt to bind port
  5. Fail at step 4
  6. Crash
  7. Repeat

Steps 1-3 happened every time. API calls in step 3 cost money.

4,590 restarts × ~$0.02 per restart = $100

The service never stayed running long enough to do useful work. Just long enough to rack up API charges.

The Missing Safety Net: Monitoring

What should have caught this before it cost $100?

1. Restart Count Monitoring

I should have been tracking restart attempts:

# Check how many times a LaunchAgent restarted today
log show --predicate 'subsystem == "com.apple.launchd"' \
  --info --debug --last 24h | \
  grep "com.bioinfo.clawdbot" | \
  grep -c "Started"

If that number exceeds 10 in an hour, something's wrong.

2. Port Conflict Detection

Before binding a port, check if it's available:

import socket

def is_port_available(port):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.bind(('127.0.0.1', port))
        sock.close()
        return True
    except OSError:
        return False

if not is_port_available(18789):
    print(f"Port 18789 already in use. Exiting.")
    exit(1)  # Don't retry if port is occupied

Exit with status 1 prevents KeepAlive from restarting.

3. API Cost Alerts

I had usage alerts set, but they triggered after the damage was done. Better approach:

Rate-based alerts:

  • Alert if API costs exceed $X per hour
  • Alert if restart rate exceeds N per minute
  • Alert if error rate spikes

Pre-spend limits:

  • Some providers support hard spending caps
  • Set daily/hourly budgets
  • Fail gracefully when limit reached
Monitor Cost Per Time, Not Total Cost
Absolute spending alerts ("$100 exceeded") trigger too late. Rate-based alerts ("$10 in last hour") catch runaway costs early. Set hourly thresholds, not daily ones.

The Fix: Radical Simplification

I removed the complexity:

Step 1: Kill Everything

# Unload all duplicate agents
launchctl unload (local path)
launchctl unload (local path)
launchctl unload (local path)

# Kill all running processes
ps aux | grep clawdbot | awk '{print $2}' | xargs kill -9

Step 2: Disable Auto-Start

Updated the single remaining agent:

<!-- AFTER: Safe configuration -->
<key>KeepAlive</key>
<false/>  <!-- No auto-restart -->

<key>RunAtLoad</key>
<false/>  <!-- No start on boot -->

<key>ThrottleInterval</key>
<integer>60</integer>  <!-- Rate limit if manually started -->

Step 3: Manual Start Only

cd (local path)
pnpm clawdbot gateway --verbose

No LaunchAgent. No auto-restart. Manual control.

For production services, I'd use proper process management (systemd on Linux, supervised with monitoring). For development tools, manual start is safer.

Lessons Learned

1. LaunchAgents Are Dangerous for API Services

LaunchAgents are perfect for running scheduled tasks (backups, cleanup scripts). They're terrible for services that make API calls:

Good use cases:

  • Scheduled backups
  • Log rotation
  • Database cleanup
  • File synchronization

Bad use cases:

  • API-backed services
  • Development tools
  • Anything with crash risk
  • Services with usage costs

If it makes API calls, don't auto-restart it. Crashes should require manual intervention.

2. Duplicate LaunchAgents Compound Problems

How did I end up with three duplicates? Iterative development without cleanup:

  1. Created first agent (testing)
  2. Created second agent (different approach)
  3. App bundle added third agent (automatic)
  4. Never cleaned up old ones

Prevention:

# List all your LaunchAgents
launchctl list | grep -i "your-service"

# Remove duplicates immediately
launchctl unload (local path)
rm (local path)

3. Port Conflicts Need Graceful Handling

The service should have checked port availability and exited cleanly:

if not is_port_available(PORT):
    logger.error(f"Port {PORT} already in use")
    sys.exit(1)  # Exit cleanly, don't crash

Exit code 1 with KeepAlive.SuccessfulExit: false prevents restart. The service dies once instead of looping forever.

4. Cost Monitoring Must Be Real-Time

My monitoring was reactive (daily summaries, weekly reports). It needed to be proactive:

Reactive monitoring (what I had):

  • Daily email: "You spent $X yesterday"
  • Weekly summary: "Last 7 days: $Y"
  • Too late to prevent damage

Proactive monitoring (what I needed):

  • Alerting: "Spent $10 in last hour (unusual)"
  • Rate limits: "Max $50/day hard cap"
  • Anomaly detection: "Restart rate 10x normal"

I've since added real-time cost monitoring using CloudWatch alarms (AWS) and custom scripts for other providers.

How to Prevent This

If you're running AI services with LaunchAgents, here's your checklist:

Pre-Flight Checks

# 1. List all LaunchAgents for your service
launchctl list | grep "your-service"

# 2. Check for port conflicts before starting
lsof -i :YOUR_PORT

# 3. Verify restart count is reasonable
log show --predicate 'subsystem == "com.apple.launchd"' \
  --last 1h | grep "your-service" | grep -c "Started"

Safe LaunchAgent Configuration

<!-- Safe template for services with API costs -->
<key>KeepAlive</key>
<false/>  <!-- No auto-restart -->

<key>RunAtLoad</key>
<false/>  <!-- Manual start only -->

<key>ThrottleInterval</key>
<integer>60</integer>  <!-- Rate limit restarts -->

<!-- NEVER use StartInterval for API services -->

Runtime Monitoring

Create a monitoring script that runs hourly:

#!/bin/bash
# check-service-health.sh

RESTART_COUNT=$(launchctl list | grep "your-service" | wc -l)
ERROR_COUNT=$(grep -c "ERROR" /tmp/your-service.err)

if [ "$RESTART_COUNT" -gt 5 ]; then
    echo "WARNING: Service restarted $RESTART_COUNT times"
    # Send alert (email, Slack, etc.)
fi

if [ "$ERROR_COUNT" -gt 100 ]; then
    echo "WARNING: $ERROR_COUNT errors in log"
    # Send alert
fi

Schedule it with a separate, simple LaunchAgent:

<key>ProgramArguments</key>
<array>
    <string>/path/to/check-service-health.sh</string>
</array>

<key>StartInterval</key>
<integer>3600</integer>  <!-- Every hour -->
Test Your Alerts Before You Need Them
Monitoring that's never triggered is monitoring that might not work. Deliberately trigger your alerts (manual restart loop, simulated API spike) to verify they actually notify you.

The Real Cost

$100 in API charges hurt. But the real cost was trust.

For 12 hours, I had runaway automation making decisions without oversight. The service was designed to be helpful (auto-restart on failure). Instead, it was harmful (infinite retry on permanent failure).

This changed how I think about automation:

Before: "Make it resilient. Auto-recover from everything." After: "Make it safe. Fail loudly on unexpected conditions."

Auto-restart is resilience theater when the failure is permanent (port conflict). Better to exit cleanly and require human intervention.

Your Turn

Check your LaunchAgents right now:

# List all your LaunchAgents
ls (local path)

# Check for suspicious restart settings
grep -l "StartInterval\|KeepAlive" (local path)

# Review any that have both

If you find duplicates or aggressive restart settings on services that make API calls, fix them before they cost you.

The $100 lesson: automation should be safe first, resilient second.


Related Articles

  • My AI Research Assistant Works the Night Shift

Related Articles

  • DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4
  • My AI Linux Expert: How Claude Code Suggested a 95,000x Faster Solution
  • DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on When LaunchAgents Attack: A $100 API Crash Loop Story? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.

Links to this entry