Practical ApplicationsJanuary 9, 20268 min readshipped

When LaunchAgents Attack: A $100 API Crash Loop Story

I woke up to a $100 API charge. Not from heavy usage. From a crash loop.

4,590 restart attempts in 12 hours. Every restart made API calls. Port conflicts triggered infinite retries. Three duplicate LaunchAgents fighting for the same port.

Here's what went wrong, what I learned, and how to prevent this from happening to you.

The Discovery

Morning coffee. Check email. See the Anthropic usage alert.

$100 overnight.

That's not supposed to happen. My background services typically cost $2-3 per day. I check the logs:

$ wc -l /tmp/clawdbot-gateway.err
4758 /tmp/clawdbot-gateway.err

$ grep -c "Gateway failed to start" /tmp/clawdbot-gateway.err
4590

4,590 failed starts. In 12 hours. That's 6-7 restart attempts per minute.

The error message was consistent:

Gateway failed to start: another gateway instance is already listening on ws://127.0.0.1:18789

Port conflict. Classic. But why was it restarting thousands of times?

The Root Cause: Three Agents, One Port

I found three LaunchAgents all trying to start the same service on the same port:

Agent 1: `com.bioinfo.clawdbot.plist`

<key>ProgramArguments</key>
<array>
    <string>node</string>
    <string>dist/entry.js</string>
    <string>gateway</string>
    <string>--port</string>
    <string>18789</string>
</array>

<key>StartInterval</key>
<integer>900</integer>  <!-- Restart every 15 minutes -->

<key>KeepAlive</key>
<dict>
    <key>SuccessfulExit</key>
    <false/>  <!-- Restart on any exit -->
    <key>Crashed</key>
    <true/>   <!-- Restart on crash -->
</dict>

<key>RunAtLoad</key>
<true/>  <!-- Start on boot -->

Agent 2: `com.bioinfo.clawdbot-gateway.plist`

Same port, same service, different wrapper command.

Agent 3: `com.clawdbot.gateway.plist`

Same port, same service, app bundle variant.

All three loaded at boot. All three trying to bind port 18789. All three configured to restart aggressively.

The Exponential Restart Problem

Here's how StartInterval + KeepAlive + RunAtLoad creates a crash loop nightmare:

Boot time (00:00):

All 3 agents start
First one succeeds, binds port 18789
Other 2 fail with port conflict
KeepAlive: true triggers immediate restart
Both retry and fail
Restart again

Every 15 minutes (StartInterval):

Even the successful one restarts
Now all 3 are fighting again
First to bind wins
Others crash-loop

Every crash:

KeepAlive.Crashed: true triggers restart
Port still occupied
Immediate failure
Immediate retry

Result: Exponential restart attempts.

Never Combine These Settings

StartInterval (scheduled restarts) + KeepAlive: true (restart on exit) + RunAtLoad: true (start on boot) creates exponential restart behavior. Each setting is safe alone. Together they're a crash loop bomb.

The Cost Multiplier

Each restart attempt wasn't free. The gateway initialization sequence:

Load service configuration
Initialize AI provider connections
Make API calls to validate credentials
Attempt to bind port
Fail at step 4
Crash
Repeat

Steps 1-3 happened every time. API calls in step 3 cost money.

4,590 restarts × ~$0.02 per restart = $100

The service never stayed running long enough to do useful work. Just long enough to rack up API charges.

The Missing Safety Net: Monitoring

What should have caught this before it cost $100?

1. Restart Count Monitoring

I should have been tracking restart attempts:

# Check how many times a LaunchAgent restarted today
log show --predicate 'subsystem == "com.apple.launchd"' \
  --info --debug --last 24h | \
  grep "com.bioinfo.clawdbot" | \
  grep -c "Started"

If that number exceeds 10 in an hour, something's wrong.

2. Port Conflict Detection

Before binding a port, check if it's available:

import socket

def is_port_available(port):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.bind(('127.0.0.1', port))
        sock.close()
        return True
    except OSError:
        return False

if not is_port_available(18789):
    print(f"Port 18789 already in use. Exiting.")
    exit(1)  # Don't retry if port is occupied

Exit with status 1 prevents KeepAlive from restarting.

3. API Cost Alerts

I had usage alerts set, but they triggered after the damage was done. Better approach:

Rate-based alerts:

Alert if API costs exceed $X per hour
Alert if restart rate exceeds N per minute
Alert if error rate spikes

Pre-spend limits:

Some providers support hard spending caps
Set daily/hourly budgets
Fail gracefully when limit reached

Monitor Cost Per Time, Not Total Cost

Absolute spending alerts ("$100 exceeded") trigger too late. Rate-based alerts ("$10 in last hour") catch runaway costs early. Set hourly thresholds, not daily ones.

The Fix: Radical Simplification

I removed the complexity:

Step 1: Kill Everything

# Unload all duplicate agents
launchctl unload (local path)
launchctl unload (local path)
launchctl unload (local path)

# Kill all running processes
ps aux | grep clawdbot | awk '{print $2}' | xargs kill -9

Step 2: Disable Auto-Start

Updated the single remaining agent:

<!-- AFTER: Safe configuration -->
<key>KeepAlive</key>
<false/>  <!-- No auto-restart -->

<key>RunAtLoad</key>
<false/>  <!-- No start on boot -->

<key>ThrottleInterval</key>
<integer>60</integer>  <!-- Rate limit if manually started -->

Step 3: Manual Start Only

cd (local path)
pnpm clawdbot gateway --verbose

No LaunchAgent. No auto-restart. Manual control.

For production services, I'd use proper process management (systemd on Linux, supervised with monitoring). For development tools, manual start is safer.

Lessons Learned

1. LaunchAgents Are Dangerous for API Services

LaunchAgents are perfect for running scheduled tasks (backups, cleanup scripts). They're terrible for services that make API calls:

Good use cases:

Scheduled backups
Log rotation
Database cleanup
File synchronization

Bad use cases:

API-backed services
Development tools
Anything with crash risk
Services with usage costs

If it makes API calls, don't auto-restart it. Crashes should require manual intervention.

2. Duplicate LaunchAgents Compound Problems

How did I end up with three duplicates? Iterative development without cleanup:

Created first agent (testing)
Created second agent (different approach)
App bundle added third agent (automatic)
Never cleaned up old ones

Prevention:

# List all your LaunchAgents
launchctl list | grep -i "your-service"

# Remove duplicates immediately
launchctl unload (local path)
rm (local path)

3. Port Conflicts Need Graceful Handling

The service should have checked port availability and exited cleanly:

if not is_port_available(PORT):
    logger.error(f"Port {PORT} already in use")
    sys.exit(1)  # Exit cleanly, don't crash

Exit code 1 with KeepAlive.SuccessfulExit: false prevents restart. The service dies once instead of looping forever.

4. Cost Monitoring Must Be Real-Time

My monitoring was reactive (daily summaries, weekly reports). It needed to be proactive:

Reactive monitoring (what I had):

Daily email: "You spent $X yesterday"
Weekly summary: "Last 7 days: $Y"
Too late to prevent damage

Proactive monitoring (what I needed):

Alerting: "Spent $10 in last hour (unusual)"
Rate limits: "Max $50/day hard cap"
Anomaly detection: "Restart rate 10x normal"

I've since added real-time cost monitoring using CloudWatch alarms (AWS) and custom scripts for other providers.

How to Prevent This

If you're running AI services with LaunchAgents, here's your checklist:

Pre-Flight Checks

# 1. List all LaunchAgents for your service
launchctl list | grep "your-service"

# 2. Check for port conflicts before starting
lsof -i :YOUR_PORT

# 3. Verify restart count is reasonable
log show --predicate 'subsystem == "com.apple.launchd"' \
  --last 1h | grep "your-service" | grep -c "Started"

Safe LaunchAgent Configuration

<!-- Safe template for services with API costs -->
<key>KeepAlive</key>
<false/>  <!-- No auto-restart -->

<key>RunAtLoad</key>
<false/>  <!-- Manual start only -->

<key>ThrottleInterval</key>
<integer>60</integer>  <!-- Rate limit restarts -->

<!-- NEVER use StartInterval for API services -->

Runtime Monitoring

Create a monitoring script that runs hourly:

#!/bin/bash
# check-service-health.sh

RESTART_COUNT=$(launchctl list | grep "your-service" | wc -l)
ERROR_COUNT=$(grep -c "ERROR" /tmp/your-service.err)

if [ "$RESTART_COUNT" -gt 5 ]; then
    echo "WARNING: Service restarted $RESTART_COUNT times"
    # Send alert (email, Slack, etc.)
fi

if [ "$ERROR_COUNT" -gt 100 ]; then
    echo "WARNING: $ERROR_COUNT errors in log"
    # Send alert
fi

Schedule it with a separate, simple LaunchAgent:

<key>ProgramArguments</key>
<array>
    <string>/path/to/check-service-health.sh</string>
</array>

<key>StartInterval</key>
<integer>3600</integer>  <!-- Every hour -->

Test Your Alerts Before You Need Them

Monitoring that's never triggered is monitoring that might not work. Deliberately trigger your alerts (manual restart loop, simulated API spike) to verify they actually notify you.

The Real Cost

$100 in API charges hurt. But the real cost was trust.

For 12 hours, I had runaway automation making decisions without oversight. The service was designed to be helpful (auto-restart on failure). Instead, it was harmful (infinite retry on permanent failure).

This changed how I think about automation:

Before: "Make it resilient. Auto-recover from everything." After: "Make it safe. Fail loudly on unexpected conditions."

Auto-restart is resilience theater when the failure is permanent (port conflict). Better to exit cleanly and require human intervention.

Your Turn

Check your LaunchAgents right now:

# List all your LaunchAgents
ls (local path)

# Check for suspicious restart settings
grep -l "StartInterval\|KeepAlive" (local path)

# Review any that have both

If you find duplicates or aggressive restart settings on services that make API calls, fix them before they cost you.

The $100 lesson: automation should be safe first, resilient second.

My AI Research Assistant Works the Night Shift

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Related experiments

Apparatus

1,130 words · 8 min read

debugging
automation
macos
cost-monitoring
launchagent

When LaunchAgents Attack: A $100 API Crash Loop Story

When LaunchAgents Attack: A $100 API Crash Loop Story

The Discovery

The Root Cause: Three Agents, One Port

Agent 1: `com.bioinfo.clawdbot.plist`

Agent 2: `com.bioinfo.clawdbot-gateway.plist`

Agent 3: `com.clawdbot.gateway.plist`

The Exponential Restart Problem

The Cost Multiplier

The Missing Safety Net: Monitoring

1. Restart Count Monitoring

2. Port Conflict Detection

3. API Cost Alerts

The Fix: Radical Simplification

Step 1: Kill Everything

Step 2: Disable Auto-Start

Step 3: Manual Start Only

Lessons Learned

1. LaunchAgents Are Dangerous for API Services

2. Duplicate LaunchAgents Compound Problems

3. Port Conflicts Need Graceful Handling

4. Cost Monitoring Must Be Real-Time

How to Prevent This

Pre-Flight Checks

Safe LaunchAgent Configuration

Runtime Monitoring

The Real Cost

Your Turn

Related Articles

Related Articles

Related experiments

Apparatus

Links to this entry

When LaunchAgents Attack: A $100 API Crash Loop Story

The Discovery

The Root Cause: Three Agents, One Port

Agent 1: com.bioinfo.clawdbot.plist

Agent 2: com.bioinfo.clawdbot-gateway.plist

Agent 3: com.clawdbot.gateway.plist

The Exponential Restart Problem

The Cost Multiplier

The Missing Safety Net: Monitoring

1. Restart Count Monitoring

2. Port Conflict Detection

3. API Cost Alerts

The Fix: Radical Simplification

Step 1: Kill Everything

Step 2: Disable Auto-Start

Step 3: Manual Start Only

Lessons Learned

1. LaunchAgents Are Dangerous for API Services

2. Duplicate LaunchAgents Compound Problems

3. Port Conflicts Need Graceful Handling

4. Cost Monitoring Must Be Real-Time

How to Prevent This

Pre-Flight Checks

Safe LaunchAgent Configuration

Runtime Monitoring

The Real Cost

Your Turn

Related Articles

Related Articles

Get the next experiment

Related experiments

Apparatus

Links to this entry

Agent 1: `com.bioinfo.clawdbot.plist`

Agent 2: `com.bioinfo.clawdbot-gateway.plist`

Agent 3: `com.clawdbot.gateway.plist`