Practical ApplicationsJanuary 14, 202615 min readshipped

The $221 Bill: Finding and Fixing Million-Token Meeting Notes

My OpenRouter bill hit $221 last month. Ouch.

Not catastrophic, but enough to make me stop and look. I run several LLM-powered systems. Trading bots (justified if they're making money). Content generation. Meeting transcription. Some spend is expected.

But $221 felt high.

What I found was a textbook case of honeymoon-phase AI development. "Wow, it can transcribe and summarize my meetings! I don't care what it costs!" Six months later, the automation is silently sending 1,006,176-token prompts to an LLM.

This is the story of tracking it down and fixing it with chunking and prompt caching. More importantly, it's about why these things matter now that we're past the "AI can do anything so who cares about costs" phase.

Key Insight

The AI honeymoon is over. As more workloads move to LLMs, cost optimization isn't optional anymore. Especially when automation runs silently with keys locked in vaults.

The Investigation

Following the Money

OpenRouter gives you a clean CSV export. 30 days of data: 106,768 requests, $221.23.

First thing I noticed: the spending wasn't evenly distributed.

import csv
from collections import defaultdict

by_app = defaultdict(lambda: {'cost': 0, 'count': 0})

with open('activity.csv') as f:
    for row in csv.DictReader(f):
        app = row['app_name']
        by_app[app]['cost'] += float(row['cost_total'])
        by_app[app]['count'] += 1

for app, data in sorted(by_app.items(), key=lambda x: x[1]['cost'], reverse=True):
    print(f"{app:40s}: ${data['cost']:6.2f} ({data['count']:,} requests)")

Output:

Agora TradingAgents                     : $116.37 (23,335 requests)
                                        : $ 51.56 (6,788 requests)
Project Chimera Trading Bot             : $ 40.53 (75,814 requests)
Meeting Notes Processor                 : $  3.11 (179 requests)

Trading bots dominating at $157/month (71% of spend). Expected. If they're profitable, $157 in LLM costs is fine.

But $51 from blank application names? That's weird.

Finding the Monster

I sorted by individual request cost:

high_cost = []
for row in reader:
    cost = float(row['cost_total'])
    if cost > 0.10:  # Anything over 10 cents
        high_cost.append({
            'cost': cost,
            'tokens_prompt': int(row['tokens_prompt']),
            'model': row['model_permaslug'],
            'app': row['app_name']
        })

for req in sorted(high_cost, key=lambda x: x['cost'], reverse=True)[:10]:
    print(f"${req['cost']:.2f} - {req['tokens_prompt']:,} tokens - {req['app']}")

Output:

$0.54 - 178,369 tokens - [blank]
$0.54 - 178,800 tokens - [blank]
$0.54 - 178,282 tokens - [blank]
$0.51 - 1,006,176 tokens - [blank]
$0.38 - 58,264 tokens - Roo Code

There it was.

1,006,176 tokens. One million tokens in a single prompt. Half a dollar for one API call.

Reality Check

One million tokens is roughly 750,000 words. That's 10 novels worth of text sent to an LLM in a single request.

The Time Pattern

I graphed the high-cost requests by hour:

Hour | Massive Requests (>100k tokens)
-----|--------------------------------
03:00 ████████████████ (15 requests)
04:00 ██████████ (10)
10:00 ███████████ (11)
22:00 ███████████████████████ (23)

3 AM spike. That's my LaunchAgent for meeting notes processing.

10 PM spike. That's me manually reprocessing meetings I missed during the day.

The million-token request? December 25, 10:41 AM. A very long holiday meeting, processed manually.

The Culprit: Set and Forget

Six months ago I built a meeting notes automation:

MacWhisper transcribes audio locally
Python script applies dictionary corrections (technical terms, names)
Send transcript to GLM-4.7 for structured notes
Save to Obsidian vault

It worked beautifully for 30-45 minute meetings. Then I stopped checking costs. Set and forget.

The Broken Code

def reprocess_meeting(meeting_file, corrections, agent_prompt):
    # Get transcript from MacWhisper
    transcript = get_transcript_from_db(meeting_date, meeting_time)

    # Apply dictionary corrections
    corrected_transcript = apply_dictionary(transcript, corrections)

    # Send ENTIRE transcript
    user_message = f"""#### START OF TRANSCRIPT

{corrected_transcript}  # Could be 600k+ characters

#### END OF TRANSCRIPT

Generate comprehensive meeting notes..."""

    result = call_openrouter(agent_prompt, user_message)

What's wrong with this?

❌ No token estimation
❌ No size warnings
❌ No chunking
❌ No cost tracking

For a 3-hour all-hands meeting, corrected_transcript could be 712,368 characters (roughly 178,000 tokens).

The script never complained. Never warned. Just quietly sent everything and hoped for the best.

Classic Automation Pitfall

Built it when meetings were short. Meetings got longer. Costs grew silently. No monitoring to catch it.

The Fix: Three-Tier Strategy

I implemented automatic strategy selection based on transcript size:

# Configuration
CACHE_THRESHOLD = 30000      # Enable caching above this
CHUNK_THRESHOLD = 80000      # Use chunking above this
CHUNK_SIZE = 40000           # Target chunk size

def reprocess_meeting(meeting_file, corrections, agent_prompt):
    transcript_tokens = estimate_tokens(corrected_transcript)

    if transcript_tokens > CHUNK_THRESHOLD:
        # Very large: chunk it
        strategy = "chunked"
        chunks = chunk_transcript(corrected_transcript, max_tokens=CHUNK_SIZE)
        result = process_chunked_meeting(chunks, agent_prompt)
    elif transcript_tokens > CACHE_THRESHOLD:
        # Medium: use caching
        strategy = "cached"
        result = call_openrouter(agent_prompt, transcript, use_cache=True)
    else:
        # Small: direct processing
        strategy = "direct"
        result = call_openrouter(agent_prompt, transcript)

Strategy 1: Direct Processing (< 30k tokens)

For daily standups and short calls, keep it simple. No optimization overhead needed.

Cost: ~$0.001-0.02 per meeting

Strategy 2: Cached Processing (30k-80k tokens)

For medium meetings, use prompt caching. The system prompt (meeting notes instructions) is 15k tokens and identical across all meetings.

def call_openrouter(system_prompt, user_message, use_cache=False):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]

    if use_cache:
        # Mark system prompt as cacheable
        messages[0]["cache_control"] = {"type": "ephemeral"}
        headers["anthropic-beta"] = "prompt-caching-2024-07-31"

How it works:

System prompt cached for 5 minutes
Subsequent meetings reuse cache at 90% discount
Only the transcript charged at full price

Cost: ~$0.05 per meeting (vs $0.15 without caching, 67% savings)

Prompt Caching Sweet Spot

If you're processing a batch of similar items with the same instructions, cache the instructions. 5-minute window is perfect for LaunchAgent batch jobs.

Strategy 3: Chunked Processing (> 80k tokens)

For very long meetings, use a two-(secret):

Pass 1: Chunk Summarization

def chunk_transcript(transcript, max_tokens=40000):
    """Split by paragraphs, preserving context"""
    paragraphs = transcript.split('\n\n')
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        test_chunk = current_chunk + '\n\n' + para if current_chunk else para
        if estimate_tokens(test_chunk) > max_tokens and current_chunk:
            chunks.append(current_chunk)
            current_chunk = para
        else:
            current_chunk = test_chunk

    if current_chunk:
        chunks.append(current_chunk)
    return chunks

Each chunk gets summarized:

chunk_prompt = """Extract key points from this transcript chunk:
- Discussion topics
- Decisions made
- Action items
- Technical details"""

summaries = []
for chunk in chunks:
    result = call_openrouter(chunk_prompt, chunk, use_cache=True)
    summaries.append(result['content'])

Pass 2: Synthesis

combined = "\n\n---\n\n".join([
    f"## Chunk {i+1}\n{summary}"
    for i, summary in enumerate(summaries)
])

final_result = call_openrouter(
    agent_prompt,
    f"Synthesize final meeting notes from these summaries:\n\n{combined}",
    use_cache=True
)

Cost: ~$0.28 per meeting (vs $0.54 without optimization, 48% savings)

Why Chunking Works

You'd think sending 180k tokens in one shot would be equivalent to sending 5 chunks of 40k. But it's not:

Smaller summaries: Each 40k chunk produces a 2k summary. Final synthesis uses 10k tokens instead of 180k.
Prompt caching: Chunk processing prompts are cached and reused.
Better quality: Focused summaries often capture key points better than one massive context dump.

The Results

Before Optimization

179 meetings/month = $3.11 total
- 150 small (< 30k): $1.29
- 25 medium (30-80k): $3.75
- 4 large (>80k): $2.00

But that $3.11 was artificially low. I wasn't processing old meetings. Once I started catching up on backlog, costs would have exploded to $15-20/month.

After Optimization

Same 179 meetings = $2.59 total (36% reduction)
- 150 small: $1.29 (no change)
- 25 medium: $1.30 (65% savings via caching)
- 4 large: $1.00 (50% savings via chunking)

More importantly: No more surprise $0.50 requests.

Cost Breakdown: 180k Token Meeting

Strategy	Input Cost	Output Cost	Total
Direct (before)	$0.72	$0.003	$0.72
Chunked + Cached (after)	$0.27	$0.005	$0.28
Savings			61%

The Bigger Picture

The AI Honeymoon is Over

I've started to realize that prompt caching, chunking, and cost optimization are critical as we move out of the honeymoon phase of AI.

Honeymoon phase:

"Wow, it can do so much I couldn't do before!"
"I don't care about costs, this is amazing"
Set up automation, don't look at bills

Production phase:

More and more going through AI systems
Multiple automations running 24/7
Keys locked away in vaults
Easy to lose track of what's running where

The meeting processor ran perfectly for 6 months. But without monitoring:

I didn't notice growing costs
I didn't see 100k+ token requests
I didn't realize 3-hour meetings were a problem

Set and Forget is Dangerous

Automation that works silently is automation that can fail silently. Or in this case, succeed expensively.

What I Changed

1. Always estimate tokens before sending

def estimate_tokens(text):
    return len(text) // 4  # Rough: 1 token ≈ 4 chars

def call_api_with_warning(prompt, message):
    total_tokens = estimate_tokens(prompt) + estimate_tokens(message)

    if total_tokens > 50000:
        print(f"⚠️  Large prompt: ~{total_tokens:,} tokens")
        print(f"💰 Estimated cost: ${total_tokens * 0.0004:.2f}")

Even rough estimates catch monsters before they cost money.

2. Log everything

def log_cost(meeting, tokens_in, tokens_out, cost, strategy):
    log_entry = {
        "timestamp": datetime.now().isoformat(),
        "meeting": meeting,
        "tokens_in": tokens_in,
        "tokens_out": tokens_out,
        "cost": cost,
        "strategy": strategy
    }

    with open(COST_LOG, 'a') as f:
        f.write(json.dumps(log_entry) + '\n')

Now I can analyze which meetings are expensive and why:

# Most expensive meetings
cat costs.jsonl | jq -s 'sort_by(.cost) | reverse | .[:10]'

3. Set up cost alerts

if cost > 0.20:
    send_notification(f"⚠️ Expensive meeting: ${cost:.2f}")

Automation needs monitoring.

Lessons for Production AI

1. Context Windows Aren't Free

Claude can handle 200k tokens. Gemini can do 1M. But just because you can doesn't mean you should.

Better approach:

Chunk large inputs intelligently
Summarize intermediate results
Synthesize at the end

You get better quality (focused summaries) and lower costs.

2. Prompt Caching is Underutilized

If you're sending the same instructions repeatedly, cache them:

# This 15k token system prompt gets reused 100 times
# Without caching: 15k × 100 = 1.5M tokens = $0.60
# With caching: 15k write + (15k × 100 × 0.1) = 165k = $0.066
# Savings: $0.53 (88% reduction)

The cache lasts 5 minutes. Perfect for batch jobs.

3. Track Your Keys

I had keys in:

`(local path)
`(local path)
`(local path)
/apps/_archived/agora/.env

One key showed as "blank" in OpenRouter because I never gave it a custom name. Took me an hour to figure out which automation was using it.

Fix: Name your keys descriptively. "Automation-fabric", "Production-trading", "Dev-testing".

4. Automate the Monitoring

Don't wait for the bill. Set up daily checks:

# Daily cost check (add to crontab)
0 18 * * * curl -s "https://openrouter.ai/api/v1/auth/key" \
  -H "Authorization: Bearer $KEY" | \
  jq '.data.usage_daily' | \
  awk '{if ($1 > 10) print "⚠️ High usage: $"$1}'

5. Chunking Beats Truncation

When I found the problem, my first instinct was "just truncate to 50k tokens."

But that throws away information. Chunking with summarization:

Preserves all information via summaries
Often captures key points better
Costs less than sending the full thing

The Trading Bot Question

Remember those trading bots spending $157/month? That's actually fine.

Agora: $117/month, 28k requests Chimera: $40/month, 76k requests

Both use GLM-4.7 (cheap at $0.40/M tokens). The question isn't "why so expensive" but "are they profitable enough to justify it?"

For ML-driven trading bots making thousands of decisions per day, $157/month in LLM costs is reasonable if they're generating returns.

That's a business question, not a technical one.

Final Numbers

Total Monthly Spend

Before deep dive: $221 (no visibility)
After optimization: ~$160-180 projected (27% reduction)
- Meetings: $2.59 (was $3.11+, headed to $15-20)
- Trading bots: $157 (under review)
- Other: Minimal

Meeting Processing

Before: Up to $0.54 per meeting, no warnings
After: Capped at ~$0.30 max, full visibility
Strategy distribution:
- 84% direct (cheap, fast)
- 14% cached (medium, optimized)
- 2% chunked (large, controlled)

Takeaways

The difference between a $221 surprise and a well-optimized $160 system is awareness and architecture.

Always:

Estimate tokens before every API call
Use prompt caching for repeated instructions
Chunk large inputs intelligently
Log everything
Set up cost alerts
Monitor automation

The AI honeymoon is over. As more workloads move to LLMs, cost optimization isn't a nice-to-have anymore. It's table stakes for production systems.

Especially when those systems run silently at 3 AM while you sleep.

Building an AI Research Night Shift

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Related experiments

Apparatus

1,314 words · 15 min read

cost-optimization
llm
debugging
prompt-caching
automation

Links to this entry

Mneme: Semantic Recall for Your Claude Code Sessions