debugging-million-token-meeting-notes - AIXplore

# The $221 Bill: Finding and Fixing Million-Token Meeting Notes My OpenRouter bill hit $221 last month. Ouch. Not catastrophic, but enough to make me stop and look. I run several LLM-powered systems. Trading bots (justified if they're making money). Content generation. Meeting transcription. Some spend is expected. But $221 felt high. What I found was a textbook case of honeymoon-phase AI development. "Wow, it can transcribe and summarize my meetings! I don't care what it costs!" Six months later, the automation is silently sending **1,006,176-token prompts** to an LLM. This is the story of tracking it down and fixing it with chunking and prompt caching. More importantly, it's about why these things matter now that we're past the "AI can do anything so who cares about costs" phase. > [!tip] Key Insight > The AI honeymoon is over. As more workloads move to LLMs, cost optimization isn't optional anymore. Especially when automation runs silently with keys locked in vaults. ## The Investigation ### Following the Money OpenRouter gives you a clean CSV export. 30 days of data: 106,768 requests, $221.23. First thing I noticed: the spending wasn't evenly distributed. ```python import csv from collections import defaultdict by_app = defaultdict(lambda: {'cost': 0, 'count': 0}) with open('activity.csv') as f: for row in csv.DictReader(f): app = row['app_name'] by_app[app]['cost'] += float(row['cost_total']) by_app[app]['count'] += 1 for app, data in sorted(by_app.items(), key=lambda x: x[1]['cost'], reverse=True): print(f"{app:40s}: ${data['cost']:6.2f} ({data['count']:,} requests)") ``` **Output:** ``` Agora TradingAgents : $116.37 (23,335 requests) : $ 51.56 (6,788 requests) Project Chimera Trading Bot : $ 40.53 (75,814 requests) Meeting Notes Processor : $ 3.11 (179 requests) ``` Trading bots dominating at $157/month (71% of spend). Expected. If they're profitable, $157 in LLM costs is fine. But $51 from blank application names? That's weird. ### Finding the Monster I sorted by individual request cost: ```python high_cost = [] for row in reader: cost = float(row['cost_total']) if cost > 0.10: # Anything over 10 cents high_cost.append({ 'cost': cost, 'tokens_prompt': int(row['tokens_prompt']), 'model': row['model_permaslug'], 'app': row['app_name'] }) for req in sorted(high_cost, key=lambda x: x['cost'], reverse=True)[:10]: print(f"${req['cost']:.2f} - {req['tokens_prompt']:,} tokens - {req['app']}") ``` **Output:** ``` $0.54 - 178,369 tokens - [blank] $0.54 - 178,800 tokens - [blank] $0.54 - 178,282 tokens - [blank] $0.51 - 1,006,176 tokens - [blank] $0.38 - 58,264 tokens - Roo Code ``` There it was. **1,006,176 tokens.** One million tokens in a single prompt. Half a dollar for one API call. > [!warning] Reality Check > One million tokens is roughly 750,000 words. That's 10 novels worth of text sent to an LLM in a single request. ### The Time Pattern I graphed the high-cost requests by hour: ``` Hour | Massive Requests (>100k tokens) -----|-------------------------------- 03:00 ████████████████ (15 requests) 04:00 ██████████ (10) 10:00 ███████████ (11) 22:00 ███████████████████████ (23) ``` 3 AM spike. That's my LaunchAgent for meeting notes processing. 10 PM spike. That's me manually reprocessing meetings I missed during the day. The million-token request? December 25, 10:41 AM. A very long holiday meeting, processed manually. ## The Culprit: Set and Forget Six months ago I built a meeting notes automation: 1. MacWhisper transcribes audio locally 2. Python script applies dictionary corrections (technical terms, names) 3. Send transcript to GLM-4.7 for structured notes 4. Save to Obsidian vault It worked beautifully for 30-45 minute meetings. Then I stopped checking costs. Set and forget. ### The Broken Code ```python def reprocess_meeting(meeting_file, corrections, agent_prompt): # Get transcript from MacWhisper transcript = get_transcript_from_db(meeting_date, meeting_time) # Apply dictionary corrections corrected_transcript = apply_dictionary(transcript, corrections) # Send ENTIRE transcript user_message = f"""#### START OF TRANSCRIPT {corrected_transcript} # Could be 600k+ characters #### END OF TRANSCRIPT Generate comprehensive meeting notes...""" result = call_openrouter(agent_prompt, user_message) ``` What's wrong with this? - ❌ No token estimation - ❌ No size warnings - ❌ No chunking - ❌ No cost tracking For a 3-hour all-hands meeting, `corrected_transcript` could be **712,368 characters** (roughly 178,000 tokens). The script never complained. Never warned. Just quietly sent everything and hoped for the best. > [!info] Classic Automation Pitfall > Built it when meetings were short. Meetings got longer. Costs grew silently. No monitoring to catch it. ## The Fix: Three-Tier Strategy I implemented automatic strategy selection based on transcript size: ```python # Configuration CACHE_THRESHOLD = 30000 # Enable caching above this CHUNK_THRESHOLD = 80000 # Use chunking above this CHUNK_SIZE = 40000 # Target chunk size def reprocess_meeting(meeting_file, corrections, agent_prompt): transcript_tokens = estimate_tokens(corrected_transcript) if transcript_tokens > CHUNK_THRESHOLD: # Very large: chunk it strategy = "chunked" chunks = chunk_transcript(corrected_transcript, max_tokens=CHUNK_SIZE) result = process_chunked_meeting(chunks, agent_prompt) elif transcript_tokens > CACHE_THRESHOLD: # Medium: use caching strategy = "cached" result = call_openrouter(agent_prompt, transcript, use_cache=True) else: # Small: direct processing strategy = "direct" result = call_openrouter(agent_prompt, transcript) ``` ### Strategy 1: Direct Processing (< 30k tokens) For daily standups and short calls, keep it simple. No optimization overhead needed. **Cost:** ~$0.001-0.02 per meeting ### Strategy 2: Cached Processing (30k-80k tokens) For medium meetings, use prompt caching. The system prompt (meeting notes instructions) is 15k tokens and identical across all meetings. ```python def call_openrouter(system_prompt, user_message, use_cache=False): messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message} ] if use_cache: # Mark system prompt as cacheable messages[0]["cache_control"] = {"type": "ephemeral"} headers["anthropic-beta"] = "prompt-caching-2024-07-31" ``` **How it works:** - System prompt cached for 5 minutes - Subsequent meetings reuse cache at 90% discount - Only the transcript charged at full price **Cost:** ~$0.05 per meeting (vs $0.15 without caching, **67% savings**) > [!tip] Prompt Caching Sweet Spot > If you're processing a batch of similar items with the same instructions, cache the instructions. 5-minute window is perfect for LaunchAgent batch jobs. ### Strategy 3: Chunked Processing (> 80k tokens) For very long meetings, use a two-pass approach: **Pass 1: Chunk Summarization** ```python def chunk_transcript(transcript, max_tokens=40000): """Split by paragraphs, preserving context""" paragraphs = transcript.split('\n\n') chunks = [] current_chunk = "" for para in paragraphs: test_chunk = current_chunk + '\n\n' + para if current_chunk else para if estimate_tokens(test_chunk) > max_tokens and current_chunk: chunks.append(current_chunk) current_chunk = para else: current_chunk = test_chunk if current_chunk: chunks.append(current_chunk) return chunks ``` Each chunk gets summarized: ```python chunk_prompt = """Extract key points from this transcript chunk: - Discussion topics - Decisions made - Action items - Technical details""" summaries = [] for chunk in chunks: result = call_openrouter(chunk_prompt, chunk, use_cache=True) summaries.append(result['content']) ``` **Pass 2: Synthesis** ```python combined = "\n\n---\n\n".join([ f"## Chunk {i+1}\n{summary}" for i, summary in enumerate(summaries) ]) final_result = call_openrouter( agent_prompt, f"Synthesize final meeting notes from these summaries:\n\n{combined}", use_cache=True ) ``` **Cost:** ~$0.28 per meeting (vs $0.54 without optimization, **48% savings**) ### Why Chunking Works You'd think sending 180k tokens in one shot would be equivalent to sending 5 chunks of 40k. But it's not: 1. **Smaller summaries**: Each 40k chunk produces a 2k summary. Final synthesis uses 10k tokens instead of 180k. 2. **Prompt caching**: Chunk processing prompts are cached and reused. 3. **Better quality**: Focused summaries often capture key points better than one massive context dump. ## The Results ### Before Optimization ``` 179 meetings/month = $3.11 total - 150 small (< 30k): $1.29 - 25 medium (30-80k): $3.75 - 4 large (>80k): $2.00 ``` But that $3.11 was artificially low. I wasn't processing old meetings. Once I started catching up on backlog, costs would have exploded to $15-20/month. ### After Optimization ``` Same 179 meetings = $2.59 total (36% reduction) - 150 small: $1.29 (no change) - 25 medium: $1.30 (65% savings via caching) - 4 large: $1.00 (50% savings via chunking) ``` More importantly: **No more surprise $0.50 requests**. ### Cost Breakdown: 180k Token Meeting | Strategy | Input Cost | Output Cost | Total | |----------|------------|-------------|-------| | **Direct (before)** | $0.72 | $0.003 | **$0.72** | | **Chunked + Cached (after)** | $0.27 | $0.005 | **$0.28** | | **Savings** | | | **61%** | ## The Bigger Picture ### The AI Honeymoon is Over I've started to realize that prompt caching, chunking, and cost optimization are critical as we move out of the honeymoon phase of AI. **Honeymoon phase:** - "Wow, it can do so much I couldn't do before!" - "I don't care about costs, this is amazing" - Set up automation, don't look at bills **Production phase:** - More and more going through AI systems - Multiple automations running 24/7 - Keys locked away in vaults - Easy to lose track of what's running where The meeting processor ran perfectly for 6 months. But without monitoring: - I didn't notice growing costs - I didn't see 100k+ token requests - I didn't realize 3-hour meetings were a problem > [!warning] Set and Forget is Dangerous > Automation that works silently is automation that can fail silently. Or in this case, succeed expensively. ### What I Changed **1. Always estimate tokens before sending** ```python def estimate_tokens(text): return len(text) // 4 # Rough: 1 token ≈ 4 chars def call_api_with_warning(prompt, message): total_tokens = estimate_tokens(prompt) + estimate_tokens(message) if total_tokens > 50000: print(f"⚠️ Large prompt: ~{total_tokens:,} tokens") print(f"💰 Estimated cost: ${total_tokens * 0.0004:.2f}") ``` Even rough estimates catch monsters before they cost money. **2. Log everything** ```python def log_cost(meeting, tokens_in, tokens_out, cost, strategy): log_entry = { "timestamp": datetime.now().isoformat(), "meeting": meeting, "tokens_in": tokens_in, "tokens_out": tokens_out, "cost": cost, "strategy": strategy } with open(COST_LOG, 'a') as f: f.write(json.dumps(log_entry) + '\n') ``` Now I can analyze which meetings are expensive and why: ```bash # Most expensive meetings cat costs.jsonl | jq -s 'sort_by(.cost) | reverse | .[:10]' ``` **3. Set up cost alerts** ```python if cost > 0.20: send_notification(f"⚠️ Expensive meeting: ${cost:.2f}") ``` Automation needs monitoring. ## Lessons for Production AI ### 1. Context Windows Aren't Free Claude can handle 200k tokens. Gemini can do 1M. But just because you can doesn't mean you should. Better approach: - Chunk large inputs intelligently - Summarize intermediate results - Synthesize at the end You get better quality (focused summaries) and lower costs. ### 2. Prompt Caching is Underutilized If you're sending the same instructions repeatedly, cache them: ```python # This 15k token system prompt gets reused 100 times # Without caching: 15k × 100 = 1.5M tokens = $0.60 # With caching: 15k write + (15k × 100 × 0.1) = 165k = $0.066 # Savings: $0.53 (88% reduction) ``` The cache lasts 5 minutes. Perfect for batch jobs. ### 3. Track Your Keys I had keys in: - `~/.secrets/api-keys.env` - `~/.config/fabric/.env` - `~/.clawdbot/clawdbot.json` - `/apps/_archived/agora/.env` One key showed as "blank" in OpenRouter because I never gave it a custom name. Took me an hour to figure out which automation was using it. **Fix:** Name your keys descriptively. "Automation-fabric", "Production-trading", "Dev-testing". ### 4. Automate the Monitoring Don't wait for the bill. Set up daily checks: ```bash # Daily cost check (add to crontab) 0 18 * * * curl -s "https://openrouter.ai/api/v1/auth/key" \ -H "Authorization: Bearer $KEY" | \ jq '.data.usage_daily' | \ awk '{if ($1 > 10) print "⚠️ High usage: quot;$1}' ``` ### 5. Chunking Beats Truncation When I found the problem, my first instinct was "just truncate to 50k tokens." But that throws away information. Chunking with summarization: - Preserves all information via summaries - Often captures key points better - Costs less than sending the full thing ## The Trading Bot Question Remember those trading bots spending $157/month? That's actually fine. **Agora:** $117/month, 28k requests **Chimera:** $40/month, 76k requests Both use GLM-4.7 (cheap at $0.40/M tokens). The question isn't "why so expensive" but "are they profitable enough to justify it?" For ML-driven trading bots making thousands of decisions per day, $157/month in LLM costs is reasonable if they're generating returns. That's a business question, not a technical one. ## Final Numbers ### Total Monthly Spend - **Before deep dive:** $221 (no visibility) - **After optimization:** ~$160-180 projected (27% reduction) - Meetings: $2.59 (was $3.11+, headed to $15-20) - Trading bots: $157 (under review) - Other: Minimal ### Meeting Processing - **Before:** Up to $0.54 per meeting, no warnings - **After:** Capped at ~$0.30 max, full visibility - **Strategy distribution:** - 84% direct (cheap, fast) - 14% cached (medium, optimized) - 2% chunked (large, controlled) ## Takeaways The difference between a $221 surprise and a well-optimized $160 system is awareness and architecture. **Always:** - Estimate tokens before every API call - Use prompt caching for repeated instructions - Chunk large inputs intelligently - Log everything - Set up cost alerts - Monitor automation The AI honeymoon is over. As more workloads move to LLMs, cost optimization isn't a nice-to-have anymore. It's table stakes for production systems. Especially when those systems run silently at 3 AM while you sleep. --- ### Related Articles - [[AI Systems & Architecture/cognitive-architectures-for-ai-agents|Cognitive Architectures for AI Agents]] - [[building-ai-research-night-shift|Building an AI Research Night Shift]] - [[Cutting-Edge AI/deepseek-r1-open-reasoning-model|DeepSeek R1: Open Source Reasoning Model]] --- About the Author: Justin Johnson builds AI systems and writes about practical AI development. <a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a>