# My AI Linux Expert: How Claude Code Suggested a 95,000x Faster Solution When you have a hammer, everything looks like a nail. When you're an ML engineer, every routing problem looks like it needs a classifier. Sometimes your AI partner needs to suggest: have you tried a screwdriver? Here's the story of how Claude Code saved me from building the wrong solution—not by writing better code, but by noticing what was already working. ## The Setup: Routing AI Requests I just got an NVIDIA DGX workstation. I have two inference engines running (Ollama and llama.cpp), both serving the same model. I need to route incoming requests intelligently: simple chat queries to Ollama (optimized for low latency), complex code generation to llama.cpp (better for long context). **My Instinct:** This is a classification problem. Download Arch-Router-1.5B, build a FastAPI service, use ML to route requests. **The Plan:** Phase 1b of my infrastructure roadmap. I documented it in `NEXT_SESSION.md` on my DGX. **My Opening Prompt to Claude Code:** ``` start phase 1b ``` That's it. Two words. No context, no explanation. ## What Happened Next Claude didn't ask for clarification. It immediately: 1. Read `NEXT_SESSION.md` to understand Phase 1b objectives 2. Read the project roadmap to understand the overall architecture 3. Checked that both inference engines were healthy 4. Started implementing **Key Insight:** I didn't need to repeat context. Claude read the documentation and proceeded. This is the value of documenting as you go—your AI partner can read. **The Implementation Flow:** ``` 13:02 - Me: "start phase 1b" 13:02 - Claude: [Reads NEXT_SESSION.md] "I'll implement Arch-Router integration..." 13:05 - Claude: [Downloads Arch-Router-1.5B] "Getting model from HuggingFace, 2.9GB..." 13:15 - Claude: [Implements arch_router_service.py] "Created FastAPI service with PyTorch..." 13:25 - Claude: [Writes tests] "Testing with 10 diverse prompts..." 13:30 - Claude: [Reports results] "9/10 correct routing decisions, but..." ``` **Time from "start" to working implementation:** 28 minutes **My role:** Provided two-word directive, reviewed outputs, gave approvals **Claude's role:** Read documentation, implemented service, tested it, reported results ## The Problem Emerges After testing, Claude reported: ``` ✓ Routing works (9/10 test cases correct) ✗ Routing time: ~950ms per request ✗ Target: <50ms Status: UNACCEPTABLE ``` **My First Reaction:** "Well, it's running on CPU. We should probably use GPU." **Claude's Analysis:** ``` The issue is running a 1.5B parameter model on CPU for inference. PyTorch doesn't have CUDA support (CPU-only wheel). However, looking at the test results, our heuristics already achieved 90% accuracy. The ML model was only used as fallback and actually made the WRONG decision on the one failed case! ``` Wait. What? The heuristics were ALREADY working? ## The Heuristics (That I Didn't Know Were There) While implementing the ML router, Claude had added "smart heuristics" as a first pass before expensive ML inference: ```python # Heuristic 1: Context length if context_length > 4096: route_to = "llamacpp" if context_length < 1024: route_to = "ollama" # Heuristic 2: Code detection if "def " in prompt or "class " in prompt: route_to = "llamacpp" # Heuristic 3: Chat detection if "hi" in prompt or "hello" in prompt: route_to = "ollama" # Heuristic 4: Latency requirement if latency_requirement == "low": route_to = "ollama" ``` Looking at the test results: - **9/10 cases routed correctly by heuristics** - **1/10 failed case: ML model got it wrong anyway** - **Heuristics were doing the work, ML was adding 950ms of latency** I had been so focused on "build an ML router" that I missed the obvious: simple rules were already solving the problem. ## The Pivot: Claude's Suggestion Here's what Claude said (from the actual conversation log): > "Actually, let me reconsider: looking at the test results, most of the correct decisions came from the heuristics, not the model. The model was only used in case 8 (Complex Reasoning), where it actually made the WRONG decision. > > So the heuristics alone might be sufficient! Let me create a 'lite' version that skips ML inference and uses only heuristics." **My Reaction:** "Wait, really? That seems too simple..." **Claude's Response:** [Implements it] "Testing the heuristics-only version..." <div class="callout" data-callout="tip"> <div class="callout-title">The Key Moment</div> <div class="callout-content"> Claude didn't have ego about the ML solution it had just spent 30 minutes implementing. It looked at the test data objectively, noticed the heuristics were working, and suggested removing complexity instead of adding it. This is the value of an AI partner without attachment to any particular approach. </div> </div> ## The Heuristics-Only Implementation Claude created `arch_router_lite.py` (removing all the ML code): ```python def make_routing_decision(prompt, context_length=None, latency_requirement=None): """ Fast heuristic-based routing (no ML inference) Rules (priority order): 1. Context length: >4k → llama.cpp, <1k → ollama 2. Code keywords → llama.cpp 3. Chat patterns → ollama 4. Latency requirement → ollama if 'low' 5. Default → ollama (lower latency) """ start_time = time.time() engine = None confidence = 0.0 # Rule 1: Context length (objective criteria) if context_length is not None: if context_length > 4096: engine = "llamacpp" confidence = 0.95 elif context_length < 1024: engine = "ollama" confidence = 0.90 # Rule 2: Code generation keywords if engine is None: code_keywords = ["function", "class", "def ", "import ", "implement"] if any(keyword in prompt.lower() for keyword in code_keywords): engine = "llamacpp" confidence = 0.85 # Rule 3: Simple chat patterns if engine is None: chat_patterns = ["hello", "hi ", "what is", "tell me"] if any(pattern in prompt.lower() for pattern in chat_patterns): engine = "ollama" confidence = 0.85 # Default: ollama (optimized for low latency) if engine is None: engine = "ollama" confidence = 0.70 routing_time = (time.time() - start_time) * 1000 # ms return engine, confidence, routing_time ``` **What Changed:** - No model loading (instant startup, was 45 seconds) - No GPU/CPU inference (just string matching) - No PyTorch dependencies (simpler deployment) - 200 lines instead of 600 lines ## The Results: 95,000x Faster **Performance Comparison:** | Approach | Routing Time | Accuracy | Startup Time | |----------|--------------|----------|--------------| | Arch-Router-1.5B (ML) | 950ms | 90% | 45s (model load) | | Heuristics-only | **0.01ms** | **90%** | Instant | | **Improvement** | **95,000x faster** | Same | **Instant** | **Test Results with Heuristics-Only:** ``` Test 1: "Hi, how are you?" → ollama ✓ (0.008ms) Test 2: "Write a Python function to parse JSON" → llamacpp ✓ (0.009ms) Test 3: [5000 token context] → llamacpp ✓ (0.007ms) Test 4: "What is machine learning?" → ollama ✓ (0.008ms) Test 5: "Implement a binary search algorithm" → llamacpp ✓ (0.010ms) ... Accuracy: 9/10 (90%) Average routing time: 0.008ms ``` Same accuracy. **95,000x faster.** Zero GPU overhead. ## Why This Matters **Why I Missed the Simple Solution:** 1. **I'm an ML engineer** - When I see a routing problem, I think "train a classifier" 2. **Sunk cost** - I had already downloaded the 2.9GB model 3. **Sophistication bias** - "Heuristics" sounds less impressive than "deep learning" 4. **Not looking at the data** - I was focused on "make ML work" not "is ML needed?" **Why Claude Caught It:** 1. **No ego** - Not attached to the ML approach it had just implemented 2. **Data-driven** - Looked at test outputs objectively 3. **No sophistication bias** - Doesn't care if solution is "impressive" 4. **Fresh perspective** - Noticed patterns I was blind to <div class="callout" data-callout="warning"> <div class="callout-title">The Lesson</div> <div class="callout-content"> Sometimes the right solution is the one that seems "too simple to work." You need a partner without bias to suggest it. When your instinct says "this needs ML," pause and ask: "Does the data support that?" </div> </div> ## The Collaboration Pattern **What I Did:** - ✅ Documented the plan (NEXT_SESSION.md) - ✅ Provided high-level direction ("start phase 1b") - ✅ Let Claude implement first approach - ✅ Reviewed test results critically - ✅ **Approved the pivot when suggested** **What I Didn't Do:** - ❌ Dictate implementation details - ❌ Insist on ML because "that's what I know" - ❌ Reject simpler solution as "too simple" - ❌ Let ego prevent the pivot **What Claude Did:** - ✅ Read existing documentation - ✅ Implemented first approach (ML router) - ✅ Tested and discovered performance issue - ✅ **Analyzed test results objectively** - ✅ **Noticed heuristics were already working** - ✅ **Suggested simpler alternative** - ✅ Implemented optimization - ✅ Validated results **The Critical Exchange:** ``` Claude: "Actually, looking at test results, the heuristics are already achieving 90% accuracy. The ML model was only used as fallback and got it wrong." Me: [Pauses] "Show me the heuristics version" Claude: [Implements it] "0.008ms routing, 90% accuracy" Me: "Perfect. Document this and let's move forward." ``` ## The Actual Prompts (Surprisingly Minimal) Throughout this 6-hour work session, my prompts were: 1. `"start phase 1b"` (2 words) 2. [Reviews outputs, approves] 3. `"that's too slow"` (implicit: fix it) 4. [Claude suggests heuristics] `"try it"` 5. [Sees results] `"perfect, document this"` **Total words from me:** ~50-100 words over 6 hours **Total implementation:** 2,200+ lines of code, tests, configs, documentation **My time investment:** ~10 minutes of typing across the session **Work accomplished:** What would have taken me 2 days alone ## Why The Heuristics Actually Work **Rule 1: Context Length is Objective** - Tokens are tokens - >4k tokens needs llama.cpp's better context handling - <1k tokens can use Ollama's speed advantage - **No ML needed** - this is deterministic **Rule 2: Code Keywords Are Reliable** - "def", "class", "function" → 95% probability it's code - Code generation needs better reasoning (llama.cpp wins) - False positives are rare and low-cost - **Pattern matching beats ML** for this task **Rule 3: Chat Patterns Are Distinctive** - "hi", "hello", "how are you" → conversational - Conversational → Ollama is fine (and faster) - False negatives just mean slightly slower response - **Simple regex beats neural network** **Rule 4: Latency Requirements Map Directly** - User says "low latency" → use Ollama - No inference needed, just honor the request - **Why would you use ML to read a flag?** **The Insight:** These heuristics work because the domains (code vs. chat, short vs. long) have clear, observable markers. ML is overkill when pattern matching suffices. ## Time Saved (The Math) **If I Built This Alone:** | Task | Without Claude | With Claude | Savings | |------|----------------|-------------|---------| | Download & setup Arch-Router | 1 hour | 10 min | 50 min | | Implement ML routing service | 3 hours | 20 min | 2h 40min | | Write comprehensive tests | 1 hour | 10 min | 50 min | | Discover performance issue | 30 min | Immediate | 30 min | | Research alternatives | 2 hours | 0 (Claude suggested) | 2 hours | | Implement heuristics | 2 hours | 15 min | 1h 45min | | Document everything | 2 hours | 30 min | 1h 30min | | **Total** | **11.5 hours** | **1.5 hours** | **10 hours** | **ROI Calculation:** - Time saved: 10 hours - Claude Code cost: $20/month ÷ 30 days = $0.67/day - My hourly rate: $[your rate] - **Value created: 10 hours × $[rate] = $[value]** - **ROI: ~1,500% on that single day** And that doesn't count the value of not building the wrong solution. ## Lessons Learned (Tactical Takeaways) ### 1. Document Your Plans Having `NEXT_SESSION.md` meant I could say "start phase 1b" and Claude knew exactly what to do. **ROI of 10 minutes planning: saved 30 minutes explaining.** ### 2. Let First Implementation Fail Fast Don't debate approaches endlessly. Claude implemented the ML version quickly, discovered it was slow, pivoted. **Faster to try and learn than to plan perfectly.** ### 3. Trust Pattern Recognition Claude noticed "heuristics are working" from test output. I was focused on "how to make ML faster." **Different lens, better insight.** ### 4. Simple Solutions Can Win 95,000x faster by removing complexity, not adding it. **"Too simple" is often the wrong criticism. Results matter more than sophistication.** ### 5. Validate Every Suggestion Claude suggested heuristics, I made it prove the claim with test results. **Trust but verify. "Show me" is always the right response.** ### 6. Be Willing to Pivot I spent 30 minutes heading toward ML solution. Claude suggested scrapping it. **Ego would have cost me performance. Flexibility saved the day.** ## What Came Next With routing solved (and fast), we continued: - Added request logging for debugging patterns - Implemented LiteLLM gateway for OpenAI-compatible API - Built metrics dashboard for monitoring - Deployed 6 optional enhancements in the same session **All in that same 6-hour session.** From "start phase 1b" to production gateway with logging, metrics, and documentation. The pattern continued: I provided direction, Claude implemented, we iterated. Same delegation model, different features. ## The Bigger Picture This isn't just about routing algorithms or DGX infrastructure. **This is about collaboration patterns:** When you have deep expertise in one area (ML for me), you develop blindspots. You reach for familiar tools even when they're wrong. You need a partner who can say "have you considered not using ML?" **This is about effective delegation:** I don't need to know how to write every line of code. I need to: - Know what I'm trying to achieve - Evaluate if solutions work - Be willing to change course - Trust expertise I don't have daily **This is about time multipliers:** 10 hours of solo work → 1.5 hours with Claude Code. Not because Claude writes code faster (though it does), but because: - It reads documentation instantly - It implements while I review - It suggests alternatives I wouldn't consider - It has no ego about pivoting ## The Invitation Next time you're reaching for the complex solution—the ML model, the microservice architecture, the sophisticated framework—pause. Ask your AI partner: **"Is there a simpler way?"** Then listen to the answer. Even if it contradicts what you were planning. Especially if it contradicts what you were planning. Sometimes the best solution is the one you overlooked because it seemed too obvious. My AI partner helped me see that 95,000x improvement was hiding behind "but that's too simple to work." What improvement is hiding in your blindspot? --- ## Related Articles <div class="quick-nav"> - [[dgx-lab-intelligent-gateway-heuristics-vs-ml-day-1|DGX Lab: When Simple Heuristics Beat ML (Technical Deep Dive)]] - [[dgx-lab-supercharged-bashrc-ml-workflows-day-2|Supercharge Your Shell with 50+ ML Aliases]] - [[building-production-ml-workspace-part-1-structure|Building a Production ML Workspace]] </div> --- ### Related Articles - [[dgx-lab-intelligent-gateway-heuristics-vs-ml-day-1|DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1]] - [[medical-llm-fine-tuning-70-to-92-percent|How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)]] - [[dgx-lab-benchmarks-vs-reality-day-4|DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4]] --- <p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p> <p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>