DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1
DGX Lab: When Simple Heuristics Beat ML by 95,000x
Date: October 20, 2025 DGX System: NVIDIA DGX Workstation Session Duration: ~6 hours Primary Focus: Multi-engine AI gateway with intelligent routing
The Unexpected Discovery
Today I built something that challenged one of my core assumptions about AI systems: that more complex is better. I set out to deploy a 1.5B parameter machine learning model for routing AI requests between inference engines. What I ended up with was 50 lines of heuristics that matched the ML model's 90% accuracy while being 95,000x faster.
This isn't a story about ML being bad—it's about recognizing when simpler solutions can achieve the same goals with dramatically better performance.
The Challenge
I'm building a multi-engine AI gateway for my DGX workstation that intelligently routes requests between different inference backends:
- Ollama: Fast, efficient for simple queries
- llama.cpp: Better for code generation and long contexts
- Cloud models (future): Reserved for complex reasoning
The routing decision needs to happen in under 50ms to avoid adding noticeable latency to user requests. The question: which engine handles each request?
The Plan: ML-Based Router
I started with Arch-Router-1.5B, a model specifically trained to route LLM requests. The architecture was elegant:
# Load Arch-Router model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(
"nicholasKluge/Arch-Router-1.5B"
)
tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Arch-Router-1.5B")
# Route a request
def route_with_ml(prompt: str) -> str:
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
return "ollama" if prediction == 0 else "llamacpp"
I built the service, deployed it, and ran my first benchmark.
Result: 950ms average routing time.
That's 19x over my 50ms target. Not even close.
The Critical Pivot
Here's where the session got interesting. Instead of immediately reaching for optimization (GPU acceleration, smaller model, quantization), I stopped and asked: What patterns is the ML model actually learning?
I analyzed the routing decisions across 100 test prompts and found clear patterns:
Pattern 1: Context Length
- Prompts <1,024 chars → Ollama (fast responses)
- Prompts >4,096 chars → llama.cpp (large context window)
Pattern 2: Content Type
- Interactive keywords ("hello", "how are you") → Ollama
- Code keywords ("def", "function", "class") → llama.cpp
Pattern 3: Explicit Requirements
- User specifies "fast" → Ollama
- User requests "detailed analysis" → llama.cpp
These weren't complex, high-dimensional patterns. They were clear decision boundaries that could be captured with simple rules.
The Heuristics Solution
I built a new router in 50 lines:
from enum import Enum
from typing import Tuple
class Engine(Enum):
OLLAMA = "ollama"
LLAMACPP = "llamacpp"
def route_with_heuristics(prompt: str) -> Tuple[Engine, float, str]:
"""Route based on heuristics, return (engine, confidence, rationale)"""
confidence = 0.5 # Base confidence
# Context length heuristic
prompt_length = len(prompt)
if prompt_length > 4096:
confidence += 0.3
return (
Engine.LLAMACPP,
min(confidence, 1.0),
"Long context (>4096 chars) better for llama.cpp"
)
elif prompt_length < 1024:
confidence += 0.2
# Code detection heuristic
code_keywords = ["def ", "function", "class ", "import ", "const ", "let "]
if any(kw in prompt.lower() for kw in code_keywords):
confidence += 0.4
return (
Engine.LLAMACPP,
min(confidence, 1.0),
"Code generation keywords detected"
)
# Interactive heuristic
interactive_keywords = ["hi", "hello", "how are you", "what's"]
if any(kw in prompt.lower() for kw in interactive_keywords):
confidence += 0.3
return (
Engine.OLLAMA,
min(confidence, 1.0),
"Interactive query, use fast engine"
)
# Default: short prompts to Ollama
return (
Engine.OLLAMA,
confidence,
"Short prompt, default to fast engine"
)
The Results
I ran the same test suite on both approaches:
| Metric | ML Model (Arch-Router-1.5B) | Heuristics |
|---|---|---|
| Routing Time | 950ms | 0.008ms |
| Accuracy | 90% (9/10 correct) | 90% (9/10 correct) |
| Speedup | Baseline | 95,000x faster |
| Infrastructure | Requires PyTorch + model files | Pure Python, no dependencies |
| Explainability | Black box | Human-readable rationale |
| GPU Required | Yes (for <50ms target) | No |
The heuristics matched ML accuracy while being 95,000x faster. Routing overhead went from 950ms to 0.008ms—essentially free.
Building the Complete System
With fast routing solved, I completed the full gateway architecture:
Architecture Overview
User Request
↓
Unified Gateway (port 5000)
↓
├─ Parse request & extract prompt
├─ Route with heuristics (0.008ms)
│ └─ Engine: ollama or llamacpp
├─ Execute on LiteLLM Proxy (port 4000)
│ └─ Forward to appropriate backend
└─ Return response + routing metadata
Key Components
1. Heuristic Router Service (arch_router_lite.py)
- FastAPI service on port 8888
- Endpoints:
/route,/execute,/health,/metrics - JSONL request logging
- Real-time metrics aggregation
2. LiteLLM Proxy (existing tool)
- OpenAI-compatible API gateway
- Manages connections to Ollama and llama.cpp
- Handles model loading, retries, fallbacks
3. Unified Gateway (unified_gateway.py)
- Single entry point on port 5000
- OpenAI-compatible
/v1/chat/completionsendpoint - Automatic routing or manual model selection
- Includes routing metadata in responses
Usage Example
import openai
# Point to local gateway instead of OpenAI
openai.api_base = "http://localhost:5000/v1"
openai.api_key = "dummy"
# Auto-routing: Gateway picks best engine
response = openai.ChatCompletion.create(
model="auto",
messages=[{"role": "user", "content": "Write a Python function to reverse a string"}]
)
# Response includes routing metadata
print(response["routing_metadata"])
# {
# "engine": "llamacpp",
# "confidence": 0.92,
# "rationale": "Code generation keywords detected",
# "routing_time_ms": 0.008
# }
The Observability Layer
I added comprehensive logging and metrics:
Request Logging (JSONL)
Every request is logged to `(local path)
{
"timestamp": "2025-10-20T14:32:15.847Z",
"prompt_length": 67,
"prompt_preview": "Write a Python function to reverse a string",
"engine": "llamacpp",
"routing_confidence": 0.92,
"routing_rationale": "Code generation keywords detected",
"routing_time_ms": 0.008,
"inference_time_ms": 1450,
"total_time_ms": 1450.008,
"success": true
}
Real-Time Metrics
The /metrics endpoint aggregates logs on-demand:
{
"total_requests": 156,
"requests_by_engine": {
"ollama": 89,
"llamacpp": 67
},
"avg_routing_time_ms": 0.009,
"avg_inference_time_ms": 1420,
"success_rate": 0.99
}
CLI Dashboard
I built a terminal dashboard with auto-refresh:
================================================================================
📊 Arch-Router Metrics Dashboard
================================================================================
📈 Summary Statistics
--------------------------------------------------------------------------------
Total Requests: 156
Success Rate: 99.4%
Avg Routing Time: 0.009ms
Avg Inference Time: 1420ms
🚀 Requests by Engine
--------------------------------------------------------------------------------
ollama | ████████████████████████████ | 89 ( 57.1%)
llamacpp | ████████████████████ | 67 ( 42.9%)
📝 Recent Requests (Last 10)
--------------------------------------------------------------------------------
✅ 5s ago | ollama | 1380ms | Hello! How can I help?...
✅ 12s ago | llamacpp | 1560ms | Write a function to reverse...
✅ 18s ago | ollama | 1290ms | What is machine learning?...
What I Learned
1. When Heuristics Beat Machine Learning
ML is powerful, but it's not always the right tool. Heuristics work better when:
- Decision boundaries are clear (not high-dimensional or nuanced)
- Speed is critical (<1ms requirements)
- Explainability matters (need to debug routing decisions)
- Infrastructure is constrained (no GPU, edge deployment)
This project's routing task had clear patterns that could be captured with rules. ML was overkill.
2. JSONL as a Lightweight Database
Append-only JSONL logs provided 90% of database functionality with 10% of complexity:
- Fast writes (O(1) append)
- Easy parsing (one JSON per line)
- Human-readable (debugging with cat/jq/grep)
- No setup (just filesystem)
For <10K requests, on-demand metrics aggregation from logs is perfectly acceptable. Scale to a real database when you have proven need.
3. Transparency Builds Trust
Including routing_metadata in every response provides:
- Debugging without server access
- User confidence through explainability
- Optimization opportunities (users can adjust prompts)
Make your AI systems explainable by exposing decision rationale, not just results.
4. OpenAI API as Universal Interface
By implementing OpenAI's /v1/chat/completions API, my gateway works with:
- Existing SDKs (openai-python, openai-node)
- LangChain, AutoGPT, Continue.dev
- Any tool expecting OpenAI format
Standards matter. Adopt them.
Performance Metrics
Final system performance:
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Routing Overhead | <50ms | 0.008ms | ✅ 6,250x better |
| Routing Accuracy | >80% | 90% | ✅ Exceeds target |
| End-to-End Latency | <2s | 1.45s | ✅ 27% better |
| Success Rate | >95% | 99.4% | ✅ Exceeds target |
Challenges & Solutions
Arch-Router-1.5B took 950ms on CPU, 19x over the 50ms target. GPU acceleration could reduce this to ~20ms, but still adds complexity.
Analyzed ML routing decisions to identify clear patterns. Implemented heuristics capturing the same logic with 0.008ms overhead and zero infrastructure.
Ollama doesn't have a /health endpoint like most services—it uses /api/tags instead.
Built flexible health checking that tries standard endpoints first, then falls back to service-specific alternatives.
Next Steps
This gateway is Phase 1 of a larger AI infrastructure project. Coming next:
- Phase 2a: Document ingestion pipeline for RAG
- Phase 2b: Vector database integration (Chroma or Qdrant)
- Phase 2c: RAG query endpoint with context retrieval
- Cloud model integration: Add Claude/GPT-4 for complex reasoning
- Request caching: LRU cache for identical prompts
- Streaming responses: SSE for real-time token generation
Related Articles
Building Production ML Workspaces: AI AgentsshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 4 - Production-Ready AI Agent TemplatesBuild production-ready AI agents with standardized templates, tool integration patterns, comprehensive testing, and deployment readiness frameworks. AI Systems & Architecture Overview Practical Applications Hub
Quick Start:
- Clone this approach with your own backends (Ollama, vLLM, etc.)
- Start with simple heuristics: prompt length + keyword matching
- Log every request to JSONL for analysis
- Measure: does your heuristic-based router meet your latency target?
- Only add ML if heuristics can't reach your accuracy goal
Key Insight: Measure first, optimize second. Don't assume ML is needed until simple solutions fail.
This is Day 1 of the DGX Lab Chronicles, documenting real AI experiments on NVIDIA DGX hardware. Session files and code available on the DGX system at /home/bioinfo/workspace/infrastructure/gateway/.
Related Articles
- DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4shippedPractical ApplicationsOct 26, 2025DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production RealityNVIDIA's DGX Spark benchmarks show 82,739 tokens/sec for training. After 6 days of intensive ML workloads and feedback from the HN community, here's what the benchmarks don't tell you about precision issues, memory fragmentation, and production workarounds.
- DGX Lab: Supercharge Your Shell with 50+ ML Productivity Aliases - Day 2shippedPractical ApplicationsOct 20, 2025DGX Lab: Supercharge Your Shell with 50+ ML Productivity Aliases - Day 2Transform your default shell into a productivity powerhouse with GPU monitoring shortcuts, smart aliases, and custom functions—setup in 5 minutes, benefit forever.
- DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3shippedPractical ApplicationsOct 24, 2025DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3Building a complete self-hosted RAG infrastructure with 8 integrated services on a single DGX workstation, supporting everything from medical AI fine-tuning to document Q&A.
About the Author: Justin Johnson builds AI systems and writes about practical AI development.
justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe
Follow the lab
Get the next experiment
Enjoyed the breakdown on DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.