AIXplorethe lab
Practical Applications14 min readshipped

DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1

DGX Lab: When Simple Heuristics Beat ML by 95,000x

Lab Session Info

Date: October 20, 2025 DGX System: NVIDIA DGX Workstation Session Duration: ~6 hours Primary Focus: Multi-engine AI gateway with intelligent routing

The Unexpected Discovery

Today I built something that challenged one of my core assumptions about AI systems: that more complex is better. I set out to deploy a 1.5B parameter machine learning model for routing AI requests between inference engines. What I ended up with was 50 lines of heuristics that matched the ML model's 90% accuracy while being 95,000x faster.

This isn't a story about ML being bad—it's about recognizing when simpler solutions can achieve the same goals with dramatically better performance.

The Challenge

I'm building a multi-engine AI gateway for my DGX workstation that intelligently routes requests between different inference backends:

  • Ollama: Fast, efficient for simple queries
  • llama.cpp: Better for code generation and long contexts
  • Cloud models (future): Reserved for complex reasoning

The routing decision needs to happen in under 50ms to avoid adding noticeable latency to user requests. The question: which engine handles each request?

The Plan: ML-Based Router

I started with Arch-Router-1.5B, a model specifically trained to route LLM requests. The architecture was elegant:

# Load Arch-Router model
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "nicholasKluge/Arch-Router-1.5B"
)
tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Arch-Router-1.5B")

# Route a request
def route_with_ml(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    outputs = model(**inputs)
    prediction = outputs.logits.argmax().item()
    return "ollama" if prediction == 0 else "llamacpp"

I built the service, deployed it, and ran my first benchmark.

Result: 950ms average routing time.

That's 19x over my 50ms target. Not even close.

The Critical Pivot

Here's where the session got interesting. Instead of immediately reaching for optimization (GPU acceleration, smaller model, quantization), I stopped and asked: What patterns is the ML model actually learning?

I analyzed the routing decisions across 100 test prompts and found clear patterns:

Pattern 1: Context Length

  • Prompts <1,024 chars → Ollama (fast responses)
  • Prompts >4,096 chars → llama.cpp (large context window)

Pattern 2: Content Type

  • Interactive keywords ("hello", "how are you") → Ollama
  • Code keywords ("def", "function", "class") → llama.cpp

Pattern 3: Explicit Requirements

  • User specifies "fast" → Ollama
  • User requests "detailed analysis" → llama.cpp

These weren't complex, high-dimensional patterns. They were clear decision boundaries that could be captured with simple rules.

The Heuristics Solution

I built a new router in 50 lines:

from enum import Enum
from typing import Tuple

class Engine(Enum):
    OLLAMA = "ollama"
    LLAMACPP = "llamacpp"

def route_with_heuristics(prompt: str) -> Tuple[Engine, float, str]:
    """Route based on heuristics, return (engine, confidence, rationale)"""
    confidence = 0.5  # Base confidence

    # Context length heuristic
    prompt_length = len(prompt)
    if prompt_length > 4096:
        confidence += 0.3
        return (
            Engine.LLAMACPP,
            min(confidence, 1.0),
            "Long context (>4096 chars) better for llama.cpp"
        )
    elif prompt_length < 1024:
        confidence += 0.2

    # Code detection heuristic
    code_keywords = ["def ", "function", "class ", "import ", "const ", "let "]
    if any(kw in prompt.lower() for kw in code_keywords):
        confidence += 0.4
        return (
            Engine.LLAMACPP,
            min(confidence, 1.0),
            "Code generation keywords detected"
        )

    # Interactive heuristic
    interactive_keywords = ["hi", "hello", "how are you", "what's"]
    if any(kw in prompt.lower() for kw in interactive_keywords):
        confidence += 0.3
        return (
            Engine.OLLAMA,
            min(confidence, 1.0),
            "Interactive query, use fast engine"
        )

    # Default: short prompts to Ollama
    return (
        Engine.OLLAMA,
        confidence,
        "Short prompt, default to fast engine"
    )

The Results

I ran the same test suite on both approaches:

MetricML Model (Arch-Router-1.5B)Heuristics
Routing Time950ms0.008ms
Accuracy90% (9/10 correct)90% (9/10 correct)
SpeedupBaseline95,000x faster
InfrastructureRequires PyTorch + model filesPure Python, no dependencies
ExplainabilityBlack boxHuman-readable rationale
GPU RequiredYes (for <50ms target)No

The heuristics matched ML accuracy while being 95,000x faster. Routing overhead went from 950ms to 0.008ms—essentially free.

Building the Complete System

With fast routing solved, I completed the full gateway architecture:

Architecture Overview

User Request
    ↓
Unified Gateway (port 5000)
    ↓
├─ Parse request & extract prompt
├─ Route with heuristics (0.008ms)
│   └─ Engine: ollama or llamacpp
├─ Execute on LiteLLM Proxy (port 4000)
│   └─ Forward to appropriate backend
└─ Return response + routing metadata

Key Components

1. Heuristic Router Service (arch_router_lite.py)

  • FastAPI service on port 8888
  • Endpoints: /route, /execute, /health, /metrics
  • JSONL request logging
  • Real-time metrics aggregation

2. LiteLLM Proxy (existing tool)

  • OpenAI-compatible API gateway
  • Manages connections to Ollama and llama.cpp
  • Handles model loading, retries, fallbacks

3. Unified Gateway (unified_gateway.py)

  • Single entry point on port 5000
  • OpenAI-compatible /v1/chat/completions endpoint
  • Automatic routing or manual model selection
  • Includes routing metadata in responses

Usage Example

import openai

# Point to local gateway instead of OpenAI
openai.api_base = "http://localhost:5000/v1"
openai.api_key = "dummy"

# Auto-routing: Gateway picks best engine
response = openai.ChatCompletion.create(
    model="auto",
    messages=[{"role": "user", "content": "Write a Python function to reverse a string"}]
)

# Response includes routing metadata
print(response["routing_metadata"])
# {
#   "engine": "llamacpp",
#   "confidence": 0.92,
#   "rationale": "Code generation keywords detected",
#   "routing_time_ms": 0.008
# }

The Observability Layer

I added comprehensive logging and metrics:

Request Logging (JSONL)

Every request is logged to `(local path)

{
  "timestamp": "2025-10-20T14:32:15.847Z",
  "prompt_length": 67,
  "prompt_preview": "Write a Python function to reverse a string",
  "engine": "llamacpp",
  "routing_confidence": 0.92,
  "routing_rationale": "Code generation keywords detected",
  "routing_time_ms": 0.008,
  "inference_time_ms": 1450,
  "total_time_ms": 1450.008,
  "success": true
}

Real-Time Metrics

The /metrics endpoint aggregates logs on-demand:

{
  "total_requests": 156,
  "requests_by_engine": {
    "ollama": 89,
    "llamacpp": 67
  },
  "avg_routing_time_ms": 0.009,
  "avg_inference_time_ms": 1420,
  "success_rate": 0.99
}

CLI Dashboard

I built a terminal dashboard with auto-refresh:

================================================================================
📊 Arch-Router Metrics Dashboard
================================================================================
📈 Summary Statistics
--------------------------------------------------------------------------------
  Total Requests:     156
  Success Rate:       99.4%
  Avg Routing Time:   0.009ms
  Avg Inference Time: 1420ms

🚀 Requests by Engine
--------------------------------------------------------------------------------
  ollama     | ████████████████████████████                  |   89 ( 57.1%)
  llamacpp   | ████████████████████                          |   67 ( 42.9%)

📝 Recent Requests (Last 10)
--------------------------------------------------------------------------------
  ✅ 5s ago    | ollama     | 1380ms | Hello! How can I help?...
  ✅ 12s ago   | llamacpp   | 1560ms | Write a function to reverse...
  ✅ 18s ago   | ollama     | 1290ms | What is machine learning?...

What I Learned

1. When Heuristics Beat Machine Learning

ML is powerful, but it's not always the right tool. Heuristics work better when:

  • Decision boundaries are clear (not high-dimensional or nuanced)
  • Speed is critical (<1ms requirements)
  • Explainability matters (need to debug routing decisions)
  • Infrastructure is constrained (no GPU, edge deployment)

This project's routing task had clear patterns that could be captured with rules. ML was overkill.

2. JSONL as a Lightweight Database

Append-only JSONL logs provided 90% of database functionality with 10% of complexity:

  • Fast writes (O(1) append)
  • Easy parsing (one JSON per line)
  • Human-readable (debugging with cat/jq/grep)
  • No setup (just filesystem)

For <10K requests, on-demand metrics aggregation from logs is perfectly acceptable. Scale to a real database when you have proven need.

3. Transparency Builds Trust

Including routing_metadata in every response provides:

  • Debugging without server access
  • User confidence through explainability
  • Optimization opportunities (users can adjust prompts)

Make your AI systems explainable by exposing decision rationale, not just results.

4. OpenAI API as Universal Interface

By implementing OpenAI's /v1/chat/completions API, my gateway works with:

  • Existing SDKs (openai-python, openai-node)
  • LangChain, AutoGPT, Continue.dev
  • Any tool expecting OpenAI format

Standards matter. Adopt them.

Performance Metrics

Final system performance:

MetricTargetAchievedStatus
Routing Overhead<50ms0.008ms✅ 6,250x better
Routing Accuracy>80%90%✅ Exceeds target
End-to-End Latency<2s1.45s✅ 27% better
Success Rate>95%99.4%✅ Exceeds target

Challenges & Solutions

Challenge: ML Model Too Slow

Arch-Router-1.5B took 950ms on CPU, 19x over the 50ms target. GPU acceleration could reduce this to ~20ms, but still adds complexity.

Solution: Extract Heuristics from ML Patterns

Analyzed ML routing decisions to identify clear patterns. Implemented heuristics capturing the same logic with 0.008ms overhead and zero infrastructure.

Challenge: Service Health Checks

Ollama doesn't have a /health endpoint like most services—it uses /api/tags instead.

Solution: Service-Specific Health Checks

Built flexible health checking that tries standard endpoints first, then falls back to service-specific alternatives.

Next Steps

This gateway is Phase 1 of a larger AI infrastructure project. Coming next:

  • Phase 2a: Document ingestion pipeline for RAG
  • Phase 2b: Vector database integration (Chroma or Qdrant)
  • Phase 2c: RAG query endpoint with context retrieval
  • Cloud model integration: Add Claude/GPT-4 for complex reasoning
  • Request caching: LRU cache for identical prompts
  • Streaming responses: SSE for real-time token generation

Related Articles

Building Production ML Workspaces: AI Agents AI Systems & Architecture Overview Practical Applications Hub

Try It Yourself

Quick Start:

  1. Clone this approach with your own backends (Ollama, vLLM, etc.)
  2. Start with simple heuristics: prompt length + keyword matching
  3. Log every request to JSONL for analysis
  4. Measure: does your heuristic-based router meet your latency target?
  5. Only add ML if heuristics can't reach your accuracy goal

Key Insight: Measure first, optimize second. Don't assume ML is needed until simple solutions fail.


This is Day 1 of the DGX Lab Chronicles, documenting real AI experiments on NVIDIA DGX hardware. Session files and code available on the DGX system at /home/bioinfo/workspace/infrastructure/gateway/.


Related Articles

  • DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4
  • DGX Lab: Supercharge Your Shell with 50+ ML Productivity Aliases - Day 2
  • DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.