AIXplorethe lab
AI Systems & Architecture10 min readshipped

Dendrite: Why O(1) Fork Latency Changes Everything for AI Agents

Dendrite: Why O(1) Fork Latency Changes Everything for AI Agents

Executive Summary
Modern inference engines like vLLM and SGLang optimize for multi-tenant throughput, serving thousands of independent requests. But agentic workloads have a fundamentally different profile: a single agent exploring multiple reasoning paths from a shared context. Dendrite inverts design priorities, providing O(1) fork latency through copy-on-write KV cache semantics. The result: 1000-10000x faster branching and 5x memory reduction for tree-structured reasoning.

The Hidden Bottleneck in Agentic Reasoning

When an AI agent reasons through a complex problem, it rarely follows a single path. Tree-of-Thought explores multiple reasoning strategies. MCTS evaluates hundreds of potential moves. Beam search maintains parallel hypotheses. All of these patterns require the same fundamental operation: forking the inference state.

Consider an AI coding agent debugging a complex issue:

[System prompt + codebase context: 4K tokens]
              │
         ┌────┴────┐
         │         │
      Fix A     Fix B        ← 2 potential fixes
         │         │
      ┌──┴──┐   ┌──┴──┐
    Test   Test Test  Test   ← Test each with 2 approaches

This modest 2-level tree requires 6 fork operations. Each fork needs access to the full 4K token context that came before it.

Here's where current systems struggle:

SystemFork OperationLatency (4K context)
vLLMRecompute prefix~50-100ms
SGLangRadixAttention lookup + copy~5-10ms
DendritePointer copy (CoW)~3μs

For our 6-branch coding agent scenario:

  • vLLM: 300-600ms just for branching overhead
  • SGLang: 30-60ms
  • Dendrite: 18μs

That 300ms might seem small, but it compounds. A Tree-of-Thought exploration with 27 branches takes 1.35 seconds in vLLM before generating a single new token. An MCTS search with 500 simulations spends 25-50 seconds on fork overhead alone.

This isn't a minor optimization opportunity. Fork latency determines which reasoning strategies are practical for interactive applications.

The Key Insight: Fork Should Be O(1)

The problem with existing systems isn't poor engineering. vLLM and SGLang are excellent at what they optimize for. The problem is they optimize for the wrong metric when it comes to agentic workloads.

vLLM's PagedAttention revolutionized multi-tenant serving by managing KV cache as paged blocks. But when you fork a sequence, you still need to copy that data or recompute it. SGLang's RadixAttention shares prefixes via a radix tree, reducing redundant computation. But creating a new branch still involves state allocation and copying.

Dendrite asks a different question: what if fork was just a pointer copy?

The answer comes from operating systems. When Unix forks a process, it doesn't copy all memory. It marks pages as copy-on-write (CoW). Both parent and child point to the same physical memory. Only when one writes to a page does the system copy that specific page.

Applied to KV cache:

Traditional (vLLM/SGLang):              Dendrite (Copy-on-Write):

Fork = Copy entire KV cache             Fork = Copy block table pointers
       O(context_length)                       O(num_blocks) ≈ O(1)

┌─────────────────────┐                 ┌─────────────────────┐
│ Parent: 4K tokens   │                 │ Block Table (Parent)│
│ [================]  │                 │ [0][1][2][3]        │
└─────────────────────┘                 └──│──│──│──│────────┘
         │ fork                            │  │  │  │
         ▼                                 ▼  ▼  ▼  ▼
┌─────────────────────┐                 ┌──────────────────────┐
│ Child: Copy 4K      │                 │ Physical Blocks      │
│ [================]  │  ← 50-100ms     │ [B0][B1][B2][B3]     │
└─────────────────────┘                 │ ref=2 (shared!)      │
                                        └──────────────────────┘
                                                   ▲
                                        ┌──│──│──│──│────────┐
                                        │ [0][1][2][3]        │
                                        │ Block Table (Child) │ ← 500ns
                                        └─────────────────────┘

Memory duplicates only when branches diverge and only at block granularity (16 tokens). A 4K token prefix shared across 6 branches uses ~1.1GB instead of 6GB.

Dendrite's Architecture

Dendrite is built around three core concepts: paged blocks, reference counting, and tree-aware scheduling.

Paged Blocks with Reference Counting

The KV cache divides into fixed-size blocks (16 tokens by default). Each physical block tracks how many sequences reference it:

// Conceptual structure
struct PhysicalBlock {
    id: BlockId,
    ref_count: AtomicU32,      // How many sequences share this block
    key_cache: [f16; BLOCK_SIZE * HEAD_SIZE],
    value_cache: [f16; BLOCK_SIZE * HEAD_SIZE],
}

// Each sequence maintains a logical view
struct BlockTable {
    blocks: Vec<BlockId>,      // Pointers to physical blocks
}

When a sequence forks, Dendrite performs a shallow copy of the block table and increments reference counts. For a 4K context (256 blocks), this means copying 256 integers and incrementing 256 counters. On modern CPUs: nanoseconds.

Copy-on-Write Trigger

The magic happens when sequences diverge. When a sequence tries to append a token to a shared block:

  1. Check the block's reference count
  2. If ref_count > 1 (shared):
    • Allocate a new physical block
    • Copy content from the shared block
    • Update this sequence's block table
    • Decrement the old block's reference count
  3. Write proceeds on the now-private block

Key property: Copying happens lazily, only for blocks that actually diverge, and only once per diverging sequence.

Tree-Aware Scheduling

Unlike vLLM's scheduler (optimized for fairness via FCFS), Dendrite's scheduler understands branch priorities. In MCTS, some branches are more promising based on value estimates. Dendrite's priority queue lets agents specify which branches to explore first, enabling efficient tree search without wasting compute on low-value paths.

Why Rust?

Dendrite is written in Rust, which might seem unusual for ML infrastructure dominated by Python and C++. Three factors drove this choice:

Memory safety without garbage collection. Copy-on-write semantics require precise control over when memory is allocated, shared, and freed. Reference counting bugs in C++ cause silent data corruption. Rust's ownership system catches these errors at compile time.

Fearless concurrency. Tree search is inherently parallel. Multiple branches can generate tokens simultaneously. Rust's type system prevents data races, letting Dendrite safely share KV cache across threads without locks on the hot path.

Embeddable library. Dendrite is designed as a library that agents embed directly, avoiding IPC overhead. Rust compiles to native code with C-compatible FFI, integrating cleanly with Python (via PyO3) or any other language.

The tradeoff is a smaller ecosystem than Python. But for infrastructure that sits below the agent framework layer, correctness and performance matter more than library availability.

Measured Performance

Real numbers from Dendrite's benchmark suite:

Fork Latency vs Context Length

TokensBlocksFork Latency
161635ns
25616650ns
102464680ns
4096256720ns
163841024790ns

16K tokens adds only 155ns vs 16 tokens. Compare this to copying 16K tokens of KV cache at memory bandwidth, which would take ~130ms.

Memory Efficiency

Tree-of-Thought scenario:

  • Problem prefix: 1024 tokens (64 blocks)
  • 3 reasoning branches, each 512 tokens
  • 3 evaluations per branch, each 128 tokens
ApproachBlocks UsedSavings
Linear (copy prefix)936 blocksbaseline
Tree (CoW sharing)232 blocks75% reduction

Effective Throughput for Tree Search

SystemEffective tok/sec (4-way branching)
vLLM40-50
SGLang80-100
Dendrite150-200

Effective tokens = unique tokens generated / wall time

Dendrite's single-sequence throughput is lower than vLLM (40-50 vs 150-200 tok/s). That's intentional. Dendrite optimizes for the branching case, not the linear case.

Real-World Impact Scenarios

AI Coding Agent

6 branches exploring different fixes:

  • vLLM: 300-600ms branching overhead
  • SGLang: 30-60ms
  • Dendrite: 18μs

User experience difference: "Why is it thinking?" vs "Instant exploration"

MCTS for Code Generation

100 simulations, average 5 moves each = 500 forks:

  • vLLM: 25-50 seconds
  • SGLang: 2.5-5 seconds
  • Dendrite: 1.5ms

MCTS becomes practical only with O(1) fork. At 25 seconds of overhead, you'd never ship an interactive MCTS-based agent. At 1.5ms, it's invisible.

Multi-Agent Debate

Multiple agents share system prompt and context, diverge in responses:

  • Memory with copying: N agents × context size
  • Memory with Dendrite: 1 × context size + divergent responses

For 8 agents sharing 8K context, that's 8GB vs ~1.2GB.

When to Use Dendrite (and When Not To)

Positioning
Dendrite and vLLM serve different use cases. This is complementary positioning, not competition. Use the right tool for your workload.

Use Dendrite For

  • Tree-of-Thought reasoning: Multiple paths from shared prompt
  • MCTS: Game playing, planning, code generation with search
  • Beam search: Translation, summarization with deep trees
  • Speculative decoding: Fast draft + verify patterns
  • Multi-agent systems: Agents sharing context efficiently

Use vLLM/SGLang For

  • Simple chat/QA: No branching needed
  • Batch processing: Independent prompts don't benefit from sharing
  • Maximum single-sequence throughput: vLLM is 3-4x faster

The Hybrid Approach

Many production systems will use both. vLLM handles high-throughput API serving for simple requests. Dendrite powers the reasoning layer when agents need to explore. The inference engine becomes a choice based on workload characteristics, not a one-size-fits-all decision.

Getting Started

Dendrite provides a Rust API designed for embedding in agent frameworks:

// Create paged cache with O(1) fork support
let cache = PagedKvCache::new(num_layers, pages, page_size, ...)?;

// Allocate root sequence and prefill
let root = cache.allocate_sequence();
// ... prefill with prompt tokens ...

// O(1) fork - shares pages via copy-on-write
let branch_a = cache.fork_sequence(root)?;
let branch_b = cache.fork_sequence(root)?;

// Each branch can now generate independently
// Pages copy only when branches write different tokens

Python bindings (via PyO3) are planned for integration with LangGraph, AutoGen, and other agent frameworks.

Current Status
Dendrite is research-grade software with 272 passing tests and working GPU inference. FP8 quantization and FlashInfer FFI integration are in progress. Production hardening is ongoing.

The Bigger Picture

The LLM inference stack has optimized for the wrong metric. Multi-tenant throughput matters for API providers serving millions of independent requests. But the next wave of AI applications isn't about serving independent requests. It's about agents that reason, explore, and search.

Tree-of-Thought showed that exploring multiple reasoning paths improves performance from 4% to 74% on complex math problems. MCTS enables superhuman game play. Speculative decoding accelerates generation 2-3x. All of these techniques share a common requirement: fast forking of inference state.

When fork takes 50-100ms, you explore fewer branches. When fork takes 500ns, you explore as many as your reasoning strategy demands. The bottleneck shifts from infrastructure to ideas.

Dendrite is a bet that agent architectures will demand different infrastructure than chatbots. As reasoning strategies grow more sophisticated, O(1) fork becomes table stakes. The question isn't whether inference engines will support efficient branching. It's whether they'll be designed for it from the start.


Dendrite is open source. Check out the GitHub repository for code, benchmarks, and contribution guidelines.

Related Articles

  • Agent Architectures with MCP
  • Building Effective AI Agents
  • DSPy: Programming Language Models at Scale
  • Manus.im System Architecture

For technical details on PagedAttention and KV cache management, see the vLLM paper and SGLang RadixAttention blog post.


Related Articles


About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on Dendrite: Why O(1) Fork Latency Changes Everything for AI Agents? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.