reproducing-warp-decode-blackwell - AIXplore

# Reproducing Warp Decode: What Happens When You Flip MoE Parallelism on Blackwell Cursor published Warp Decode last week. No code. No independent reproductions. Just a blog post describing a different way to parallelize expert dispatch on a GPU. The claims: 1.84x throughput on B200 with better numerical accuracy than the standard approach. I built it from scratch on a DGX Spark GB10 (Blackwell, sm_121) and tested it on two models. This post covers the reproduction, the results, why it works better on some architectures than others, and the one thing I tried that failed. ## What Warp Decode Does Standard MoE inference organizes GPU work around experts. For each token, a router picks top-k experts. The implementation gathers all tokens assigned to expert 0, runs that expert's FFN, scatters results back, then moves to expert 1. Repeat for all active experts. This scatter-gather loop has overhead: Python dispatch, dynamic indexing, synchronization between expert batches. Warp Decode flips the axis. Instead of "which tokens go to this expert?", each GPU thread asks "what's my output value?" Each thread (or warp) owns one output dimension and pulls from whatever experts it needs. No gather. No scatter. No synchronization between experts. Every thread is independent. Concretely, I implemented two Triton kernels: **Kernel 1 (gate_up):** Each program instance owns a block of intermediate neurons for one token-expert pair. It streams over the hidden dimension, accumulating gate and up dot products in FP32 registers, applies the activation function, and writes one chunk of intermediate output. **Kernel 2 (down):** Each program instance owns a block of output dimensions for one token. It loops through all top-k routed experts, loading intermediate activations and down-projection weights, folding routing weights into an FP32 accumulator. One write at the end. The key property: no shared mutable state between programs. The GPU scheduler sees a flat namespace of independent work. ## Results: Gemma 4 26B-A4B 128 experts, top-8 routing, 704-dim expert FFN, GELU activation. Brand new model (one week old at the time of testing). Layer-level benchmarks with error bars (200 iterations, fixed seeds): | Batch | HF Eager (ms) | Warp Decode (ms) | Speedup | Cosine Sim | |-------|--------------|------------------|---------|------------| | 1 | 2.39±0.93 | 0.54±0.09 | 4.42x | 0.99999976 | | 2 | 5.30±1.82 | 1.01±0.12 | 5.23x | 0.99999970 | | 4 | 8.66±3.04 | 1.88±0.15 | 4.60x | 0.99999976 | | 8 | 14.16±4.29 | 3.01±0.13 | 4.71x | 0.99999970 | Average 4.7x at the layer level. Notice the variance: HF eager has CV 0.30-0.39 (the Python per-expert loop is noisy), Warp Decode has CV 0.04-0.17 (Triton kernels are consistent). Against HF's batched matmul (a stronger baseline that uses `torch.bmm` with gathered expert weights): 2.6-3.2x speedup. I also tried `torch.compile()` on the eager path. It didn't help. Dynamic shapes in the per-expert loop defeat the compiler. At batch=8 it was actually slower than uncompiled. **End-to-end generation** (full model, all 30 MoE layers patched, 64 tokens): | Config | Tokens/sec | |--------|-----------| | Stock HuggingFace (eager) | 11.89 | | Warp Decode (30 layers patched) | 16.39 | | **Speedup** | **1.38x** | The 4.7x layer-level compresses to 1.38x end-to-end because MoE expert dispatch is roughly 25% of total forward pass time. Attention, dense MLP, embedding, and normalization are unchanged. ## Results: Phi-3.5-MoE 16 experts, top-2 routing, 6400-dim expert FFN, SiLU activation. This model tests a very different architecture: fewer experts, lower routing, much larger intermediate dimensions. | Batch | HF Eager (ms) | Warp Decode (ms) | Speedup | |-------|--------------|------------------|---------| | 1 | 2.19 | 1.70 | 1.29x | | 2 | 5.05 | 3.03 | 1.67x | | 4 | 7.69 | 5.38 | 1.43x | | 8 | 11.76 | 8.25 | 1.43x | Smaller speedup. Top-2 routing means only 2 expert dispatches per token. The overhead Warp Decode eliminates is proportionally smaller. The technique still helps, but modestly. ## The Fusion Experiment Cursor describes keeping "the eight intermediate results" entirely in registers, never writing them to global memory. I tried this: a single fused kernel that computes gate_up intermediate values in registers and immediately uses them for the down projection. No intermediate buffer. It was slower. On both models. By a lot. | Model | Intermediate | Unfused (ms) | Fused (ms) | Ratio | |-------|-------------|-------------|-----------|-------| | Gemma 4 | 704-dim, 11 KB/token | 0.54 | 5.10 | 0.10x | | Phi-3.5-MoE | 6400-dim, 25 KB/token | 1.70 | 11.22 | 0.15x | I tested with 6400-dim intermediates (Phi) specifically to see if larger buffers would tip the balance toward fusion. They didn't. I initially thought this was a Triton limitation: Triton programs are thread blocks, not individual warps, so maybe the granularity was wrong. To test this, I wrote the same fused kernel in raw CUDA with explicit `__shfl_xor_sync` warp-level butterfly reductions. One warp per output dimension, full hardware control, compiled for sm_121. The CUDA fused kernel was 700x slower than the unfused Triton approach. The diagnosis: data reuse. The CUDA fused kernel loads hidden states once per intermediate dimension (5,632 times per output dimension across all experts). The Triton unfused kernel loads hidden states once per 64 intermediate dimensions (11 times per expert). That's 64x more redundant memory traffic. For Gemma 4's dimensions, this works out to ~90 GB of redundant reads at ~200 GB/s effective bandwidth, predicting ~450ms. The measured 387ms matches. The fix would be shared memory tiling: load hidden states once per block, reuse across intermediate dimensions. But that's what Triton's tile-based programming model already does automatically. You'd be reimplementing Triton's data reuse in CUDA, then adding fusion on top. I then wrote a V2 with shared memory tiling: hidden states loaded once per block, reused across intermediate dimensions. Still 900ms. The hidden state reuse was fixed, but each thread still serializes over 704 intermediate dimensions, computing full dot products of length 2816. The inner loop has no parallelism across intermediate dims. Triton's unfused approach splits intermediate dimensions across 64 threads working simultaneously. Getting that same parallelism inside a fused CUDA kernel while maintaining the register-local intermediate values requires multi-level tiling: shared memory for hidden states and weight tiles, cooperative thread groups for the dot products, register-level accumulation for the intermediate activations. That's serious kernel engineering, not a weekend experiment. I iterated through 9 CUDA kernel versions, systematically fixing each bottleneck. The journey from V1 (387ms, 700x slower) to V9 (0.99ms, 1.85x slower) was instructive: - **280x** came from getting the algorithm right: match Triton's parallelism decomposition (V1 to V6) - **1.4x** came from vectorized loads: half2, float4, __ldg cache hints (V6 to V9) - **1.85x** remains: Triton's compiler generates better instruction scheduling and register allocation One final finding closed the loop. NVIDIA's wmma tensor core API requires matrix dimensions M, N, K all >= 16. At batch=1 decode, the computation is matrix-vector (N=1). Tensor cores can't help. Triton's `tl.dot` uses vectorized FMA for this shape, not tensor cores. The remaining gap is compiler quality, not hardware utilization. The unfused Triton approach gives you most of the possible win with a fraction of the engineering effort. That's the pragmatic answer. ## What Determines Whether Warp Decode Helps It's the top-k routing count relative to expert count. | Model | Experts | Top-k | Ratio | Speedup | |-------|---------|-------|-------|---------| | Gemma 4 | 128 | 8 | 1:16 | 4.7x | | Phi-3.5-MoE | 16 | 2 | 1:8 | 1.4x | High top-k with many experts means more dispatch overhead per token. Warp Decode eliminates that overhead. Low top-k with few experts means the dispatch is already cheap. Less room to improve. Intermediate dimension doesn't drive it. Model size doesn't drive it. Routing sparsity does. ## The Industry Direction | Model | Year | Experts | Intermediate | Top-k | |-------|------|---------|-------------|-------| | Mixtral 8x7B | 2023 | 8 | 14,336 | 2 | | DeepSeek-V3 | 2025 | 256 | 2,048 | 8 | | Gemma 4 26B-A4B | 2026 | 128 | 704 | 8 | | Qwen3.5-35B | 2026 | 256 | 512 | 8 | The trend: many small experts, high top-k, small intermediate dimensions. Every 2026 model follows this pattern. Fine-grained routing with more experts gives better specialization and load balancing. This is exactly the architecture profile where Warp Decode wins. The technique isn't a novelty for one model generation. It's aligned with where MoE design is heading. ## Where These Batch Sizes Actually Matter Before the vLLM comparison, some context on what batch sizes show up in production. This determines whether our results are relevant or academic. **Decode dominates interactive sessions.** For a typical chat interaction, prefill (processing the prompt) takes 10-20% of wall-clock time. Decode (generating each output token sequentially) takes 80-90%. Every decode step is memory-bandwidth-bound, loading entire model weights from HBM. That's where MoE dispatch overhead lives, and where Warp Decode operates. **Production batch sizes by workload:** | Workload | Batch Size | Time Pressure | Our Speedup | |----------|-----------|---------------|-------------| | Code completion (Cursor) | 1-4 | Sub-100ms latency | 1.08-1.18x vs vLLM kernel | | Interactive chat | 8-16 | TTFT-sensitive | 1.08-1.15x vs vLLM kernel | | Multi-tenant serving | 32-64 | Throughput | vLLM kernel wins | | Batch/offline | 128-2048 | Cost per token | vLLM kernel wins decisively | Sources: vLLM production configs typically set `--max-num-seqs 256` but effective decode batch per GPU is 32-64. Google TPU Gemini serving uses batch 30 for decode. Anthropic offers a "low-batch-size" fast mode at 6x cost premium, which tells you the latency-sensitive market is real. The split isn't even. Roughly 15-25% of production MoE workloads run at B=1-32 (latency-sensitive). The rest optimize for throughput. But that 15-25% is where users directly feel response time, and it's where the willingness to pay for speed is highest. ## vs vLLM's Triton Kernel (Fair Comparison) I extracted vLLM 0.19.0's actual `fused_moe_kernel` (the @triton.jit function, copied verbatim) and benchmarked kernel-to-kernel, excluding preprocessing. First attempt included my Python reimplementation of `moe_align_block_size` (4-9ms overhead), which made vLLM look terrible. That was wrong. Isolating just kernel execution time gives the honest picture: | Batch | Use Case | vLLM kernel (ms) | Warp Decode (ms) | Winner | |-------|----------|-----------------|------------------|--------| | 1 | Code completion | 0.40 | 0.37 | Warp (1.08x) | | 4 | Small chat | 1.53 | 1.30 | Warp (1.18x) | | 8 | Chat serving | 2.29 | 1.99 | Warp (1.15x) | | 16 | Chat serving | 3.65 | 3.39 | Warp (1.08x) | | 32 | High concurrency | 4.55 | 5.44 | vLLM (0.84x) | | 64 | Batch inference | 5.58 | 9.57 | vLLM (0.58x) | | 128 | Prefill | 5.57 | 17.04 | vLLM (0.33x) | Crossover at batch ~24. Warp Decode wins the decode path (B=1-16, which covers code completion and interactive chat), vLLM's blocked matmul wins the throughput path (B=32+, which covers batch inference and prefill). This matches Cursor's original framing: "Warp Decode optimizes exclusively for small-batch decode scenarios." It also explains why they built it. Cursor runs code completion (batch 1-4). That's exactly where this technique wins. For the full MoE forward pass (including Python dispatch overhead), Warp Decode wins at every batch size tested up to 128, with 1.9-5.1x speedup over HF eager. But the kernel-to-kernel comparison is the honest one. ## What Would Change for Production **Triton to CUDA.** We used Triton for rapid prototyping. Production inference engines (vLLM, TensorRT-LLM) would need compiled CUDA kernels. The algorithm translates directly. The warp-level fusion that Cursor describes would be possible in CUDA but not Triton. **Batch size scaling.** We tested batch 1-8 (single-token decode). Larger batches at serving scale need separate benchmarks. The parallelism story changes when you have 32+ tokens. **Quantization interaction.** INT4/FP8 expert weights might shift the compute-vs-bandwidth balance. Worth testing. **Full framework integration.** We monkey-patched HuggingFace's expert dispatch. A proper integration handles the router, recombination, and memory management together. ## Reproducibility 38 correctness tests. Cosine similarity 0.99999976. Max absolute error 3.91e-03. Two models tested. Error bars on all measurements. Fixed random seeds. Code available. What we don't have: comparison against Cursor's original implementation (no code available), testing on B200 hardware, or integration into a serving framework. The reproduction validates the core idea and maps the architectural conditions where it helps. It doesn't claim to have matched or beaten the original authors' numbers on their hardware. --- ## Footnote: The Bandwidth Wall After completing the dispatch optimization work, I ran an automated research loop (52 experiments, two parallel Claude-in-the-loop optimization agents iterating for ~45 minutes) to see what else could be squeezed out. Both agents independently converged on the same finding: these kernels are bandwidth-bound, not compute-bound. The expert weights dominate memory traffic. Halving precision to INT8 (per-row absmax symmetric quantization) gives a clean 2x speedup with cosine similarity 0.9999. INT4 with group quantization reaches 2.5x but accuracy drops to 0.991. | Precision | B=1 Layer Latency | Speedup | Accuracy | |-----------|------------------|---------|----------| | FP16 (Warp Decode) | 0.57ms | 1.0x | 1.0000 | | INT8 | 0.28ms | 2.01x | 0.9999 | | INT4g128 + INT8 mixed | 0.21ms | 2.58x | 0.9913 | This isn't novel. vLLM and TensorRT-LLM already support INT8 MoE dispatch. The interesting part is the interaction: Warp Decode's dispatch optimization and weight quantization are complementary. The dispatch pattern reduces overhead from expert routing. Quantization reduces bandwidth from weight loading. They target different bottlenecks and compound. For practitioners: if your MoE decode is slow, quantize the expert weights first (2x with no engineering). Then look at dispatch patterns (1.1-1.2x with custom kernels). Both help. Neither obsoletes the other. Justin