Practical ApplicationsJune 25, 202611 min readshipped

Day One: Standing Up the Inference Platform

Run this

claude "Stand up a local OpenAI-compatible inference endpoint on a single H100: serve a 27B dense reasoning model with SGLang and DFlash speculative decoding, put it behind a LiteLLM gateway on port 4000, then benchmark single-stream tokens per second and time-to-first-token. Report the exact launch flags you used and flag any that are missing from the public DFlash docs."

claude code

The node came up on a Tuesday. Eight H100s, behind a grant clock that doesn't stop. The instinct, when a machine like that turns on, is to point it at the hard problem immediately. We didn't. Day one was scaffolding, and scaffolding is most of what makes the expensive hours count.

A borrowed node teaches you one thing fast: the science never stalls on the science. It stalls because someone can't log in, or a training run dies three hours deep on a half-copied file, or the model your autonomous agent depends on isn't actually serving yet. So the order of operations for day one wrote itself. Get the team onto the box safely. Install the software the research engine needs. Stage the data so nothing waits on a copy. Then, and only then, stand up the thing everything else leans on: local inference.

This is the build log for that day. Most of it transfers to anyone standing up real work on rented or granted hardware, whether or not you care about retinas.

Provisioning the team

Five people needed access, and exactly one of them needed the power to wipe the box.

The account the cloud console hands you is root-equivalent and destructive: the same credentials that run your jobs can also delete the instance. You do not give that to the rest of your team. So before anyone touched the node, we built the access structure: one admin account, four sandboxed accounts scoped to their own directories, a shared lane for collaboration, and a way in that didn't mean opening the firewall.

That last part is the interesting one, and it's deep enough that I've tucked it into a side quest above rather than derail the walk-through. Short version: the box dials out to a server we control and the team connects through that relay, so we never expose an inbound port on the node at all.

Installing the brain

The second job was software, and the software that mattered was Claude Code.

The research engine that runs alongside our modeling work, ARIA, is built on headless claude -p children. It needs a real Claude Code install to think. The box shipped with none of it: no Node, no claude binary, nothing. So the setup was Node 20, then Claude Code, then the part that actually takes judgment: replicating the safety hooks.

Those hooks are not optional on a shared box with an autonomous agent on it. They block destructive shell commands before they run, and they block content that shouldn't leave the building. We carried the pattern over from our existing engine and adapted two things to this environment: the paths, and the confidentiality denylist. The denylist is project-specific by design. What counts as a leak here is not what counts as a leak somewhere else, and the hook config is where that judgment lives.

Staging the data

The training corpus is public: AI-READI, a large multimodal retinal dataset built for exactly this kind of open research. Two pieces of it mattered for day one. The OCT-A corpus came to 1.07 TiB across 175,713 files. The color fundus corpus came to 50,315 DICOM images at 116 GB, spread across four different camera devices.

Pulling that down cleanly onto the node is its own discipline, and the first attempt taught the lesson the hard way. The compressed version of the lesson: stop guessing where data lives, run long transfers where a dropped connection can't kill them, and treat every byte on a delete-only box as something you should be able to regenerate.

Both corpora landed on the persistent disk with zero failed transfers, and the fundus images got extracted in place to cropped 512-pixel PNGs: 50,315 of them, no errors. That PNG set is the substrate the encoder and the research engine both read from, which matters more than it sounds. One copy of the truth, two consumers, no drift.

The main event: a fully local inference platform

Everything above was setup for this. The goal was a brain for the research engine that runs entirely on the node, with no external API keys and no tokens leaving the box. Every generated token served from our own hardware.

Before touching a single serving flag, we ran our last30days skill over the current speculative-decoding research. This field moves week to week, and the recipe that was state-of-the-art a month ago is often superseded. The skill pulls the last thirty days of papers, repos, and discussion and synthesizes what's actually working now. It pointed where we expected: block-diffusion speculative decoding, and a target of 250 to 300 tokens a second on a single stream for a model in this class.

The architecture is three tiers behind one gateway:

A warm tier that's always resident: the model the engine talks to constantly.
An on-demand tier for heavier reasoning, spun up when a job needs it.
A premium tier for the largest model, reserved for synthesis.

All three sit behind a single LiteLLM gateway on one port, OpenAI-compatible, so every consumer talks to one address and routes by model name. Two serving engines underneath: SGLang for most of it, vLLM for the one model with no first-class SGLang recipe.

Why dense, not a mixture of experts

The warm model is Qwen3.6-27B, the dense variant, paired with a DFlash drafter for speculative decoding.¹DFlash is a block-diffusion speculative-decoding method with open drafter checkpoints from Z-Lab. The small drafter proposes a block of tokens and the target model verifies them in a single pass, so accepted guesses are nearly free. The dense-versus-mixture call wasn't obvious, so we let the benchmarks make it. On the intelligence and coding indices we trust,²Artificial Analysis publishes an intelligence index and a coding index that roll several public benchmarks into one comparable score per model, which is what we use to shortlist before testing locally. the 27B dense model scored 37.1 and 53.7, ahead of the larger 35B-A3B mixture-of-experts model at 31.6 and 41.9 and ahead of the comparable Gemma model. And dense models get the bigger win from speculative decoding: a mixture model activates only a slice of its parameters per token, which leaves less headroom for a drafter to run ahead. Dense gives the drafter more to predict, so the speedup is larger. Best base score and best speedup pointed the same way.

Interactive · Speculative decoding

Model

activates every parameter per token

Draft block size16 tokens

293tok/ssingle-stream decode

5.6×vs no spec-decode

The dense model predicts the next token reliably, so most of the drafted block survives verification and rides through for free. Bigger blocks keep paying off until acceptance tails away. This is the warm tier we shipped: dense 27B + DFlash, near 287 tok/s.

Illustrative model of acceptance-driven speedup, anchored to our measured warm-tier numbers.

The recipe was wrong as shipped

This is the part worth writing down, because it cost real time. The published DFlash serving recipe does not launch as documented. Three flags are missing from the public docs, and each one aborts the launcher before the model loads:

SGLANG_ENABLE_SPEC_V2=1 as an environment variable. DFlash is built on the Spec-V2 framework, and without this the argument parser exits before it even tries.
--mamba-scheduler-strategy extra_buffer. The 27B model uses a hybrid Mamba architecture,³Mamba is a state-space sequence architecture, an alternative to pure attention. This model interleaves Mamba and attention layers, which is why the scheduler and the cache both need Mamba-specific handling. and the radix cache needs this to coexist with speculative decoding on it.
The number of draft tokens must exactly equal the DFlash block size. The reference config had them mismatched; the launcher refuses to start and tells you, in so many words, that for DFlash they must match.

There's one more, less an undocumented bug than a model-specific detail: this hybrid model's Mamba layers need their linear-attention backends set explicitly, Triton for the prefill pass and FlashInfer for decode, which the stock command leaves unset. Get those four things right and it launches clean. Get any one wrong and you get a parser abort with a message that, at best, half-explains itself.

What it does once it's running

Tuned, the warm model serves a single stream at roughly 287 tokens a second on average, peaking near 359, with time-to-first-token around 200 milliseconds. Under load it aggregates to nearly 1,000 tokens a second across 16 concurrent streams. The key-value cache holds a 260,000-token pool, and it cleanly recalls a needle planted at 136,000 tokens of context, which is the long-context behavior the research engine actually needs when it's reasoning over a pile of papers.

Reasoning models eat their own token budget

This is a reasoning model: it emits its chain of thought in one field and its answer in another. Call it with a tight output cap and the thinking consumes the whole budget, so the answer field comes back empty. That looks exactly like a model failure when it's really a configuration one. Give reasoning models generous output budgets, parse the answer field, and don't trust structured-output mode on every build.

None of this stays up by luck. The gateway runs as a managed service that survives reboot, the warm model runs in a container set to restart unless we stop it, and the benchmark harness that proves all of the above lives in the repo, not in a scratch directory that the next wipe takes with it.

Mapping the Spark learnings forward

Here's the through-line for anyone who followed the DGX Lab series: almost none of this was figured out on the H100s. It was worked out months earlier on a DGX Spark, the desktop machine, where a wrong turn costs minutes instead of metered GPU-hours.

Which engine for which model. Whether speculative decoding was worth the complexity. The gateway pattern. The three-tier brain. All of that got beaten out on the small box first, and it transferred almost wholesale.

What changed is that the data-center card made everything simpler, not harder. The Spark's GPU architecture needed special handling at every turn: custom container images per model class, fallback kernels, architecture flags the upstream tools didn't set on their own. The H100 is a stock architecture. Plain installs work. No image-juggling, no kernel patches. Worth one caveat on the speedup numbers, though: the headline DFlash figures you see quoted are measured on Blackwell with its NVFP4 tensor cores, which Hopper doesn't have. We didn't isolate the speculative-decoding speedup on our own card, our tuning took single-stream from 250 to 287 tokens a second, so treat the splashy multiples as Blackwell numbers, not Hopper ones.⁴The widely-quoted DFlash speedups are Blackwell figures, measured with NVFP4 tensor cores the H100 (Hopper) doesn't have. We report absolute throughput rather than a speedup multiple, because we tuned the served config rather than benchmarking speculative decoding on versus off. The method is the same. The hardware just gets out of the way.

That's the day. A team that can work without tripping over each other, a research engine with a brain, a terabyte of data staged clean, and a local inference stack fast enough that nothing waits on it. The next post is about waking the research engine up on top of all this, which turned out to be three walls in a trench coat.

Related experiments

Apparatus

2,630 words · 11 min read

h100
sglang
speculative-decoding
litellm
llm-serving
dgx-spark
build-in-public