Gemini Diffusion Explained: Block-Parallel Denoising at 1-2k tokens/sec
For years, powerful large language models like GPT-4 and Claude have dominated text generation. They work by predicting the next word at a time, using autoregression. Think of it like writing a sentence word by word: the model predicts the next token based on all the tokens that came before it. This works well for coherent text, but it is inherently sequential. Errors made early are hard to fix later.
What if text generators worked like Stable Diffusion for images? Instead of predicting words one-by-one, what if they started with pure noise and gradually refined entire paragraphs through iterative denoising?
Google DeepMind just made this a reality with Gemini Diffusion, and the results challenge everything we thought we knew about language model architecture.1The full technical report is not yet published. The benchmarks here are from Google DeepMind's May 2025 blog post and corroborated by Fortune's coverage. Treat these numbers as preliminary. Instead of predicting word by word, Gemini Diffusion uses a technique inspired by diffusion models for image and audio generation. It generates the entire text sequence in parallel, refining it block by block through multiple steps.
The shift: from autoregressive to diffusion generation
Here is the core difference:
Autoregressive (AR) models generate text sequentially, one token at a time. Like writing a sentence word by word. (Examples: GPT-4, Claude.)
Diffusion models generate the entire sequence in parallel, starting from a "noisy" state (often fully masked or random text) and iteratively refining it over multiple steps. Like starting with a mixed-up paragraph and gradually revealing the correct words.
This parallel approach allows diffusion models to correct errors anywhere in the sequence during generation, making them potentially better for tasks like editing and revision.
Autoregressive LLMs predict the next token given all previous ones, fast to train, easy to cache, but serial at inference. Discrete diffusion LMs start from pure noise (or a fully masked sentence) and iteratively denoise the whole sequence. The reverse process is parallel-token and order-agnostic, so each step corrects global errors and can sample long contexts in a handful of refinements.
Early proofs-of-concept like D3PM and DiffuSeq matched small GPT-2 quality but were 10-100x slower. Gemini Diffusion changes that equation entirely.2D3PM (Austin et al., 2021) formulated the multinomial noise schedule. DiffuSeq (Gong et al., 2022) showed diversity gains in translation. SEDD (Lou et al., 2023) closed the quality gap using score-entropy diffusion. LLaDA (2024) proved 8B-scale viability competitive with LLaMA 3-8B.
From images to words: the evolution timeline
The research arc spans four years:
- 2021: D3PM formulated multinomial noise schedule for discrete tokens
- 2022: DiffuSeq showed diversity gains in translation and summarization
- 2023: SEDD closed quality gap to GPT-2 with score-entropy diffusion
- 2024: LLaDA proved scaling viability with 8B-param model competitive with LLaMA 3-8B
- 2025: Gemini Diffusion puts diffusion on the commercial roadmap
Like earlier discrete DDPM variants, Gemini Diffusion learns a forward corruption process (mask to noise) and a reverse denoising process (noise to text). The breakthrough is scale (multi-billion parameters) and an aggressive block-parallel decoder that refines 128 or more tokens per step.
Drag the scrubber to decode one sentence both ways. Autoregressive commits a token per step; diffusion refines the whole block in a few passes.
Same 11-token output. Diffusion commits it in 4 passes, not 11 sequential steps. About 2.8x fewer round-trips through the model.
What makes Gemini Diffusion stand out
| Dimension | Gemini Diffusion | Typical AR LLM |
|---|---|---|
| Decoder style | Block-parallel denoising (128+ tokens/step) | Left-to-right |
| Sampling steps | 32 to 8 to 2 (distilled) | 1 |
| Reported speed | 1-2k tokens/s on TPU v5p | 0.3-0.6k on same hardware |
| Benchmarks | Matches Gemini 2.5 Flash on HumanEval and BigCodeBench | State-of-practice |
| Control | Native classifier-free and prompt-guided diffusion | Needs RLHF or adapters |
How Gemini Diffusion works
At its core, Gemini Diffusion is built on the Transformer architecture, but with a key difference: it does not use the causal mask that restricts AR models to only looking at earlier tokens.
Transformer foundation
This allows Gemini Diffusion to look at the entire sequence (including future tokens) at once, helping it understand overall context during refinement. The model maintains a Transformer-based backbone but removes the causal masking constraint that forces left-to-right generation, enabling bidirectional attention and global context awareness.
Iterative refinement
During generation, the model starts with a sequence where most or all tokens are "masked" (hidden). In each step, it processes the entire masked sequence and fills in some of the missing tokens. This process repeats 5-10 times, gradually reducing masked tokens until a coherent output is produced.
The model learns through a sophisticated noising/denoising scheme where random subsets of tokens are masked at ratios from 0% to 100%. The model learns to handle everything from small corruptions to complete generation from scratch.
Parallel processing
Because the model works on the whole sequence in each step, it is much faster than the token-by-token approach of AR models. It can generate many tokens concurrently in each refinement pass.
Speed optimization
Early diffusion models often required hundreds of steps, making them slow. Gemini Diffusion uses step-distillation to efficiently achieve high-quality results in just a few steps, dramatically reducing generation time and reaching over 1,000 tokens per second.3Step-distillation trains a student model to emulate an N-step sampler with N/k steps. Combined with time-agnostic masking and dynamic classifier-free guidance, this compresses the 32-step sampler down to 2 steps at the cost tier while preserving quality on structured tasks like code.
Under the hood: technical innovations
Block-parallel denoising pipeline
Instead of token-by-token generation, Gemini Diffusion processes entire blocks simultaneously. Each denoising step can refine 128 or more tokens in parallel, enabling massive throughput gains.
Step-distillation and speculative decoding
Diffusion's biggest pain-point is sampling time (hundreds of denoise steps). Gemini attacks this with three techniques: step-distillation (train a student to emulate N-step sampler with N/k steps), time-agnostic masking (predict all time-steps jointly), and dynamic classifier-free guidance (controllability without extra passes).
The result: 32 to 8 to 2 steps at different precisions, plus speculative decoding. That is how it beats AR models in latency despite the extra denoise loop.
Control knobs
Unlike autoregressive models that need extensive RLHF tuning, diffusion naturally supports classifier-free guidance for prompt adherence vs. creativity, style transfer and guided editing, and length and toxicity control without additional training.
Benchmark deep-dive: where it excels and struggles
Coding excellence
| Benchmark | Gemini Diffusion | Gemini 2.0 Flash-Lite |
|---|---|---|
| HumanEval | 89.6% | 90.2% |
| MBPP | 76.0% | 75.8% |
| LiveCodeBench | 30.9% | 28.5% |
| BigCodeBench | 45.4% | 45.8% |
Knowledge gaps
| Benchmark | Gemini Diffusion | Gemini 2.0 Flash-Lite |
|---|---|---|
| Global MMLU | 69.1% | 79.0% |
| GPQA Diamond | 40.4% | 56.5% |
| BIG-Bench Hard | 15.0% | 21.0% |
Key insight: diffusion excels at structured tasks (code, math) where global coherence matters, but trails in broad knowledge retrieval where autoregressive models have maturity advantages.
Why builders should care
Faster burst-generation for agents
The 1-2k tokens/second throughput enables new agent architectures. Instead of waiting seconds for responses, agents can generate comprehensive analysis, code, or documentation in near real-time.
Better global rewrites
Diffusion's parallel nature makes it ideal for style transfer (convert technical docs to marketing copy instantly), guided editing (apply specific constraints while preserving content), and code refactoring (maintain functionality while improving structure).
Fewer hallucinations in structured output
Early empirical signals suggest diffusion models produce more consistent structured outputs (JSON, code) because they can maintain global coherence throughout generation.
Further reading
Follow the lab
Get the next experiment
Enjoyed the breakdown on Gemini Diffusion Explained: Block-Parallel Denoising at 1-2k tokens/sec? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.
Related experiments
Apparatus
1,350 words · 9 min read
- diffusion-models
- language-generation
- google-deepmind
- discrete-diffusion
- block-parallel-decoding