Cutting-Edge AIJune 2, 20259 min readshipped

Gemini Diffusion Explained: Block-Parallel Denoising at 1-2k tokens/sec

Run this

claude "Reproduce the Gemini Diffusion benchmark comparison from the AIXplore entry. Implement a minimal discrete diffusion sampler, compare step counts vs. autoregressive token speed on a fixed prompt, and output a markdown table matching the post format."

claude code

For years, powerful large language models like GPT-4 and Claude have dominated text generation. They work by predicting the next word at a time, using autoregression. Think of it like writing a sentence word by word: the model predicts the next token based on all the tokens that came before it. This works well for coherent text, but it is inherently sequential. Errors made early are hard to fix later.

What if text generators worked like Stable Diffusion for images? Instead of predicting words one-by-one, what if they started with pure noise and gradually refined entire paragraphs through iterative denoising?

Google DeepMind just made this a reality with Gemini Diffusion, and the results challenge everything we thought we knew about language model architecture.¹The full technical report is not yet published. The benchmarks here are from Google DeepMind's May 2025 blog post and corroborated by Fortune's coverage. Treat these numbers as preliminary. Instead of predicting word by word, Gemini Diffusion uses a technique inspired by diffusion models for image and audio generation. It generates the entire text sequence in parallel, refining it block by block through multiple steps.

The shift: from autoregressive to diffusion generation

Here is the core difference:

Autoregressive (AR) models generate text sequentially, one token at a time. Like writing a sentence word by word. (Examples: GPT-4, Claude.)

Diffusion models generate the entire sequence in parallel, starting from a "noisy" state (often fully masked or random text) and iteratively refining it over multiple steps. Like starting with a mixed-up paragraph and gradually revealing the correct words.

This parallel approach allows diffusion models to correct errors anywhere in the sequence during generation, making them potentially better for tasks like editing and revision.

Autoregressive LLMs predict the next token given all previous ones, fast to train, easy to cache, but serial at inference. Discrete diffusion LMs start from pure noise (or a fully masked sentence) and iteratively denoise the whole sequence. The reverse process is parallel-token and order-agnostic, so each step corrects global errors and can sample long contexts in a handful of refinements.

Early proofs-of-concept like D3PM and DiffuSeq matched small GPT-2 quality but were 10-100x slower. Gemini Diffusion changes that equation entirely.²D3PM (Austin et al., 2021) formulated the multinomial noise schedule. DiffuSeq (Gong et al., 2022) showed diversity gains in translation. SEDD (Lou et al., 2023) closed the quality gap using score-entropy diffusion. LLaDA (2024) proved 8B-scale viability competitive with LLaMA 3-8B.

From images to words: the evolution timeline

The research arc spans four years:

2021: D3PM formulated multinomial noise schedule for discrete tokens
2022: DiffuSeq showed diversity gains in translation and summarization
2023: SEDD closed quality gap to GPT-2 with score-entropy diffusion
2024: LLaDA proved scaling viability with 8B-param model competitive with LLaMA 3-8B
2025: Gemini Diffusion puts diffusion on the commercial roadmap

Like earlier discrete DDPM variants, Gemini Diffusion learns a forward corruption process (mask to noise) and a reverse denoising process (noise to text). The breakthrough is scale (multi-billion parameters) and an aggressive block-parallel decoder that refines 128 or more tokens per step.

Autoregressive vs diffusion generation: sequential token prediction vs block-parallel denoising — Autoregressive vs. diffusion decoding. AR models predict left-to-right one token at a time; diffusion models denoise an entire masked block in parallel, then repeat.

Drag the scrubber to decode one sentence both ways. Autoregressive commits a token per step; diffusion refines the whole block in a few passes.

Interactive · decode one sentence

Autoregressiveone token per step, left to right

11 / 11 steps

Adiffusionmodelrefineseverytokenintheblockatonce

Diffusion (block-parallel)whole block refined per denoising pass

4 / 4 passes

Adiffusionmodelrefineseverytokenintheblockatonce

AR sequential steps

11 tokens

Diffusion passes

11 tokens

Same 11-token output. Diffusion commits it in 4 passes, not 11 sequential steps. About 2.8x fewer round-trips through the model.

What makes Gemini Diffusion stand out

Dimension	Gemini Diffusion	Typical AR LLM
Decoder style	Block-parallel denoising (128+ tokens/step)	Left-to-right
Sampling steps	32 to 8 to 2 (distilled)	1
Reported speed	1-2k tokens/s on TPU v5p	0.3-0.6k on same hardware
Benchmarks	Matches Gemini 2.5 Flash on HumanEval and BigCodeBench	State-of-practice
Control	Native classifier-free and prompt-guided diffusion	Needs RLHF or adapters

How Gemini Diffusion works

At its core, Gemini Diffusion is built on the Transformer architecture, but with a key difference: it does not use the causal mask that restricts AR models to only looking at earlier tokens.

Transformer foundation

This allows Gemini Diffusion to look at the entire sequence (including future tokens) at once, helping it understand overall context during refinement. The model maintains a Transformer-based backbone but removes the causal masking constraint that forces left-to-right generation, enabling bidirectional attention and global context awareness.

During generation, the model starts with a sequence where most or all tokens are "masked" (hidden). In each step, it processes the entire masked sequence and fills in some of the missing tokens. This process repeats 5-10 times, gradually reducing masked tokens until a coherent output is produced.

The model learns through a sophisticated noising/denoising scheme where random subsets of tokens are masked at ratios from 0% to 100%. The model learns to handle everything from small corruptions to complete generation from scratch.

Parallel processing

Because the model works on the whole sequence in each step, it is much faster than the token-by-token approach of AR models. It can generate many tokens concurrently in each refinement pass.

Speed optimization

Early diffusion models often required hundreds of steps, making them slow. Gemini Diffusion uses step-distillation to efficiently achieve high-quality results in just a few steps, dramatically reducing generation time and reaching over 1,000 tokens per second.³Step-distillation trains a student model to emulate an N-step sampler with N/k steps. Combined with time-agnostic masking and dynamic classifier-free guidance, this compresses the 32-step sampler down to 2 steps at the cost tier while preserving quality on structured tasks like code.

Under the hood: technical innovations

Block-parallel denoising pipeline

Instead of token-by-token generation, Gemini Diffusion processes entire blocks simultaneously. Each denoising step can refine 128 or more tokens in parallel, enabling massive throughput gains.

Step-distillation and speculative decoding

Diffusion's biggest pain-point is sampling time (hundreds of denoise steps). Gemini attacks this with three techniques: step-distillation (train a student to emulate N-step sampler with N/k steps), time-agnostic masking (predict all time-steps jointly), and dynamic classifier-free guidance (controllability without extra passes).

The result: 32 to 8 to 2 steps at different precisions, plus speculative decoding. That is how it beats AR models in latency despite the extra denoise loop.

Control knobs

Unlike autoregressive models that need extensive RLHF tuning, diffusion naturally supports classifier-free guidance for prompt adherence vs. creativity, style transfer and guided editing, and length and toxicity control without additional training.

Benchmark deep-dive: where it excels and struggles

Coding excellence

Benchmark	Gemini Diffusion	Gemini 2.0 Flash-Lite
HumanEval	89.6%	90.2%
MBPP	76.0%	75.8%
LiveCodeBench	30.9%	28.5%
BigCodeBench	45.4%	45.8%

Knowledge gaps

Benchmark	Gemini Diffusion	Gemini 2.0 Flash-Lite
Global MMLU	69.1%	79.0%
GPQA Diamond	40.4%	56.5%
BIG-Bench Hard	15.0%	21.0%

Key insight: diffusion excels at structured tasks (code, math) where global coherence matters, but trails in broad knowledge retrieval where autoregressive models have maturity advantages.

Why builders should care

Faster burst-generation for agents

The 1-2k tokens/second throughput enables new agent architectures. Instead of waiting seconds for responses, agents can generate comprehensive analysis, code, or documentation in near real-time.

Better global rewrites

Diffusion's parallel nature makes it ideal for style transfer (convert technical docs to marketing copy instantly), guided editing (apply specific constraints while preserving content), and code refactoring (maintain functionality while improving structure).

Fewer hallucinations in structured output

Early empirical signals suggest diffusion models produce more consistent structured outputs (JSON, code) because they can maintain global coherence throughout generation.

Further reading

Related experiments

Apparatus

1,350 words · 9 min read

diffusion-models
language-generation
google-deepmind
discrete-diffusion
block-parallel-decoding