Cutting-Edge AIJune 24, 202610 min readshipped

Two Camps of Reasoning Post-Training: What You Can Check vs What You Can't

Run this

claude "Take a task you would reflexively grade with an LLM judge (summary quality, a generated plan, a factual answer). Before you reach for the judge, try to make it verifiable instead: corpus-ground the reward so a passage match scores it, or gate a process rubric on a checkable sub-answer so only correct-answer rollouts get rubric credit. Implement the minimal version of both the judge-graded reward and the verifiable reward for one small task, run them side by side on a handful of examples, and report where the verifiable version stops being possible. That boundary is the line between the two camps."

claude code

Call something a "reasoning model" and you've named the output, not the recipe. Underneath that one label sit two training programs that fail in opposite directions, demand different tooling, and answer to different open problems. The thing that splits them is not the architecture and not the base model. It's the reward.

One program rewards a verifiable answer. There is a checker: the math answer matches, the unit test passes, the proof verifies, the extracted field is in the corpus. The other program has no checker, so it manufactures a reward for everything you still want the model to get good at, where "good" is a judgment call: a clear explanation, a sound plan, a safe diagnosis, a well-cited summary. Pick the wrong one for your task and you burn the run, either chasing a checker that doesn't exist for your domain or training a judge that your policy learns to game.

The fork is the reward, not the architecture. Everything downstream, the failure modes, the tooling, the ceiling, follows from whether a checker exists for your task.

This is the map a practitioner has to read in 2026. And the most interesting thing on it is that the two camps, which look like a wall, are turning out to be a dial.

Camp one: the reward is a checker

This is RLVR, and DeepSeek-R1 is the founding text everyone still reacts to. R1-Zero made the strong version of the claim: take a base model, skip the supervised warm-start entirely, and run pure RL against a verifiable reward. Reasoning emerges. The reported jump on AIME 2024 was pass@1 from 15.6% to 71.0%, and to 86.7% with majority voting.¹DeepSeek-R1 (arXiv 2501.12948). R1-Zero is the pure-RL-from-base result; 15.6 to 71.0 is pass@1, 86.7 is cons@64 majority voting on AIME 2024.

The mechanism is the whole appeal. You don't write down a chain of thought for the model to imitate. You reward the outcome and let the policy find its own path there. Because the reward is deterministic and ungameable (a math answer is right or it isn't), there is no judge to corrupt, no reward model to drift. You can, in principle, exceed any teacher, because you're not distilling a teacher. You're searching, scored by a checker.

What you trade for that clean signal is a set of failure modes you have to budget for from day one.

Entropy collapse. RLVR drives output entropy down fast. The model gets confident, exploration dies early, and you stop discovering new paths long before you've found the good ones. The entropy literature (clip-low/clip-high schemes, entropy-aware objectives) exists because naive RLVR overconfidently converges.

Calibration degradation. This one is quieter and worse. RLVR makes models more confident in wrong answers. The DCPO work names a fundamental gradient conflict: the objective that maximizes accuracy pulls against the one that minimizes calibration error, so optimizing for the checker actively un-calibrates the model. If you care about a model that knows when it doesn't know, RLVR works against you unless you decouple the two objectives.

A real learnability ceiling. The Unlearnability work gives the first systematic characterization of a hard subset of examples that stay unlearnable even when correct rollouts are present in training. They have low gradient similarity to everything else, ungeneralizable reasoning that more compute won't fix. The ceiling is not always a tuning problem.

The Qwen confound, which deserves its own section. Before you trust that your reward signal did the work, you have to rule out that you measured a base-model prior.

The fight inside camp one: does RLVR teach, or just sharpen?

The richest argument in the whole space lives entirely inside the verifiable camp. Follow it as a back-and-forth, because each move reframes the last.

R1 sets the premise: pure RL from a base model makes reasoning emerge.

Yue and colleagues land the skeptical punch. Their paper asks whether RL incentivizes reasoning capacity beyond the base model, and the answer is uncomfortable. RLVR-trained models beat the base at pass@1, but sample wide enough and the base model catches up and passes them at large pass@k. Read literally, that says RL is re-weighting the distribution toward paths the base already knew, narrowing exploration rather than extending the reasoning boundary. The framing stuck; everyone now argues in its terms.²Yue et al., "Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?" (arXiv 2504.13837).

The metric rebuttal goes after the measurement. The CoT-Pass@K work points out that on short-answer math, a base model lands the right number through a wrong chain often enough to flatter pass@k. The answer is two characters; guessing it is cheap. When you require the reasoning to be correct, not just the final token, the post-RLVR model holds a persistent gap over base across large K. Yue's crossover, on this account, is a CoT-blindness artifact, not a fact about capability.³"RLVR Implicitly Incentivizes Correct Reasoning in Base LLMs," which introduces the CoT-Pass@K metric (arXiv 2506.14245).

"RL doesn't teach, it just narrows" and "RL teaches, you were mis-measuring" are not the same kind of claim. One is about the result; the other is about the ruler.

The scale rebuttal blames the training budget. ProRL argues the ceiling Yue found was a too-short-training ceiling. Prolong the RL (thousands of steps, KL control, periodic reference-policy resets) and you reach strategies the base model can't produce even under heavy sampling, with the largest gains on logic puzzles rather than math. Stability and duration, not a hard wall, were the missing ingredients.⁴ProRL: Prolonged Reinforcement Learning (arXiv 2505.24864).

Then the disturbing wrinkle. Spurious Rewards is the result that should make everyone re-run their controls. On Qwen2.5-Math-7B, RLVR with random rewards improved MATH-500 by +21.4%, nearly matching the +29.1% from genuine ground-truth rewards. The reward signal made almost no difference. But the catch is the whole result: this works on Qwen and fails on Llama3 and OLMo2. The mechanism is that RL surfaces a latent "code reasoning" behavior already sitting in Qwen's prior; the gradient is just nudging the model toward a strategy it already had.⁵Spurious Rewards: Rethinking Training Signals in RLVR (arXiv 2506.10947). The +21.4 vs +29.1 figures are MATH-500 on Qwen2.5-Math-7B; the abstract is explicit that the effect fails on Llama3 and OLMo2.

The synthesis lands where careful people now sit. Pass@k is a diagnostic, not an objective. RLVR does both things, but unevenly: most of the measurable gain is search compression (folding pass@k capability into pass@1 efficiency), with a smaller genuine expansion that shows up mainly with cross-domain data and many gradient steps. The sharpen-vs-teach binary was the wrong question. It's a mix, and the mix is dominated by sharpening.

Two mechanistic findings make the sharpening read hard to dismiss. The RELEX work shows RLVR weight deltas are extremely low-rank: a rank-1 approximation captures most of the gain, and you can linearly extrapolate future checkpoints from a short window with no learned model.⁶RELEX, "You Only Need Minimal RLVR Training" (arXiv 2605.21468): the rank-1, linearly-extrapolatable weight-delta result. A rank-1 nudge is easy to square with "re-weight a direction the base already had" and hard to square with "teach genuinely new capability." And the practical kicker: the RL infrastructure itself was silently corrupting results. The vLLM V0-to-V1 transition returned logprobs before temperature and penalty post-processing, so the trainer's record of what the policy did diverged from what it actually did, quietly tanking RL stability until importance-sampling corrections (TIS, sequence-level MIS) fixed the train-inference mismatch. A meaningful share of "our RL run diverged" was a logprob-semantics bug, not a reward problem.⁷"vLLM V0 to V1: Correctness Before Corrections in RL" (ServiceNow, HF blog); the importance-sampling fix is also in arXiv 2605.14220.

Run two controls before you believe your reward signal did the work: a random-reward run, and a non-Qwen base. If random rewards on Qwen recover most of your gain, you measured a prior.

If you take one operational thing from camp one, take that.

Camp two: you build the reward

Now the other half of the map, and the half where most real work lives. You cannot write a checker for "good explanation," "sound plan," "safe medical advice," "well-cited summary." There is no string to match, no test to run. So you manufacture a reward.

The dominant approaches, roughly in order of how much they resist gaming:

Rubrics as rewards. Decompose the desired behavior into binary, auditable criteria and score the fraction satisfied. Did it cite a source? Was the contraindication flagged? Is the answer responsive to the question actually asked? The RaR line, plus OpenRubrics, Rubric-ARM, and the pairwise rubric systems, all sit here. The appeal is that each criterion is individually checkable even when the whole is a judgment call.

LLM-as-judge, with a preference for reasoning judges. When you score with a model, the work on reasoning judges in non-verifiable post-training is the finding to internalize: reasoning judges resist reward hacking where non-reasoning judges get gamed easily. And prefer pairwise or generative reward models over pointwise Likert scoring. Pointwise scalarization (rate this 1 to 10) has a discriminability ceiling and is the single easiest target to hack, because the policy only has to move a scalar.

Process supervision. Reward the steps, not just the outcome, so the model can't reach a right answer through bad reasoning and bank the credit.

Two things follow that camp-one practitioners don't have to think about. First, reward hacking is the dominant failure mode, not a tail risk. Your judge is a model, and your policy is being optimized to maximize that model's score, which is the textbook setup for adversarial drift. Second, rubric quality is now your bottleneck, which is why rubric generation has become its own RL problem (OpenRubrics, Rubric-ARM). You don't escape the hard part by writing a rubric. You move it.

The frontier extension: multi-turn agentic RL

Agentic RL reads like a third camp, but it isn't. It's camp one stretched: a verifiable-ish reward (did the task succeed), a long horizon, and tool calls in the loop. Most of the open work here is coming out of Chinese labs, with Alibaba's Tongyi DeepResearch as the headline open artifact (a 30B-class MoE agent reporting competitive numbers against closed deep-research systems on HLE, BrowseComp, and GAIA), and OpenWebRL running online multi-turn RL against live websites with Playwright rollouts.

What's new is the cost structure, not the reward philosophy. You budget for:

Turn-level credit assignment. A trajectory succeeds or fails at the end, but the gradient has to find which of twenty turns mattered. Turn-level reward design and discriminative token credit assignment exist to spread that signal.
Training collapse and exploration instability. Long horizons make the exploration problem worse; uncertainty-guided exploration control (T2PO) is one response.
Rollout infrastructure as a first-class cost. Generating multi-turn tool-using trajectories is expensive enough that "rollout-as-a-service" is now a named pattern, not an implementation detail.
A live tension you have to design around: reasoning and tool-use compete during agentic RL. Tuning one can degrade the other; recent work argues they need disentangling rather than joint optimization. The agent that thinks harder is not automatically the agent that calls tools better, and naive agentic RL can trade one for the other without telling you.

This is where Yue's own paper pointed as the way out of the sharpening ceiling: if single-turn RLVR mostly compresses what the base already knows, multi-turn interaction with an environment is where genuinely new capability has room to come from.

The bridge: the camps are converging

Here is the part that makes the two-camp map worth drawing and then partly erasing. The strongest single intellectual move in the current literature is the claim that verifiable-versus-not is a continuum, not a dichotomy.

RaR draws the line explicitly: it extends RLVR beyond verifiable domains by swapping the lone checker for rubric feedback. Run that the other way and RLVR becomes the limiting case, rubric-RL collapsed to a single criterion set to exact match. Add criteria and loosen them and you slide continuously into the judge regime. The wall between the camps is just the number of rubric criteria, dialed from one to many. There is no discontinuity to defend.⁸Rubrics as Rewards (arXiv 2507.17746). The paper frames RaR as extending RLVR with rubric feedback; reading it the other way makes RLVR the single-criterion limiting case.

Interactive · the dial1 criterion

1 · exact matchmany · judgment

Verifiable reward (RLVR)

Reward: A checker: exact match. The answer is right or it is not.
Dominant failure: Entropy collapse and miscalibration. No judge to game.
Example task: Did the math answer match? Did the unit test pass?
Reward-hacking surface: 4%

The wall between the camps is the number of criteria, dialed from one to many. There is no discontinuity to defend. Drag it.

That reframing has a practical edge, and it's the skill separating teams right now: before you reach for a judge, try to make the task verifiable. The bridge work is a set of recipes for doing exactly that.

Corpus-ground the reward. The CorVer line builds a corpus-match signal for factual QA, replacing a neural verifier with a deterministic check against a source corpus. A domain that "needed a judge" becomes verifiable, and you get RLVR's no-hacking property back.
Gate a process rubric on a checkable sub-answer. LongTraceRL applies a rubric over gold entities along the reasoning chain, but only to correct-answer responses, positive-only by construction. Rubrics (camp two) doing process supervision on top of a verifiable outcome (camp one), with the gate doing the anti-hacking work.
Executor-ground the reward. Where a result can be run, run it. "Correct is not enough" without an executor to confirm the output does what it claims.

The 2026 skill is not picking a camp. It's recognizing how much of a judge-problem you can convert into a checker-problem before you concede that you genuinely can't.

Practitioners still choose a tool today. Theorists are busy arguing the tools are one mechanism. A good mental model holds both at once.

When to reach for which

The decision rule, stripped down:

Does your task have a cheap, reliable, deterministic checker? Math, code with tests, formal proof, structured extraction, anything you can corpus-match. If yes, reach for RLVR. Budget for entropy collapse, calibration degradation, and a learnability ceiling on some examples. And run the two controls: a random-reward run and a non-Qwen base, before you credit your reward signal.

No checker? Don't reach for a judge yet. Ask whether you can manufacture a checker first: corpus-ground it, gate a rubric on a checkable sub-answer, executor-ground it. Reach for a judge only when you've established you genuinely can't make the task verifiable. When you do, prefer reasoning judges and pairwise or generative reward models over pointwise Likert, and treat reward hacking as the expected failure, not the surprise.

Long-horizon tool-use? That's its own infrastructure bill, not just a reward choice. Budget for turn-level credit assignment, exploration instability, rollout cost, and the reasoning-versus-tool-use tension.

The uncomfortable through-line, the one to keep in front of you as you plan a run: the slice of work that is cleanly verifiable is the thin slice. Most of the value, the explanations, the plans, the judgment calls, the open-ended diagnoses, lives in the camp where the reward is exactly the thing you can't define. Which is precisely why the bridge work, turning judge-problems into checker-problems, is where the payoff is. The cleanest tool covers the smallest domain, and closing that gap is the open game.

Related reading on this site: the DeepSeek V3 technical review for the RLVR-origin lineage, the OpenAI o3 / o4-mini release analysis for how the closed labs frame checkable-domain RL, and Anthropic's multi-agent research system plus the ten-dollar autonomous agent squad for where agentic RL is heading in practice.

Related experiments

Apparatus

2,585 words · 10 min read

reinforcement-learning
rlvr
reasoning-models
reward-modeling
llm-as-judge
agentic-rl
post-training