codex-vs-claude-code-vs-opencode - AIXplore

# Codex, Claude Code, OpenCode: A Comparison from Inside the Harness ![[assets/codex-vs-claude-code-vs-opencode-hero.png]] I'm writing this in Claude Code. That's not a credibility flex. It's a working confession before the rest of the piece reads as scoring. For the past year I've used all three of the harnesses I'm about to compare. Claude Code is my daily driver. I call Codex from inside it often, and I keep OpenCode in regular rotation on purpose, as a discipline for staying sharp to the alternatives. I don't want to wake up in two years and discover I let myself get lured into Claude because everyone else did. The fact that Claude Code is my daily means I owe the other two a fair read every time I sit down to compare them, including this one. Take the rest accordingly. The three-harness field as of May 2026 is neither converging nor consolidating. Codex CLI, Claude Code, and OpenCode all shipped meaningful updates in the last 90 days. All three look more alike on a feature checklist than they did in February. All three are still, at the architectural level, doing fundamentally different things. The temptation when you're staring at a feature checklist is to call that convergence. The temptation when you're sitting inside one of them is to call it tribalism. Neither read is right. There's a finding worth naming early because it shapes the rest of the conversation. After a March map-file leak exposed the Claude Code source, an analysis titled "98% of Claude Code Is Not AI" argued the system is mostly deterministic scaffolding: context routing, tool dispatch, frustration regexes, and a 3,000-line conditional that decides what the model even gets to see. Gary Marcus elaborated the same point independently, walking through the 3,167-line `print.ts` kernel with its 486 branch points and 12 levels of nesting. The model is a small, expensive component nested inside a much larger software system. That's true. It's also incomplete in a way I want to come back to. For now, hold it as one accurate observation among several. ## The common floor Six weeks ago you could draw a clean architectural line between these three. Today the line is smudged in interesting places. All three now have cascading config files. Claude Code reads `CLAUDE.md`. Codex reads `AGENTS.md`. OpenCode reads both. The convention is the same: drop a markdown file at the project root, the harness picks it up, child directories override. I've watched my own AGENTS.md files start surviving moves between harnesses with no edits. That's a real shift. A year ago, switching coding agents meant reauthoring your prompts. All three have a marketplace pattern. Codex shipped 90+ first-class plugins on May 3. Claude Code has skills plus an MCP server ecosystem that the rest of the world is now writing servers for. OpenCode has its plugin layer plus the Oh-My-OpenCode pack that bundles common workflows. The shape is the same: a curated directory of small, named capabilities the harness can dispatch to. All three now have multi-agent worktrees or parallel-session features. All three now have some form of persisted-session or wakeup primitive. Codex Persisted Goals shipped in v0.128.0 on May 1. OpenCode Warping Sessions shipped in v1.14.40 on May 7. Claude Code's ScheduleWakeup and durable cron have been in Opus 4.7 since mid-April. All three route through MCP or an MCP-equivalent abstraction. All three have a chrome-or-browser path for grounded web tasks: Codex for Chrome shipped May 7, Claude Code's MCP OAuth update with enterprise TLS proxy support shipped the same day. If you read only the changelogs, you'd conclude these are turning into the same product. If you actually use them for a week each, you'd conclude something else. ## What's actually new in the last 90 days The shared-floor summary above hides the architectural intent inside each release. Here's the same window, sorted by harness. **Claude Code (Opus 4.7 era)** | Date | Shipped | |---|---| | 2026-04-16 | Opus 4.7 launch: 1M-token context on Max plan, 128K output limit, adaptive thinking with `xhigh` effort default | | 2026-04-16 | `/effort` ladder, `ultrathink` keyword, browser automation 2x faster vs 4.6 | | Late April | WarpGrep v2 boosted repo patching, drove SWE-bench Pro 57.5% | | 2026-04-23 | Anthropic engineering postmortem on the April quality regression | | Throughout | The Claude Code team shipping 50+ then 60+ reliability fixes per week, on a public weekly cadence | | 2026-05-05 | Finance agent templates, ten ready-to-run for pitchbooks, KYC, month-end close | | 2026-05-05 | Microsoft 365 add-in carrying context across Excel, Word, PowerPoint | | 2026-05-07 | MCP OAuth update with enterprise TLS proxy support, simplified server discovery | | Throughout | Skills + auto-memory pair pattern: every correction triggers both an auto-memory entry and a learning entry that gets promoted to the source skill | The single biggest moat in this set is the 1M context on Max. Codex matches feature-for-feature on most other dimensions but ships a standard 200K context window. The skills + auto-memory loop is the second moat: it's the part of the harness that makes long-horizon coherence cumulative across sessions, not just within one. **Codex CLI (GPT-5.5 "Spud" era)** | Date | Shipped | |---|---| | 2026-04-16 | Background Computer Use | | 2026-04-23 | GPT-5.5 launch ("Spud"), GitHub PR creation/review, multi-agent worktrees, AGENTS.md cascading config | | 2026-04-28 | Bedrock availability, ending Azure exclusivity | | 2026-05-01 | Persisted Goals v0.128.0 | | 2026-05-03 | Marketplace plugins, 90+ first-class | | 2026-05-07 | Codex for Chrome extension | | 2026-05-07 | GPT-5.5-Cyber preview for defenders | Sam Altman framed the launch with a single tweet asking to talk to people who'd built things with 5.5 that took "ludicrous token budgets." The center of gravity is shifting toward longer agentic runs while the harness stays terse. Greg Brockman put it more directly the same week: "GPT-5.5 is both very capable and very succinct." The verbosity contrast with Claude Code is intentional, not accidental. **OpenCode** | Date | Shipped | |---|---| | 2026-04-08 | Oh-My-OpenCode plugin pack with Ultrawork mode | | April | VS Code extension GA, sessions sync between CLI and IDE | | 2026-05-07 | Warping Sessions v1.14.40 | | 2026-05-07 | Desktop App stability v1.14.41 | | 2026-05-08 | OpenCode Go subscription tier, $5 first month, $10/mo thereafter, curated low-latency endpoints for DeepSeek V4 Pro, Qwen 3.6, Kimi K2.6 | The Go tier launching yesterday is the cleanest evidence OpenCode is monetizing without abandoning open-source. Core CLI stays MIT-licensed; the paid tier is a curated endpoint bundle, not a feature gate. **Benchmarks worth keeping in your head** SWE-bench Pro: Claude Code 57.5% with WarpGrep v2, Codex 57.0%. Effectively a tie. Terminal-Bench 2.0: Codex 82.7% at xhigh effort, Claude Code 53.0% at default effort. The terminal-loop benchmark is the strongest single quantitative argument for Codex on autonomous loops. It is also the one bench where the effort setting matters most. At standard effort the two harnesses tie at 53.0%. ## Where they actually differentiate The differentiation isn't in the features. It's in what each harness is shaped to do well. Claude Code is built for long-horizon coherence. The 1M-token context on Max, the skills plus auto-memory loop, the durable cron primitives, the cascading config: these aren't independent features, they're a stack designed around a single operating assumption, which is that you're going to want the harness to run for a long time without you watching. Boris Cherny, who leads the Claude Code team, captured the architectural premise on Hacker News in February. When they shipped Claude Code a year earlier, the model could run unattended for less than thirty seconds before going off the rails. Now it runs for minutes, hours, and days at a time. The harness is what survives that gap. The verbosity follows. The premium pricing follows. The fact that the same harness feels heavy on a quick atomic task follows from the same commitment. Codex CLI is built for per-session bundled productivity. The ChatGPT Plus tier rebalanced in April toward more sessions of shorter duration, not fewer sessions of longer duration. The default behavior leans terse. The marketplace-plugin layer is shaped around tasks you can name in a sentence and finish in a session. Sam Altman's tweet that GPT-5.5 in Codex was surprisingly good at non-coding tasks is the tell: Codex's center of gravity is still atomic computer work, and the non-coding angle is the expansion play. Operators on r/ChatGPTCoding describe the experience as "45 minutes and move on to the next thing," which is exactly what the harness is optimized to produce. OpenCode is built for substrate freedom. MIT-licensed, 75+ providers natively supported, dual-agent Build/Plan toggle, $10/month Go tier launched yesterday, no Anthropic or OpenAI permission required to ship anything. The architectural commitment is that no single model lab will be the right answer for every task in every quarter, and that the harness should not be the lever a vendor pulls to change the deal on you. Practitioners running OpenCode through Tailscale to keep proprietary code off third-party clouds are using exactly the property the project is shaped to provide. These three commitments don't reduce to a feature checklist. They show up in the texture of using each harness for a week. Convergence and differentiation are happening at the same time, in different layers, and that's what makes the choice interesting rather than mechanical. ## The model still matters, for a non-obvious reason Here's where the 98%-not-AI finding gets more interesting, not less. If you take the claim at face value, the natural conclusion is that model selection becomes a commodity decision. Pick whichever model is cheapest per token, swap it in under the harness, ship. Plenty of people are doing exactly that. Operators routinely document running Claude Code "95% cheaper" by swapping models like MiniMax M2.7, Kimi K2.6, GLM-4.7, Qwen 3.6, and DeepSeek V4 Pro underneath via gateway proxies like LiteLLM, NVIDIA NIM, OpenRouter, and Ollama. Anthropic's own harness, pointed at someone else's model, served by a homelab GPU. That's a real practitioner pattern, not a thought experiment. What that pattern reveals is the opposite of "model doesn't matter." When you swap MiniMax in under Claude Code, the harness keeps working. The skills load. The MCP servers route. The auto-memory pair fires on the same triggers. But the dynamic underneath the scaffold shifts. The model handles the same context window differently. It writes back into the conversation with a different cadence. It interprets the cascading config files with a different bias. The shape of the harness is unchanged; the character coming through it is different. You notice within an hour. This is the part the "harness is the IP" framing leaves out. The harness is the IP, and the harness is shaped around the model whose character it was built to surface. Claude Code's long-horizon coherence depends on a model that holds context coherently over hours. Codex CLI's atomic-task speed depends on a model that produces short, decisive answers without thinking out loud. OpenCode's substrate-freedom story depends on whatever model you point it at being able to follow the harness's protocols at all. Each scaffold is doing the work that lets a particular model character be useful at scale. The scaffold isn't replacing the model. It's the lens that makes the model legible. So the practical question isn't "which model is best." It's "which character do I want running for hours without me watching, and which harness is shaped to surface that character?" When operators swap models under Claude Code and report the harness still working, they're describing how durable the scaffold is. They're also, usually within a few sessions, describing the model's character changing. Some find the swap productive because the new character fits a different work shape. Some find it disappointing because they were paying for the original character. Both reports are true, and both are about the model. ## Economics, gateway routing, and the price of character Pricing rotates faster than this paragraph will age, so I'll stay structural. Subscription pricing across all three is in motion. ChatGPT Plus rebalanced its Codex allocation in April. Claude Code's Pro tier has been rumored to lose Claude Code access entirely, and Anthropic's response on that has been ambiguous. OpenCode launched its Go subscription tier yesterday at $5 for the first month, $10 thereafter. None of these prices will be the same in 90 days. What's stable is the pattern. The pattern is that gateway routing has become a real economic lever. A single shell alias that swaps the gateway target before launching the harness is enough to route a session to whichever provider makes sense for the work. Voice-sensitive drafting goes to the premium model whose character you want. Bulk file scanning routes to a cheaper open-weight model that's perfectly capable of grep-with-context. Sensitive code that can't leave the building routes to a self-hosted endpoint on local hardware. None of this requires changing harnesses. It requires owning your gateway layer. The shape of the practitioner setup is small: a handful of aliases, a single gateway service that knows how to reach four or five model providers, and a per-task instinct for which alias to launch. The harness is constant. The model rotates. The bill drops by 40% to 80% versus running everything against the premium tier, depending on workload mix. OpenCode does this natively, with first-class multi-provider support and no gateway plumbing required. Claude Code does it through the gateway pattern, which costs you a one-time setup and gives you the full Claude Code skills, MCP, and auto-memory ecosystem on top of any model. Codex CLI does it less by design but increasingly by accident, since Codex on Bedrock and reports of cross-provider Codex use both exist. The trend across all three is that vendor lock-in at the model layer is becoming optional. The lock-in that remains is at the harness layer, where it's better-defended and more justified, because that's where the actual software is. ## Situational fit There's no single right answer. There's a shape for each kind of work, and the harness that fits the shape best is usually the one your hands reach for unprompted. | Scenario | Best fit | Runner-up | |---|---|---| | Greenfield project, voice-sensitive | Claude Code | Codex CLI | | Large legacy refactor | Claude Code (1M context) | OpenCode + Claude via gateway | | Atomic task, time-boxed | Codex CLI | Claude Code | | Debugging a flaky test | Codex CLI | Claude Code | | Long-form writing or research | Claude Code (skills + auto-memory) | Codex | | Scheduled overnight job | Claude Code (ScheduleWakeup, durable cron) | Codex (Persisted Goals) | | Cost-constrained or private codebase | OpenCode + local models | Claude Code via gateway proxy | | IDE-embedded | Cursor (likely Claude Code under the hood) | Claude Code terminal | | Building a custom skill or agent | Claude Code | OpenCode (Dual-Agent) | The matrix is starting points to copy and mutate, not buy recommendations. Three operator profiles map onto it cleanly, and they're worth naming. The solo founder shipping atomic tasks at speed wants Codex Plus or OpenCode plus local models. Codex Plus's per-session rebalancing fits a day made of many small tasks: forty-five minutes on a feature, ship, next thing. OpenCode plus a local model fits the same work shape if cost is the binding constraint or if the codebase has to stay on local hardware. Both setups punish you for trying to use them on long uninterrupted refactors. They reward you for doing what they're shaped for. The senior engineer doing long-horizon refactors and voice-sensitive comms wants Claude Code Max. The 1M context, the skills layer, the auto-memory loop, ScheduleWakeup, durable cron. The premium price buys you the scaffold that lets the model run for hours without supervision and lets your written voice survive across sessions. If your work involves a six-hour refactor of a legacy codebase, or a piece of writing that needs to come out sounding like you, Claude Code Max is the only one of the three currently shaped for that. The builder-leader running a fleet of agents on top of one of these wants Claude Code as the outer harness, with mode-switchers for routing across providers, Codex called as a tool from inside Claude Code when GPT-5.5's specific terseness or reasoning shape wins on the task, and OpenCode in regular rotation as the deliberate fluency check against the daily driver. This is my pattern. The outer scaffold survives the model swap. The skills and MCP ecosystem do the orchestration work. The inner tool dispatches handle the cases where another harness's native model character is the better fit. The bill is mixed across providers, the work shape is mixed across surfaces, and that's the point. The discipline of keeping the other two warm is what stops the daily driver from becoming the only thing you can see. If none of those three operator profiles describes your work, the right move isn't to look for the highest-scoring tool. It's to spend a week each in the two that look closest to your shape and trust your hands to tell you which one you reach for unprompted. Mine kept landing in Claude Code. Yours might not. ## Sidebar: autonomous agents are a different category OpenClaw, Hermes, and the broader autonomous-agent layer are blurring at the edges with these three harnesses, but they're a distinct category. Harnesses run with a human in the loop, optimize for supervised sessions, and surface their work in a UI you can interrupt. Autonomous frameworks run as fleets without per-session human input and optimize for long-horizon unsupervised execution, which is a different reliability problem with a different set of failure modes. The OpenClaw security incident from early May is the cautionary tale that makes the distinction visible. The harness category is what most builder-leaders are currently shipping into production. The autonomous category is what they're prototyping. A separate piece in this series will cover that category specifically. ## What to do with this The three-harness field as of May 2026 is not converging, not consolidating, and not collapsing into a single answer. Three live patterns, each shaped around what its native model does well, each carrying a real architectural commitment that survives even when you swap the model underneath. The right pick is the one whose commitment matches your work shape and whose character you trust to run for the longest unattended stretch you'll let it. The question of what runs for hours without going off the rails is the same question I worked through in [[AI Development & Agents/anthropic-dreaming-continual-learning|Anthropic Said "Feels Infinite." Dreaming Is Where It Breaks.]] from the model-side, and it shows up here from the harness-side. The two pieces fit together. The model and the scaffold are the two halves of the same answer. This piece is the technical companion to a more personal one on Run Data Run: [Three Tools, Three Characters, One Working Week](https://rundatarun.io/p/three-harnesses-three-characters). That version trades the benchmark numbers for the texture of one person actually using all three across a book, a blog, a model, and a creative draft. --- ### Related Articles - [[AI Development & Agents/anthropic-dreaming-continual-learning|Anthropic Said "Feels Infinite." Dreaming Is Where It Breaks.]] - [[AI Development & Agents/making-claude-code-more-agentic|Making Claude Code More Agentic]] - [[AI Development & Agents/claude-code-best-practices|Claude Code: Best Practices for Agentic Coding]] --- About the Author: Justin Johnson builds AI systems and writes about practical AI development. <a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a>