the-control-plane-bet-code-as-action - AIXplore

# The control plane bet: search and agents converging on code-as-action > [!tip] TLDR > **The why**: Within one week in late May and early June 2026, Perplexity and Anthropic shipped the same architectural bet from different directions. > **The shape**: The model stops calling tools. It writes code that calls tools, executes that code in a sandbox, and returns one clean result to the context. Retrieval and orchestration collapse into the same primitive. > **The hard part**: It isn't code generation. It's moving deterministic operations (dedupe, filter, aggregate, fan-out control) into the sandbox where they consume zero context tokens. > **Reproduction prompt for Claude Code**: *"Audit the tool-call sequences in [repo or skill name]. For any chain of 3+ sequential tool calls that could be expressed as a Python function, draft a code-as-action replacement and show me the token count change."* > > Read on if you want to understand why this matters for how you build agent systems today. ## The convergence On June 1, 2026, Perplexity published "Rethinking Search as Code Generation." Four days earlier, on May 28, Anthropic had shipped Opus 4.8 with Dynamic Workflows. Two labs, one week apart, made the same architectural bet from opposite ends of the stack. Anthropic's `ant` CLI, which shipped back in April for Claude Code to drive via the `claude-api` skill, is the configuration-as-code companion to the same thesis. Three products. Two companies. One shared bet: the model is a **control plane**, not a participant in the retrieval or orchestration loop. Two independent teams working on different problems (Perplexity on search quality and token cost, Anthropic on code migration and multi-step reasoning) arrived at structurally identical architectures without apparent coordination. That convergence is the signal. --- ## What's wrong with the tool-call loop Before the architectural shift makes sense, you need to feel the problem it solves. Standard agentic search loops look like this: call a tool, get a result, decide whether to call another, repeat until done. The model participates in every step. Each intermediate result lands in the context window and stays there. Every subsequent call happens against a window that's increasingly full of old results. Three failure modes compound from there. **Context pollution.** A search-and-synthesize pipeline with 12 tool calls carries 12 search results into the context, plus 12 "here's what I found" transitions, plus whatever the model appended at each step. By step 8, the model reasons over its own accumulated output more than the incoming evidence. The retrieval quality degrades, invisibly, because the context window is doing work it shouldn't. **Serial latency.** Fan-out queries (where you need ten sources in parallel, then deduplicate, then filter by date, then rerank) take ten-plus model calls in a tool-call loop. Each waits for the prior result. Those are inference roundtrips on deterministic operations Python executes in microseconds. **Manual control flow.** The model decides, on each step, whether to call another tool or stop. That decision is probabilistic. Models stop early (insufficient evidence) or run long (searching past diminishing returns) because they can't see the full aggregate until they stop. You can tune this with prompting, but you're fighting the architecture. Code-as-action relocates all three to the sandbox. The model writes the retrieval strategy as a program before the first tool call. Deterministic operations run in Python; only the final result reaches the context. One model call, one program, one result. > **The model stops participating in the loop. It writes the loop as a program, runs it in a sandbox, and reads back one result.** --- ## Perplexity's Search-as-Code Three layers: 1. **Control plane**: A frontier model reads the task and decides the retrieval strategy: fan-out queries, narrow-and-deep, time-bounded, iterative refinement. 2. **SDK**: The model writes Python against Perplexity's Agentic Search SDK (`search`, `rerank`, `filter`, `dedupe`, `join`, `retry`, `aggregate`). 3. **Sandbox**: That code runs in a secure environment with a persistent filesystem. Intermediate results live on disk between passes, not in the context window. The CVE advisory case study puts numbers on the token reduction: 288,700 tokens dropping to 42,900, an 85.1% reduction. The mechanism is exactly what the architecture predicts: 200 search results get filtered to 12 relevant ones in the sandbox; only those 12 reach the model. > **288,700 tokens down to 42,900. The model never sees the 188 results that got filtered out in the sandbox.** On five benchmarks (DSQA, BrowseComp, HLE, WideSearch, WANDR), Perplexity reports SaC above OpenAI Responses API, Anthropic Managed Agents, Exa, and Parallel Tasks. The important caveat: WANDR is Perplexity's own benchmark, and the 2.5x lead lives there. The DSQA +19.77pp and BrowseComp 0.805 numbers carry more weight because they're on independent evaluations. Wait for independent replication before restructuring production retrieval pipelines around these numbers; the architecture is sound, but the benchmark claims are vendor-graded. Two design choices in the paper are worth taking seriously: **Python over TypeScript or Bash.** Models write better Python. The test is which language produces the fewest bugs in zero-shot code generation. Python wins consistently. For retrieval-shaped work involving list operations, filtering, and aggregation, the standard library covers most of the SDK operations without syntactic overhead. **Filesystem over REPL for state.** Long agentic trajectories crash with REPL-only state. A kernel restart wipes the intermediate results the model was building on. Scratch files that survive across sandbox restarts are a reliability improvement that compounds on multi-step tasks. --- ## Anthropic's Dynamic Workflows Same pattern, applied to orchestration rather than retrieval. Claude auto-writes an orchestration script that fans out to parallel subagents, gathers their results, and runs adversarial verifiers before returning a final answer. The trigger is explicit ("create a workflow to...") or implicit via the `ultracode` setting, which evaluates every task as a workflow candidate. The proof point Anthropic ships with the release: Jarred Sumner ported Bun from Zig to Rust using Dynamic Workflows: 750,000 lines, first commit to merge in 11 days. Two caveats: Bun's codebase is unusually self-contained and test-covered, which is exactly the shape Dynamic Workflows handles best; and 11 days of Claude time with Sumner reviewing is not 11 days zero-touch. Still, a working port at that scale is a real data point. The UX change in Dynamic Workflows is Claude deciding when to fan out rather than the user invoking it. The underlying Workflow tool already exists in Claude Code and already supports parallel subagent orchestration. What's new is the planning loop: Claude evaluates "this task warrants a fan-out" and writes the orchestration script without explicit prompting. Tradeoffs Anthropic doesn't surface prominently: - **Verifiers are still LLMs.** Adversarial verifiers reduce hallucination rate on surface errors. They can't validate semantic correctness on code paths they don't execute. The Bun port worked because it has a real test suite covering semantic behavior. - **Latency floor.** Verifier passes run after fan-out, sequentially. Expect a mandatory review pass that adds wall time on long tasks. - **ultracode scope risk.** Auto-orchestration on a poorly-scoped task spawns subagents on the wrong abstraction. Scoping "this task" correctly is still your job; the 1000-agent ceiling is the only real backstop. My recommendation: use Dynamic Workflows explicitly (you invoke it, you scope it) before enabling `ultracode` always-on. Test it on a real audit or migration task, measure where it earns its token spend, then decide on the setting. --- ## The ant CLI and configuration-as-code The third piece, shipped back in April alongside Claude Managed Agents: `ant`, the Claude Platform CLI. ```bash brew install anthropics/tap/ant ant messages create --model claude-opus-4-8 --message '{role: user, content: "..."}' ant beta:agents # launch/manage Managed Agents ant beta:worker # poll and run jobs when self-hosting ant auth login ``` The architectural intent is explicit in the release: `ant` is designed for Claude Code to drive via the `claude-api` skill. When Claude Code authors API integrations, it reaches for `ant messages create` in shell rather than SDK import boilerplate. Fewer lines, same output, and the invocation is greppable in transcripts. The more consequential piece is `ant beta:worker` for Managed Agents: poll and run jobs when self-hosting. Combined with `ant beta:agents`, this creates a Git-ops path for agent configurations. One reply in the release thread framed it well: "Agents and Skills as part of your regular deploy process, updated on each push." Stateful AI resources as versioned infrastructure, not dashboard-only artifacts. --- ## The shared architecture Abstracting away the domain specifics, both companies ship the same compute graph: ``` task → model reads → model writes program → sandbox runs program → clean result to context ``` The key operations that move from the context window to the sandbox: | Operation | Tool-call loop | Code-as-action | |---|---|---| | Fan-out queries | N model calls, N results in context | Async calls in sandboxed Python | | Deduplication | Model summarizes duplicates | `set()` in sandbox | | Date filtering | Model applies heuristics | `.filter(lambda x: x.date > threshold)` | | Reranking | Model reorders in prompt | Sort by score, slice top K | | Aggregation | Model synthesizes from all results | Computed field in sandbox | Model handles judgment (what strategy, how to interpret ambiguous results, when to refine). Sandbox handles computation. The split is clean and compositional. You can add operations to either layer without touching the other. > **The model decides the strategy. Python does the arithmetic. Only judgment touches the context window.** This pattern has antecedents. CodeAct (Wang et al., 2024) established "model writes code, sandbox executes code, code is the tool-call interface" as a research primitive. Voyager (2023) applied it to embodied agents. What's new is production-scale retrieval systems and hosted platforms shipping on it in the same week. --- ## What this means for how you build **Retrieval pipelines**: audit your tool-call sequences. Any chain of three or more sequential calls where the model filters or aggregates results is a candidate. The token savings are proportional to the number of intermediate results your model never has to see. **Multi-file work in Claude Code**: explicit Dynamic Workflows (scoped, you invoke it) are worth experimenting with for codebase migrations and audits. Leave `ultracode` off until you've measured it on a workload where you can verify the output independently. **Claude Platform integrations**: install `ant`. For simple API calls, `ant messages create` from a shell invocation is faster to author and grep than SDK boilerplate. The `ant beta:worker` path is worth evaluating if you run self-hosted Managed Agents or plan to. **Homelab or small-scale search**: the pattern is accessible without Perplexity's SDK or Anthropic's hosted service. A Python function that fans out queries to multiple backends, deduplicates by domain, and returns top K to the model is the same architecture at a smaller scale. The key constraint is that deterministic operations happen before the model sees the results, not after. --- ## What this doesn't solve Neither architecture addresses: **Unknown query shape.** The model writes a retrieval strategy before it knows what the results look like. A wrong initial strategy runs to completion before you can correct it. SaC adds iterative refinement, but the first pass still commits to a program. **Benchmark transparency.** WANDR is Perplexity's benchmark. "Beats Anthropic Managed Agents" is a self-graded claim on a self-built eval. Dynamic Workflows' token consumption isn't published per-workload. Budget both carefully before scaling. **Verification ceiling.** An adversarial verifier that consistently misses the same error class creates a false floor of confidence. The failure mode is invisible until it reaches production. **SDK portability.** SaC requires Perplexity's SDK. Dynamic Workflows requires Claude Code. The underlying pattern is portable to any LLM plus a Python sandbox; the production implementations are not. --- ## The convergence signal Two teams working on different problems (retrieval latency and token cost, versus agentic code migration and multi-step verification) converged on the same architectural shape without apparent coordination. That's the bet worth internalizing now, independent of either company's specific implementation: **model as control plane, sandbox as execution, context window for judgment only**. The primitives are Python and a subprocess call.