anthropic-dreaming-continual-learning - AIXplore

# Anthropic Said "Feels Infinite." Dreaming Is Where It Breaks. ![[assets/anthropic-dreaming-continual-learning-hero.png]] At Code with Claude in San Francisco on May 6, Anthropic described "context windows that feel infinite." The press read that as a token-count announcement and ran with "infinite context window" headlines. That paraphrase loses the argument. The word doing all the work is *feel*. There is no new context primitive. The 1M-token window on Sonnet 4.6 and Opus 4.6 is the same one that went GA in Q1. The Compaction API and the memory tool already shipped. What is genuinely new is **Dreaming**: a scheduled offline process that reviews past sessions, extracts patterns, and curates memory artifacts so an agent improves over time. Harvey reported a 6x lift on task completion. No weights move. If you read the Dreaming description side by side with a continual-learning textbook, you will notice they describe the same operational signature. Replay buffers, regularization, consolidation. The continual-learning literature has been formalizing this for two years. Anthropic shipped a feature that inherits the literature's failure modes without inheriting its math. That gap is the point of this post. The argument I want to defend: at the user-visible layer, Dreaming and continual learning are indistinguishable. At the engineering-debugging layer, they fail in completely different ways and demand different fixes. Operators who collapse the two will misroute half their incidents. ## What "feels infinite" is actually made of Anthropic's stack as of May 6 has five layers, not one. None of them is a context window expansion. | Layer | What it does | Status | |---|---|---| | 1M-token window | Fixed context, GA pricing | Shipped Q1 2026 | | Compaction API (`compact-2026-01-12`) | Server-side summarization at ~150K | Beta, already in use | | Memory tool | Persistent plain-text plus structured playbooks | Shipped earlier | | **Dreaming** (research preview) | Offline session review, pattern extraction, memory curation | New, May 6 | | Outcomes / multiagent orchestration | Rubric-graded retry loop and lead/worker delegation | Public beta, May 6 | The press conflated all five layers into "infinite context window." Anthropic's own framing was more careful. The window is the same 1M ceiling. The *feel* comes from compaction plus memory plus Dreaming, which together let an agent appear to retain context indefinitely while the actual token budget at any moment stays bounded. This is not a token-count story. It is a memory-management story dressed in token-count language. The distinction matters because the failure modes live at different layers. ## Dreaming is continual learning with the consolidation target swapped Read these two sentences in sequence. > [!info] Anthropic on Dreaming > A scheduled process that reviews your agent sessions and memory stores, extracts patterns, and curates memories so your agents improve over time. > [!info] Continual learning textbook definition > A learning process in which the model incorporates new tasks while preserving performance on old ones, addressed by replay buffers, regularization, and consolidation mechanisms. Replay maps to session review. Regularization maps to memory curation. Consolidation maps to playbook extraction. The mechanisms line up almost word for word. The only structural difference is the *target* of consolidation. Continual learning writes to the weights. Dreaming writes to a memory store the agent retrieves from at runtime. That swap matters more than it looks. A weight update is destructive and global. A memory write is additive and scoped. A weight update is debugged with gradient inspection and benchmark regressions. A memory write is debugged by reading what got curated, what got evicted, and what the retriever surfaced this turn. Same operational signature, different forensic tools. The result: Dreaming systems can fail in ways the continual-learning literature already has names for, like plasticity-stability tradeoffs in eviction policy or distribution shift in what gets prioritized. But operators will not recognize them, because the vocabulary is sitting in a parallel research community most builders do not read. ## The case for treating them as the same thing Stand in the customer's seat for a moment. The system retains information across sessions. Old facts stay accessible after new ones land. It can answer about events past the training cutoff. Recurring user-specific tasks improve over time because the retrieved examples improve. If the user-visible behavior is "the system gets better at our work as we use it together," the customer is correct to call that learning. The mechanism does not matter to them. This is the position implicit in Anthropic's product framing, and it is the position the practical-systems literature is converging on. *In-Context Learning can Perform Continual Learning Like Humans* (arxiv 2509.22764) makes the equivalence claim explicitly. *End-to-End Test-Time Training for Long Context* (arxiv 2512.23675) goes further, compressing long context into weights at test time. The line is genuinely blurring. Builders who insist on the old taxonomy at the user-experience layer will sound pedantic and lose the conversation. The customer does not need to know whether their fact lives in the prompt, the retriever, or the playbook. They need it to be there when they ask. ## The case against Continual learning as the literature defines it is a property of the parameters. It is the plasticity-stability problem under shifting distributions, addressed by EWC, functional regularization, replay buffers. The hard part is keeping the model from clobbering old skills when learning new ones, while the model itself is the substrate that holds both. In-context "learning" plus memory tooling has none of those properties. - The knowledge lives in the prompt or in retrieved chunks, not in the model. - It evaporates the moment the context summarizer drops a fact, the retriever misses, or the embedding goes cold. - The model itself is identical across users and sessions. Personalization lives outside it. - There is no plasticity-stability problem because there is no plasticity. There is just retrieval. The failure modes diverge cleanly. Continual learning fails by *forgetting*: weights drift, old tasks degrade. Dreaming fails by *eviction*: a curator decided yesterday's lesson did not earn shelf space, or the retriever ranked a stale playbook above a fresh one, or the compaction step summarized a load-bearing detail into "the user previously discussed configuration." Both look like the system forgot. Only one of them is. The fix for forgetting is offline retraining or fine-tuning. The fix for eviction is reading the memory store, finding the missing fact, and chasing whichever component evicted it. If you treat an eviction incident as a forgetting incident, you will spend a week designing an evaluation harness for a problem that was, in fact, a stale embedding index. > [!warning] The diagnostic question that matters > When something gets forgotten, do not ask "did the model forget?" Ask "where did this fact live, and which component evicted it?" Until you can enumerate every place a fact can live in your system, you cannot debug forgetting. ## The convergence is happening, but unevenly Test-time training papers genuinely are blurring the line. *Rethinking Long Context Generation from the Continual...* (ACL 2025) and the test-time training work cited above are evidence that the *mechanism* is starting to converge, not just the user experience. I do not want to argue against that convergence. It is happening. What I want to argue is that the convergence is uneven, and the speed varies by layer. At the user-experience layer it is essentially complete. At the engineering layer it is not, and conflating them makes the engineering work harder. The fix for an eviction incident is not a fine-tune. The fix for a stale playbook is not a longer context window. The fix for a weight-level forgetting problem is not a better retriever. Treating "the agent forgot" as a single category of incident loses information that was load-bearing. This is also the place where leadership conversations get hard. Executives will hear "infinite context" and assume the system retains by default. Builders will know retention is a stack of bets, each of which can lose silently. When the bet loses and the user reports a forgotten fact, somebody has to translate "the compactor summarized your spec into three sentences" into language a stakeholder can act on. The translation gets harder when both sides have collapsed the model into a single mental object called "the AI." ## What I would tell a builder shipping on this stack Four things, in order of how often I have wished I had internalized them earlier. **Treat memory as a first-class subsystem, not a side effect of long context.** If you cannot enumerate where every fact in your system lives, you cannot debug forgetting. The map you want is concrete: prompt window, retrieval index, scratchpad files, persistent playbooks, compactor summaries. Each layer has its own eviction policy. Each policy is debuggable independently. None of them are debuggable when you have collapsed them into a single mental object called "the agent's memory." **Keep the literature's vocabulary in your incident review.** When a user reports a forgotten fact, force the question "did the model forget, or did the system evict?" before any fix gets proposed. The vocabularies live in different research communities and your team's bias will be toward whichever one they read more recently. Make the dichotomy explicit in the postmortem template. **Expect Dreaming to drift in ways the literature describes.** Curated memory is not a stable artifact. It distribution-shifts, it overfits to recent sessions, and it can entrench stale patterns the way a poorly tuned replay buffer entrenches old tasks. The continual-learning surveys (Wang et al. 2025; van de Ven et al. 2024) catalog these failure modes. They will appear in your Dreaming logs. Read those papers before you have your first incident, not after. **Resist letting the marketing term colonize the diagnostic vocabulary.** "Infinite context" is a fine pitch deck term. It is a terrible incident-review term. The diagnostic vocabulary needs to stay precise, because the fix for eviction differs from the fix for forgetting differs from the fix for stale curation. Three different fixes, three different teams, three different timelines. One marketing phrase covers all of them and tells you nothing about which one you are looking at. The convergence is happening. The conflation is still expensive. Anthropic shipped a feature that earns the "feels infinite" framing, and builders who let that framing colonize their incident reviews will pay for it at the seams. The seams are where the work lives. --- ### Related Articles - [[AI Development & Agents/anthropic-multi-agent-research-system|Anthropic's Multi-Agent Research System]] - [[AI Development & Agents/ambient-intelligence-intent-over-instructions|When AI Knows What You Mean, Not Just What You Say]] - [[AI Development & Agents/making-claude-code-more-agentic|Making Claude Code More Agentic]] --- About the Author: Justin Johnson builds AI systems and writes about practical AI development. <a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a>