AI Systems & Architecture7 min readshipped

The Composition Threshold: When a Bigger Skill Library Starts to Mean a Better Agent

The Composition Threshold: When a Bigger Skill Library Starts to Mean a Better Agent

TLDR

The why. A post crossed my feed claiming autonomous AI research orchestrated on 800 markdown skills. We run about fifty. The trained reflex, and mine, was to scroll past: in agent systems more is usually a worse agent, because an uncurated pile taxes attention, fights itself, and rots. A skill count is normally a vanity number.

The shape. The repo is the De-Anthropocentric Research Engine, DARE: 900-odd pure-markdown files, no application code, no framework, no runtime of its own. Claude Code is the runtime; the skills are the program. They sit in a strict four-layer hierarchy (Campaign, Strategy, Tactic, SOP) where each layer can only call the one directly beneath it.

The hard part. Reconciling the reflex with the repo. The count is a symptom of strict composition, not a substitute for it, and the line where "more" flips from liability to capability is a property of architecture, not a number. Finding that line, and showing it is structural rather than numeric, was the work.

Key takeaway. Judge a skill library by granularity, layering, working-set size, and whether the agent can author its own next skill, never by the headline count. Don't chase 800. Build the structure that makes 800 safe to reach.

A post crossed my feed claiming autonomous AI research orchestrated on 800 markdown skills. We run about fifty. My first instinct was to scroll past, because I have written before that in agent systems more is rarely better, and a big skill count is the kind of number that usually means someone confused inventory for capability. An uncurated catalog of skills is a tax: every one of them competes for the model's attention at selection time, a fraction will conflict, and most will quietly rot against a moved file or a dead API. By that logic, 800 should be 800 problems.

Then I opened the repo, and the number turned out to be measuring something other than what I assumed.

What 900 markdown files actually are

The project is the De-Anthropocentric Research Engine, DARE, and the architecture is the interesting part, not the count. There is no application code. No runtime. No framework in the usual sense. The entire system is 900-odd pure-markdown files, each a self-contained instruction set, and the thing that executes them is Claude Code itself. The skills are the program; the agent is the interpreter. That is the whole stack.

The files are not a flat folder. They sit in a strict four-layer command hierarchy, and the layering is the design that makes it hold together:

  • Campaign (45+): a full research phase. "Crystallize a direction." "Acquire the literature."
  • Strategy (200+): the iteration engines inside a campaign. The search-read-reflect loop, the coverage-saturation check, the rules for when to stop.
  • Tactic (120+): combinations. A "cross-domain collision" tactic wires several atomic operations into one creative move.
  • SOP (500+): single-responsibility atoms. Run one search. Score one hypothesis. Extract one analogy.

The rule across the layers is absolute: each layer calls only the one directly beneath it. A strategy never touches a tool directly; a tactic never decides research direction. That constraint is what keeps 900 files from collapsing into mud. Every skill has exactly one concern, and the only way up is composition.

So the count is not a pile. It is what disciplined decomposition looks like when you carry it all the way down to one-action atoms. The number is a symptom of the architecture, not a substitute for it, and that distinction is the entire reconciliation. A dumped catalog of 900 skills would be the liability I expected. A layered arsenal of 900 single-responsibility skills is something else.

Arsenal, not pipeline

This is what the larger library actually buys you, stated as the qualitative shift it is, because that is the real answer to what a bigger skill set gets an agent.

Every other autonomous research system the authors name (Sakana's AI Scientist, HKUDS's AI-Researcher, Agent Laboratory, Dolphin, ARIS) runs a fixed pipeline: literature, then gap, then hypothesis, then experiment, in that order, with the agent's autonomy confined to local choices inside a stage. If the experiment phase reveals the literature review missed a subfield, there is no mechanism to walk back and fix it.

DARE is not a pipeline. It is an arsenal the agent reads and then decides how to use. The ten research packages are freely composable; the agent reads a catalog after the direction is set and chooses which packages to invoke, in what order, and when to loop back, driven by the current research state rather than a hardcoded lifecycle. A gap-analysis campaign offers fifteen-plus detection methods. A creative-ideation campaign offers thirty-one-plus generation techniques. The agent selects among them per problem.

The README says why, and it is almost word-for-word the point I had been circling: the methods are there "not because more is better, but because different research problems demand different tools, and a system locked to one approach per phase cannot adapt." Not 800 versus 50 as a quantity. The shift from executing one prescribed path to navigating a space of composable moves with the authority to backtrack. You cannot get that behavior from five mega-skills, because there is nothing to compose. You get it from five hundred atoms and a discipline for combining them.

Two failure modes, one variable

A flat catalog of 900 marketplace skills and a layered arsenal of 900 research skills have the same count and opposite value. The difference is not the number. It is whether each skill is a single-responsibility atom inside a structure that controls what can call what, so only a handful are ever live for a given task. Composition discipline is the variable. The count just rides along with it.

Where more starts to mean better

So "more isn't better" is narrower than the slogan makes it sound. More is worse below a composition threshold and better above it, and the threshold is structural.

Below it, skills are coarse, overlapping, and flat. Every one you add widens the attention cost at selection time and the odds that two of them disagree, and nothing gets more capable because there was no composition to deepen. This is the marketplace case: security reviews of large community skill stores have flagged a sizable share of listings as risky, and the size of the store is the size of that surface. There, the count is the liability I assumed at the top.

Above it, skills are single-responsibility atoms inside a hierarchy that keeps the working set small while the reach grows. Adding an SOP does not tax every decision, because the layering means only the relevant few are ever in the room for a given task. Reach scales with the catalog; load does not. That decoupling is what DARE's four layers buy. Past that line, a larger library genuinely is a more capable agent, because each addition is a new composable move rather than a new thing to wade through.

The threshold is not a number. You can be over it at 80 well-factored skills and under it at 8,000 dumped ones. It is a property of how the skills are cut and constrained.

What this says about our fifty, and about evolvable agents

Our fifty skills are fine, but they are mostly a flat set with progressive disclosure bolted on, not a layered arsenal. The lesson from DARE is not "get to 800." It is to cut skills into single-responsibility atoms and impose the call-only-downward discipline, so the count can grow later without clogging the agent, and so composition becomes possible at all.

And there is a bigger point underneath, about where this is heading. If the skills are just markdown that the runtime reads, a question I have chased before about where an agent's context should live, then an agent capable of reading a composable arsenal is, in principle, capable of writing the next file in it. The same property that lets Claude Code execute 900 skills lets it author the 901st: observe a gap during a run, write a new single-responsibility SOP in the house format, and drop it into the arsenal for next time. That is the evolvable agent, and it only works on top of the composition discipline, not the count. A pile cannot be safely self-extended, because a new skill in a pile just deepens the mud. A layered arsenal can, because the new skill has exactly one concern and a defined place in the hierarchy. The architecture that makes a large library usable by a human author is the same architecture that makes it extensible by the agent itself.

DARE's own roadmap points this direction. Its next step is autonomous skill ablation: the agent identifies redundant, overlapping, or under-used skills and fuses them into fewer, stronger composites, pruning the library the way you prune a network, fewer parameters without losing capability. Reading the arsenal and editing the arsenal are the same capability aimed in two directions, and a layered structure is what makes the second one safe.

That is the through-line from DARE back to where I think agent design is heading. The interesting number was never 800. It was the granularity underneath it.

What to check on Monday

When you see a skill count, large or small, the count tells you nothing on its own. Four things tell you which side of the threshold it is on.

  • Granularity. Is each skill one action, or a kitchen-sink mega-prompt? Atoms compose; blobs collide.
  • Layering. Is there a rule for what can call what, or can anything invoke anything? Unconstrained call graphs are how a catalog turns to mud.
  • Working set. How many skills are actually live for a single task? Reach can be huge as long as load stays small.
  • Authorship. Can the agent add a skill in the same format it reads? If yes, the library compounds. If no, it only ever shrinks in relevance.

Chase those four, not the headline number. A well-factored fifty beats a dumped five thousand, and a well-factored nine hundred beats both, for the same reason: the agent is only ever as good as what it can compose, and composition is a property you build in, not a number you reach.

Follow the lab

Get the next experiment

Enjoyed the breakdown on The Composition Threshold: When a Bigger Skill Library Starts to Mean a Better Agent? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.