Waking the Research Engine: Three Walls, One Experiment
The last post ended with a warm model serving tokens on the node at speed. That felt like the finish line. It was not. A model that answers is not a lab that runs, and the gap between the two turned out to be three walls, one after another, each invisible until the one before it came down.
The thing trying to wake up is ARIA, our autonomous research engine.1ARIA reads the literature, proposes experiments, scores them, and promotes the strongest into real GPU runs. On this project it decides what's worth a metered GPU-hour. It cut its teeth on the desktop DGX Spark; the node is where it gets to run at scale. Its whole job is to read, propose, score, and promote experiments without a human in the loop for each step. None of that happens if it can't think, can't talk to the model, or can't trust its own scores.
Wall one: it had no brain
ARIA thinks by spawning headless Claude Code processes.2The engine shells out to claude -p children to do its reasoning. That means a real Claude Code install, not just an API key, has to exist on the box. The node shipped with none of it. No Node.js, no claude binary.
That part was mechanical: Node, then Claude Code, then the safety hooks. The hooks are the part that takes judgment, not typing. They sit in front of every shell command the agent issues and block the destructive ones, and they block content that shouldn't leave the building. We carried the pattern over from our existing engine and swapped the confidentiality denylist for one specific to this project. An autonomous agent on a shared node without those hooks is not a tool, it's a liability.
Wall two: the gateway bug that only big requests triggered
With a brain installed, every request the agent made to the local gateway came back a server error. And here's where it got strange: a plain hand-written request to the exact same gateway worked fine.
That contradiction is the whole story, and I've put the chase in a side quest above. The short version: the agent's requests carried two dozen tool definitions, the manual test carried none, and a version of the gateway routed tool-laden requests down an experimental path that the serving engine rejected. One config flag forced the normal path and the wall came down.
Wall three: the model was eating its own answers
The engine came alive and started scoring ideas. Every single idea came back with the same score: a flat, neutral default across every dimension. A scorer that rates everything identically is not scoring, it's failing silently.
The warm model is a reasoning model. It thinks in one field and answers in another.3Reasoning models emit chain-of-thought into a separate reasoning_content channel, then the actual answer into content. If the token budget runs out during thinking, content arrives empty and any parser downstream sees nothing. The critic was calling it with a tight token cap, the kind of number that's fine for a normal model. The thinking consumed the entire budget before the answer field got a single token, so the parser got nothing and fell back to the neutral default. Raising the budget to a few thousand tokens turned the flat defaults into real, varied scores immediately.
Patching the one critic call would have left every other reasoning-model call in the codebase carrying the same latent bug. We fixed it at the shared gateway choke points instead, with a token floor, and added a guard test that statically scans the code for any reasoning-model call with a starved budget. That scan immediately flagged four more sites nobody had noticed. Fix the class, then make the class un-reintroducible.
A lane contract for the GPUs
Somewhere in here a six-GPU training run kicked off on the same box, with no record of which cards were in use. On a shared node that's a silent collision waiting to happen: two processes grabbing the same device, both crashing, neither sure why.
So before going further we wrote a lane contract and committed it where every agent and human reads it. One card for inference, one for the engine's experiments, the rest a training pool. A small tool hands out cards from the pool, sets the device mask, runs the job, and releases on exit, refusing a card that's already claimed and reclaiming any card whose process died. The moment two parties share a GPU box, coordination stops being optional, and the cheapest version is a contract plus a tool that enforces it.
Wiring it to the literature
An engine that proposes experiments without reading first just generates plausible-sounding noise. We gave it three research tools: one for novelty checks against published work, one for the biomedical literature and trials, and one for machine-learning methods.4Novelty and biomedical lookups run over OpenAlex and a biomedical MCP server (PubMed, trials, genes, diseases); methods come from arXiv. We left the protein and drug-discovery tools off as off-mission for retinal imaging and slow to start. The test that it worked was watching it pull live sources before proposing an idea, instead of inventing citations.
There was one deliberate compromise. The engine is designed to score with a panel of different model families, so the critic isn't correlated with the producer. Only the warm model was up; the other tiers weren't leased yet, so every cross-family call was erroring. Rather than block on that, we ran it in single-model mode, with the warm model playing every role, and logged it loudly as a temporary quality tradeoff with a one-line switch to restore the panel. Shipping a known, reversible compromise beats waiting for perfect.
The four-layer dig to a first experiment
Here's the state that should have been alarming and wasn't, because everything looked healthy: the engine was "running." It produced scored ideas, the logs were green, no errors. And after thirteen cycles it had promoted exactly zero of them into an experiment. Its experiment lane sat empty.
Getting from there to a first autonomous experiment was a dig through four layers, each hiding the next.
The promote bar was set above the ceiling. The engine promotes any idea scoring above a bar. The startup bar shipped at 8.0. But organic ideas, the kind grounded in real evidence, top out near 6.8, because the evidence only supports so much confidence. The bar was higher than anything the engine could actually produce, so PROMOTE could never fire. The engine wasn't stuck, exactly. It was doing precisely what it was told, forever, to no end.
This is the bug. The startup bar shipped at 8.0, but organic ideas top out near 6.8 because the evidence only supports so much confidence. The bar sits above the ceiling, so PROMOTE never fires and the engine runs ideas into the void.
Dots are the real composite scores from the engine's first ignition run.
The fix was already in the design and never firing: recompute the bar as a percentile of the live score distribution instead of a fixed constant. An adaptive bar lands within reach of the real ideas, so the best ones clear it.
The verifier couldn't read resources. With the bar fixed, every idea then failed verification with "could not extract resources." Same root cause as wall three, a reasoning-model call with a starved token budget, in a different place. This is exactly why we'd made the fix a class-wide floor with a static scan: the scan had already flagged this site. Closing it let verification actually read what each experiment needed.
The labels didn't exist. Now verification ran, and reported that the dataset the ideas wanted had no labels. The images were staged, but the table of biomarker values the experiments expected wasn't there. The raw clinical data was on the box, in a standard medical format, and building usable labels from it had one trap worth the whole side quest.
The images were the wrong file type. Last layer. Verification passed, but the experiments would have died at run time, because the image-loading code expected one format and the staged images were in another. The fix wasn't to teach the engine the other format. A parallel eval pipeline had already extracted the same images into the format the loader wanted, so we pointed the engine at that shared corpus. One copy of the data, two consumers, no drift, no second extraction.
Then it worked. Verification passed, the adaptive bar recomputed itself down to where the ideas actually lived, the top idea cleared it, and the engine promoted it into a real experiment and started designing the run. The experiment lane still has to fill, the actual GPU run is a couple of steps ahead, but the loop closed: an idea became an experiment with no human pulling it across. The engine is detached and walking on its own now.
The next post is about the thing all of this was scaffolding for: starting the encoder training, and the research-before-compute discipline that decides what the node spends its hours on.
Follow the lab
Get the next experiment
Enjoyed the breakdown on Waking the Research Engine: Three Walls, One Experiment? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.
Related experiments
Apparatus
1,969 words · 12 min read
- autonomous-agents
- litellm
- reasoning-models
- llm-infrastructure
- research-automation
- build-in-public