Practical ApplicationsJune 30, 20269 min readshipped

Keeping the Node Smoking: What Eight H100s Buy, and the Engine That Keeps Them Full

Run this

claude "On a machine with several GPUs, write a small program that keeps every chip busy: check how busy each GPU is, and whenever one goes idle, pull the next job from a waiting list, hand it that chip, run it, and release the chip when the job finishes. Make it refuse a chip another job already holds, and reclaim a chip whose job has died."

claude code

A borrowed node is a clock, not a trophy. NVIDIA, through its Inception program, gave SocialEyes eight H100 GPUs for a couple of months. The day that happens, the job changes. The hardware is not the impressive part. Anyone can rent a fast GPU by the hour. The impressive part is whether, when the two months are up, every one of those eight chips ran something worth running for nearly every hour we had it.

That is harder than it sounds. A GPU sitting idle on a borrowed clock is pure waste, and idle is where a big machine drifts the second you stop feeding it. You can watch ours not drift, right now, on the Borrowed Iron scoreboard, a public page that shows what the node is doing live. This post is the two halves of keeping it busy: what the machine gives you that a smaller one can't, and the system we built so no chip ever sits cold. I'll explain the technical bits in plain terms as they come up, so skip those asides if they're old news to you.

What the machine actually gives you

Here are the numbers, then what they mean.

The node is eight H100 chips. ¹A GPU is the chip that does the heavy, repetitive math that training an AI model is mostly made of. The H100 is NVIDIA's data-center model, the generation most large models today were trained on. Each chip carries 80 gigabytes of the fastest memory made, the kind that sits right against the processor so data barely has to travel.²The technical name is HBM, high-bandwidth memory, stacked right on the chip package. 80GB per chip, 640GB across the eight. The "bandwidth" part, how fast data moves in and out, is usually what caps training speed, which is why it counts for more than raw capacity. Eight of them together is 640 gigabytes of that memory in one box. Add to that enough raw throughput to do roughly sixteen quadrillion math operations a second,³That's 15.8 petaflops in FP8, the low-precision number format AI training leans on. "Peak" is the theoretical ceiling; real work runs under it, which is part of why keeping the chips fed counts for so much. two terabytes of ordinary system memory, and seventeen terabytes of fast local storage. You can see every bit of it lit up and working on the scoreboard; it draws those numbers straight off the machine and refreshes daily.

Numbers like that stay abstract until you stand them next to something. So here is the comparison I find most useful.

I came into this from the small end. The whole approach got worked out first on a DGX Spark, NVIDIA's desktop machine, which has 128 gigabytes of memory and a single GPU. It's a genuinely good research rig, and most of the hard questions got answered on it, where a wrong turn costs minutes instead of money on a borrowed clock. But it is one chip. Next to the Spark, the node has roughly five times the memory and around fifty times the speed of moving that memory around, which for this kind of work is the number that bites.

Here is what that buys, in everyday terms. On the small box, testing a batch of ideas is a single-file line: you run one setup, wait, read the result, run the next. You are always waiting on the one chip. On the node, the same batch is a grid. Eight setups run at the same time, each on its own chip, and the whole study finishes in the time one of them used to take. The work didn't get smarter. It got eight-wide.

The memory buys a quieter win. With 80 gigabytes on each chip you stop fighting the things that eat a small-box day: you can hold a bigger model, feed it more images at once, and train on longer stretches of data without the constant shuffling a smaller memory forces. And one chip can run a chatbot-grade model locally, fast, while the other seven train, which is why our research engine has a brain to think with that costs nothing per use. (That local setup is its own story.)

The small box didn't get replaced by the big one. It taught the method the big one now runs at scale. The data-center chip even made things simpler, not harder: the desktop Spark needed special handling at every turn, while the H100 is the standard part everything is built for, so plain installs just work. Cheap box to learn on, expensive box to run on.

An idle chip is a bug

A machine that can run eight things at once will run zero the moment nobody is watching. Keeping it full isn't a switch you flip, it's a habit you keep, and on a shared box with three of us and an autonomous agent all launching work, the habit needs structure or it falls apart into clashes and gaps.

Two pieces hold it together.

The first is a written rule for who gets which chip. We wrote it down early and put it where every person and program on the box can read it: one chip for the always-on local model, one for the research engine's own experiments, the rest a shared pool for training. A small tool hands out chips from that pool, runs the job, and gives the chip back when it's done. It refuses a chip someone already holds, and it reclaims a chip whose job has died. The instant two parties share a machine like this, agreeing who gets what stops being optional, and the cheapest version of that agreement is a written rule plus a tool that enforces it. (We learned this the loud way, with two jobs grabbing the same chip, which is the usual way these rules get written.)

The second is an auto-filler. The rule stops clashes, but it doesn't stop gaps: a chip can finish its job and just sit there. So a small program keeps an eye on the chips, and the moment one goes quiet it hands it the next job from a waiting list. The job that follows the rule refills itself. The effect is that the box stays warm between work sessions and over weekends, when no human is steering it at all. The bones of that program are in the claude prompt at the top of this post, ready for any multi-GPU machine you'd rather not babysit.

Interactive · Keeping the lanes full

0warm brain

1ARIA

2training

3training

4training

5training

6training

7training

93%node busy8 of 8 litfully fed

The filler is on. When a training card finishes its job and drops to idle, a small program hands it the next job off the waiting list and it lights back up. The node stays pinned near full on its own, even with no human watching.

Decorative animation. Lane 0 is the always-on warm brain, lane 1 is ARIA, lanes 2-7 are the training pool the filler keeps fed.

Flip the filler off in that panel and watch what a borrowed node does on its own: the training cards finish their jobs and just sit there, dark, one by one. That drift is the entire problem this post is about. The filler is what keeps it from happening.

Why you check twice before calling a chip idle

There's a standard readout that tells you how busy each GPU is. The catch: a chip can read 0% for a heartbeat between two steps, or while it loads the next batch of images, and a careless watcher will call it idle and double-book it with a second job. So you look twice, a few seconds apart, before you act. A chip that's quiet in both looks is genuinely free; a chip that was 0% and is now 98% was just catching its breath. We caught more than one false alarm this way before it turned into a wrong move.

The engine that never lets the queue run dry

The waiting list is only as good as what's on it, and the deepest way to keep the node full isn't a program at all. It's having something that never stops coming up with the next thing worth trying. For us that something is ARIA, our autonomous research engine, running alongside the modeling work on its own two chips.

If you're new to ARIA, we've written about it a lot: how it woke up on this very node, the night it crashed our desktop box and we built monitoring in five minutes, and the run where it fired off 151 experiments overnight. On Run Data Run there's the plain-English version, Inside ARIA: teaching a machine to do science, and the idea underneath all of it, The Overnight Loop: try, measure, learn, repeat, while you sleep.

In short, ARIA reads the research literature, comes up with experiments, runs them, checks its own work for cheating,⁴The "cheating" check is a leakage control: we scramble the answer key and re-run the test. If the model still seems to "succeed" on scrambled answers, it was memorizing rather than learning, so the result gets thrown out. More on this in the gate aside below. and writes the clean results up in a form the modeling side picks up automatically. On a node you're trying to keep full, that role is bigger than "a second researcher." It's what keeps the work pipeline from ever drying up, because it's generating things to try around the clock while the humans sleep.

This week it got a sharper job. Most of ARIA's experiments are quick, cheap tests, the kind that can look thrilling and still be too small to trust. So we taught it to promote its own winners: when a quick test points somewhere promising, ARIA automatically re-runs a bigger, more careful version, several times over to be sure it wasn't a fluke, checks that for cheating, and only if the result holds up does it hand the idea to the modeling team as a candidate worth real training time. The modeling side reads those handoffs on its own.

How strict that hand-off bar is, is what makes it trustworthy instead of just more noise. Of the first handful of ideas ARIA promoted this way, none have cleared the bar yet. Two were dropped when the exciting little result fell apart at full size, and two hit technical errors. Zero winners sounds like a failure. It's the opposite. The bar is doing exactly its job: spending a little cheap compute to kill an idea that only looked good because it was small, before it costs a full training run on the expensive chips. Keeping the node full was never about keeping it busy with junk.

So keeping the node full has three layers stacked together. The written rule stops clashes. The auto-filler closes the gaps. And the research engine keeps the waiting list stocked with experiments that earned their spot. The machine stays warm because something is always feeding it, and that something is more and more choosing what to run on evidence instead of guesswork.

A live window onto the machine

I linked the scoreboard up top because you should be able to see the node working, not just take my word for it. What it is, and why we built it.

It's a different kind of thing from these posts. The posts are written stories; the scoreboard is a live page. It reads a fresh snapshot off the machine every day and redraws itself, so the numbers are current without anyone typing them in. It shows the eight chips and what each is running, how much of the machine's math power is lit up versus sitting available, how many words the local model has generated, and a panel on what the research engine has been up to, papers read, experiments run, how cleanly they ran. It even works out what those locally-generated words would have cost on a commercial AI service, against the zero it costs to run an open model on hardware you already hold.

It's also a small lesson in publishing from a private project. The node sits inside confidential research, so the page is built one field at a time from an approved list, and only two kinds of numbers are cleared to leave the building: what the compute is doing, and what the research engine is doing. The science it produces, the costs, the specifics of the data, none of that crosses to the public page, and that line is enforced in the code that builds the page, not just hidden in how it looks. You can put a public window on a private machine, but only if you decide, on purpose and number by number, what's allowed through it.

What keeping it full is actually for

None of this is the goal. It's the groundwork. A node kept smoking is the setting in which the actual job, training our own model to read retinal images, gets enough attempts to land a result.

And it has started to. The clearest sign so far is simple to say: the same retinal images, trained the same way, score worse than a strong off-the-shelf model when we train on a short budget, and better than it when we train on a full one. The signal was in the images the whole time. Only enough compute, spent over enough runs, pulled it out. That single result is the entire case for keeping the node full, and it's the subject of the next post, where the training gets the detailed treatment it's earned.

The machine is borrowed and the clock doesn't stop. The discipline of keeping it full, a written rule, an auto-filler, and an engine that never runs out of good experiments, is ours to keep. Watch it run on the scoreboard; the next post is the science it's been feeding.

Related reading on this site: the DGX Lab series opener for where the small-box story starts, the day-one inference platform post for the local model that gives the engine its brain, and waking the research engine for how ARIA first got onto the node.

Related experiments

Apparatus

2,623 words · 9 min read

gpu
h100
dgx-spark
gpu-utilization
autonomous-agents
build-in-public