From a DGX Spark to a Borrowed Cluster: A Retinal-AI Lab, Built in Public
NVIDIA, through its Inception program for startups, looked at what SocialEyes is building, believed in the mission, and handed us a cluster of eight H100s for a couple of months. When a partner does that, you owe them one thing in return: make the most of every hour. This series is the build log of how we do it, written in public as it happens.
I want to start by saying who "we" is, why I'm in the room, and what you'll get out of following along.
What SocialEyes is working on
SocialEyes is a small group building AI that reads systemic health from retinal images. The eye is the one place in the body where you can photograph blood vessels and neural tissue directly, non-invasively, with a camera that already sits in a lot of clinics. 1This area is usually called "oculomics." The reference point most people know is RETFound (Zhou et al., Nature 2023), a foundation model trained on retinal images that transfers to a range of downstream tasks. Signals in that image track things that have nothing to do with eyesight: cardiovascular risk, metabolic state, markers that normally need a blood draw.
The bet is that a strong enough foundation model over retinal images can surface those systemic signals from a single photo. Not as a replacement for a blood test, but as a cheap, scalable first look that works wherever there's a camera. The training data is grounded in public sources, including AI-READI 2AI-READI is a public, NIH-funded multimodal dataset built for AI-ready diabetes and health research. Working from a public dataset is deliberate: it keeps the methodology shareable., so the methodology can be talked about openly even when specific results stay in the lab.
The eye is the cheapest window we have into the rest of the body. SocialEyes is trying to read through it.
Why I'm in the room
I don't run SocialEyes. I work with them, and have since the start of this year. They found me through the open work, the DGX Spark experiments and the writing on this blog, and reached out. I started helping, I dug the mission, and now we're doing genuinely cool things with this NVIDIA grant.
That origin matters to me, because it's the thing I keep arguing for in public: build the real thing, share it, and the right people show up. 3More on why I build the way I do in my "I'm Justin Johnson, I Build Things" piece on Run Data Run. A working system and an honest write-up change the conversation from "should we" to "how do we scale this." This collaboration is what that looks like when it works.
From a small box to a borrowed one
Here's the through-line for anyone who followed the DGX Lab seriesshippedPractical ApplicationsOct 21, 2025My AI Linux Expert: How Claude Code Suggested a 95,000x Faster SolutionWhen building an AI request router, my instinct was to use ML. Claude Code analyzed the test results, noticed the heuristics were already working, and suggested removing the ML model entirely—achieving 95,000x faster routing.: the small box came first.
A DGX Spark on a desk, 128GB of unified memory, is a real research rig at a tiny fraction of cluster cost. 4The DGX Spark is NVIDIA's desktop GB10 machine. The catch is its GPU architecture (sm_121) needs special handling that the data-center H100 (sm_90) does not, which turns out to matter a lot when you move between them. Most of the hard questions, which serving stack, which model, how to make local inference behave, got beaten out on that small box first, where a mistake costs minutes instead of metered GPU-hours. The week-one stack workshippedAI Systems & ArchitectureOct 28, 2025DGX Spark: Week One Update - Finding the Right StackSystematic debugging reveals configuration fixes that transformed DGX Spark performance from frustrating to transformative with 3.6x speedups. and the benchmark reality checkshippedPractical ApplicationsOct 26, 2025DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production RealityNVIDIA's DGX Spark benchmarks show 82,739 tokens/sec for training. After 6 days of intensive ML workloads and feedback from the HN community, here's what the benchmarks don't tell you about precision issues, memory fragmentation, and production workarounds. were all on the Spark.
The borrowed 8xH100 cluster, an NVIDIA DGX Cloud node we reach through the grant, gets to stand on all of that. The lineage is the point: the homelab box is the cheap R&D rig that de-risks the expensive one. Small box teaches the method, big box runs it at scale.
An autonomous research engine in the loop
There's a second character in this story. Alongside the modeling work, we run an autonomous research engine, ARIA, that reads the literature, proposes experiments, and helps decide what's worth a GPU-hour. ARIA cut its teeth on the small box, including the night it crashed the DGX and we built GPU monitoring in five minutesshippedPractical ApplicationsJan 17, 2026When ARIA Crashed the DGX: Building GPU Monitoring in 5 MinutesMy autonomous research system tried to allocate 260GB on a 128GB GPU and crashed the entire DGX. Here's the monitoring system I built in 5 minutes to prevent it from happening again. and the run where it fired 151 experiments overnightshippedPractical ApplicationsMar 14, 2026AutoResearch on Blackwell GB10: 151 Experiments OvernightRunning Karpathy's AutoResearch overnight on a Blackwell GB10 revealed that hardware FLOPS, not VRAM, determine optimal model architecture, with the agent discovering a 22.5% improvement using only 6.1 GB of 128 GB available. It's running on the cluster now. There's a paper on it in the pipeline, and I'll link it here when it's out.
The reason it earns its place: on a metered, borrowed cluster, the expensive mistake is running the wrong experiment. Reading the right papers first is cheap. That discipline, research before compute, is a thread that'll run through several posts in this series.
What the next couple of months look like
The plan, at a high level:
- Stage the data onto the cluster cleanly, terabytes of public retinal imagery, without the hygiene mistakes that cost a day.
- Train our own retinal foundation model, with the recipe locked from the literature before the first training run.
- Run a battery of experiments across several research directions, with the autonomous engine helping triage them.
- Keep the platform productionized, not a pile of one-off scripts, so every GPU-hour the grant gives us turns into real work.
The next post gets concrete: standing up the box and getting the whole team working on it at once, without anyone tripping over anyone else.
What this log is for
Each entry is a self-contained engineering story, scrubbed of anything proprietary and written so you can apply it to your own borrowed or rented GPUs. You don't need to care about retinas to get value from it.
The point of writing it in public is that almost none of the hard-won lessons are specific to retinal AI. How you get the most out of a borrowed cluster on a deadline. How you give a team GPU access without handing out the keys to wipe it. How you stage a terabyte of data without guessing your way into a wasted day. How you serve a reasoning model locally with speculative decoding when the docs lag the code. Those transfer to anyone standing up real work on borrowed hardware.
So that's the setup. A mission I believe in, a partner who backed it with serious compute, and a standing invitation to watch us make the most of it in public. New parts land as the work happens. Follow along below.
Related reading on this site: the DGX Lab series openershippedPractical ApplicationsOct 21, 2025My AI Linux Expert: How Claude Code Suggested a 95,000x Faster SolutionWhen building an AI request router, my instinct was to use ML. Claude Code analyzed the test results, noticed the heuristics were already working, and suggested removing the ML model entirely—achieving 95,000x faster routing. for where the small-box lineage starts, The Hidden Crisis in LLM Fine-TuningshippedEmerging TrendsOct 23, 2025The Hidden Crisis in LLM Fine-Tuning: When Your Model Silently Forgets EverythingCatastrophic forgetting in LLM fine-tuning is a silent killer that produces zero-token outputs without errors or warnings, and the solution might surprise you. for the catastrophic-forgetting problem that shapes how we'll train the retinal model, and How I Delegated a 9-Day Medical AI ExperimentshippedPractical ApplicationsOct 28, 2025How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)Delegating a complex 60-hour ML experiment to Claude revealed when to intervene and when to trust. Learn the decision points that turned 70% accuracy into 92.4%. for an earlier medical-AI build in this same style.
Follow the lab
Get the next experiment
Enjoyed the breakdown on From a DGX Spark to a Borrowed Cluster: A Retinal-AI Lab, Built in Public? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.
Related experiments
Apparatus
1,126 words · 7 min read
- gpu
- h100
- dgx-spark
- ml-infrastructure
- build-in-public
- retinal-ai