Practical ApplicationsJune 24, 20267 min readshipped

From a DGX Spark to a Borrowed Cluster: A Retinal-AI Lab, Built in Public

NVIDIA, through its Inception program for startups, looked at what SocialEyes is building, believed in the mission, and handed us a cluster of eight H100s for a couple of months. When a partner does that, you owe them one thing in return: make the most of every hour. This series is the build log of how we do it, written in public as it happens.

I want to start by saying who "we" is, why I'm in the room, and what you'll get out of following along.

What SocialEyes is working on

SocialEyes is a small group building AI that reads systemic health from retinal images. The eye is the one place in the body where you can photograph blood vessels and neural tissue directly, non-invasively, with a camera that already sits in a lot of clinics. ¹This area is usually called "oculomics." The reference point most people know is RETFound (Zhou et al., Nature 2023), a foundation model trained on retinal images that transfers to a range of downstream tasks. Signals in that image track things that have nothing to do with eyesight: cardiovascular risk, metabolic state, markers that normally need a blood draw.

The bet is that a strong enough foundation model over retinal images can surface those systemic signals from a single photo. Not as a replacement for a blood test, but as a cheap, scalable first look that works wherever there's a camera. The training data is grounded in public sources, including AI-READI ²AI-READI is a public, NIH-funded multimodal dataset built for AI-ready diabetes and health research. Working from a public dataset is deliberate: it keeps the methodology shareable., so the methodology can be talked about openly even when specific results stay in the lab.

The eye is the cheapest window we have into the rest of the body. SocialEyes is trying to read through it.

Why I'm in the room

I don't run SocialEyes. I work with them, and have since the start of this year. They found me through the open work, the DGX Spark experiments and the writing on this blog, and reached out. I started helping, I dug the mission, and now we're doing genuinely cool things with this NVIDIA grant.

That origin matters to me, because it's the thing I keep arguing for in public: build the real thing, share it, and the right people show up. ³More on why I build the way I do in my "I'm Justin Johnson, I Build Things" piece on Run Data Run. A working system and an honest write-up change the conversation from "should we" to "how do we scale this." This collaboration is what that looks like when it works.

From a small box to a borrowed one

Here's the through-line for anyone who followed the DGX Lab series: the small box came first.

A DGX Spark on a desk, 128GB of unified memory, is a real research rig at a tiny fraction of cluster cost. ⁴The DGX Spark is NVIDIA's desktop GB10 machine. The catch is its GPU architecture (sm_121) needs special handling that the data-center H100 (sm_90) does not, which turns out to matter a lot when you move between them. Most of the hard questions, which serving stack, which model, how to make local inference behave, got beaten out on that small box first, where a mistake costs minutes instead of metered GPU-hours. The week-one stack work and the benchmark reality check were all on the Spark.

The borrowed 8xH100 cluster, an NVIDIA DGX Cloud node we reach through the grant, gets to stand on all of that. The lineage is the point: the homelab box is the cheap R&D rig that de-risks the expensive one. Small box teaches the method, big box runs it at scale.

An autonomous research engine in the loop

There's a second character in this story. Alongside the modeling work, we run an autonomous research engine, ARIA, that reads the literature, proposes experiments, and helps decide what's worth a GPU-hour. ARIA cut its teeth on the small box, including the night it crashed the DGX and we built GPU monitoring in five minutes and the run where it fired 151 experiments overnight. It's running on the cluster now. There's a paper on it in the pipeline, and I'll link it here when it's out.

The reason it earns its place: on a metered, borrowed cluster, the expensive mistake is running the wrong experiment. Reading the right papers first is cheap. That discipline, research before compute, is a thread that'll run through several posts in this series.

What the next couple of months look like

The plan, at a high level:

Stage the data onto the cluster cleanly, terabytes of public retinal imagery, without the hygiene mistakes that cost a day.
Train our own retinal foundation model, with the recipe locked from the literature before the first training run.
Run a battery of experiments across several research directions, with the autonomous engine helping triage them.
Keep the platform productionized, not a pile of one-off scripts, so every GPU-hour the grant gives us turns into real work.

The next post gets concrete: standing up the box and getting the whole team working on it at once, without anyone tripping over anyone else.

What this log is for

How to read this series

Each entry is a self-contained engineering story, scrubbed of anything proprietary and written so you can apply it to your own borrowed or rented GPUs. You don't need to care about retinas to get value from it.

The point of writing it in public is that almost none of the hard-won lessons are specific to retinal AI. How you get the most out of a borrowed cluster on a deadline. How you give a team GPU access without handing out the keys to wipe it. How you stage a terabyte of data without guessing your way into a wasted day. How you serve a reasoning model locally with speculative decoding when the docs lag the code. Those transfer to anyone standing up real work on borrowed hardware.

So that's the setup. A mission I believe in, a partner who backed it with serious compute, and a standing invitation to watch us make the most of it in public. New parts land as the work happens. Follow along below.

Related reading on this site: the DGX Lab series opener for where the small-box lineage starts, The Hidden Crisis in LLM Fine-Tuning for the catastrophic-forgetting problem that shapes how we'll train the retinal model, and How I Delegated a 9-Day Medical AI Experiment for an earlier medical-AI build in this same style.

Related experiments

Apparatus

1,126 words · 7 min read

gpu
h100
dgx-spark
ml-infrastructure
build-in-public
retinal-ai