autoresearch-blackwell-gb10-151-experiments - AIXplore

# AutoResearch on Blackwell GB10: 151 Experiments Overnight ![[autoresearch-gb10-hero-v2.png]] > [!tip] TLDR > Clone [Karpathy's AutoResearch](https://github.com/karpathy/autoresearch), swap Flash Attention 3 for PyTorch SDPA, correct the FLOPS constant from 990 (H100) to your GPU's actual TFLOPS, and run. The key insight: on lower-TFLOPS GPUs, shrink everything (batch size, depth, dimensions) to maximize training steps per 5-minute window. The GB10's optimal config uses 6.1 GB of 128 GB VRAM. > > Read on if you want the full experiment log, phase analysis, and comparison with H100 results. I wrote about the broader implications of the overnight loop pattern on [Run Data Run](https://rundatarun.io/p/the-overnight-loop). This article is the technical companion: what I changed, what the agent found, and what the data tells us about hardware-aware optimization. ## Setup **Hardware:** NVIDIA DGX Spark, GB10 Blackwell GPU. 128 GB unified memory, 213 TFLOPS FP16 compute. **Agent:** Claude Sonnet 4.5 running vanilla AutoResearch with two modifications: 1. **Flash Attention 3 swapped for PyTorch SDPA.** Blackwell doesn't support FA3 yet. The swap is one line: ```python # Original (H100) # Uses flash_attn_func from flash_attn # Replaced with: F.scaled_dot_product_attention(q, k, v, is_causal=True) ``` 2. **FLOPS constant corrected.** AutoResearch hardcodes the H100's 990 TFLOPS. The GB10 measures 213: ```python # config gpu_flops = 213e12 # GB10 measured, not H100's 990e12 ``` Everything else was Karpathy's original code. No other modifications. ## Timeline **Afternoon session (2 hours):** 18 experiments. The agent immediately gravitated toward smaller configurations. Batch size reductions, shallower models, dimension shrinking. By experiment 18, it had improved the baseline by 20.7%. **Overnight session (16 hours):** 133 additional experiments (151 total). Twenty-six improvements kept. 122 ideas discarded. Three crashes (torch.compile failures on larger models). Final result: validation loss reduced from 1.463 to 1.135 bits per byte, a 22.5% improvement. ## The Core Discovery: Step-Limited, Not VRAM-Limited The H100 community had already run AutoResearch extensively. Karpathy's reference run: 126 experiments on an H100, 2.82% improvement. The optimal H100 configuration: depth-8 models, batch size 2^19, about 44 GB of VRAM. Standard approach. Bigger GPU, bigger models, more data per step. The GB10 agent went the opposite direction. > [!warning] The Key Insight > The GB10 is **step-limited**, not VRAM-limited. At 213 TFLOPS vs the H100's 990, the GB10 completes far fewer training steps in a 5-minute window. With H100-optimized configs, the GB10 managed only 93 steps. Not enough to learn anything useful. The agent's solution: cut everything to maximize step count. | Metric | H100 Optimal | GB10 Optimal | Delta | |--------|-------------|-------------|-------| | Model depth | 8-9 layers | 4 layers | -50% | | Batch size | 2^19 | 2^16 | -87.5% (8x smaller) | | Hidden dimensions | 512 | 384 | -25% | | VRAM used | ~44 GB | ~6.1 GB | -86% | | Steps in 5 min | ~953 | ~1,300 | +36% | | Best val_bpb | 0.970 | 1.135 | - | ![[autoresearch-gb10-comparison-v2.png]] The absolute val_bpb is higher on GB10 (worse), which is expected. The GB10 has 4.6x less compute. But the relative improvement (22.5% vs 2.82%) tells us the agent found a much larger optimization surface on hardware where the default config was badly mismatched. ## Improvement Phases The 22.5% improvement didn't happen uniformly. Three distinct phases: ### Phase 1: Batch Size Reduction (91% of total gain) The single biggest lever. Cutting batch size from 2^19 to 2^16 reduced per-step compute by 8x, allowing roughly 8x more steps per window. The tradeoff (noisier gradients from smaller batches) was more than offset by the additional learning iterations. ### Phase 2: Model Architecture Shrinking (6% of total gain) Depth from 8 to 4 layers. Hidden dimensions from 512 to 384. Each reduction freed compute for more steps. The agent tried going even smaller (depth-2, dimensions 256) but hit diminishing returns where the model capacity was too low to learn the task. ### Phase 3: Hyperparameter and Architectural Tweaks (3% of total gain) Fine-tuning on top of the smaller architecture: - Highway connections (small positive) - MLP expansion ratio adjustments - RoPE base frequency tuning - Learning rate schedule optimization ## What Failed The failures are as instructive as the successes: - **Larger models:** Crashed on `torch.compile`. The GB10's compiler support for complex architectures was incomplete. - **SwiGLU activation:** Didn't beat ReLU squared on this hardware/task combination. - **Multi-query attention:** Halved the step count because of a slow SDPA code path. The theoretical memory savings weren't worth the compute penalty. - **Label smoothing:** Catastrophic. Unclear why, possibly interacting poorly with the small batch noise. - **EMA (0.999 decay):** Catastrophic. 1,300 steps isn't enough for exponential moving average to converge. The momentum was still tracking initialization noise. > [!info] Pattern Recognition > Most failures shared a theme: techniques designed for long training runs with large batches don't transfer to short runs with small batches. The 5-minute window changes the optimization landscape fundamentally. ## Independent Validation Three groups ran AutoResearch on GB10 independently: 1. **My run:** 151 experiments, results documented here. 2. **NVIDIA Developer Forums:** [27 experiments](https://forums.developer.nvidia.com/t/karpathys-autoresearch-customised-for-spark/362949), found similar smaller/shallower strategies. 3. **MLX port on M4:** [autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx), found even shallower models were optimal on Apple Silicon. Three different groups. Three different GPU architectures. Same fundamental insight: **hardware determines optimal architecture.** Copying configs from a different GPU makes things worse, not better. ## Validation Set Concerns The community has raised [valid concerns](https://github.com/karpathy/autoresearch/discussions/43) about "spoiling the validation set." Running 151 experiments against the same metric risks overfitting to data-specific quirks. **Mitigations in this run:** - 5-minute training budget limits the search space - Changes are architectural (depth, width, batch size), not per-sample - 22.5% improvement is large enough to be unlikely pure noise - Independent convergence across three groups provides external validation **Open questions:** - Transfer to unseen test sets? Unknown. - Transfer to different datasets? Unknown. - Transfer to different tasks on the same hardware? Likely for the step-limited insight (it's physics), less certain for specific architectural choices. The step-limited discovery is real regardless of validation concerns. 213 TFLOPS is a hardware constant, not a dataset artifact. ## Reproducing This The full code, all 151 experiment logs, and configuration files are [on GitHub](https://github.com/BioInfo/autoresearch-blackwell-gb10). To run on your own hardware: 1. Clone AutoResearch 2. Swap FA3 for SDPA if your GPU doesn't support Flash Attention 3 3. Measure your GPU's actual TFLOPS and update the config 4. Run overnight The interesting experiment is running on different hardware and comparing what the agent discovers. Each GPU should find its own optimum. That's the whole point. > [!tip] For DGX Spark Owners > If you're coming from [[AI Systems & Architecture/dgx-spark-week-one-finding-the-right-stack|the Week One setup guide]], the SDPA swap is the only change needed. The NGC PyTorch container works out of the box with AutoResearch. ## What This Means The overnight loop pattern (try, measure, keep or discard, repeat) is becoming infrastructure for hardware optimization. Every new GPU architecture ships with default recommendations tuned for benchmarks, not for the specific workload-hardware combination your users care about. An autonomous agent with a 5-minute cycle can find hardware-specific optima overnight. Three independent groups proved it converges. The loop is the contribution, not any individual result. For more on where this pattern is headed beyond ML training (GPU kernels, frontend performance, marketing, drug discovery), see the [companion piece on Run Data Run](https://rundatarun.io/p/the-overnight-loop). --- ### Related Articles - [[AI Systems & Architecture/dgx-spark-week-one-finding-the-right-stack|DGX Spark: Week One - Finding the Right Stack]] - [[Practical Applications/building-ai-research-night-shift|My AI Research Assistant Works the Night Shift]] - [[AI Systems & Architecture/agent-architectures-with-mcp|Agent Architectures with MCP]] --- About the Author: Justin Johnson builds AI systems and writes about practical AI development. <a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a>