DGX Spark: Week One Update - Finding the Right Stack
DGX Spark: Week One Update - Finding the Right Stack
Read Day 4 first: DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4shippedPractical ApplicationsOct 26, 2025DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production RealityNVIDIA's DGX Spark benchmarks show 82,739 tokens/sec for training. After 6 days of intensive ML workloads and feedback from the HN community, here's what the benchmarks don't tell you about precision issues, memory fragmentation, and production workarounds.
TL;DR: My initial post about DGX Spark challenges was more pessimistic than warranted. After systematic debugging, I found the right software configuration. The hardware is powerful, the learning curve is steep, and working with bleeding-edge ARM64 + Blackwell has been exactly the kind of challenge I was looking for.
What Changed
Yesterday after my initial post about production challenges, I did a deep dive with Claude Code to systematically work through every issue I documented. The conclusion? Most of my problems weren't hardware limitations. They were configuration issues.
The core problem: I was using PyTorch wheels compiled for CUDA 12.8 (cu128). The DGX Spark ships with CUDA 13.0, but the PyTorch cu128 wheels include an older ptxas compiler that doesn't recognize GB10's compute capability (sm_121). When I tried to use torch.compile(), I got:
PTXASError: ptxas fatal: Value 'sm_121a' is not defined for option 'gpu-name'
Switching to PyTorch compiled for CUDA 13.0 (cu130 wheels) fixed this completely. But the real discovery was testing NVIDIA's NGC containers.
Performance Testing Results
I ran benchmarks on a 120M parameter transformer model across four different PyTorch configurations:
| Configuration | Eager Mode | Compiled | vs Baseline |
|---|---|---|---|
| PyTorch cu128 | Crash | Crash | N/A |
| PyTorch cu130 | 48.6ms | 38.7ms | 1.0x |
| PyTorch nightly cu130 | 43.5ms | 30.7ms | 1.6x |
| NGC Container 25.09 | 13.7ms | 13.5ms | 3.6x |
What Actually Works Now
Training:
- NGC containers for production workloads
- PyTorch nightly cu130 for development
- Standard cache clearing patterns still recommended
Inference:
- Ollama for GPU-accelerated serving (works perfectly on ARM64)
- CPU inference as fallback when needed
- vLLM doesn't work on ARM64 + CUDA 13.0 due to library mismatches
Memory Management:
- The fragmentation issue is real but manageable
torch.cuda.empty_cache()every 50 steps- Limit long-running sessions to 2-3 hours
- Monitor VRAM usage
The ARM64 + Blackwell Reality
Here's what I understand now about this platform: it's genuinely new territory. The GB10 Blackwell was announced in January 2025. CUDA 13.0 was released in August 2025. ARM64 server DGX is a new product line for NVIDIA. The combination of all three means limited real-world testing and some rough edges.
That's not a complaint. That's the reality of working with cutting-edge hardware, and it's exactly what makes this interesting.
I expect the CUDA/driver ecosystem to mature over the next 6-12 months. When driver 590.x or 600.x releases, many of these edge cases will likely disappear. But right now, if you know the workarounds, the system is absolutely usable for production work.
Lessons from Debugging
Working through this with Claude Code was actually fun. Not in a "haha fun" way, but in the way that systematically tracking down root causes in complex systems is satisfying. Every issue had a reason. Every failure pointed to something specific.
The debugging process taught me more about the ML software stack (application → PyTorch → CUDA runtime → driver → hardware) than I've learned in months of just using working systems. When something fails at the CUDA layer, you learn exactly how these layers interact.
Correcting My Initial Take
My Day 4 post about benchmark realityshippedPractical ApplicationsOct 26, 2025DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production RealityNVIDIA's DGX Spark benchmarks show 82,739 tokens/sec for training. After 6 days of intensive ML workloads and feedback from the HN community, here's what the benchmarks don't tell you about precision issues, memory fragmentation, and production workarounds. was too negative. I was frustrated by unexpected failures and interpreted them as fundamental hardware issues. After investigation, most were software configuration problems with clear solutions.
Before hitting these issues, I documented several successful implementations showing the system working well: building an intelligent AI gatewayshippedPractical ApplicationsOct 20, 2025DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1Building an intelligent AI gateway that routes requests 95,000x faster than ML while maintaining 90% accuracy—proving that smart heuristics can outperform deep learning., creating ML workflow optimizationsshippedPractical ApplicationsOct 20, 2025DGX Lab: Supercharge Your Shell with 50+ ML Productivity Aliases - Day 2Transform your default shell into a productivity powerhouse with GPU monitoring shortcuts, smart aliases, and custom functions—setup in 5 minutes, benefit forever., and deploying a complete RAG infrastructureshippedPractical ApplicationsOct 24, 2025DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3Building a complete self-hosted RAG infrastructure with 8 integrated services on a single DGX workstation, supporting everything from medical AI fine-tuning to document Q&A.. The Day 4 challenges were real, but they didn't invalidate the earlier successes.
The benchmarks NVIDIA publishes are accurate. I just wasn't running the right stack to achieve them. Once I switched to CUDA 13.0 and NGC containers, the performance matched expectations.
What I got wrong:
- Training wasn't broken (it worked fine all along)
- Inference issues were CUDA version related, not fundamental flaws
- llama.cpp and Ollama work well on this platform
- The "15 hours debugging training failures" were inference testing failures
What I got right:
- Memory fragmentation is real and requires workarounds
- Long training sessions need careful management
- CPU inference is more stable than GPU for certain workloads
- This platform requires expertise to configure correctly
Why This Matters
I bought this system specifically because it's bleeding edge. I want to work with the newest GPU architecture. I want to understand ARM64 ML infrastructure before it becomes mainstream. I want to figure out optimization patterns that will matter in six months when this stack is mature.
NVIDIA is building something genuinely new here. ARM64 server GPUs with Blackwell architecture represent a different approach to ML infrastructure. There's a learning curve, yes. But the performance potential is real, and getting in early means understanding the platform deeply.
Working with Claude Code through this process made it educational rather than frustrating. Having an AI pair programmer that could systematically work through CUDA versions, benchmark different configurations, and help document findings turned debugging into research.
The debugging process itself taught me more about the ML software stack than months of just using working systems. Claude Code and I spent about 15 hours systematically working through CUDA versions, container configurations, and memory management patterns. We documented everything: which PyTorch builds work, which crash, what the error messages actually mean, how the software stack layers interact.
This wasn't wasted time. Understanding the platform at this level means I know exactly what's happening when something goes wrong. I know which layer of the stack to investigate. I know when an issue is in my code versus when it's a CUDA runtime bug that needs a driver update. That knowledge is valuable for everything I'll build on this system.
When ARM64 ML infrastructure is standard in two years, I'll have two years of experience optimizing for it. That's the value of working with bleeding-edge platforms.
Current State
Right now, my DGX Spark runs:
- CUDA 13.0
- NGC PyTorch containers for training
- Ollama for inference serving
- PyTorch nightly cu130 for quick tests
- Monitoring scripts for driver updates
Training is fast. Inference is stable. The system feels solid for my workloads.
The rough edges are still there. Memory management requires attention. Long sessions need monitoring. But these are manageable engineering challenges, not fundamental blockers.
For Other Early Adopters
- Verify your CUDA version (13.0 matters for Blackwell)
- Test NGC containers (major performance difference)
- Use Ollama instead of vLLM on ARM64
- Implement cache clearing patterns
- Monitor VRAM closely during long runs
Your exact configuration might differ, but the pattern holds: get the software stack right for your hardware, and the performance is there.
The system isn't "not ready." It's ready for people willing to learn the platform and work with bleeding-edge software stacks. That's a feature, not a bug.
Technical Environment:
- Hardware: NVIDIA DGX Spark (GB10 Blackwell, ARM64)
- CUDA: 13.0 (critical for sm_121 support)
- Primary: NGC PyTorch 25.09 container
- Alternative: PyTorch 2.10.0.dev nightly cu130
- Inference: Ollama (GPU-accelerated)
- Tools: Claude Code for systematic debugging
Current Projects:
- Gemma 2 9B fine-tuning for medical QA
- Specialized oncology domain adaptation
- Multi-model inference pipeline development
Further Reading
DGX Lab Series
- Three Days to Build an AI Research LabshippedPractical ApplicationsOct 21, 2025My AI Linux Expert: How Claude Code Suggested a 95,000x Faster SolutionWhen building an AI request router, my instinct was to use ML. Claude Code analyzed the test results, noticed the heuristics were already working, and suggested removing the ML model entirely—achieving 95,000x faster routing. - The journey begins
- Day 1: Intelligent GatewayshippedPractical ApplicationsOct 20, 2025DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1Building an intelligent AI gateway that routes requests 95,000x faster than ML while maintaining 90% accuracy—proving that smart heuristics can outperform deep learning. - When simple heuristics beat ML
- Day 2: ML Workflow OptimizationsshippedPractical ApplicationsOct 20, 2025DGX Lab: Supercharge Your Shell with 50+ ML Productivity Aliases - Day 2Transform your default shell into a productivity powerhouse with GPU monitoring shortcuts, smart aliases, and custom functions—setup in 5 minutes, benefit forever. - Shell productivity
- Day 3: RAG InfrastructureshippedPractical ApplicationsOct 24, 2025DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3Building a complete self-hosted RAG infrastructure with 8 integrated services on a single DGX workstation, supporting everything from medical AI fine-tuning to document Q&A. - Production deployment
- Day 4: Benchmarks vs RealityshippedPractical ApplicationsOct 26, 2025DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production RealityNVIDIA's DGX Spark benchmarks show 82,739 tokens/sec for training. After 6 days of intensive ML workloads and feedback from the HN community, here's what the benchmarks don't tell you about precision issues, memory fragmentation, and production workarounds. - The initial challenges
Related Topics
- Agent Architectures with MCPshippedAI Systems & ArchitectureMar 24, 2025Agent Architectures with Model Context Protocol: A Technical SurveyTechnical survey of architectural patterns for implementing AI agents with Model Context Protocol, including comparative analysis of frameworks.
- Cline & Roo Code Quick StartshippedPractical ApplicationsMar 21, 2025Cline and Roo Code: Quick Start GuideGet started with Cline and Roo Code AI coding agents in VS Code, covering installation, features, and optimization techniques.
Related Articles
- Implementing Model Context Protocol (MCP) Across AI Coding AssistantsshippedAI Systems & ArchitectureMar 22, 2025Implementing Model Context Protocol (MCP) Across AI Coding AssistantsComprehensive guide to implementing Model Context Protocol (MCP) across different AI coding assistants with practical examples and best practices.
- Building a Markdown RAG System: A Practical Guide to Document-Grounded AIshippedAI Systems & ArchitectureMar 21, 2024Building a Markdown RAG System: A Practical Guide to Document-Grounded AIDetailed walkthrough of building a Retrieval-Augmented Generation (RAG) system for markdown documents with lightweight implementation.
- Inside Manus.im: The Elegant Architecture Behind a Powerful AI AgentshippedAI Systems & ArchitectureMar 27, 2025Inside Manus.im: The Elegant Architecture Behind a Powerful AI AgentTechnical deep dive into the system architecture of Manus.im, revealing how elegant prompt engineering and tool design enable autonomous capabilities.
About the Author: Justin Johnson builds AI systems and writes about practical AI development.
justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe
Follow the lab
Get the next experiment
Enjoyed the breakdown on DGX Spark: Week One Update - Finding the Right Stack? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.