Practical ApplicationsOctober 26, 202512 min readshipped

DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production Reality

Run this

claude "Reproduce the DGX Spark benchmark-vs-reality check: fine-tune a small Llama in BF16 then FP4, log tokens/sec and accuracy degradation, then stress allocation until memory fragmentation forces a failure. Output a table putting NVIDIA published numbers next to what you measured."

claude code

NVIDIA DGX Spark: When Benchmark Numbers Meet Production Reality

A 6-Day Deep Dive into Real-World ML Performance

Follow-Up Available

**Update (Oct 28):** I found the root cause and solutions! Most of the issues documented here were CUDA version mismatches, not hardware problems. Read the follow-up for the 3.6x performance breakthrough and complete solutions.

→ Week One Update: Finding the Right Stack - The solution and 3.6x performance breakthrough

Update (Oct 27): After posting this on Hacker News, I received excellent technical feedback that revealed gaps in my testing and some overclaimed conclusions. I've updated the article with corrections, additional tests, and acknowledgments. Thanks to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr for the constructive criticism. This is what good technical discussion looks like.

NVIDIA recently published benchmarks showcasing the DGX Spark: 82,739 tokens/second for fine-tuning, sub-1% accuracy degradation with FP4, and impressive inference throughput. After spending 6+ days running intensive ML workloads on a DGX Spark (training multiple models from scratch, fine-tuning with LoRA, and benchmarking inference), I can tell you the real story.

The short version: NVIDIA's numbers are technically accurate. But they don't tell you about the FP16 precision issues, memory fragmentation requiring hard reboots, or the 15 hours I spent debugging "training failures" that turned out to be inference bugs.

This is the post I wish I'd read before diving in.

What NVIDIA Showed Us

NVIDIA's blog highlights some impressive numbers:

Fine-Tuning Performance:

Llama 3.2 3B: 82,739 tokens/sec (full fine-tuning, BF16)
Llama 3.1 8B: 53,657 tokens/sec (LoRA, BF16)
Llama 3.3 70B: 5,079 tokens/sec (QLoRA, FP4)

Inference Performance:

Qwen3 14B: 5,928 tokens/sec prompt processing, 22.71 tokens/sec generation
GPT-OSS-20B: 82.74 tokens/sec generation

Key Claims:

1 petaflop of FP4 compute
Less than 1% accuracy degradation with FP4
273 GB/sec memory bandwidth
Support for 128GB+ models locally

Impressive on paper. Let's see how it holds up.

My Testing Environment

Hardware:

DGX Spark (ARM64 architecture)
GB10 GPU (Blackwell generation, unified memory)
Driver 580.95.05
CUDA 13.0
Ubuntu 24.04.3 LTS

Software:

PyTorch 2.5.0 (NVIDIA container: nvcr.io/nvidia/pytorch:24.10-py3)
Ollama 0.3.9 for inference
Transformers 4.44.0

Workloads:

Inference Benchmark: Phi-3.5-mini-instruct (3.8B params) via Ollama
Fine-Tuning: 7 LoRA experiments on Gemma-3-4b-it for medical Q&A (10,000 examples)
Training: NanoChat project (125M param models from scratch)

Duration: 6+ consecutive days of ML work

The Results: What Matches

✅ Training Performance is Real (When It Works)

My Gemma-3-4b-it LoRA fine-tuning achieved speeds comparable to NVIDIA's benchmarks. Training completed in 10-12 hours for 3 epochs on 10,000 examples with batch size 4, right in line with NVIDIA's Llama 3.1 8B numbers.

Training configuration:

- Base Model: Gemma-3-4b-it (4B parameters)
- Method: LoRA (rank 16, alpha 32)
- Precision: BF16 mixed precision
- Batch Size: 4
- Learning Rate: 2e-5
- Epochs: 3
- Dataset: 10,000 medical Q&A pairs

Results across 7 experiments:

Experiment 1-5 (baseline): 70% accuracy
Experiment 7 (optimized): 82% accuracy
Experiment 8 (SFTTrainer): 84% accuracy

All models trained successfully with smooth loss curves. Training throughput was excellent.

Verdict: ✅ NVIDIA's training performance claims are accurate.

✅ Inference Speed Scales as Expected

I benchmarked Phi-3.5-mini-instruct (3.8B params, Q4_K_M quantization) via Ollama:

Prompt Type          Tokens/Second
-----------------------------------------
Short (20 tokens)    83.25
Medium (34 tokens)   79.74
Long (113 tokens)    78.47
-----------------------------------------
Average              ~80 tokens/sec

NVIDIA showed 22.71 tokens/sec for Qwen3 14B (nearly 4x larger). My 3.8B model at 80 tokens/sec is roughly 3.6x faster, which tracks with the size difference.

Verdict: ✅ Inference speed scales proportionally with model size.

✅ 4-Bit Quantization Works

NVIDIA claims "less than 1% accuracy degradation" with FP4. I used Q4_K_M (4-bit) quantization extensively via Ollama, and the model quality was excellent (coherent, contextually appropriate responses with no noticeable degradation).

Verdict: ✅ 4-bit quantization is production-viable.

The Reality: What They Didn't Tell You

⚠️ FP16 GPU Inference Has Numerical Issues

Update: This section has been revised based on community feedback. My original title "GPU Inference is Fundamentally Broken" was overstated. The issue appears to be FP16-specific.

Here's what happened. After training my first model (Experiment 1), I loaded it for evaluation:

model = AutoModelForCausalLM.from_pretrained(
    "./models/exp-001/final",
    torch_dtype=torch.float16,  # FP16 for inference
    device_map={"": 0}  # GPU
)

# Generate text
output = model.generate(...)

Result: Empty responses. PAD tokens. Sometimes inf/nan errors crashing CUDA.

I assumed my training failed. I retrained with different hyperparameters (Experiment 2). Same result. Experiment 3, 4, 5, all "failed." I spent 15+ hours debugging my training code, convinced I was doing something wrong.

Then I tried this:

model = AutoModelForCausalLM.from_pretrained(
    "./models/exp-001/final",
    torch_dtype=torch.float32,  # CPU uses FP32
    device_map='cpu'  # CPU instead of GPU
)

# Generate text
output = model.generate(...)

Result: Perfect. Coherent, relevant medical responses. The model worked beautifully.

All 7 experiments had trained successfully. The training loss decreased smoothly. The models learned. But FP16 GPU inference was broken.

What I Should Have Tested

HN user enum asked the critical question: "Does it work if you change to torch.bfloat16?"

I trained with BF16 (which worked perfectly) but tested inference with FP16 (which failed). I never tested BF16 inference.

This is a significant gap in my testing. The issue might be:

FP16 inference specifically broken (numerical instability on this hardware)
BF16 inference might work fine (matching training precision)

I can't access the original trained models to retest this now, but it's the obvious next experiment.

The Pattern (Revised)

✅ Training on GPU (BF16): Works perfectly
✅ Inference on CPU (FP32): Works perfectly
❌ Inference on GPU (FP16): Produces inf/nan, empty responses, or crashes
❓ Inference on GPU (BF16): Not tested yet (likely works based on community feedback)
✅ Inference via Ollama: Works (see next section)

My mistake: Claiming "GPU inference is fundamentally broken" without testing all precision combinations. The issue is likely FP16-specific, not a fundamental GPU problem.

For PyTorch/Transformers users: stick to BF16 for both training and inference, or use CPU/Ollama for evaluation.

❌ Memory Fragmentation Causes System Hangs

During Experiment 6, my training script was humming along beautifully. Loss decreasing, checkpoints saving, everything looked perfect. Then at 88% complete (7.5 hours in), the system froze. Hard. No response to keyboard, SSH connection dead, GPU stuck.

Hard reboot required. Progress lost.

The issue: GPU memory fragmentation during long-running training causes system-level instability.

Required mitigations for all subsequent training:

# Critical: Clear GPU cache regularly
if step % 50 == 0:
    torch.cuda.empty_cache()

# Critical: Limit training duration
if time.time() - start_time > 2.5 * 3600:  # 2.5 hours
    logging.info("Approaching safety limit - saving and exiting")
    save_checkpoint(model, optimizer, step)
    break

# Critical: Checkpoint frequently (not too frequently)
if step % 200 == 0:  # Not 100 - creates more fragmentation
    save_checkpoint(model, optimizer, step)

# Monitor GPU memory
if step % 100 == 0:
    allocated = torch.cuda.memory_allocated() / 1e9
    if allocated > 75:  # >75GB on 128GB system
        logging.warning("High GPU memory - fragmentation risk")

After implementing these mitigations, I successfully trained Experiments 7 and 8 to completion. But the limitation is real: maximum 2-3 hour training sessions before restarting.

NVIDIA's benchmarks don't mention this constraint. Their tests were likely short-duration runs that didn't expose the fragmentation issue.

Community note (from eadwu): This might be a Linux kernel memory management issue rather than pure hardware limitation. Similar issues occur on WSL and can be mitigated with memory compaction services. Worth investigating kernel-level solutions.

⚠️ llama.cpp: My Experience vs. Official Benchmarks

I tried running llama.cpp directly (without Ollama's wrapper):

./llama-cli \
  --model Phi-3.5-mini-instruct-Q4_K_M.gguf \
  --prompt "Explain what a neural network is in one sentence." \
  --n-predict 512

Result: Empty responses on all test prompts. Zero tokens generated.

However: HN user veber-alex pointed out that official llama.cpp benchmarks show the DGX Spark running multiple models successfully, including:

Llama models working
Qwen models working
GPU acceleration confirmed
Multiple quantization levels tested

My assessment: I likely hit a version mismatch, build configuration issue, or Phi-3.5-specific problem. The official benchmarks prove llama.cpp works on this hardware. My experience was real, but not representative of llama.cpp's capabilities on DGX Spark.

Recommendation: Build llama.cpp from source for best results, or use Ollama (which bundles a tested version).

Root Cause: Blackwell + ARM64 Combination

Update: My original section claimed "ARM64 + CUDA support is brand new." HN user bradfa correctly pointed out: "Aarch64 and CUDA has been a thing for many years on Jetson boards."

I need to be more precise. The issues exist because of three factors converging:

1. ARM64 Architecture (Mature, But...)

ARM64 + CUDA: Mature (Jetson boards since ~2015)
Most ML tools primarily tested on x86_64
PyTorch available via NVIDIA Docker containers (recommended path)
Some Python packages may need building from source

2. Blackwell GB10 GPU (New)

Newest GPU generation (sm_121 compute capability)
Unified memory architecture (CPU and GPU share 128GB RAM)
Limited real-world production testing vs. established datacenter GPUs
Driver maturity: 6-12 months behind older architectures

3. CUDA 13.0 (Latest)

Released ~6 months ago
Works well with established workflows
Requires PyTorch 2.5+ (cutting edge)

The specific combination that's bleeding edge:

Blackwell GB10 + ARM64 + CUDA 13.0 = New Territory
↓
Limited production testing of this specific stack
↓
Edge cases in numerical precision (FP16 inference)
↓
Memory management challenges (training)

NVIDIA's benchmarks likely use:

TensorRT-LLM (not standard PyTorch)
Short-duration controlled tests
Configurations that avoid the fragmentation trigger
BF16 consistently (not mixed FP16/BF16)

Precision Deep Dive (Updated)

Here's what actually works vs. what's broken:

Use Case	Precision	Device	Status	Performance
Training	BF16	GPU	✅ Works	Excellent
Training	FP32	GPU	✅ Works	Slower but stable
Inference	FP16	GPU	❌ Broken	inf/nan errors
Inference	BF16	GPU	❓ Not Tested	Unknown (likely works)
Inference	FP32	CPU	✅ Works	Slower, reliable
Inference	Q4_K_M	GPU (Ollama)	✅ Works	80 tok/sec

The key insight: FP16 GPU inference has issues. BF16 likely works. CPU works fine.

Ollama: GPU-Accelerated Inference That Works

Update: My original article stated "Inference via Ollama: Works (CPU-optimized backend)." HN user jasonjmcghee asked: "From the article it sounds like ollama runs cpu inference not GPU inference. Is that the case for you?"

I was wrong. Ollama IS using GPU. After posting, I verified:

# During Ollama inference
nvidia-smi dmon -s u

# gpu     sm    mem    enc    dec    jpg    ofa
# Idx      %      %      %      %      %      %
    0     96      0      0      0      0      0

96% GPU utilization during inference. It's definitely using the GPU.

Why memory shows 0%: The GB10 has a unified memory architecture where CPU and GPU share the same 128GB RAM pool. Traditional discrete GPU memory metrics don't apply here.

✅ Inference via Ollama:

# Stable, reliable, good performance
ollama run phi3.5:3.8b-mini-instruct-q4_K_M

# Results: 23 tokens/sec, GPU-accelerated

Ollama uses GPU acceleration with optimized quantized inference. The combination of quantization (Q4_K_M) and GPU acceleration gives production-ready performance.

What This Means for Production

What's Production-Ready

✅ Training with workarounds:

# Implement these in all training scripts:
1. torch.cuda.empty_cache() every 50 steps
2. Maximum 2-3 hour sessions with checkpointing
3. Monitor GPU memory continuously
4. Use BF16 consistently (training AND inference)

# Result: Successfully trained 7+ models

✅ Inference via Ollama:

# GPU-accelerated, stable, good performance
ollama run phi3.5:3.8b-mini-instruct-q4_K_M
# 23 tok/s, production-ready

✅ CPU inference for evaluation:

# Slow but 100% reliable
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float32,
    device_map='cpu'
)
# Perfect for batch evaluation

What Needs More Testing

⚠️ GPU inference (PyTorch/Transformers):

FP16 has numerical instability (confirmed)
BF16 not tested yet (likely works)
Test this yourself before production use

⚠️ Multi-hour unattended training:

Memory fragmentation risk after 3-8 hours
Requires babysitting or session limits
Train in chunks with checkpointing

What About TensorRT-LLM?

Update: HN user renaudr asked: "Have you tried to run GPT-OSS-120b using TRT-LLM?"

I haven't. NVIDIA's benchmarks likely used TensorRT-LLM, their optimized inference engine. I focused on standard PyTorch/Transformers workflows (what most practitioners use).

If you need production GPU inference performance matching NVIDIA's benchmarks, TensorRT-LLM is the recommended path. It likely avoids the FP16 precision issues I encountered. That's a separate deep-dive I haven't done yet.

My Actual Numbers vs. NVIDIA's Claims

Metric	NVIDIA	My Experience	Match?
Training speed	53,657-82,739 tok/sec	Comparable relative speed	✅
Inference speed	22-82 tok/sec	23 tok/sec (Ollama, GPU)	✅
4-bit quality	<1% degradation	No noticeable degradation	✅
Training precision	BF16	BF16 works perfectly	✅
GPU inference	Implied working	FP16 broken, BF16 untested	⚠️
Training stability	Not mentioned	2-3 hour limit	❌
Memory management	Not mentioned	Manual cache clearing	❌

Recommendations

If You're Using DGX Spark

DO:

Implement GPU cache clearing (every 50 steps)
Limit training sessions to 2-3 hours
Checkpoint frequently (every 200+ steps)
Use BF16 consistently for training and inference
Test BF16 inference before claiming GPU inference is broken
Use Ollama for production inference (GPU-accelerated)
Build llama.cpp from source if needed
Monitor NVIDIA driver updates

DON'T:

Use FP16 for GPU inference (numerical instability)
Plan 8+ hour unattended training
Skip the workarounds
Assume your experience generalizes without more testing

If You're Considering DGX Spark

It's worth it if:

You're an experienced ML engineer
You need local training capabilities
You can implement stability workarounds
Ollama or TRT-LLM inference meets your needs
You're willing to monitor driver updates

Look elsewhere if:

You need plug-and-play GPU inference
You expect production-ready out-of-box
You need long unattended training runs
You don't have time for workarounds

The Silver Lining

Despite the issues, I successfully:

✅ Fine-tuned 7 models with 70-84% accuracy on medical Q&A ✅ Achieved 23 tokens/sec GPU-accelerated inference via Ollama ✅ Validated NVIDIA's training performance claims ✅ Built a production-ready medical chatbot ✅ Documented all workarounds for future users

The hardware is powerful. But it requires expert-level knowledge to navigate current limitations.

Lessons Learned

1. Test All Precision Combinations

When FP16 inference failed, I should have tested BF16 inference (matching training precision) before claiming GPU inference was broken. Complete your testing matrix before drawing conclusions.

2. Hardware Issues Can Look Like Software Bugs

I spent 15+ hours debugging my "broken" training code. The training was fine; FP16 inference had numerical issues. Check your assumptions systematically.

3. Benchmark Numbers Are True But Incomplete

NVIDIA's numbers are accurate for their test conditions. But they don't reveal precision constraints, stability limits, or workarounds needed for production. Context matters.

4. Bleeding Edge Means Trade-offs

Blackwell + ARM64 + CUDA 13.0 is cutting edge. That means bugs, limitations, and workarounds. Wait 6-12 months if you need stability, or be ready to troubleshoot.

5. Community Feedback is Invaluable

The HN community caught my incomplete testing, overclaimed conclusions, and factual errors. Ship early, iterate publicly, accept criticism gracefully.

Community Acknowledgments

This article improved significantly thanks to Hacker News feedback. Special thanks to:

enum - Caught the BF16 vs FP16 testing gap, shared PyTorch installation insights
veber-alex - Provided llama.cpp official benchmarks link
bradfa - Corrected ARM64+CUDA maturity claims (Jetson history)
furyofantares - Questioned incomplete testing and overclaimed conclusions
stuckinhell - Shared contrary experience (their DGX inference works fine)
jasonjmcghee - Asked about Ollama CPU vs GPU usage
eadwu - Explained memory fragmentation might be kernel-level
renaudr - Suggested TRT-LLM testing

Technical discussion like this makes everyone smarter. I'm grateful for the pushback.

What I'd Tell NVIDIA

If NVIDIA engineers are reading this, here's constructive feedback:

Document these in your communications:

Precision guidance:
- FP16 inference behavior on Blackwell+ARM64
- BF16 consistency recommendation (training and inference)
- When to use TRT-LLM vs. PyTorch
Stability constraints:
- Training session duration considerations
- Memory management best practices
- GPU cache clearing recommendations
Inference setup:
- Which backends work out-of-box (TRT-LLM vs PyTorch)
- Unified memory architecture implications
- Ollama as tested inference path
Real-world examples:
- Long-running training strategies
- Production inference patterns
- Monitoring and mitigation tactics

Your benchmarks are accurate. They'd be more valuable with context about setup, limitations, and recommended practices.

Conclusion: Powerful But Not Plug-and-Play

The NVIDIA DGX Spark delivers on raw performance when you work within its constraints. Training throughput matches the benchmarks. Inference speed is excellent (via Ollama). The hardware potential is real.

But it's not plug-and-play. FP16 GPU inference has issues. Memory fragmentation limits training sessions. You need expertise to navigate the current limitations.

NVIDIA's benchmarks are technically true. They're just not the whole truth.

For ML engineers willing to implement workarounds and test thoroughly, DGX Spark is a powerful tool. For teams expecting production-ready performance out-of-box, the maturity isn't there yet (especially for standard PyTorch workflows on ARM64).

My verdict: Cautiously recommended for experts. Wait 6-12 months if you need stability. Test BF16 inference before I did.

Appendix: Full Experimental Data

Inference Benchmark Details

Test Model: Phi-3.5-mini-instruct (3.8B parameters) Quantization: Q4_K_M (4-bit) Engine: Ollama 0.3.9 (llama.cpp backend) Acceleration: GPU (96% utilization confirmed)

Configuration:

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 512,
  "repeat_penalty": 1.1,
  "num_runs_per_prompt": 3
}

Results by Prompt Category:

Category	Prompt Length	Avg Response Length	Tokens/Sec
Short	20 tokens	46 tokens	83.25
Medium	34 tokens	456 tokens	79.74
Long	113 tokens	512 tokens	78.47

Fine-Tuning Experiment Summary

Base Model: google/gemma-3-4b-it Dataset: PubMedQA artificial (10,000 medical Q&A pairs) Method: LoRA fine-tuning

Configuration:

{
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "learning_rate": 2e-5,
    "batch_size": 4,
    "epochs": 3,
    "precision": "bf16",
    "gradient_accumulation_steps": 2
}

Results:

Experiment	Training Time	Final Loss	FP16 GPU Test	CPU Test	Accuracy
Exp 1-5	~3 hours each	Decreased	❌	✅	70%
Exp 6	7.5h (hung)	N/A	N/A	N/A	N/A
Exp 7	~12 hours	1.82	❌	✅	82%
Exp 8	~10.5 hours	1.61	❌	✅	84%

Best Model (Experiment 8):

Used SFTTrainer instead of base Trainer
Text-based dataset (no pre-tokenization)
Dynamic padding
+14% improvement over baseline

System Configuration

Hardware:
  Platform: NVIDIA DGX Spark
  Architecture: ARM64 (aarch64)
  GPU: GB10 (Blackwell, unified memory)
  GPU Driver: 580.95.05
  CUDA: 13.0
  Memory: 128 GB unified (CPU/GPU share)

Software:
  OS: Ubuntu 24.04.3 LTS
  Kernel: 6.11.0-1016-nvidia
  PyTorch: 2.5.0a0+e000cf0ad9.nv24.10 (Docker)
  Container: nvcr.io/nvidia/pytorch:24.10-py3
  Transformers: 4.44.0
  Ollama: 0.3.9
  CUDA Runtime: 13.0
  cuDNN: 9.5.1

Stability Mitigations:
  - GPU cache clearing: Every 50 steps
  - Checkpoint interval: 200 steps
  - Max session duration: 2.5 hours
  - Memory monitoring: Every 100 steps
  - Inference: BF16 recommended (FP16 has issues)

Follow-up: Week One Update: Finding the Right Stack - The root cause and 3.6x performance breakthrough!

More from the DGX Lab Chronicles:

Production ML Resources:

About This Series:

I'm documenting my journey building production ML systems on an NVIDIA DGX Spark. The wins, the losses, the mistakes, and the corrections. This is Day 4 of the DGX Lab Chronicles.

Want to follow along? Every article shares real code, actual performance data, and honest assessments (including when I get things wrong).

Found errors or have corrections? The HN community made this article significantly better. Feel free to reach out.

Published: October 26, 2025 Updated: October 27, 2025 Series: DGX Lab Chronicles (Day 4) Reading Time: 12 minutes

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

NVIDIA DGX Spark: When Benchmark Numbers Meet Production Reality

What NVIDIA Showed Us

My Testing Environment

The Results: What Matches

✅ Training Performance is Real (When It Works)

✅ Inference Speed Scales as Expected

✅ 4-Bit Quantization Works

The Reality: What They Didn't Tell You

⚠️ FP16 GPU Inference Has Numerical Issues

What I Should Have Tested

The Pattern (Revised)

❌ Memory Fragmentation Causes System Hangs

⚠️ llama.cpp: My Experience vs. Official Benchmarks

Root Cause: Blackwell + ARM64 Combination

1. ARM64 Architecture (Mature, But...)

2. Blackwell GB10 GPU (New)

3. CUDA 13.0 (Latest)

Precision Deep Dive (Updated)

Ollama: GPU-Accelerated Inference That Works

What This Means for Production

What's Production-Ready

What Needs More Testing

What About TensorRT-LLM?

My Actual Numbers vs. NVIDIA's Claims

Recommendations

If You're Using DGX Spark

If You're Considering DGX Spark

The Silver Lining

Lessons Learned

1. Test All Precision Combinations

2. Hardware Issues Can Look Like Software Bugs

3. Benchmark Numbers Are True But Incomplete

4. Bleeding Edge Means Trade-offs

5. Community Feedback is Invaluable

Community Acknowledgments

What I'd Tell NVIDIA

Conclusion: Powerful But Not Plug-and-Play

Appendix: Full Experimental Data

Inference Benchmark Details

Fine-Tuning Experiment Summary

System Configuration

Related Articles

Get the next experiment