AIXplorethe lab
Practical Applications12 min readshipped

DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production Reality

NVIDIA DGX Spark: When Benchmark Numbers Meet Production Reality

A 6-Day Deep Dive into Real-World ML Performance

Follow-Up Available
**Update (Oct 28):** I found the root cause and solutions! Most of the issues documented here were CUDA version mismatches, not hardware problems. Read the follow-up for the 3.6x performance breakthrough and complete solutions.

Week One Update: Finding the Right Stack - The solution and 3.6x performance breakthrough

Update (Oct 27): After posting this on Hacker News, I received excellent technical feedback that revealed gaps in my testing and some overclaimed conclusions. I've updated the article with corrections, additional tests, and acknowledgments. Thanks to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr for the constructive criticism. This is what good technical discussion looks like.


NVIDIA recently published benchmarks showcasing the DGX Spark: 82,739 tokens/second for fine-tuning, sub-1% accuracy degradation with FP4, and impressive inference throughput. After spending 6+ days running intensive ML workloads on a DGX Spark (training multiple models from scratch, fine-tuning with LoRA, and benchmarking inference), I can tell you the real story.

The short version: NVIDIA's numbers are technically accurate. But they don't tell you about the FP16 precision issues, memory fragmentation requiring hard reboots, or the 15 hours I spent debugging "training failures" that turned out to be inference bugs.

This is the post I wish I'd read before diving in.

What NVIDIA Showed Us

NVIDIA's blog highlights some impressive numbers:

Fine-Tuning Performance:

  • Llama 3.2 3B: 82,739 tokens/sec (full fine-tuning, BF16)
  • Llama 3.1 8B: 53,657 tokens/sec (LoRA, BF16)
  • Llama 3.3 70B: 5,079 tokens/sec (QLoRA, FP4)

Inference Performance:

  • Qwen3 14B: 5,928 tokens/sec prompt processing, 22.71 tokens/sec generation
  • GPT-OSS-20B: 82.74 tokens/sec generation

Key Claims:

  • 1 petaflop of FP4 compute
  • Less than 1% accuracy degradation with FP4
  • 273 GB/sec memory bandwidth
  • Support for 128GB+ models locally

Impressive on paper. Let's see how it holds up.

My Testing Environment

Hardware:

  • DGX Spark (ARM64 architecture)
  • GB10 GPU (Blackwell generation, unified memory)
  • Driver 580.95.05
  • CUDA 13.0
  • Ubuntu 24.04.3 LTS

Software:

  • PyTorch 2.5.0 (NVIDIA container: nvcr.io/nvidia/pytorch:24.10-py3)
  • Ollama 0.3.9 for inference
  • Transformers 4.44.0

Workloads:

  1. Inference Benchmark: Phi-3.5-mini-instruct (3.8B params) via Ollama
  2. Fine-Tuning: 7 LoRA experiments on Gemma-3-4b-it for medical Q&A (10,000 examples)
  3. Training: NanoChat project (125M param models from scratch)

Duration: 6+ consecutive days of ML work

The Results: What Matches

✅ Training Performance is Real (When It Works)

My Gemma-3-4b-it LoRA fine-tuning achieved speeds comparable to NVIDIA's benchmarks. Training completed in 10-12 hours for 3 epochs on 10,000 examples with batch size 4, right in line with NVIDIA's Llama 3.1 8B numbers.

Training configuration:

- Base Model: Gemma-3-4b-it (4B parameters)
- Method: LoRA (rank 16, alpha 32)
- Precision: BF16 mixed precision
- Batch Size: 4
- Learning Rate: 2e-5
- Epochs: 3
- Dataset: 10,000 medical Q&A pairs

Results across 7 experiments:

  • Experiment 1-5 (baseline): 70% accuracy
  • Experiment 7 (optimized): 82% accuracy
  • Experiment 8 (SFTTrainer): 84% accuracy

All models trained successfully with smooth loss curves. Training throughput was excellent.

Verdict: ✅ NVIDIA's training performance claims are accurate.

✅ Inference Speed Scales as Expected

I benchmarked Phi-3.5-mini-instruct (3.8B params, Q4_K_M quantization) via Ollama:

Prompt Type          Tokens/Second
-----------------------------------------
Short (20 tokens)    83.25
Medium (34 tokens)   79.74
Long (113 tokens)    78.47
-----------------------------------------
Average              ~80 tokens/sec

NVIDIA showed 22.71 tokens/sec for Qwen3 14B (nearly 4x larger). My 3.8B model at 80 tokens/sec is roughly 3.6x faster, which tracks with the size difference.

Verdict: ✅ Inference speed scales proportionally with model size.

✅ 4-Bit Quantization Works

NVIDIA claims "less than 1% accuracy degradation" with FP4. I used Q4_K_M (4-bit) quantization extensively via Ollama, and the model quality was excellent (coherent, contextually appropriate responses with no noticeable degradation).

Verdict: ✅ 4-bit quantization is production-viable.

The Reality: What They Didn't Tell You

⚠️ FP16 GPU Inference Has Numerical Issues

Update: This section has been revised based on community feedback. My original title "GPU Inference is Fundamentally Broken" was overstated. The issue appears to be FP16-specific.

Here's what happened. After training my first model (Experiment 1), I loaded it for evaluation:

model = AutoModelForCausalLM.from_pretrained(
    "./models/exp-001/final",
    torch_dtype=torch.float16,  # FP16 for inference
    device_map={"": 0}  # GPU
)

# Generate text
output = model.generate(...)

Result: Empty responses. PAD tokens. Sometimes inf/nan errors crashing CUDA.

I assumed my training failed. I retrained with different hyperparameters (Experiment 2). Same result. Experiment 3, 4, 5, all "failed." I spent 15+ hours debugging my training code, convinced I was doing something wrong.

Then I tried this:

model = AutoModelForCausalLM.from_pretrained(
    "./models/exp-001/final",
    torch_dtype=torch.float32,  # CPU uses FP32
    device_map='cpu'  # CPU instead of GPU
)

# Generate text
output = model.generate(...)

Result: Perfect. Coherent, relevant medical responses. The model worked beautifully.

All 7 experiments had trained successfully. The training loss decreased smoothly. The models learned. But FP16 GPU inference was broken.

What I Should Have Tested

HN user enum asked the critical question: "Does it work if you change to torch.bfloat16?"

I trained with BF16 (which worked perfectly) but tested inference with FP16 (which failed). I never tested BF16 inference.

This is a significant gap in my testing. The issue might be:

  • FP16 inference specifically broken (numerical instability on this hardware)
  • BF16 inference might work fine (matching training precision)

I can't access the original trained models to retest this now, but it's the obvious next experiment.

The Pattern (Revised)

  • Training on GPU (BF16): Works perfectly
  • Inference on CPU (FP32): Works perfectly
  • Inference on GPU (FP16): Produces inf/nan, empty responses, or crashes
  • Inference on GPU (BF16): Not tested yet (likely works based on community feedback)
  • Inference via Ollama: Works (see next section)

My mistake: Claiming "GPU inference is fundamentally broken" without testing all precision combinations. The issue is likely FP16-specific, not a fundamental GPU problem.

For PyTorch/Transformers users: stick to BF16 for both training and inference, or use CPU/Ollama for evaluation.

❌ Memory Fragmentation Causes System Hangs

During Experiment 6, my training script was humming along beautifully. Loss decreasing, checkpoints saving, everything looked perfect. Then at 88% complete (7.5 hours in), the system froze. Hard. No response to keyboard, SSH connection dead, GPU stuck.

Hard reboot required. Progress lost.

The issue: GPU memory fragmentation during long-running training causes system-level instability.

Required mitigations for all subsequent training:

# Critical: Clear GPU cache regularly
if step % 50 == 0:
    torch.cuda.empty_cache()

# Critical: Limit training duration
if time.time() - start_time > 2.5 * 3600:  # 2.5 hours
    logging.info("Approaching safety limit - saving and exiting")
    save_checkpoint(model, optimizer, step)
    break

# Critical: Checkpoint frequently (not too frequently)
if step % 200 == 0:  # Not 100 - creates more fragmentation
    save_checkpoint(model, optimizer, step)

# Monitor GPU memory
if step % 100 == 0:
    allocated = torch.cuda.memory_allocated() / 1e9
    if allocated > 75:  # >75GB on 128GB system
        logging.warning("High GPU memory - fragmentation risk")

After implementing these mitigations, I successfully trained Experiments 7 and 8 to completion. But the limitation is real: maximum 2-3 hour training sessions before restarting.

NVIDIA's benchmarks don't mention this constraint. Their tests were likely short-duration runs that didn't expose the fragmentation issue.

Community note (from eadwu): This might be a Linux kernel memory management issue rather than pure hardware limitation. Similar issues occur on WSL and can be mitigated with memory compaction services. Worth investigating kernel-level solutions.

⚠️ llama.cpp: My Experience vs. Official Benchmarks

I tried running llama.cpp directly (without Ollama's wrapper):

./llama-cli \
  --model Phi-3.5-mini-instruct-Q4_K_M.gguf \
  --prompt "Explain what a neural network is in one sentence." \
  --n-predict 512

Result: Empty responses on all test prompts. Zero tokens generated.

However: HN user veber-alex pointed out that official llama.cpp benchmarks show the DGX Spark running multiple models successfully, including:

  • Llama models working
  • Qwen models working
  • GPU acceleration confirmed
  • Multiple quantization levels tested

My assessment: I likely hit a version mismatch, build configuration issue, or Phi-3.5-specific problem. The official benchmarks prove llama.cpp works on this hardware. My experience was real, but not representative of llama.cpp's capabilities on DGX Spark.

Recommendation: Build llama.cpp from source for best results, or use Ollama (which bundles a tested version).

Root Cause: Blackwell + ARM64 Combination

Update: My original section claimed "ARM64 + CUDA support is brand new." HN user bradfa correctly pointed out: "Aarch64 and CUDA has been a thing for many years on Jetson boards."

I need to be more precise. The issues exist because of three factors converging:

1. ARM64 Architecture (Mature, But...)

  • ARM64 + CUDA: Mature (Jetson boards since ~2015)
  • Most ML tools primarily tested on x86_64
  • PyTorch available via NVIDIA Docker containers (recommended path)
  • Some Python packages may need building from source

2. Blackwell GB10 GPU (New)

  • Newest GPU generation (sm_121 compute capability)
  • Unified memory architecture (CPU and GPU share 128GB RAM)
  • Limited real-world production testing vs. established datacenter GPUs
  • Driver maturity: 6-12 months behind older architectures

3. CUDA 13.0 (Latest)

  • Released ~6 months ago
  • Works well with established workflows
  • Requires PyTorch 2.5+ (cutting edge)

The specific combination that's bleeding edge:

Blackwell GB10 + ARM64 + CUDA 13.0 = New Territory
↓
Limited production testing of this specific stack
↓
Edge cases in numerical precision (FP16 inference)
↓
Memory management challenges (training)

NVIDIA's benchmarks likely use:

  • TensorRT-LLM (not standard PyTorch)
  • Short-duration controlled tests
  • Configurations that avoid the fragmentation trigger
  • BF16 consistently (not mixed FP16/BF16)

Precision Deep Dive (Updated)

Here's what actually works vs. what's broken:

Use CasePrecisionDeviceStatusPerformance
TrainingBF16GPU✅ WorksExcellent
TrainingFP32GPU✅ WorksSlower but stable
InferenceFP16GPU❌ Brokeninf/nan errors
InferenceBF16GPU❓ Not TestedUnknown (likely works)
InferenceFP32CPU✅ WorksSlower, reliable
InferenceQ4_K_MGPU (Ollama)✅ Works80 tok/sec

The key insight: FP16 GPU inference has issues. BF16 likely works. CPU works fine.

Ollama: GPU-Accelerated Inference That Works

Update: My original article stated "Inference via Ollama: Works (CPU-optimized backend)." HN user jasonjmcghee asked: "From the article it sounds like ollama runs cpu inference not GPU inference. Is that the case for you?"

I was wrong. Ollama IS using GPU. After posting, I verified:

# During Ollama inference
nvidia-smi dmon -s u

# gpu     sm    mem    enc    dec    jpg    ofa
# Idx      %      %      %      %      %      %
    0     96      0      0      0      0      0

96% GPU utilization during inference. It's definitely using the GPU.

Why memory shows 0%: The GB10 has a unified memory architecture where CPU and GPU share the same 128GB RAM pool. Traditional discrete GPU memory metrics don't apply here.

Inference via Ollama:

# Stable, reliable, good performance
ollama run phi3.5:3.8b-mini-instruct-q4_K_M

# Results: 23 tokens/sec, GPU-accelerated

Ollama uses GPU acceleration with optimized quantized inference. The combination of quantization (Q4_K_M) and GPU acceleration gives production-ready performance.

What This Means for Production

What's Production-Ready

Training with workarounds:

# Implement these in all training scripts:
1. torch.cuda.empty_cache() every 50 steps
2. Maximum 2-3 hour sessions with checkpointing
3. Monitor GPU memory continuously
4. Use BF16 consistently (training AND inference)

# Result: Successfully trained 7+ models

Inference via Ollama:

# GPU-accelerated, stable, good performance
ollama run phi3.5:3.8b-mini-instruct-q4_K_M
# 23 tok/s, production-ready

CPU inference for evaluation:

# Slow but 100% reliable
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float32,
    device_map='cpu'
)
# Perfect for batch evaluation

What Needs More Testing

⚠️ GPU inference (PyTorch/Transformers):

  • FP16 has numerical instability (confirmed)
  • BF16 not tested yet (likely works)
  • Test this yourself before production use

⚠️ Multi-hour unattended training:

  • Memory fragmentation risk after 3-8 hours
  • Requires babysitting or session limits
  • Train in chunks with checkpointing

What About TensorRT-LLM?

Update: HN user renaudr asked: "Have you tried to run GPT-OSS-120b using TRT-LLM?"

I haven't. NVIDIA's benchmarks likely used TensorRT-LLM, their optimized inference engine. I focused on standard PyTorch/Transformers workflows (what most practitioners use).

If you need production GPU inference performance matching NVIDIA's benchmarks, TensorRT-LLM is the recommended path. It likely avoids the FP16 precision issues I encountered. That's a separate deep-dive I haven't done yet.

My Actual Numbers vs. NVIDIA's Claims

MetricNVIDIAMy ExperienceMatch?
Training speed53,657-82,739 tok/secComparable relative speed
Inference speed22-82 tok/sec23 tok/sec (Ollama, GPU)
4-bit quality<1% degradationNo noticeable degradation
Training precisionBF16BF16 works perfectly
GPU inferenceImplied workingFP16 broken, BF16 untested⚠️
Training stabilityNot mentioned2-3 hour limit
Memory managementNot mentionedManual cache clearing

Recommendations

If You're Using DGX Spark

DO:

  • Implement GPU cache clearing (every 50 steps)
  • Limit training sessions to 2-3 hours
  • Checkpoint frequently (every 200+ steps)
  • Use BF16 consistently for training and inference
  • Test BF16 inference before claiming GPU inference is broken
  • Use Ollama for production inference (GPU-accelerated)
  • Build llama.cpp from source if needed
  • Monitor NVIDIA driver updates

DON'T:

  • Use FP16 for GPU inference (numerical instability)
  • Plan 8+ hour unattended training
  • Skip the workarounds
  • Assume your experience generalizes without more testing

If You're Considering DGX Spark

It's worth it if:

  • You're an experienced ML engineer
  • You need local training capabilities
  • You can implement stability workarounds
  • Ollama or TRT-LLM inference meets your needs
  • You're willing to monitor driver updates

Look elsewhere if:

  • You need plug-and-play GPU inference
  • You expect production-ready out-of-box
  • You need long unattended training runs
  • You don't have time for workarounds

The Silver Lining

Despite the issues, I successfully:

✅ Fine-tuned 7 models with 70-84% accuracy on medical Q&A ✅ Achieved 23 tokens/sec GPU-accelerated inference via Ollama ✅ Validated NVIDIA's training performance claims ✅ Built a production-ready medical chatbot ✅ Documented all workarounds for future users

The hardware is powerful. But it requires expert-level knowledge to navigate current limitations.

Lessons Learned

1. Test All Precision Combinations

When FP16 inference failed, I should have tested BF16 inference (matching training precision) before claiming GPU inference was broken. Complete your testing matrix before drawing conclusions.

2. Hardware Issues Can Look Like Software Bugs

I spent 15+ hours debugging my "broken" training code. The training was fine; FP16 inference had numerical issues. Check your assumptions systematically.

3. Benchmark Numbers Are True But Incomplete

NVIDIA's numbers are accurate for their test conditions. But they don't reveal precision constraints, stability limits, or workarounds needed for production. Context matters.

4. Bleeding Edge Means Trade-offs

Blackwell + ARM64 + CUDA 13.0 is cutting edge. That means bugs, limitations, and workarounds. Wait 6-12 months if you need stability, or be ready to troubleshoot.

5. Community Feedback is Invaluable

The HN community caught my incomplete testing, overclaimed conclusions, and factual errors. Ship early, iterate publicly, accept criticism gracefully.

Community Acknowledgments

This article improved significantly thanks to Hacker News feedback. Special thanks to:

  • enum - Caught the BF16 vs FP16 testing gap, shared PyTorch installation insights
  • veber-alex - Provided llama.cpp official benchmarks link
  • bradfa - Corrected ARM64+CUDA maturity claims (Jetson history)
  • furyofantares - Questioned incomplete testing and overclaimed conclusions
  • stuckinhell - Shared contrary experience (their DGX inference works fine)
  • jasonjmcghee - Asked about Ollama CPU vs GPU usage
  • eadwu - Explained memory fragmentation might be kernel-level
  • renaudr - Suggested TRT-LLM testing

Technical discussion like this makes everyone smarter. I'm grateful for the pushback.

What I'd Tell NVIDIA

If NVIDIA engineers are reading this, here's constructive feedback:

Document these in your communications:

  1. Precision guidance:

    • FP16 inference behavior on Blackwell+ARM64
    • BF16 consistency recommendation (training and inference)
    • When to use TRT-LLM vs. PyTorch
  2. Stability constraints:

    • Training session duration considerations
    • Memory management best practices
    • GPU cache clearing recommendations
  3. Inference setup:

    • Which backends work out-of-box (TRT-LLM vs PyTorch)
    • Unified memory architecture implications
    • Ollama as tested inference path
  4. Real-world examples:

    • Long-running training strategies
    • Production inference patterns
    • Monitoring and mitigation tactics

Your benchmarks are accurate. They'd be more valuable with context about setup, limitations, and recommended practices.

Conclusion: Powerful But Not Plug-and-Play

The NVIDIA DGX Spark delivers on raw performance when you work within its constraints. Training throughput matches the benchmarks. Inference speed is excellent (via Ollama). The hardware potential is real.

But it's not plug-and-play. FP16 GPU inference has issues. Memory fragmentation limits training sessions. You need expertise to navigate the current limitations.

NVIDIA's benchmarks are technically true. They're just not the whole truth.

For ML engineers willing to implement workarounds and test thoroughly, DGX Spark is a powerful tool. For teams expecting production-ready performance out-of-box, the maturity isn't there yet (especially for standard PyTorch workflows on ARM64).

My verdict: Cautiously recommended for experts. Wait 6-12 months if you need stability. Test BF16 inference before I did.


Appendix: Full Experimental Data

Inference Benchmark Details

Test Model: Phi-3.5-mini-instruct (3.8B parameters) Quantization: Q4_K_M (4-bit) Engine: Ollama 0.3.9 (llama.cpp backend) Acceleration: GPU (96% utilization confirmed)

Configuration:

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 512,
  "repeat_penalty": 1.1,
  "num_runs_per_prompt": 3
}

Results by Prompt Category:

CategoryPrompt LengthAvg Response LengthTokens/Sec
Short20 tokens46 tokens83.25
Medium34 tokens456 tokens79.74
Long113 tokens512 tokens78.47

Fine-Tuning Experiment Summary

Base Model: google/gemma-3-4b-it Dataset: PubMedQA artificial (10,000 medical Q&A pairs) Method: LoRA fine-tuning

Configuration:

{
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "learning_rate": 2e-5,
    "batch_size": 4,
    "epochs": 3,
    "precision": "bf16",
    "gradient_accumulation_steps": 2
}

Results:

ExperimentTraining TimeFinal LossFP16 GPU TestCPU TestAccuracy
Exp 1-5~3 hours eachDecreased70%
Exp 67.5h (hung)N/AN/AN/AN/A
Exp 7~12 hours1.8282%
Exp 8~10.5 hours1.6184%

Best Model (Experiment 8):

  • Used SFTTrainer instead of base Trainer
  • Text-based dataset (no pre-tokenization)
  • Dynamic padding
  • +14% improvement over baseline

System Configuration

Hardware:
  Platform: NVIDIA DGX Spark
  Architecture: ARM64 (aarch64)
  GPU: GB10 (Blackwell, unified memory)
  GPU Driver: 580.95.05
  CUDA: 13.0
  Memory: 128 GB unified (CPU/GPU share)

Software:
  OS: Ubuntu 24.04.3 LTS
  Kernel: 6.11.0-1016-nvidia
  PyTorch: 2.5.0a0+e000cf0ad9.nv24.10 (Docker)
  Container: nvcr.io/nvidia/pytorch:24.10-py3
  Transformers: 4.44.0
  Ollama: 0.3.9
  CUDA Runtime: 13.0
  cuDNN: 9.5.1

Stability Mitigations:
  - GPU cache clearing: Every 50 steps
  - Checkpoint interval: 200 steps
  - Max session duration: 2.5 hours
  - Memory monitoring: Every 100 steps
  - Inference: BF16 recommended (FP16 has issues)

Related Articles

Follow-up: Week One Update: Finding the Right Stack - The root cause and 3.6x performance breakthrough!

More from the DGX Lab Chronicles:

  • Day 1: When Simple Heuristics Beat ML by 95,000x
  • Day 2: Supercharge Your Shell with 50+ ML Productivity Aliases
  • Day 3: Building a Complete RAG Infrastructure

Production ML Resources:

  • Building a Production ML Workspace Series
  • The Hidden Crisis in LLM Fine-Tuning

About This Series:

I'm documenting my journey building production ML systems on an NVIDIA DGX Spark. The wins, the losses, the mistakes, and the corrections. This is Day 4 of the DGX Lab Chronicles.

Want to follow along? Every article shares real code, actual performance data, and honest assessments (including when I get things wrong).

Found errors or have corrections? The HN community made this article significantly better. Feel free to reach out.


Published: October 26, 2025 Updated: October 27, 2025 Series: DGX Lab Chronicles (Day 4) Reading Time: 12 minutes


Related Articles

  • DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1
  • How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)
  • DGX Lab: Supercharge Your Shell with 50+ ML Productivity Aliases - Day 2

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on DGX Spark Benchmarks: 82,739 tokens/sec on Paper, the Production Reality? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.