When ARIA Crashed the DGX: Building GPU Monitoring in 5 Minutes
When ARIA Crashed the DGX: Building GPU Monitoring in 5 Minutes
The Crash
6:51 AM. ARIA experiment hung. 10:30 AM. System crashed.
The autonomous research system I'd been running for 437 sessions tried to load a 650M parameter genomics model with 1Mb context and process 100 test sequences. The math was simple.
Embedding memory per sequence: 2.60 GB. With 100 sequences: Peak memory = 260.02 GB. GPU available: 128.5 GB.
260 GB > 128 GB = Out-of-Memory crash that froze the entire system. SSH unresponsive. Hard reboot required.
The experiment code had a memory leak. It created species_ids tensors inside the loop for every sequence instead of reusing them. No batching. No cleanup. Just accumulating tensors until the GPU choked.
I could have spent hours debugging the experiment code. Instead, I spent 5 minutes building a monitoring system that would catch this before it happens again.
The GB10 Architecture Challenge
The NVIDIA DGX Spark is built around the GB10 Grace Blackwell Superchip (Grace CPU + Blackwell GPU). The platform uses unified memory architecture instead of dedicated GPU framebuffer memory.
$ nvidia-smi
+-------------------------------------------------------------------------+
| GPU Name Persistence-M | Bus-Id Disp.A |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage |
|=========================================================================|
| 0 NVIDIA GB10 On | 00000000:91:00.0 Off |
| 32% 31C P2 77W / 550W | [N/A] |
+-------------------------------------------------------------------------+
Memory-Usage | Not Supported
Not Supported. That's the documented behavior for nvidia-smi memory queries on GB10's unified memory architecture. The compute works perfectly (CUDA 13.0, PyTorch 2.9.0). Temperature works. GPU utilization works. But the traditional framebuffer-based memory reporting doesn't apply here.
This creates challenges for standard monitoring tools that expect dedicated framebuffer memory:
- Prometheus + DCGM: Uses NVML memory APIs (not available on unified memory)
- Netdata: Uses nvidia-smi memory queries (returns "Not Supported")
- Telegraf: Uses nvidia-smi memory queries (returns "Not Supported")
- Datadog, New Relic: Use DCGM (also affected, plus $180/month)
NVML (NVIDIA Management Library) is the standard API for GPU monitoring. On GB10's unified memory architecture, NVML memory info APIs return "Not Supported" by design. This isn't a driver bug. It's a fundamental architectural difference.
The ecosystem tooling is still adapting to unified memory platforms. Some tools crash when memory APIs return unexpected values. Others simply show incomplete data. I needed monitoring that works with GB10's architecture, not against it.
What I Built in 5 Minutes
Three layers of defense against memory crashes:
Layer 1: systemd-oomd (already installed) Linux's built-in OOM prevention. Uses PSI (Pressure Stall Information) to detect thrashing before the crash. When memory pressure hits 50%, it kills the offending process. System stays responsive.
Layer 2: Custom GPU monitor (the 5-minute build)
Python script that monitors GPU VRAM via PyTorch (not NVML), system RAM via /proc/meminfo, and memory pressure via PSI. Sends Telegram alerts when thresholds exceeded.
Layer 3: Netdata (already running) Real-time dashboard for historical metrics and visualization.
The critical piece was Layer 2. PyTorch uses the CUDA Runtime API, not NVML. It works perfectly on GB10.
The Code
import torch
def get_gpu_memory():
"""Get GPU memory via PyTorch (works on GB10)"""
if not torch.cuda.is_available():
return None
allocated = torch.cuda.memory_allocated(0) / 1e9
reserved = torch.cuda.memory_reserved(0) / 1e9
total = torch.cuda.get_device_properties(0).total_memory / 1e9
return {
"used_gb": reserved,
"total_gb": total,
"used_pct": (reserved / total) * 100
}
This works. Accurate VRAM usage even though nvidia-smi shows [N/A].
For system memory and PSI:
def get_memory_info():
"""Get system memory from /proc/meminfo"""
info = {}
with open("/proc/meminfo") as f:
for line in f:
parts = line.split()
if len(parts) >= 2:
key = parts[0].rstrip(":")
info[key] = int(parts[1]) * 1024 # KB to bytes
total = info.get("MemTotal", 0) / 1e9
available = info.get("MemAvailable", 0) / 1e9
used = total - available
return {
"used_gb": used,
"total_gb": total,
"used_pct": (used / total * 100) if total > 0 else 0
}
def get_psi_pressure():
"""Read PSI from /proc/pressure/memory"""
with open("/proc/pressure/memory") as f:
lines = f.readlines()
some_line = [l for l in lines if l.startswith("some")][0]
some_avg10 = float(some_line.split()[1].split("=")[1])
full_line = [l for l in lines if l.startswith("full")][0]
full_avg10 = float(full_line.split()[1].split("=")[1])
return {"some_avg10": some_avg10, "full_avg10": full_avg10}
PSI (Pressure Stall Information) is the secret weapon. It's been in the Linux kernel since 4.20 (2018), but most people don't use it.
$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=45967
full avg10=0.00 avg60=0.00 avg300=0.00 total=45766
When some avg10 hits 50%, it means half the time in the last 10 seconds, tasks were waiting for memory. That's thrashing. That's when you get alerted, while the system is still responsive.
The threshold check:
def check_thresholds(metrics):
"""Check if any thresholds exceeded"""
alerts = []
if metrics.gpu_used_pct > 85.0:
alerts.append(f"GPU memory at {metrics.gpu_used_pct:.1f}%")
if metrics.memory_used_pct > 80.0:
alerts.append(f"System RAM at {metrics.memory_used_pct:.1f}%")
if metrics.psi_some_avg10 > 50.0:
alerts.append(f"Memory pressure at {metrics.psi_some_avg10:.1f}% - System is thrashing!")
return alerts
Telegram integration:
def send_telegram_alert(alerts, metrics):
"""Send alert via Telegram bot"""
message = f"""⚠️ *Resource Alert*
GPU: {metrics.gpu_used_pct:.1f}% ({metrics.gpu_used_gb:.1f}/{metrics.gpu_total_gb:.1f} GB)
RAM: {metrics.memory_used_pct:.1f}% ({metrics.memory_used_gb:.1f}/{metrics.memory_total_gb:.1f} GB)
PSI: {metrics.psi_some_avg10:.1f}%
*Issues:*
"""
for alert in alerts:
message += f"• {alert}\n"
subprocess.run([notify_script, message])
The complete script is 200 lines. Deployed via cron (every 5 minutes). Logs to `(local path)
Cron vs Systemd Service
I chose cron over a systemd service. Simpler deployment, lower resource usage, and 5-minute checks are sufficient for catching problems before they become crashes.
Telegram Bot Setup (2 minutes)
Creating the bot was the easiest part:
- Message @BotFather on Telegram:
/newbot - Name: DGX Alerts
- Get bot token
- Start chat with your bot, send any message
- Visit
https://api.telegram.org/bot<TOKEN>/getUpdatesto get chat ID - Save credentials:
cat > (local path) <<EOF
TELEGRAM_BOT_TOKEN="your-token-here"
TELEGRAM_CHAT_ID="your-chat-id"
EOF
chmod 600 (local path)
Instant alerts to my phone. Better than email (don't check constantly), desktop notifications (server has no GUI), or Slack (don't use personally).
systemd-oomd Configuration (1 minute)
Ubuntu ships systemd-oomd by default since 22.04. Most people don't configure it.
# /etc/systemd/oomd.conf.d/dgx.conf
[OOM]
SwapUsedLimit=90%
DefaultMemoryPressureLimit=50%
DefaultMemoryPressureDurationSec=20s
This means:
- If swap usage > 90%, kill offending process
- If memory pressure > 50% for 20 seconds, kill offending process
- Act fast (20 seconds, not the default 30)
systemd-oomd prevents total system freeze. If a process is thrashing, it gets killed before SSH becomes unresponsive. You can still log in and investigate instead of hard rebooting.
It's already saved the system twice since I configured it.
Testing
Validation with synthetic load:
# Allocate ~1.5GB GPU memory
import torch
x = torch.randn(20000, 20000, device='cuda')
# Monitor catches it
(local path)
Result:
🖥️ DGX Spark [11:15:23]
⚠️ Resource Alert
GPU: 87.3% (111.8/128.0 GB)
RAM: 45.2% (57.8/128.0 GB)
PSI: 0.0%
Issues:
• GPU memory at 87.3% [threshold: 85%]
Real Telegram alert showing GPU memory crossing the 85% threshold. System detected high memory usage and sent instant notification before crash could occur.
Alert arrived on my phone instantly. System still responsive. Training job could continue or be gracefully stopped.
Why DIY Beats Enterprise Here
Standard solution (Prometheus + Grafana + DCGM):
- Setup time: 4-8 hours
- Maintenance: 1-2 hours/month
- Resources: 1-2GB RAM, 5-10% CPU (3 services)
- GB10 unified memory: NVML memory APIs not available
- Cost: Free (but complex)
DIY solution:
- Setup time: 5 minutes (one-time)
- Maintenance: <30 min/month (review logs)
- Resources: <250MB RAM, <3% CPU
- GB10 unified memory: PyTorch CUDA Runtime API works
- Cost: $0
Enterprise monitoring is designed for multi-node clusters with SRE teams and traditional GPU architectures. For a single homelab node with unified memory architecture, a 200-line Python script is better. Easier to understand, faster to debug, zero cost, tailored to the platform's actual capabilities.
When Enterprise Tools Make Sense
For traditional PCIe GPUs with dedicated VRAM, Prometheus + DCGM is the right choice. It's battle-tested and scales beautifully. For unified memory platforms like GB10, you need monitoring that works with the architecture rather than expecting framebuffer-style telemetry.
Performance Impact
Measured overhead:
CPU Usage:
- Before monitoring: 2-3% idle, 85-92% training
- After monitoring: No measurable change
- systemd-oomd: <0.1% CPU
- dgx-monitor (cron): <0.1% CPU (runs 5 seconds every 5 minutes)
Memory:
- Before: 8-10GB idle
- After: +200MB (systemd-oomd 10MB, dgx-monitor 50MB, Netdata 150MB)
Disk I/O:
- Log growth: ~6MB/day
- Negligible for 1.92TB NVMe
Network:
- Telegram API: <5KB/day
- Negligible
Zero impact on training performance.
What I Learned
1. Unified memory architectures need different approaches
The GB10 Grace Blackwell Superchip uses unified memory instead of dedicated GPU framebuffer memory. This architectural choice affects how monitoring works. Standard NVML-based tools expect traditional memory reporting. Unified memory platforms need different monitoring approaches.
2. PyTorch as a monitoring tool
We think of PyTorch as a training framework. It's also a reliable way to query GPU state via CUDA Runtime API. When NVML fails, PyTorch succeeds.
3. PSI is underutilized
Pressure Stall Information catches thrashing before the system freezes. It's been in the kernel since 2018, barely anyone uses it. Game-changer for memory monitoring.
4. systemd-oomd saves systems
Configure it properly (50% threshold, 20 second delay). It's already killed runaway processes twice before total freeze. Don't ignore it.
5. Telegram for server alerts is excellent
Setup takes 5 minutes. API is dead simple. Notifications are instant and free. Better than email (slow), SMS (costs money), or Slack (enterprise overkill).
6. The 5-minute rule
When a system crashes, you have two choices: spend hours debugging the root cause, or spend 5 minutes building safeguards so it can't happen again. I chose safeguards. The ARIA experiment still needs fixing, but at least I'll get a warning before the next crash.
The Files
Complete implementation:
(local path)
├── scripts/
│ ├── dgx-monitor.py # Main monitoring script (200 lines)
│ ├── notify # Telegram notification wrapper
│ └── dgx-monitor.cron # Cron job template
├── docs/
│ └── MONITORING.md # Full documentation
└── logs/
└── dgx-monitor.log # Alert history
Cron configuration:
# /etc/cron.d/dgx-monitor
*/5 * * * * username (local path) >> (local path) 2>&1
systemd-oomd configuration:
# /etc/systemd/oomd.conf.d/dgx.conf
[OOM]
SwapUsedLimit=90%
DefaultMemoryPressureLimit=50%
DefaultMemoryPressureDurationSec=20s
What's Next
The monitoring system works. Now I need to fix the actual ARIA experiment code:
- Move
species_idsoutside the loop (create once, reuse) - Add batching (process multiple sequences together)
- Reduce context size (1Mb is excessive, use realistic lengths)
- Add memory monitoring in the experiment loop (clear cache periodically)
- Enable quick mode by default (validate with small datasets first)
But those fixes can wait. The monitoring system ensures the next failure won't crash the entire DGX.
Session 438 starts soon. This time, I'll get a Telegram alert when memory hits 85%.
Related Articles
- Debugging Claude Code with Claude: A Meta-Optimization JourneyshippedPractical ApplicationsJan 10, 2026Debugging Claude Code with Claude: A Meta-Optimization JourneyUsing Claude to analyze its own debug logs and session data reveals hidden performance bottlenecks and provides a systematic approach to optimizing AI development tools.
- Building an AI Research Night ShiftshippedPractical ApplicationsNov 6, 2025My AI Research Assistant Works the Night Shift (A Claude Code Skill Story)How I built a Claude Code skill that researches AI developments overnight using intelligent automation that adapts, prevents duplicates, and provides instant answers.
- Syncing Claude Code Configs Across MachinesshippedPractical ApplicationsOct 20, 2025Syncing Claude Code Configurations Across Multiple Machines: A Practical GuideLearn how to intelligently sync Claude Code configurations across Mac, Pi, and DGX boxes while preserving machine-specific settings like model endpoints and API keys
About the Author: Justin Johnson builds AI systems and writes about practical AI development.
justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe
Follow the lab
Get the next experiment
Enjoyed the breakdown on When ARIA Crashed the DGX: Building GPU Monitoring in 5 Minutes? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.
Related experiments
- Practical ApplicationsFrom a DGX Spark to a Borrowed Cluster: A Retinal-AI Lab, Built in Public
- Practical ApplicationsDGX Lab: Supercharge Your Shell with 50+ ML Productivity Aliases - Day 2
- Practical ApplicationsDGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3
Apparatus
929 words · 12 min read
- gpu-monitoring
- nvidia-dgx
- blackwell
- pytorch
- devops
- automation