when-aria-crashed-the-dgx - AIXplore

# When ARIA Crashed the DGX: Building GPU Monitoring in 5 Minutes ## The Crash 6:51 AM. ARIA experiment hung. 10:30 AM. System crashed. The autonomous research system I'd been running for 437 sessions tried to load a 650M parameter genomics model with 1Mb context and process 100 test sequences. The math was simple. Embedding memory per sequence: 2.60 GB. With 100 sequences: Peak memory = 260.02 GB. GPU available: 128.5 GB. 260 GB > 128 GB = Out-of-Memory crash that froze the entire system. SSH unresponsive. Hard reboot required. The experiment code had a memory leak. It created `species_ids` tensors inside the loop for every sequence instead of reusing them. No batching. No cleanup. Just accumulating tensors until the GPU choked. I could have spent hours debugging the experiment code. Instead, I spent 5 minutes building a monitoring system that would catch this before it happens again. ## The GB10 Architecture Challenge The NVIDIA DGX Spark is built around the GB10 Grace Blackwell Superchip (Grace CPU + Blackwell GPU). The platform uses unified memory architecture instead of dedicated GPU framebuffer memory. ```bash $ nvidia-smi +-------------------------------------------------------------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | |=========================================================================| | 0 NVIDIA GB10 On | 00000000:91:00.0 Off | | 32% 31C P2 77W / 550W | [N/A] | +-------------------------------------------------------------------------+ Memory-Usage | Not Supported ``` `Not Supported`. That's the documented behavior for `nvidia-smi` memory queries on GB10's unified memory architecture. The compute works perfectly (CUDA 13.0, PyTorch 2.9.0). Temperature works. GPU utilization works. But the traditional framebuffer-based memory reporting doesn't apply here. This creates challenges for standard monitoring tools that expect dedicated framebuffer memory: - Prometheus + DCGM: Uses NVML memory APIs (not available on unified memory) - Netdata: Uses nvidia-smi memory queries (returns "Not Supported") - Telegraf: Uses nvidia-smi memory queries (returns "Not Supported") - Datadog, New Relic: Use DCGM (also affected, plus $180/month) NVML (NVIDIA Management Library) is the standard API for GPU monitoring. On GB10's unified memory architecture, NVML memory info APIs return "Not Supported" by design. This isn't a driver bug. It's a fundamental architectural difference. The ecosystem tooling is still adapting to unified memory platforms. Some tools crash when memory APIs return unexpected values. Others simply show incomplete data. I needed monitoring that works with GB10's architecture, not against it. ## What I Built in 5 Minutes Three layers of defense against memory crashes: **Layer 1: systemd-oomd** (already installed) Linux's built-in OOM prevention. Uses PSI (Pressure Stall Information) to detect thrashing before the crash. When memory pressure hits 50%, it kills the offending process. System stays responsive. **Layer 2: Custom GPU monitor** (the 5-minute build) Python script that monitors GPU VRAM via PyTorch (not NVML), system RAM via `/proc/meminfo`, and memory pressure via PSI. Sends Telegram alerts when thresholds exceeded. **Layer 3: Netdata** (already running) Real-time dashboard for historical metrics and visualization. The critical piece was Layer 2. PyTorch uses the CUDA Runtime API, not NVML. It works perfectly on GB10. ## The Code ```python import torch def get_gpu_memory(): """Get GPU memory via PyTorch (works on GB10)""" if not torch.cuda.is_available(): return None allocated = torch.cuda.memory_allocated(0) / 1e9 reserved = torch.cuda.memory_reserved(0) / 1e9 total = torch.cuda.get_device_properties(0).total_memory / 1e9 return { "used_gb": reserved, "total_gb": total, "used_pct": (reserved / total) * 100 } ``` This works. Accurate VRAM usage even though nvidia-smi shows `[N/A]`. For system memory and PSI: ```python def get_memory_info(): """Get system memory from /proc/meminfo""" info = {} with open("/proc/meminfo") as f: for line in f: parts = line.split() if len(parts) >= 2: key = parts[0].rstrip(":") info[key] = int(parts[1]) * 1024 # KB to bytes total = info.get("MemTotal", 0) / 1e9 available = info.get("MemAvailable", 0) / 1e9 used = total - available return { "used_gb": used, "total_gb": total, "used_pct": (used / total * 100) if total > 0 else 0 } def get_psi_pressure(): """Read PSI from /proc/pressure/memory""" with open("/proc/pressure/memory") as f: lines = f.readlines() some_line = [l for l in lines if l.startswith("some")][0] some_avg10 = float(some_line.split()[1].split("=")[1]) full_line = [l for l in lines if l.startswith("full")][0] full_avg10 = float(full_line.split()[1].split("=")[1]) return {"some_avg10": some_avg10, "full_avg10": full_avg10} ``` PSI (Pressure Stall Information) is the secret weapon. It's been in the Linux kernel since 4.20 (2018), but most people don't use it. ```bash $ cat /proc/pressure/memory some avg10=0.00 avg60=0.00 avg300=0.00 total=45967 full avg10=0.00 avg60=0.00 avg300=0.00 total=45766 ``` When `some avg10` hits 50%, it means half the time in the last 10 seconds, tasks were waiting for memory. That's thrashing. That's when you get alerted, while the system is still responsive. The threshold check: ```python def check_thresholds(metrics): """Check if any thresholds exceeded""" alerts = [] if metrics.gpu_used_pct > 85.0: alerts.append(f"GPU memory at {metrics.gpu_used_pct:.1f}%") if metrics.memory_used_pct > 80.0: alerts.append(f"System RAM at {metrics.memory_used_pct:.1f}%") if metrics.psi_some_avg10 > 50.0: alerts.append(f"Memory pressure at {metrics.psi_some_avg10:.1f}% - System is thrashing!") return alerts ``` Telegram integration: ```python def send_telegram_alert(alerts, metrics): """Send alert via Telegram bot""" message = f"""⚠️ *Resource Alert* GPU: {metrics.gpu_used_pct:.1f}% ({metrics.gpu_used_gb:.1f}/{metrics.gpu_total_gb:.1f} GB) RAM: {metrics.memory_used_pct:.1f}% ({metrics.memory_used_gb:.1f}/{metrics.memory_total_gb:.1f} GB) PSI: {metrics.psi_some_avg10:.1f}% *Issues:* """ for alert in alerts: message += f"• {alert}\n" subprocess.run([notify_script, message]) ``` The complete script is 200 lines. Deployed via cron (every 5 minutes). Logs to `~/logs/dgx-monitor.log`. > [!tip] Cron vs Systemd Service > I chose cron over a systemd service. Simpler deployment, lower resource usage, and 5-minute checks are sufficient for catching problems before they become crashes. ## Telegram Bot Setup (2 minutes) Creating the bot was the easiest part: 1. Message @BotFather on Telegram: `/newbot` 2. Name: DGX Alerts 3. Get bot token 4. Start chat with your bot, send any message 5. Visit `https://api.telegram.org/bot<TOKEN>/getUpdates` to get chat ID 6. Save credentials: ```bash cat > ~/.telegram_config <<EOF TELEGRAM_BOT_TOKEN="your-token-here" TELEGRAM_CHAT_ID="your-chat-id" EOF chmod 600 ~/.telegram_config ``` Instant alerts to my phone. Better than email (don't check constantly), desktop notifications (server has no GUI), or Slack (don't use personally). ## systemd-oomd Configuration (1 minute) Ubuntu ships systemd-oomd by default since 22.04. Most people don't configure it. ```ini # /etc/systemd/oomd.conf.d/dgx.conf [OOM] SwapUsedLimit=90% DefaultMemoryPressureLimit=50% DefaultMemoryPressureDurationSec=20s ``` This means: - If swap usage > 90%, kill offending process - If memory pressure > 50% for 20 seconds, kill offending process - Act fast (20 seconds, not the default 30) systemd-oomd prevents total system freeze. If a process is thrashing, it gets killed before SSH becomes unresponsive. You can still log in and investigate instead of hard rebooting. It's already saved the system twice since I configured it. ## Testing Validation with synthetic load: ```python # Allocate ~1.5GB GPU memory import torch x = torch.randn(20000, 20000, device='cuda') # Monitor catches it ~/workspace/infrastructure/scripts/dgx-monitor.py ``` Result: ``` 🖥️ DGX Spark [11:15:23] ⚠️ Resource Alert GPU: 87.3% (111.8/128.0 GB) RAM: 45.2% (57.8/128.0 GB) PSI: 0.0% Issues: • GPU memory at 87.3% [threshold: 85%] ``` ![DGX Monitoring Alert](../assets/dgx-monitoring-alert.jpg) *Real Telegram alert showing GPU memory crossing the 85% threshold. System detected high memory usage and sent instant notification before crash could occur.* Alert arrived on my phone instantly. System still responsive. Training job could continue or be gracefully stopped. ## Why DIY Beats Enterprise Here **Standard solution (Prometheus + Grafana + DCGM):** - Setup time: 4-8 hours - Maintenance: 1-2 hours/month - Resources: 1-2GB RAM, 5-10% CPU (3 services) - GB10 unified memory: NVML memory APIs not available - Cost: Free (but complex) **DIY solution:** - Setup time: 5 minutes (one-time) - Maintenance: <30 min/month (review logs) - Resources: <250MB RAM, <3% CPU - GB10 unified memory: PyTorch CUDA Runtime API works - Cost: $0 Enterprise monitoring is designed for multi-node clusters with SRE teams and traditional GPU architectures. For a single homelab node with unified memory architecture, a 200-line Python script is better. Easier to understand, faster to debug, zero cost, tailored to the platform's actual capabilities. > [!tip] When Enterprise Tools Make Sense > For traditional PCIe GPUs with dedicated VRAM, Prometheus + DCGM is the right choice. It's battle-tested and scales beautifully. For unified memory platforms like GB10, you need monitoring that works with the architecture rather than expecting framebuffer-style telemetry. ## Performance Impact Measured overhead: **CPU Usage:** - Before monitoring: 2-3% idle, 85-92% training - After monitoring: No measurable change - systemd-oomd: <0.1% CPU - dgx-monitor (cron): <0.1% CPU (runs 5 seconds every 5 minutes) **Memory:** - Before: 8-10GB idle - After: +200MB (systemd-oomd 10MB, dgx-monitor 50MB, Netdata 150MB) **Disk I/O:** - Log growth: ~6MB/day - Negligible for 1.92TB NVMe **Network:** - Telegram API: <5KB/day - Negligible Zero impact on training performance. ## What I Learned **1. Unified memory architectures need different approaches** The GB10 Grace Blackwell Superchip uses unified memory instead of dedicated GPU framebuffer memory. This architectural choice affects how monitoring works. Standard NVML-based tools expect traditional memory reporting. Unified memory platforms need different monitoring approaches. **2. PyTorch as a monitoring tool** We think of PyTorch as a training framework. It's also a reliable way to query GPU state via CUDA Runtime API. When NVML fails, PyTorch succeeds. **3. PSI is underutilized** Pressure Stall Information catches thrashing before the system freezes. It's been in the kernel since 2018, barely anyone uses it. Game-changer for memory monitoring. **4. systemd-oomd saves systems** Configure it properly (50% threshold, 20 second delay). It's already killed runaway processes twice before total freeze. Don't ignore it. **5. Telegram for server alerts is excellent** Setup takes 5 minutes. API is dead simple. Notifications are instant and free. Better than email (slow), SMS (costs money), or Slack (enterprise overkill). **6. The 5-minute rule** When a system crashes, you have two choices: spend hours debugging the root cause, or spend 5 minutes building safeguards so it can't happen again. I chose safeguards. The ARIA experiment still needs fixing, but at least I'll get a warning before the next crash. ## The Files Complete implementation: ``` ~/workspace/infrastructure/ ├── scripts/ │ ├── dgx-monitor.py # Main monitoring script (200 lines) │ ├── notify # Telegram notification wrapper │ └── dgx-monitor.cron # Cron job template ├── docs/ │ └── MONITORING.md # Full documentation └── logs/ └── dgx-monitor.log # Alert history ``` Cron configuration: ```bash # /etc/cron.d/dgx-monitor */5 * * * * username ~/workspace/infrastructure/scripts/dgx-monitor.py >> ~/logs/dgx-monitor-cron.log 2>&1 ``` systemd-oomd configuration: ```bash # /etc/systemd/oomd.conf.d/dgx.conf [OOM] SwapUsedLimit=90% DefaultMemoryPressureLimit=50% DefaultMemoryPressureDurationSec=20s ``` ## What's Next The monitoring system works. Now I need to fix the actual ARIA experiment code: 1. **Move `species_ids` outside the loop** (create once, reuse) 2. **Add batching** (process multiple sequences together) 3. **Reduce context size** (1Mb is excessive, use realistic lengths) 4. **Add memory monitoring in the experiment loop** (clear cache periodically) 5. **Enable quick mode by default** (validate with small datasets first) But those fixes can wait. The monitoring system ensures the next failure won't crash the entire DGX. Session 438 starts soon. This time, I'll get a Telegram alert when memory hits 85%. --- ### Related Articles - [[Practical Applications/debugging-claude-code-with-claude|Debugging Claude Code with Claude: A Meta-Optimization Journey]] - [[Practical Applications/building-ai-research-night-shift|Building an AI Research Night Shift]] - [[Practical Applications/syncing-claude-code-configs-across-machines|Syncing Claude Code Configs Across Machines]] --- About the Author: Justin Johnson builds AI systems and writes about practical AI development. <a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a>