# Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility You've organized your workspace and built a documentation system. Your files are structured, your logs are comprehensive, and you're capturing daily progress. But there's still a critical problem: **Six months from now, you won't be able to reproduce your best experiment.** You'll remember "that one experiment with great results" but you won't recall the exact model version, hyperparameters, dataset preprocessing steps, or random seed that made it work. Without systematic experiment tracking, your ML work becomes a collection of unreproducible results. This article shows you how to build experiment tracking systems that ensure every experiment is reproducible, comparable, and builds toward accumulated knowledge. <div class="callout" data-callout="info"> <div class="callout-title">About This Series</div> <div class="callout-content"> This is Part 3 of a 5-part series on building production ML workspaces. Previous parts: - [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]] - [[building-production-ml-workspace-part-2-documentation|Part 2: Documentation Systems]] Coming next: - **Part 4**: Building Production-Ready AI Agents - **Part 5**: Ollama Model Management and Workflow Integration </div> </div> --- ## The Reproducibility Crisis ML practitioners face a reproducibility problem that costs time and credibility: **The Problem:** - "I got 94% accuracy last month, but I can't reproduce it" - "Which dataset version did I use for that experiment?" - "What were the hyperparameters that worked so well?" - "Why did this experiment fail when the previous one succeeded?" **The Cost:** - Wasted GPU time re-running experiments - Lost insights from unreproducible results - Inability to build on past successes - Difficulty comparing experiments fairly - No clear path from prototype to production **The Root Cause:** - Missing configuration files - Undocumented preprocessing steps - Unknown model versions - Missing random seeds - Incomplete environment specifications - No experiment metadata --- ## The Experiment Tracking System Our solution: A comprehensive experiment template and tracking system that captures everything needed for reproducibility. ### System Components ``` Experiment Tracking System ├── Experiment Template # Standardized structure ├── Configuration Management # All settings in version control ├── Progress Tracking # Real-time status updates ├── Results Capture # Standardized outputs ├── Comparison Tools # Cross-experiment analysis └── Lifecycle Management # Active → Completed → Archived ``` Each component ensures experiments are reproducible, comparable, and build toward knowledge accumulation. --- ## The Complete Experiment Template Every experiment gets its own directory created from this template: ```bash experiments/active/YYYYMMDD-HHMMSS-experiment-name/ ├── README.md # Experiment documentation ├── config.yaml # All configuration ├── environment.yaml # Conda/pip dependencies ├── data/ # Experiment-specific data │ ├── inputs/ # Input data (symlinks) │ └── outputs/ # Generated data ├── src/ # Experiment code │ ├── prepare_data.py # Data preparation │ ├── train.py # Training code │ ├── evaluate.py # Evaluation code │ └── utils.py # Helper functions ├── results/ # All outputs │ ├── metrics.json # Quantitative metrics │ ├── plots/ # Visualizations │ ├── models/ # Saved models │ └── predictions/ # Predictions for analysis ├── logs/ # Execution logs │ ├── training.log # Training output │ └── evaluation.log # Evaluation output ├── notebooks/ # Analysis notebooks │ └── analysis.ipynb # Results analysis └── PROGRESS.md # Real-time progress tracking ``` --- ## The Experiment README Template Location: `templates/experiment/README.md` ```markdown # Experiment: [Descriptive Name] **Status:** 🔄 Active | ✅ Completed | ⚠️ Failed | 📦 Archived **Created:** YYYY-MM-DD HH:MM:SS **Last Updated:** YYYY-MM-DD HH:MM:SS --- ## Quick Summary **Objective:** [One sentence describing what you're trying to accomplish] **Hypothesis:** [What you expect to happen and why] **Result:** [Final outcome - fill in when complete] **Key Finding:** [Most important insight - fill in when complete] --- ## Experiment Details ### Goal [Detailed description of what you're trying to achieve and why it matters] ### Research Questions 1. [Question 1] 2. [Question 2] 3. [Question 3] ### Success Criteria - [ ] Metric 1: Target value - [ ] Metric 2: Target value - [ ] Qualitative criterion --- ## Methodology ### Model - **Base Model:** [e.g., llama3.1:8b] - **Model Version:** [specific version or commit] - **Custom Modifications:** [any changes to base model] ### Dataset - **Name:** [dataset name and version] - **Size:** [number of samples] - **Location:** `data/inputs/[dataset-name]` - **Preprocessing:** [description of preprocessing steps] - **Splits:** Train (X%), Val (Y%), Test (Z%) ### Configuration - **Configuration File:** `config.yaml` - **Key Hyperparameters:** - Learning rate: X - Batch size: Y - Epochs: Z - Temperature: T - Other important parameters ### Environment - **GPU:** [e.g., NVIDIA RTX 4090] - **CUDA Version:** [version] - **Python Version:** [version] - **Dependencies:** `environment.yaml` --- ## Execution ### Setup Commands ```bash # Create environment conda env create -f environment.yaml conda activate experiment-env # Prepare data python src/prepare_data.py # Verify setup python src/utils.py --verify ``` ### Training ```bash # Run training python src/train.py --config config.yaml # Monitor progress tail -f logs/training.log ``` ### Evaluation ```bash # Run evaluation python src/evaluate.py --checkpoint results/models/best_model.pt # View results python src/utils.py --summarize ``` --- ## Results ### Quantitative Metrics | Metric | Target | Achieved | Status | |--------|--------|----------|--------| | Accuracy | ≥85% | [fill in] | ⏳ | | F1 Score | ≥0.80 | [fill in] | ⏳ | | Inference Time | <100ms | [fill in] | ⏳ | | GPU Memory | <8GB | [fill in] | ⏳ | ### Qualitative Results [Description of qualitative findings] ### Visualizations - Loss curves: `results/plots/loss_curves.png` - Confusion matrix: `results/plots/confusion_matrix.png` - Sample predictions: `results/predictions/samples.txt` --- ## Analysis ### What Worked Well 1. [Success 1] 2. [Success 2] ### What Didn't Work 1. [Issue 1 and why] 2. [Issue 2 and why] ### Surprises - [Unexpected finding 1] - [Unexpected finding 2] ### Key Insights 1. **Insight 1:** [Description and impact] 2. **Insight 2:** [Description and impact] --- ## Comparison with Previous Experiments | Experiment | Metric 1 | Metric 2 | Key Difference | |------------|----------|----------|----------------| | This one | [value] | [value] | [what changed] | | [Previous] | [value] | [value] | [baseline] | | [Another] | [value] | [value] | [comparison] | **Performance Change:** [Better/Worse/Similar] - [Explanation] --- ## Reproducibility Checklist - [ ] All code committed to version control - [ ] Configuration file complete and tested - [ ] Environment file captures all dependencies - [ ] Random seeds set and documented - [ ] Data preprocessing steps documented - [ ] Results files saved - [ ] Logs captured completely - [ ] README fully filled out --- ## Next Steps ### Immediate Actions 1. [Action 1] 2. [Action 2] ### Future Experiments - **Idea 1:** [Description and rationale] - **Idea 2:** [Description and rationale] ### Production Path [If results are good, what are the steps to production?] --- ## Resources ### Code - Training script: `src/train.py` - Evaluation script: `src/evaluate.py` - Configuration: `config.yaml` ### Data - Input data: `data/inputs/` - Outputs: `data/outputs/` ### Logs - Training log: `logs/training.log` - Evaluation log: `logs/evaluation.log` ### External References - [Paper/blog post 1] - [Documentation 1] --- ## Notes ### [YYYY-MM-DD] [Dated notes about decisions, observations, or issues encountered] ### [YYYY-MM-DD] [More dated notes] --- ## Metadata **Created By:** [Your name] **Related Experiments:** [Links to related experiment directories] **Tags:** [experiment-type, model-family, dataset-name] **Duration:** [Total time spent] **GPU Hours:** [Total GPU hours used] ``` --- ## Configuration Management All experiment configuration goes in `config.yaml`: ```yaml # config.yaml - Complete experiment configuration experiment: name: "sentiment-analysis-llama3" description: "Sentiment analysis on product reviews" created: "2024-10-19T14:30:00" random_seed: 42 model: name: "llama3.1:8b" version: "latest" temperature: 0.3 max_tokens: 100 top_p: 0.9 frequency_penalty: 0.0 presence_penalty: 0.0 data: dataset_name: "product_reviews" dataset_version: "v2.0" train_path: "data/inputs/train.csv" val_path: "data/inputs/val.csv" test_path: "data/inputs/test.csv" preprocessing: lowercase: true remove_stopwords: false max_length: 512 truncation: true training: batch_size: 32 epochs: 3 learning_rate: 0.0001 optimizer: "adam" weight_decay: 0.01 warmup_steps: 100 gradient_accumulation_steps: 1 max_grad_norm: 1.0 evaluation: metrics: - accuracy - f1_score - precision - recall save_predictions: true plot_confusion_matrix: true hardware: device: "cuda" gpu_memory_limit: 8GB mixed_precision: true logging: level: "INFO" log_interval: 10 # Log every N steps save_checkpoints: true checkpoint_interval: 500 # Save every N steps output: results_dir: "results/" plots_dir: "results/plots/" models_dir: "results/models/" predictions_dir: "results/predictions/" ``` **Why YAML for configuration?** - Human-readable and version-controllable - Easy to compare across experiments - Can be loaded directly in Python - Supports comments for documentation --- ## Progress Tracking with PROGRESS.md Track real-time experiment progress: ```markdown # Experiment Progress **Last Updated:** 2024-10-19 15:45:00 --- ## Current Status 🔄 **Phase:** Training (Epoch 2/3) 📊 **Progress:** 67% Complete ⏱️ **Elapsed Time:** 1h 24m ⏰ **Estimated Remaining:** 42m --- ## Checklist ### Setup ✅ - [x] Environment created - [x] Dependencies installed - [x] Data downloaded and verified - [x] Configuration validated ### Data Preparation ✅ - [x] Data loaded - [x] Preprocessing applied - [x] Train/val/test splits created - [x] Data quality checks passed ### Training 🔄 - [x] Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389 - [x] Epoch 2/3 complete - Loss: 0.356, Val Loss: 0.334 - [ ] Epoch 3/3 - In progress - [ ] Final checkpoint saved ### Evaluation ⏳ - [ ] Load best checkpoint - [ ] Run evaluation on test set - [ ] Generate predictions - [ ] Create visualizations - [ ] Calculate all metrics ### Analysis ⏳ - [ ] Compare with baseline - [ ] Analyze errors - [ ] Document insights - [ ] Update README with results --- ## Timeline | Timestamp | Event | Details | |-----------|-------|---------| | 14:30:00 | Started | Environment setup | | 14:35:00 | Data loaded | 10,000 samples | | 14:40:00 | Training started | Epoch 1/3 | | 15:02:00 | Epoch 1 complete | Loss: 0.452 → 0.389 | | 15:24:00 | Epoch 2 complete | Loss: 0.356 → 0.334 | | 15:45:00 | Epoch 3 in progress | ~67% complete | --- ## Metrics (Real-time) ### Training Metrics - **Current Loss:** 0.312 - **Best Val Loss:** 0.334 (Epoch 2) - **Learning Rate:** 0.0001 - **GPU Utilization:** 92% - **Memory Usage:** 6.8GB / 8GB ### Observations - Loss decreasing steadily - No signs of overfitting yet - GPU utilization good - Memory usage within limits --- ## Issues Encountered ### Issue 1: Data Loading Slow - **Time:** 14:32:00 - **Problem:** Initial data loading took 5 minutes - **Solution:** Implemented data caching - **Impact:** Reduced to 30 seconds ### Issue 2: None yet --- ## Next Actions 1. ✅ Wait for Epoch 3 to complete (~42m) 2. ⏳ Run evaluation on test set 3. ⏳ Generate visualizations 4. ⏳ Compare with baseline experiment 5. ⏳ Update README with final results ``` **Update this file throughout the experiment** to track progress and capture issues as they happen. --- ## Creating New Experiments ### Automation Script Create `scripts/utilities/new_experiment.sh`: ```bash #!/bin/bash # Usage: ./scripts/utilities/new_experiment.sh "experiment-description" EXPERIMENT_NAME=$1 TIMESTAMP=$(date +%Y%m%d-%H%M%S) EXPERIMENT_DIR="experiments/active/${TIMESTAMP}-${EXPERIMENT_NAME}" TEMPLATE_DIR="templates/experiment" if [ -z "$EXPERIMENT_NAME" ]; then echo "Usage: $0 'experiment-description'" exit 1 fi echo "Creating new experiment: ${EXPERIMENT_DIR}" # Create directory structure mkdir -p "${EXPERIMENT_DIR}"/{data/{inputs,outputs},src,results/{plots,models,predictions},logs,notebooks} # Copy templates cp "${TEMPLATE_DIR}/README.md" "${EXPERIMENT_DIR}/" cp "${TEMPLATE_DIR}/config.yaml" "${EXPERIMENT_DIR}/" cp "${TEMPLATE_DIR}/environment.yaml" "${EXPERIMENT_DIR}/" cp "${TEMPLATE_DIR}/PROGRESS.md" "${EXPERIMENT_DIR}/" cp -r "${TEMPLATE_DIR}/src/"* "${EXPERIMENT_DIR}/src/" cp "${TEMPLATE_DIR}/notebooks/analysis.ipynb" "${EXPERIMENT_DIR}/notebooks/" # Update timestamps in README sed -i "s/YYYY-MM-DD HH:MM:SS/$(date '+%Y-%m-%d %H:%M:%S')/g" "${EXPERIMENT_DIR}/README.md" # Initialize git (optional) # cd "${EXPERIMENT_DIR}" && git init echo "✅ Experiment created: ${EXPERIMENT_DIR}" echo "" echo "Next steps:" echo "1. cd ${EXPERIMENT_DIR}" echo "2. Edit config.yaml with your configuration" echo "3. Edit README.md with experiment details" echo "4. Run: conda env create -f environment.yaml" echo "5. Start experimenting!" ``` Usage: ```bash ./scripts/utilities/new_experiment.sh "sentiment-analysis-llama3" ``` --- ## Experiment Comparison ### Results Comparison Tool Create `scripts/utilities/compare_experiments.py`: ```python #!/usr/bin/env python3 import json import sys from pathlib import Path from tabulate import tabulate def load_metrics(experiment_dir): """Load metrics from experiment""" metrics_file = Path(experiment_dir) / "results" / "metrics.json" if not metrics_file.exists(): return None with open(metrics_file) as f: return json.load(f) def compare_experiments(experiment_dirs): """Compare multiple experiments""" results = [] for exp_dir in experiment_dirs: exp_path = Path(exp_dir) exp_name = exp_path.name metrics = load_metrics(exp_path) if metrics: results.append({ 'Experiment': exp_name, 'Accuracy': f"{metrics.get('accuracy', 0):.2%}", 'F1 Score': f"{metrics.get('f1_score', 0):.3f}", 'Inference Time': f"{metrics.get('inference_time_ms', 0):.1f}ms", 'GPU Memory': f"{metrics.get('gpu_memory_gb', 0):.1f}GB", }) if results: print("\n" + "="*80) print("EXPERIMENT COMPARISON") print("="*80 + "\n") print(tabulate(results, headers='keys', tablefmt='grid')) print() else: print("No metrics found in specified experiments") if __name__ == "__main__": if len(sys.argv) < 2: print("Usage: compare_experiments.py <exp_dir1> <exp_dir2> ...") sys.exit(1) compare_experiments(sys.argv[1:]) ``` Usage: ```bash python scripts/utilities/compare_experiments.py \ experiments/completed/20241019-143000-sentiment-llama3/ \ experiments/completed/20241019-155000-sentiment-phi3/ ``` --- ## Lifecycle Management ### Moving Experiments Through States **Active → Completed:** ```bash # When experiment finishes successfully mv experiments/active/20241019-143000-experiment-name \ experiments/completed/ ``` **Completed → Archived:** ```bash # After 30 days or when no longer needed for reference mv experiments/completed/20241019-143000-experiment-name \ experiments/archived/ ``` ### Automated Archival Script Create `scripts/automation/archive_old_experiments.sh`: ```bash #!/bin/bash # Archive experiments older than 30 days from completed/ COMPLETED_DIR="experiments/completed" ARCHIVED_DIR="experiments/archived" DAYS_OLD=30 echo "Archiving experiments older than ${DAYS_OLD} days..." find "${COMPLETED_DIR}" -maxdepth 1 -type d -mtime +${DAYS_OLD} | while read exp_dir; do if [ "$exp_dir" != "$COMPLETED_DIR" ]; then exp_name=$(basename "$exp_dir") echo "Archiving: ${exp_name}" mv "$exp_dir" "${ARCHIVED_DIR}/" fi done echo "✅ Archival complete" ``` Run monthly: ```bash ./scripts/automation/archive_old_experiments.sh ``` --- ## Experiment Registry Track all experiments in a central registry: ### Registry File: EXPERIMENTS.md Location: `~/workspace/EXPERIMENTS.md` ```markdown # Experiment Registry Central tracking of all ML experiments. --- ## Active Experiments | Started | Name | Objective | Status | Notes | |---------|------|-----------|--------|-------| | 2024-10-19 | sentiment-llama3 | Sentiment analysis with Llama 3.1 | 🔄 Training | Epoch 2/3 | --- ## Completed Experiments (Last 30 Days) | Completed | Name | Result | Best Metric | Link | |-----------|------|--------|-------------|------| | 2024-10-18 | baseline-sentiment | Success | 82% accuracy | [[experiments/completed/20241018-090000-baseline-sentiment]] | --- ## Top Performers ### Sentiment Analysis 1. **sentiment-llama3** - 91% accuracy (2024-10-19) 2. **baseline-sentiment** - 82% accuracy (2024-10-18) ### Code Generation [To be filled] --- ## Insights Database ### Temperature Settings - **Finding:** Lower temperature (0.3) improves consistency for classification - **Evidence:** sentiment-llama3 (0.3 temp) vs baseline-sentiment (0.7 temp) - **Applicability:** Classification tasks ### Batch Size - **Finding:** Batch size 32 optimal for RTX 4090 with 8B models - **Evidence:** Multiple experiments - **Applicability:** Similar GPU and model size --- ``` Update this after every experiment completes. --- ## Best Practices ### Before Starting an Experiment 1. **Create from template** - Use the automation script 2. **Fill out README** - Complete objective, hypothesis, success criteria 3. **Configure completely** - All settings in config.yaml 4. **Set random seeds** - For reproducibility 5. **Document environment** - Complete environment.yaml ### During the Experiment 1. **Update PROGRESS.md** - Track real-time progress 2. **Monitor logs** - Watch for issues early 3. **Capture observations** - Note surprises in README 4. **Save checkpoints** - Regular model checkpoints 5. **Track metrics** - Log to metrics.json ### After the Experiment 1. **Complete README** - Fill in all results sections 2. **Run comparison** - Compare with previous experiments 3. **Extract insights** - What did you learn? 4. **Update registry** - Add to EXPERIMENTS.md 5. **Move to completed** - Clear active/ directory 6. **Create session log** - If significant (Part 2) --- ## Integration with Documentation System The experiment system integrates with the three-tier documentation from Part 2: **Tier 1 (Execution Logs):** - Stored in `experiments/.../logs/` - Detailed technical output - For debugging **Tier 2 (Activity Log):** ```markdown ## 2024-10-19 ### Experiments - Started: 20241019-143000-sentiment-llama3 - Goal: Improve sentiment accuracy with Llama 3.1 - Status: Training (Epoch 2/3) - Current: 89% validation accuracy ``` **Tier 3 (Session Log):** - Created when experiment completes - Links to experiment directory - Provides narrative and insights --- ## Common Pitfalls **Pitfall 1: Incomplete configuration** - **Problem:** Missing hyperparameters make reproduction impossible - **Solution:** Use config.yaml template, fill completely **Pitfall 2: No random seeds** - **Problem:** Can't reproduce exact results - **Solution:** Always set and document random seeds **Pitfall 3: Undocumented preprocessing** - **Problem:** Different preprocessing gives different results - **Solution:** Document every preprocessing step in README and code **Pitfall 4: Lost model versions** - **Problem:** "Which model version did I use?" - **Solution:** Record exact model versions in config.yaml **Pitfall 5: No comparison** - **Problem:** Don't know if experiment improved things - **Solution:** Always compare with baseline or previous experiments --- ## What's Next You now have systematic experiment tracking that ensures reproducibility and enables knowledge accumulation. Your experiments have standardized structures, comprehensive documentation, and clear lifecycle management. In **Part 4: Building Production-Ready AI Agents**, we'll apply these same principles to agent development: - Agent templates and scaffolding - Tool integration patterns - Testing and evaluation frameworks - Production deployment readiness - Agent-specific monitoring We'll create agents that are as well-structured and reproducible as your experiments. --- ## Key Takeaways - **Reproducibility requires comprehensive tracking** of configuration, environment, and methodology - **Standardized templates** ensure nothing gets forgotten - **Progress tracking** captures real-time state for debugging - **Experiment comparison** reveals what works consistently - **Lifecycle management** keeps workspace organized - **Integration with documentation** creates complete knowledge capture --- ## Resources **Templates:** - Experiment README template (provided above) - Configuration template (config.yaml) - Progress tracking template (PROGRESS.md) **Scripts:** - `new_experiment.sh` - Create experiments from template - `compare_experiments.py` - Compare experiment results - `archive_old_experiments.sh` - Lifecycle management **Tools:** - Python logging module - YAML for configuration - JSON for metrics storage - Git for version control --- ## Series Navigation - **Previous:** [[building-production-ml-workspace-part-2-documentation|Part 2: Documentation Systems]] - **Next:** [[building-production-ml-workspace-part-4-agents|Part 4: Building Production-Ready AI Agents]] - **Series Home:** Building a Production ML Workspace on GPU Infrastructure --- **Questions or suggestions?** Find me on Twitter [@bioinfo](https://twitter.com/bioinfo) or at [rundatarun.io](https://rundatarun.io) --- ### Related Articles - [[building-production-ml-workspace-part-2-documentation|Building a Production ML Workspace: Part 2 - Documentation Systems That Scale]] - [[building-production-ml-workspace-part-1-structure|Building a Production ML Workspace: Part 1 - Designing an Organized Structure]] - [[building-production-ml-workspace-part-5-collaboration|Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration]] --- <p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p> <p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>