# Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility
You've organized your workspace and built a documentation system. Your files are structured, your logs are comprehensive, and you're capturing daily progress. But there's still a critical problem:
**Six months from now, you won't be able to reproduce your best experiment.**
You'll remember "that one experiment with great results" but you won't recall the exact model version, hyperparameters, dataset preprocessing steps, or random seed that made it work. Without systematic experiment tracking, your ML work becomes a collection of unreproducible results.
This article shows you how to build experiment tracking systems that ensure every experiment is reproducible, comparable, and builds toward accumulated knowledge.
<div class="callout" data-callout="info">
<div class="callout-title">About This Series</div>
<div class="callout-content">
This is Part 3 of a 5-part series on building production ML workspaces. Previous parts:
- [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]]
- [[building-production-ml-workspace-part-2-documentation|Part 2: Documentation Systems]]
Coming next:
- **Part 4**: Building Production-Ready AI Agents
- **Part 5**: Ollama Model Management and Workflow Integration
</div>
</div>
---
## The Reproducibility Crisis
ML practitioners face a reproducibility problem that costs time and credibility:
**The Problem:**
- "I got 94% accuracy last month, but I can't reproduce it"
- "Which dataset version did I use for that experiment?"
- "What were the hyperparameters that worked so well?"
- "Why did this experiment fail when the previous one succeeded?"
**The Cost:**
- Wasted GPU time re-running experiments
- Lost insights from unreproducible results
- Inability to build on past successes
- Difficulty comparing experiments fairly
- No clear path from prototype to production
**The Root Cause:**
- Missing configuration files
- Undocumented preprocessing steps
- Unknown model versions
- Missing random seeds
- Incomplete environment specifications
- No experiment metadata
---
## The Experiment Tracking System
Our solution: A comprehensive experiment template and tracking system that captures everything needed for reproducibility.
### System Components
```
Experiment Tracking System
├── Experiment Template # Standardized structure
├── Configuration Management # All settings in version control
├── Progress Tracking # Real-time status updates
├── Results Capture # Standardized outputs
├── Comparison Tools # Cross-experiment analysis
└── Lifecycle Management # Active → Completed → Archived
```
Each component ensures experiments are reproducible, comparable, and build toward knowledge accumulation.
---
## The Complete Experiment Template
Every experiment gets its own directory created from this template:
```bash
experiments/active/YYYYMMDD-HHMMSS-experiment-name/
├── README.md # Experiment documentation
├── config.yaml # All configuration
├── environment.yaml # Conda/pip dependencies
├── data/ # Experiment-specific data
│ ├── inputs/ # Input data (symlinks)
│ └── outputs/ # Generated data
├── src/ # Experiment code
│ ├── prepare_data.py # Data preparation
│ ├── train.py # Training code
│ ├── evaluate.py # Evaluation code
│ └── utils.py # Helper functions
├── results/ # All outputs
│ ├── metrics.json # Quantitative metrics
│ ├── plots/ # Visualizations
│ ├── models/ # Saved models
│ └── predictions/ # Predictions for analysis
├── logs/ # Execution logs
│ ├── training.log # Training output
│ └── evaluation.log # Evaluation output
├── notebooks/ # Analysis notebooks
│ └── analysis.ipynb # Results analysis
└── PROGRESS.md # Real-time progress tracking
```
---
## The Experiment README Template
Location: `templates/experiment/README.md`
```markdown
# Experiment: [Descriptive Name]
**Status:** 🔄 Active | ✅ Completed | ⚠️ Failed | 📦 Archived
**Created:** YYYY-MM-DD HH:MM:SS
**Last Updated:** YYYY-MM-DD HH:MM:SS
---
## Quick Summary
**Objective:** [One sentence describing what you're trying to accomplish]
**Hypothesis:** [What you expect to happen and why]
**Result:** [Final outcome - fill in when complete]
**Key Finding:** [Most important insight - fill in when complete]
---
## Experiment Details
### Goal
[Detailed description of what you're trying to achieve and why it matters]
### Research Questions
1. [Question 1]
2. [Question 2]
3. [Question 3]
### Success Criteria
- [ ] Metric 1: Target value
- [ ] Metric 2: Target value
- [ ] Qualitative criterion
---
## Methodology
### Model
- **Base Model:** [e.g., llama3.1:8b]
- **Model Version:** [specific version or commit]
- **Custom Modifications:** [any changes to base model]
### Dataset
- **Name:** [dataset name and version]
- **Size:** [number of samples]
- **Location:** `data/inputs/[dataset-name]`
- **Preprocessing:** [description of preprocessing steps]
- **Splits:** Train (X%), Val (Y%), Test (Z%)
### Configuration
- **Configuration File:** `config.yaml`
- **Key Hyperparameters:**
- Learning rate: X
- Batch size: Y
- Epochs: Z
- Temperature: T
- Other important parameters
### Environment
- **GPU:** [e.g., NVIDIA RTX 4090]
- **CUDA Version:** [version]
- **Python Version:** [version]
- **Dependencies:** `environment.yaml`
---
## Execution
### Setup Commands
```bash
# Create environment
conda env create -f environment.yaml
conda activate experiment-env
# Prepare data
python src/prepare_data.py
# Verify setup
python src/utils.py --verify
```
### Training
```bash
# Run training
python src/train.py --config config.yaml
# Monitor progress
tail -f logs/training.log
```
### Evaluation
```bash
# Run evaluation
python src/evaluate.py --checkpoint results/models/best_model.pt
# View results
python src/utils.py --summarize
```
---
## Results
### Quantitative Metrics
| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| Accuracy | ≥85% | [fill in] | ⏳ |
| F1 Score | ≥0.80 | [fill in] | ⏳ |
| Inference Time | <100ms | [fill in] | ⏳ |
| GPU Memory | <8GB | [fill in] | ⏳ |
### Qualitative Results
[Description of qualitative findings]
### Visualizations
- Loss curves: `results/plots/loss_curves.png`
- Confusion matrix: `results/plots/confusion_matrix.png`
- Sample predictions: `results/predictions/samples.txt`
---
## Analysis
### What Worked Well
1. [Success 1]
2. [Success 2]
### What Didn't Work
1. [Issue 1 and why]
2. [Issue 2 and why]
### Surprises
- [Unexpected finding 1]
- [Unexpected finding 2]
### Key Insights
1. **Insight 1:** [Description and impact]
2. **Insight 2:** [Description and impact]
---
## Comparison with Previous Experiments
| Experiment | Metric 1 | Metric 2 | Key Difference |
|------------|----------|----------|----------------|
| This one | [value] | [value] | [what changed] |
| [Previous] | [value] | [value] | [baseline] |
| [Another] | [value] | [value] | [comparison] |
**Performance Change:** [Better/Worse/Similar] - [Explanation]
---
## Reproducibility Checklist
- [ ] All code committed to version control
- [ ] Configuration file complete and tested
- [ ] Environment file captures all dependencies
- [ ] Random seeds set and documented
- [ ] Data preprocessing steps documented
- [ ] Results files saved
- [ ] Logs captured completely
- [ ] README fully filled out
---
## Next Steps
### Immediate Actions
1. [Action 1]
2. [Action 2]
### Future Experiments
- **Idea 1:** [Description and rationale]
- **Idea 2:** [Description and rationale]
### Production Path
[If results are good, what are the steps to production?]
---
## Resources
### Code
- Training script: `src/train.py`
- Evaluation script: `src/evaluate.py`
- Configuration: `config.yaml`
### Data
- Input data: `data/inputs/`
- Outputs: `data/outputs/`
### Logs
- Training log: `logs/training.log`
- Evaluation log: `logs/evaluation.log`
### External References
- [Paper/blog post 1]
- [Documentation 1]
---
## Notes
### [YYYY-MM-DD]
[Dated notes about decisions, observations, or issues encountered]
### [YYYY-MM-DD]
[More dated notes]
---
## Metadata
**Created By:** [Your name]
**Related Experiments:** [Links to related experiment directories]
**Tags:** [experiment-type, model-family, dataset-name]
**Duration:** [Total time spent]
**GPU Hours:** [Total GPU hours used]
```
---
## Configuration Management
All experiment configuration goes in `config.yaml`:
```yaml
# config.yaml - Complete experiment configuration
experiment:
name: "sentiment-analysis-llama3"
description: "Sentiment analysis on product reviews"
created: "2024-10-19T14:30:00"
random_seed: 42
model:
name: "llama3.1:8b"
version: "latest"
temperature: 0.3
max_tokens: 100
top_p: 0.9
frequency_penalty: 0.0
presence_penalty: 0.0
data:
dataset_name: "product_reviews"
dataset_version: "v2.0"
train_path: "data/inputs/train.csv"
val_path: "data/inputs/val.csv"
test_path: "data/inputs/test.csv"
preprocessing:
lowercase: true
remove_stopwords: false
max_length: 512
truncation: true
training:
batch_size: 32
epochs: 3
learning_rate: 0.0001
optimizer: "adam"
weight_decay: 0.01
warmup_steps: 100
gradient_accumulation_steps: 1
max_grad_norm: 1.0
evaluation:
metrics:
- accuracy
- f1_score
- precision
- recall
save_predictions: true
plot_confusion_matrix: true
hardware:
device: "cuda"
gpu_memory_limit: 8GB
mixed_precision: true
logging:
level: "INFO"
log_interval: 10 # Log every N steps
save_checkpoints: true
checkpoint_interval: 500 # Save every N steps
output:
results_dir: "results/"
plots_dir: "results/plots/"
models_dir: "results/models/"
predictions_dir: "results/predictions/"
```
**Why YAML for configuration?**
- Human-readable and version-controllable
- Easy to compare across experiments
- Can be loaded directly in Python
- Supports comments for documentation
---
## Progress Tracking with PROGRESS.md
Track real-time experiment progress:
```markdown
# Experiment Progress
**Last Updated:** 2024-10-19 15:45:00
---
## Current Status
🔄 **Phase:** Training (Epoch 2/3)
📊 **Progress:** 67% Complete
⏱️ **Elapsed Time:** 1h 24m
⏰ **Estimated Remaining:** 42m
---
## Checklist
### Setup ✅
- [x] Environment created
- [x] Dependencies installed
- [x] Data downloaded and verified
- [x] Configuration validated
### Data Preparation ✅
- [x] Data loaded
- [x] Preprocessing applied
- [x] Train/val/test splits created
- [x] Data quality checks passed
### Training 🔄
- [x] Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389
- [x] Epoch 2/3 complete - Loss: 0.356, Val Loss: 0.334
- [ ] Epoch 3/3 - In progress
- [ ] Final checkpoint saved
### Evaluation ⏳
- [ ] Load best checkpoint
- [ ] Run evaluation on test set
- [ ] Generate predictions
- [ ] Create visualizations
- [ ] Calculate all metrics
### Analysis ⏳
- [ ] Compare with baseline
- [ ] Analyze errors
- [ ] Document insights
- [ ] Update README with results
---
## Timeline
| Timestamp | Event | Details |
|-----------|-------|---------|
| 14:30:00 | Started | Environment setup |
| 14:35:00 | Data loaded | 10,000 samples |
| 14:40:00 | Training started | Epoch 1/3 |
| 15:02:00 | Epoch 1 complete | Loss: 0.452 → 0.389 |
| 15:24:00 | Epoch 2 complete | Loss: 0.356 → 0.334 |
| 15:45:00 | Epoch 3 in progress | ~67% complete |
---
## Metrics (Real-time)
### Training Metrics
- **Current Loss:** 0.312
- **Best Val Loss:** 0.334 (Epoch 2)
- **Learning Rate:** 0.0001
- **GPU Utilization:** 92%
- **Memory Usage:** 6.8GB / 8GB
### Observations
- Loss decreasing steadily
- No signs of overfitting yet
- GPU utilization good
- Memory usage within limits
---
## Issues Encountered
### Issue 1: Data Loading Slow
- **Time:** 14:32:00
- **Problem:** Initial data loading took 5 minutes
- **Solution:** Implemented data caching
- **Impact:** Reduced to 30 seconds
### Issue 2: None yet
---
## Next Actions
1. ✅ Wait for Epoch 3 to complete (~42m)
2. ⏳ Run evaluation on test set
3. ⏳ Generate visualizations
4. ⏳ Compare with baseline experiment
5. ⏳ Update README with final results
```
**Update this file throughout the experiment** to track progress and capture issues as they happen.
---
## Creating New Experiments
### Automation Script
Create `scripts/utilities/new_experiment.sh`:
```bash
#!/bin/bash
# Usage: ./scripts/utilities/new_experiment.sh "experiment-description"
EXPERIMENT_NAME=$1
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
EXPERIMENT_DIR="experiments/active/${TIMESTAMP}-${EXPERIMENT_NAME}"
TEMPLATE_DIR="templates/experiment"
if [ -z "$EXPERIMENT_NAME" ]; then
echo "Usage: $0 'experiment-description'"
exit 1
fi
echo "Creating new experiment: ${EXPERIMENT_DIR}"
# Create directory structure
mkdir -p "${EXPERIMENT_DIR}"/{data/{inputs,outputs},src,results/{plots,models,predictions},logs,notebooks}
# Copy templates
cp "${TEMPLATE_DIR}/README.md" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/config.yaml" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/environment.yaml" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/PROGRESS.md" "${EXPERIMENT_DIR}/"
cp -r "${TEMPLATE_DIR}/src/"* "${EXPERIMENT_DIR}/src/"
cp "${TEMPLATE_DIR}/notebooks/analysis.ipynb" "${EXPERIMENT_DIR}/notebooks/"
# Update timestamps in README
sed -i "s/YYYY-MM-DD HH:MM:SS/$(date '+%Y-%m-%d %H:%M:%S')/g" "${EXPERIMENT_DIR}/README.md"
# Initialize git (optional)
# cd "${EXPERIMENT_DIR}" && git init
echo "✅ Experiment created: ${EXPERIMENT_DIR}"
echo ""
echo "Next steps:"
echo "1. cd ${EXPERIMENT_DIR}"
echo "2. Edit config.yaml with your configuration"
echo "3. Edit README.md with experiment details"
echo "4. Run: conda env create -f environment.yaml"
echo "5. Start experimenting!"
```
Usage:
```bash
./scripts/utilities/new_experiment.sh "sentiment-analysis-llama3"
```
---
## Experiment Comparison
### Results Comparison Tool
Create `scripts/utilities/compare_experiments.py`:
```python
#!/usr/bin/env python3
import json
import sys
from pathlib import Path
from tabulate import tabulate
def load_metrics(experiment_dir):
"""Load metrics from experiment"""
metrics_file = Path(experiment_dir) / "results" / "metrics.json"
if not metrics_file.exists():
return None
with open(metrics_file) as f:
return json.load(f)
def compare_experiments(experiment_dirs):
"""Compare multiple experiments"""
results = []
for exp_dir in experiment_dirs:
exp_path = Path(exp_dir)
exp_name = exp_path.name
metrics = load_metrics(exp_path)
if metrics:
results.append({
'Experiment': exp_name,
'Accuracy': f"{metrics.get('accuracy', 0):.2%}",
'F1 Score': f"{metrics.get('f1_score', 0):.3f}",
'Inference Time': f"{metrics.get('inference_time_ms', 0):.1f}ms",
'GPU Memory': f"{metrics.get('gpu_memory_gb', 0):.1f}GB",
})
if results:
print("\n" + "="*80)
print("EXPERIMENT COMPARISON")
print("="*80 + "\n")
print(tabulate(results, headers='keys', tablefmt='grid'))
print()
else:
print("No metrics found in specified experiments")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: compare_experiments.py <exp_dir1> <exp_dir2> ...")
sys.exit(1)
compare_experiments(sys.argv[1:])
```
Usage:
```bash
python scripts/utilities/compare_experiments.py \
experiments/completed/20241019-143000-sentiment-llama3/ \
experiments/completed/20241019-155000-sentiment-phi3/
```
---
## Lifecycle Management
### Moving Experiments Through States
**Active → Completed:**
```bash
# When experiment finishes successfully
mv experiments/active/20241019-143000-experiment-name \
experiments/completed/
```
**Completed → Archived:**
```bash
# After 30 days or when no longer needed for reference
mv experiments/completed/20241019-143000-experiment-name \
experiments/archived/
```
### Automated Archival Script
Create `scripts/automation/archive_old_experiments.sh`:
```bash
#!/bin/bash
# Archive experiments older than 30 days from completed/
COMPLETED_DIR="experiments/completed"
ARCHIVED_DIR="experiments/archived"
DAYS_OLD=30
echo "Archiving experiments older than ${DAYS_OLD} days..."
find "${COMPLETED_DIR}" -maxdepth 1 -type d -mtime +${DAYS_OLD} | while read exp_dir; do
if [ "$exp_dir" != "$COMPLETED_DIR" ]; then
exp_name=$(basename "$exp_dir")
echo "Archiving: ${exp_name}"
mv "$exp_dir" "${ARCHIVED_DIR}/"
fi
done
echo "✅ Archival complete"
```
Run monthly:
```bash
./scripts/automation/archive_old_experiments.sh
```
---
## Experiment Registry
Track all experiments in a central registry:
### Registry File: EXPERIMENTS.md
Location: `~/workspace/EXPERIMENTS.md`
```markdown
# Experiment Registry
Central tracking of all ML experiments.
---
## Active Experiments
| Started | Name | Objective | Status | Notes |
|---------|------|-----------|--------|-------|
| 2024-10-19 | sentiment-llama3 | Sentiment analysis with Llama 3.1 | 🔄 Training | Epoch 2/3 |
---
## Completed Experiments (Last 30 Days)
| Completed | Name | Result | Best Metric | Link |
|-----------|------|--------|-------------|------|
| 2024-10-18 | baseline-sentiment | Success | 82% accuracy | [[experiments/completed/20241018-090000-baseline-sentiment]] |
---
## Top Performers
### Sentiment Analysis
1. **sentiment-llama3** - 91% accuracy (2024-10-19)
2. **baseline-sentiment** - 82% accuracy (2024-10-18)
### Code Generation
[To be filled]
---
## Insights Database
### Temperature Settings
- **Finding:** Lower temperature (0.3) improves consistency for classification
- **Evidence:** sentiment-llama3 (0.3 temp) vs baseline-sentiment (0.7 temp)
- **Applicability:** Classification tasks
### Batch Size
- **Finding:** Batch size 32 optimal for RTX 4090 with 8B models
- **Evidence:** Multiple experiments
- **Applicability:** Similar GPU and model size
---
```
Update this after every experiment completes.
---
## Best Practices
### Before Starting an Experiment
1. **Create from template** - Use the automation script
2. **Fill out README** - Complete objective, hypothesis, success criteria
3. **Configure completely** - All settings in config.yaml
4. **Set random seeds** - For reproducibility
5. **Document environment** - Complete environment.yaml
### During the Experiment
1. **Update PROGRESS.md** - Track real-time progress
2. **Monitor logs** - Watch for issues early
3. **Capture observations** - Note surprises in README
4. **Save checkpoints** - Regular model checkpoints
5. **Track metrics** - Log to metrics.json
### After the Experiment
1. **Complete README** - Fill in all results sections
2. **Run comparison** - Compare with previous experiments
3. **Extract insights** - What did you learn?
4. **Update registry** - Add to EXPERIMENTS.md
5. **Move to completed** - Clear active/ directory
6. **Create session log** - If significant (Part 2)
---
## Integration with Documentation System
The experiment system integrates with the three-tier documentation from Part 2:
**Tier 1 (Execution Logs):**
- Stored in `experiments/.../logs/`
- Detailed technical output
- For debugging
**Tier 2 (Activity Log):**
```markdown
## 2024-10-19
### Experiments
- Started: 20241019-143000-sentiment-llama3
- Goal: Improve sentiment accuracy with Llama 3.1
- Status: Training (Epoch 2/3)
- Current: 89% validation accuracy
```
**Tier 3 (Session Log):**
- Created when experiment completes
- Links to experiment directory
- Provides narrative and insights
---
## Common Pitfalls
**Pitfall 1: Incomplete configuration**
- **Problem:** Missing hyperparameters make reproduction impossible
- **Solution:** Use config.yaml template, fill completely
**Pitfall 2: No random seeds**
- **Problem:** Can't reproduce exact results
- **Solution:** Always set and document random seeds
**Pitfall 3: Undocumented preprocessing**
- **Problem:** Different preprocessing gives different results
- **Solution:** Document every preprocessing step in README and code
**Pitfall 4: Lost model versions**
- **Problem:** "Which model version did I use?"
- **Solution:** Record exact model versions in config.yaml
**Pitfall 5: No comparison**
- **Problem:** Don't know if experiment improved things
- **Solution:** Always compare with baseline or previous experiments
---
## What's Next
You now have systematic experiment tracking that ensures reproducibility and enables knowledge accumulation. Your experiments have standardized structures, comprehensive documentation, and clear lifecycle management.
In **Part 4: Building Production-Ready AI Agents**, we'll apply these same principles to agent development:
- Agent templates and scaffolding
- Tool integration patterns
- Testing and evaluation frameworks
- Production deployment readiness
- Agent-specific monitoring
We'll create agents that are as well-structured and reproducible as your experiments.
---
## Key Takeaways
- **Reproducibility requires comprehensive tracking** of configuration, environment, and methodology
- **Standardized templates** ensure nothing gets forgotten
- **Progress tracking** captures real-time state for debugging
- **Experiment comparison** reveals what works consistently
- **Lifecycle management** keeps workspace organized
- **Integration with documentation** creates complete knowledge capture
---
## Resources
**Templates:**
- Experiment README template (provided above)
- Configuration template (config.yaml)
- Progress tracking template (PROGRESS.md)
**Scripts:**
- `new_experiment.sh` - Create experiments from template
- `compare_experiments.py` - Compare experiment results
- `archive_old_experiments.sh` - Lifecycle management
**Tools:**
- Python logging module
- YAML for configuration
- JSON for metrics storage
- Git for version control
---
## Series Navigation
- **Previous:** [[building-production-ml-workspace-part-2-documentation|Part 2: Documentation Systems]]
- **Next:** [[building-production-ml-workspace-part-4-agents|Part 4: Building Production-Ready AI Agents]]
- **Series Home:** Building a Production ML Workspace on GPU Infrastructure
---
**Questions or suggestions?** Find me on Twitter [@bioinfo](https://twitter.com/bioinfo) or at [rundatarun.io](https://rundatarun.io)
---
### Related Articles
- [[building-production-ml-workspace-part-2-documentation|Building a Production ML Workspace: Part 2 - Documentation Systems That Scale]]
- [[building-production-ml-workspace-part-1-structure|Building a Production ML Workspace: Part 1 - Designing an Organized Structure]]
- [[building-production-ml-workspace-part-5-collaboration|Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration]]
---
<p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p>
<p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>