Practical ApplicationsOctober 19, 20259 min readshipped

Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility

You've organized your workspace and built a documentation system. Your files are structured, your logs are comprehensive, and you're capturing daily progress. But there's still a critical problem:

Six months from now, you won't be able to reproduce your best experiment.

You'll remember "that one experiment with great results" but you won't recall the exact model version, hyperparameters, dataset preprocessing steps, or random seed that made it work. Without systematic experiment tracking, your ML work becomes a collection of unreproducible results.

This article shows you how to build experiment tracking systems that ensure every experiment is reproducible, comparable, and builds toward accumulated knowledge.

About This Series

This is Part 3 of a 5-part series on building production ML workspaces. Previous parts:

Coming next:

Part 4: Building Production-Ready AI Agents
Part 5: Ollama Model Management and Workflow Integration

The Reproducibility Crisis

ML practitioners face a reproducibility problem that costs time and credibility:

The Problem:

"I got 94% accuracy last month, but I can't reproduce it"
"Which dataset version did I use for that experiment?"
"What were the hyperparameters that worked so well?"
"Why did this experiment fail when the previous one succeeded?"

The Cost:

Wasted GPU time re-running experiments
Lost insights from unreproducible results
Inability to build on past successes
Difficulty comparing experiments fairly
No clear path from prototype to production

The Root Cause:

Missing configuration files
Undocumented preprocessing steps
Unknown model versions
Missing random seeds
Incomplete environment specifications
No experiment metadata

The Experiment Tracking System

Our solution: A comprehensive experiment template and tracking system that captures everything needed for reproducibility.

System Components

Experiment Tracking System
├── Experiment Template       # Standardized structure
├── Configuration Management  # All settings in version control
├── Progress Tracking        # Real-time status updates
├── Results Capture          # Standardized outputs
├── Comparison Tools         # Cross-experiment analysis
└── Lifecycle Management     # Active → Completed → Archived

Each component ensures experiments are reproducible, comparable, and build toward knowledge accumulation.

The Complete Experiment Template

Every experiment gets its own directory created from this template:

experiments/active/YYYYMMDD-HHMMSS-experiment-name/
├── README.md                 # Experiment documentation
├── config.yaml              # All configuration
├── environment.yaml         # Conda/pip dependencies
├── data/                    # Experiment-specific data
│   ├── inputs/             # Input data (symlinks)
│   └── outputs/            # Generated data
├── src/                     # Experiment code
│   ├── prepare_data.py     # Data preparation
│   ├── train.py            # Training code
│   ├── evaluate.py         # Evaluation code
│   └── utils.py            # Helper functions
├── results/                 # All outputs
│   ├── metrics.json        # Quantitative metrics
│   ├── plots/              # Visualizations
│   ├── models/             # Saved models
│   └── predictions/        # Predictions for analysis
├── logs/                    # Execution logs
│   ├── training.log        # Training output
│   └── evaluation.log      # Evaluation output
├── notebooks/              # Analysis notebooks
│   └── analysis.ipynb      # Results analysis
└── PROGRESS.md             # Real-time progress tracking

The Experiment README Template

Location: templates/experiment/README.md

# Experiment: [Descriptive Name]

**Status:** 🔄 Active | ✅ Completed | ⚠️ Failed | 📦 Archived
**Created:** YYYY-MM-DD HH:MM:SS
**Last Updated:** YYYY-MM-DD HH:MM:SS

---

## Quick Summary

**Objective:** [One sentence describing what you're trying to accomplish]

**Hypothesis:** [What you expect to happen and why]

**Result:** [Final outcome - fill in when complete]

**Key Finding:** [Most important insight - fill in when complete]

---

## Experiment Details

### Goal
[Detailed description of what you're trying to achieve and why it matters]

### Research Questions
1. [Question 1]
2. [Question 2]
3. [Question 3]

### Success Criteria
- [ ] Metric 1: Target value
- [ ] Metric 2: Target value
- [ ] Qualitative criterion

---

## Methodology

### Model
- **Base Model:** [e.g., llama3.1:8b]
- **Model Version:** [specific version or commit]
- **Custom Modifications:** [any changes to base model]

### Dataset
- **Name:** [dataset name and version]
- **Size:** [number of samples]
- **Location:** `data/inputs/[dataset-name]`
- **Preprocessing:** [description of preprocessing steps]
- **Splits:** Train (X%), Val (Y%), Test (Z%)

### Configuration
- **Configuration File:** `config.yaml`
- **Key Hyperparameters:**
  - Learning rate: X
  - Batch size: Y
  - Epochs: Z
  - Temperature: T
  - Other important parameters

### Environment
- **GPU:** [e.g., NVIDIA RTX 4090]
- **CUDA Version:** [version]
- **Python Version:** [version]
- **Dependencies:** `environment.yaml`

---

## Execution

### Setup Commands
```bash
# Create environment
conda env create -f environment.yaml
conda activate experiment-env

# Prepare data
python src/prepare_data.py

# Verify setup
python src/utils.py --verify

Training

# Run training
python src/train.py --config config.yaml

# Monitor progress
tail -f logs/training.log

Evaluation

# Run evaluation
python src/evaluate.py --checkpoint results/models/best_model.pt

# View results
python src/utils.py --summarize

Results

Quantitative Metrics

Metric	Target	Achieved	Status
Accuracy	≥85%	[fill in]	⏳
F1 Score	≥0.80	[fill in]	⏳
Inference Time	<100ms	[fill in]	⏳
GPU Memory	<8GB	[fill in]	⏳

Qualitative Results

[Description of qualitative findings]

Visualizations

Loss curves: results/plots/loss_curves.png
Confusion matrix: results/plots/confusion_matrix.png
Sample predictions: results/predictions/samples.txt

Analysis

What Worked Well

[Success 1]
[Success 2]

What Didn't Work

[Issue 1 and why]
[Issue 2 and why]

Surprises

[Unexpected finding 1]
[Unexpected finding 2]

Key Insights

Insight 1: [Description and impact]
Insight 2: [Description and impact]

Comparison with Previous Experiments

Experiment	Metric 1	Metric 2	Key Difference
This one	[value]	[value]	[what changed]
[Previous]	[value]	[value]	[baseline]
[Another]	[value]	[value]	[comparison]

Performance Change: [Better/Worse/Similar] - [Explanation]

Reproducibility Checklist

All code committed to version control
Configuration file complete and tested
Environment file captures all dependencies
Random seeds set and documented
Data preprocessing steps documented
Results files saved
Logs captured completely
README fully filled out

Next Steps

Immediate Actions

[Action 1]
[Action 2]

Future Experiments

Idea 1: [Description and rationale]
Idea 2: [Description and rationale]

Production Path

[If results are good, what are the steps to production?]

Resources

Code

Training script: src/train.py
Evaluation script: src/evaluate.py
Configuration: config.yaml

Data

Input data: data/inputs/
Outputs: data/outputs/

Logs

Training log: logs/training.log
Evaluation log: logs/evaluation.log

External References

[Paper/blog post 1]
[Documentation 1]

Notes

[YYYY-MM-DD]

[Dated notes about decisions, observations, or issues encountered]

[YYYY-MM-DD]

[More dated notes]

Metadata

Created By: [Your name] Related Experiments: [Links to related experiment directories] Tags: [experiment-type, model-family, dataset-name] Duration: [Total time spent] GPU Hours: [Total GPU hours used]


---

## Configuration Management

All experiment configuration goes in `config.yaml`:

```yaml
# config.yaml - Complete experiment configuration

experiment:
  name: "sentiment-analysis-llama3"
  description: "Sentiment analysis on product reviews"
  created: "2024-10-19T14:30:00"
  random_seed: 42

model:
  name: "llama3.1:8b"
  version: "latest"
  temperature: 0.3
  max_tokens: 100
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0

data:
  dataset_name: "product_reviews"
  dataset_version: "v2.0"
  train_path: "data/inputs/train.csv"
  val_path: "data/inputs/val.csv"
  test_path: "data/inputs/test.csv"
  preprocessing:
    lowercase: true
    remove_stopwords: false
    max_length: 512
    truncation: true

training:
  batch_size: 32
  epochs: 3
  learning_rate: 0.0001
  optimizer: "adam"
  weight_decay: 0.01
  warmup_steps: 100
  gradient_accumulation_steps: 1
  max_grad_norm: 1.0

evaluation:
  metrics:
    - accuracy
    - f1_score
    - precision
    - recall
  save_predictions: true
  plot_confusion_matrix: true

hardware:
  device: "cuda"
  gpu_memory_limit: 8GB
  mixed_precision: true

logging:
  level: "INFO"
  log_interval: 10  # Log every N steps
  save_checkpoints: true
  checkpoint_interval: 500  # Save every N steps

output:
  results_dir: "results/"
  plots_dir: "results/plots/"
  models_dir: "results/models/"
  predictions_dir: "results/predictions/"

Why YAML for configuration?

Human-readable and version-controllable
Easy to compare across experiments
Can be loaded directly in Python
Supports comments for documentation

Progress Tracking with PROGRESS.md

Track real-time experiment progress:

# Experiment Progress

**Last Updated:** 2024-10-19 15:45:00

---

## Current Status

🔄 **Phase:** Training (Epoch 2/3)
📊 **Progress:** 67% Complete
⏱️ **Elapsed Time:** 1h 24m
⏰ **Estimated Remaining:** 42m

---

## Checklist

### Setup ✅
- [x] Environment created
- [x] Dependencies installed
- [x] Data downloaded and verified
- [x] Configuration validated

### Data Preparation ✅
- [x] Data loaded
- [x] Preprocessing applied
- [x] Train/val/test splits created
- [x] Data quality checks passed

### Training 🔄
- [x] Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389
- [x] Epoch 2/3 complete - Loss: 0.356, Val Loss: 0.334
- [ ] Epoch 3/3 - In progress
- [ ] Final checkpoint saved

### Evaluation ⏳
- [ ] Load best checkpoint
- [ ] Run evaluation on test set
- [ ] Generate predictions
- [ ] Create visualizations
- [ ] Calculate all metrics

### Analysis ⏳
- [ ] Compare with baseline
- [ ] Analyze errors
- [ ] Document insights
- [ ] Update README with results

---

## Timeline

| Timestamp | Event | Details |
|-----------|-------|---------|
| 14:30:00 | Started | Environment setup |
| 14:35:00 | Data loaded | 10,000 samples |
| 14:40:00 | Training started | Epoch 1/3 |
| 15:02:00 | Epoch 1 complete | Loss: 0.452 → 0.389 |
| 15:24:00 | Epoch 2 complete | Loss: 0.356 → 0.334 |
| 15:45:00 | Epoch 3 in progress | ~67% complete |

---

## Metrics (Real-time)

### Training Metrics
- **Current Loss:** 0.312
- **Best Val Loss:** 0.334 (Epoch 2)
- **Learning Rate:** 0.0001
- **GPU Utilization:** 92%
- **Memory Usage:** 6.8GB / 8GB

### Observations
- Loss decreasing steadily
- No signs of overfitting yet
- GPU utilization good
- Memory usage within limits

---

## Issues Encountered

### Issue 1: Data Loading Slow
- **Time:** 14:32:00
- **Problem:** Initial data loading took 5 minutes
- **Solution:** Implemented data caching
- **Impact:** Reduced to 30 seconds

### Issue 2: None yet
---

## Next Actions

1. ✅ Wait for Epoch 3 to complete (~42m)
2. ⏳ Run evaluation on test set
3. ⏳ Generate visualizations
4. ⏳ Compare with baseline experiment
5. ⏳ Update README with final results

Update this file throughout the experiment to track progress and capture issues as they happen.

Creating New Experiments

Automation Script

Create scripts/utilities/new_experiment.sh:

#!/bin/bash

# Usage: ./scripts/utilities/new_experiment.sh "experiment-description"

EXPERIMENT_NAME=$1
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
EXPERIMENT_DIR="experiments/active/${TIMESTAMP}-${EXPERIMENT_NAME}"
TEMPLATE_DIR="templates/experiment"

if [ -z "$EXPERIMENT_NAME" ]; then
    echo "Usage: $0 'experiment-description'"
    exit 1
fi

echo "Creating new experiment: ${EXPERIMENT_DIR}"

# Create directory structure
mkdir -p "${EXPERIMENT_DIR}"/{data/{inputs,outputs},src,results/{plots,models,predictions},logs,notebooks}

# Copy templates
cp "${TEMPLATE_DIR}/README.md" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/config.yaml" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/environment.yaml" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/PROGRESS.md" "${EXPERIMENT_DIR}/"
cp -r "${TEMPLATE_DIR}/src/"* "${EXPERIMENT_DIR}/src/"
cp "${TEMPLATE_DIR}/notebooks/analysis.ipynb" "${EXPERIMENT_DIR}/notebooks/"

# Update timestamps in README
sed -i "s/YYYY-MM-DD HH:MM:SS/$(date '+%Y-%m-%d %H:%M:%S')/g" "${EXPERIMENT_DIR}/README.md"

# Initialize git (optional)
# cd "${EXPERIMENT_DIR}" && git init

echo "✅ Experiment created: ${EXPERIMENT_DIR}"
echo ""
echo "Next steps:"
echo "1. cd ${EXPERIMENT_DIR}"
echo "2. Edit config.yaml with your configuration"
echo "3. Edit README.md with experiment details"
echo "4. Run: conda env create -f environment.yaml"
echo "5. Start experimenting!"

Usage:

./scripts/utilities/new_experiment.sh "sentiment-analysis-llama3"

Experiment Comparison

Results Comparison Tool

Create scripts/utilities/compare_experiments.py:

#!/usr/bin/env python3
import json
import sys
from pathlib import Path
from tabulate import tabulate

def load_metrics(experiment_dir):
    """Load metrics from experiment"""
    metrics_file = Path(experiment_dir) / "results" / "metrics.json"
    if not metrics_file.exists():
        return None

    with open(metrics_file) as f:
        return json.load(f)

def compare_experiments(experiment_dirs):
    """Compare multiple experiments"""
    results = []

    for exp_dir in experiment_dirs:
        exp_path = Path(exp_dir)
        exp_name = exp_path.name
        metrics = load_metrics(exp_path)

        if metrics:
            results.append({
                'Experiment': exp_name,
                'Accuracy': f"{metrics.get('accuracy', 0):.2%}",
                'F1 Score': f"{metrics.get('f1_score', 0):.3f}",
                'Inference Time': f"{metrics.get('inference_time_ms', 0):.1f}ms",
                'GPU Memory': f"{metrics.get('gpu_memory_gb', 0):.1f}GB",
            })

    if results:
        print("\n" + "="*80)
        print("EXPERIMENT COMPARISON")
        print("="*80 + "\n")
        print(tabulate(results, headers='keys', tablefmt='grid'))
        print()
    else:
        print("No metrics found in specified experiments")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: compare_experiments.py <exp_dir1> <exp_dir2> ...")
        sys.exit(1)

    compare_experiments(sys.argv[1:])

Usage:

python scripts/utilities/compare_experiments.py \
    experiments/completed/20241019-143000-sentiment-llama3/ \
    experiments/completed/20241019-155000-sentiment-phi3/

Lifecycle Management

Moving Experiments Through States

Active → Completed:

# When experiment finishes successfully
mv experiments/active/20241019-143000-experiment-name \
   experiments/completed/

Completed → Archived:

# After 30 days or when no longer needed for reference
mv experiments/completed/20241019-143000-experiment-name \
   experiments/archived/

Automated Archival Script

Create scripts/automation/archive_old_experiments.sh:

#!/bin/bash

# Archive experiments older than 30 days from completed/

COMPLETED_DIR="experiments/completed"
ARCHIVED_DIR="experiments/archived"
DAYS_OLD=30

echo "Archiving experiments older than ${DAYS_OLD} days..."

find "${COMPLETED_DIR}" -maxdepth 1 -type d -mtime +${DAYS_OLD} | while read exp_dir; do
    if [ "$exp_dir" != "$COMPLETED_DIR" ]; then
        exp_name=$(basename "$exp_dir")
        echo "Archiving: ${exp_name}"
        mv "$exp_dir" "${ARCHIVED_DIR}/"
    fi
done

echo "✅ Archival complete"

Run monthly:

./scripts/automation/archive_old_experiments.sh

Experiment Registry

Track all experiments in a central registry:

Registry File: EXPERIMENTS.md

Location: `(local path)

# Experiment Registry

Central tracking of all ML experiments.

---

## Active Experiments

| Started | Name | Objective | Status | Notes |
|---------|------|-----------|--------|-------|
| 2024-10-19 | sentiment-llama3 | Sentiment analysis with Llama 3.1 | 🔄 Training | Epoch 2/3 |

---

## Completed Experiments (Last 30 Days)

| Completed | Name | Result | Best Metric | Link |
|-----------|------|--------|-------------|------|
| 2024-10-18 | baseline-sentiment | Success | 82% accuracy | [experiments/completed/20241018-090000-baseline-sentiment](/experiments-completed/20241018-090000-baseline-sentiment) |

---

## Top Performers

### Sentiment Analysis
1. **sentiment-llama3** - 91% accuracy (2024-10-19)
2. **baseline-sentiment** - 82% accuracy (2024-10-18)

### Code Generation
[To be filled]

---

## Insights Database

### Temperature Settings
- **Finding:** Lower temperature (0.3) improves consistency for classification
- **Evidence:** sentiment-llama3 (0.3 temp) vs baseline-sentiment (0.7 temp)
- **Applicability:** Classification tasks

### Batch Size
- **Finding:** Batch size 32 optimal for RTX 4090 with 8B models
- **Evidence:** Multiple experiments
- **Applicability:** Similar GPU and model size

---

Update this after every experiment completes.

Best Practices

Before Starting an Experiment

Create from template - Use the automation script
Fill out README - Complete objective, hypothesis, success criteria
Configure completely - All settings in config.yaml
Set random seeds - For reproducibility
Document environment - Complete environment.yaml

During the Experiment

Update PROGRESS.md - Track real-time progress
Monitor logs - Watch for issues early
Capture observations - Note surprises in README
Save checkpoints - Regular model checkpoints
Track metrics - Log to metrics.json

After the Experiment

Complete README - Fill in all results sections
Run comparison - Compare with previous experiments
Extract insights - What did you learn?
Update registry - Add to EXPERIMENTS.md
Move to completed - Clear active/ directory
Create session log - If significant (Part 2)

Integration with Documentation System

The experiment system integrates with the three-tier documentation from Part 2:

Tier 1 (Execution Logs):

Stored in experiments/.../logs/
Detailed technical output
For debugging

Tier 2 (Activity Log):

## 2024-10-19

### Experiments
- Started: 20241019-143000-sentiment-llama3
  - Goal: Improve sentiment accuracy with Llama 3.1
  - Status: Training (Epoch 2/3)
  - Current: 89% validation accuracy

Tier 3 (Session Log):

Created when experiment completes
Links to experiment directory
Provides narrative and insights

Common Pitfalls

Pitfall 1: Incomplete configuration

Problem: Missing hyperparameters make reproduction impossible
Solution: Use config.yaml template, fill completely

Pitfall 2: No random seeds

Problem: Can't reproduce exact results
Solution: Always set and document random seeds

Pitfall 3: Undocumented preprocessing

Problem: Different preprocessing gives different results
Solution: Document every preprocessing step in README and code

Pitfall 4: Lost model versions

Problem: "Which model version did I use?"
Solution: Record exact model versions in config.yaml

Pitfall 5: No comparison

Problem: Don't know if experiment improved things
Solution: Always compare with baseline or previous experiments

What's Next

You now have systematic experiment tracking that ensures reproducibility and enables knowledge accumulation. Your experiments have standardized structures, comprehensive documentation, and clear lifecycle management.

In Part 4: Building Production-Ready AI Agents, we'll apply these same principles to agent development:

Agent templates and scaffolding
Tool integration patterns
Testing and evaluation frameworks
Production deployment readiness
Agent-specific monitoring

We'll create agents that are as well-structured and reproducible as your experiments.

Key Takeaways

Reproducibility requires comprehensive tracking of configuration, environment, and methodology
Standardized templates ensure nothing gets forgotten
Progress tracking captures real-time state for debugging
Experiment comparison reveals what works consistently
Lifecycle management keeps workspace organized
Integration with documentation creates complete knowledge capture

Resources

Templates:

Experiment README template (provided above)
Configuration template (config.yaml)
Progress tracking template (PROGRESS.md)

Scripts:

new_experiment.sh - Create experiments from template
compare_experiments.py - Compare experiment results
archive_old_experiments.sh - Lifecycle management

Tools:

Python logging module
YAML for configuration
JSON for metrics storage
Git for version control

Series Navigation

Previous: Part 2: Documentation Systems
Next: Part 4: Building Production-Ready AI Agents
Series Home: Building a Production ML Workspace on GPU Infrastructure

Questions or suggestions? Find me on Twitter @bioinfo or at rundatarun.io

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Related experiments

Apparatus

1,115 words · 9 min read

experiment-tracking
reproducibility
ml-development
best-practices
gpu-computing

Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility

The Reproducibility Crisis

The Experiment Tracking System

System Components

The Complete Experiment Template

The Experiment README Template

Training

Evaluation

Results

Quantitative Metrics

Qualitative Results

Visualizations

Analysis

What Worked Well

What Didn't Work

Surprises

Key Insights

Comparison with Previous Experiments

Reproducibility Checklist

Next Steps

Immediate Actions

Future Experiments

Production Path

Resources

Code

Data

Logs

External References

Notes

[YYYY-MM-DD]

[YYYY-MM-DD]

Metadata

Progress Tracking with PROGRESS.md

Creating New Experiments

Automation Script

Experiment Comparison

Results Comparison Tool

Lifecycle Management

Moving Experiments Through States

Automated Archival Script

Experiment Registry

Registry File: EXPERIMENTS.md

Best Practices

Before Starting an Experiment

During the Experiment

After the Experiment

Integration with Documentation System

Common Pitfalls

What's Next

Key Takeaways

Resources

Series Navigation

Related Articles

Get the next experiment

Related experiments

Apparatus

Links to this entry