A Production ML Workspace· Part 3 of 5
Practical Applications9 min readshipped

Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility

Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility

You've organized your workspace and built a documentation system. Your files are structured, your logs are comprehensive, and you're capturing daily progress. But there's still a critical problem:

Six months from now, you won't be able to reproduce your best experiment.

You'll remember "that one experiment with great results" but you won't recall the exact model version, hyperparameters, dataset preprocessing steps, or random seed that made it work. Without systematic experiment tracking, your ML work becomes a collection of unreproducible results.

This article shows you how to build experiment tracking systems that ensure every experiment is reproducible, comparable, and builds toward accumulated knowledge.

About This Series

This is Part 3 of a 5-part series on building production ML workspaces. Previous parts:

  • Part 1: Workspace Structure
  • Part 2: Documentation Systems

Coming next:

  • Part 4: Building Production-Ready AI Agents
  • Part 5: Ollama Model Management and Workflow Integration

The Reproducibility Crisis

ML practitioners face a reproducibility problem that costs time and credibility:

The Problem:

  • "I got 94% accuracy last month, but I can't reproduce it"
  • "Which dataset version did I use for that experiment?"
  • "What were the hyperparameters that worked so well?"
  • "Why did this experiment fail when the previous one succeeded?"

The Cost:

  • Wasted GPU time re-running experiments
  • Lost insights from unreproducible results
  • Inability to build on past successes
  • Difficulty comparing experiments fairly
  • No clear path from prototype to production

The Root Cause:

  • Missing configuration files
  • Undocumented preprocessing steps
  • Unknown model versions
  • Missing random seeds
  • Incomplete environment specifications
  • No experiment metadata

The Experiment Tracking System

Our solution: A comprehensive experiment template and tracking system that captures everything needed for reproducibility.

System Components

Experiment Tracking System
├── Experiment Template       # Standardized structure
├── Configuration Management  # All settings in version control
├── Progress Tracking        # Real-time status updates
├── Results Capture          # Standardized outputs
├── Comparison Tools         # Cross-experiment analysis
└── Lifecycle Management     # Active → Completed → Archived

Each component ensures experiments are reproducible, comparable, and build toward knowledge accumulation.


The Complete Experiment Template

Every experiment gets its own directory created from this template:

experiments/active/YYYYMMDD-HHMMSS-experiment-name/
├── README.md                 # Experiment documentation
├── config.yaml              # All configuration
├── environment.yaml         # Conda/pip dependencies
├── data/                    # Experiment-specific data
│   ├── inputs/             # Input data (symlinks)
│   └── outputs/            # Generated data
├── src/                     # Experiment code
│   ├── prepare_data.py     # Data preparation
│   ├── train.py            # Training code
│   ├── evaluate.py         # Evaluation code
│   └── utils.py            # Helper functions
├── results/                 # All outputs
│   ├── metrics.json        # Quantitative metrics
│   ├── plots/              # Visualizations
│   ├── models/             # Saved models
│   └── predictions/        # Predictions for analysis
├── logs/                    # Execution logs
│   ├── training.log        # Training output
│   └── evaluation.log      # Evaluation output
├── notebooks/              # Analysis notebooks
│   └── analysis.ipynb      # Results analysis
└── PROGRESS.md             # Real-time progress tracking

The Experiment README Template

Location: templates/experiment/README.md

# Experiment: [Descriptive Name]

**Status:** 🔄 Active | ✅ Completed | ⚠️ Failed | 📦 Archived
**Created:** YYYY-MM-DD HH:MM:SS
**Last Updated:** YYYY-MM-DD HH:MM:SS

---

## Quick Summary

**Objective:** [One sentence describing what you're trying to accomplish]

**Hypothesis:** [What you expect to happen and why]

**Result:** [Final outcome - fill in when complete]

**Key Finding:** [Most important insight - fill in when complete]

---

## Experiment Details

### Goal
[Detailed description of what you're trying to achieve and why it matters]

### Research Questions
1. [Question 1]
2. [Question 2]
3. [Question 3]

### Success Criteria
- [ ] Metric 1: Target value
- [ ] Metric 2: Target value
- [ ] Qualitative criterion

---

## Methodology

### Model
- **Base Model:** [e.g., llama3.1:8b]
- **Model Version:** [specific version or commit]
- **Custom Modifications:** [any changes to base model]

### Dataset
- **Name:** [dataset name and version]
- **Size:** [number of samples]
- **Location:** `data/inputs/[dataset-name]`
- **Preprocessing:** [description of preprocessing steps]
- **Splits:** Train (X%), Val (Y%), Test (Z%)

### Configuration
- **Configuration File:** `config.yaml`
- **Key Hyperparameters:**
  - Learning rate: X
  - Batch size: Y
  - Epochs: Z
  - Temperature: T
  - Other important parameters

### Environment
- **GPU:** [e.g., NVIDIA RTX 4090]
- **CUDA Version:** [version]
- **Python Version:** [version]
- **Dependencies:** `environment.yaml`

---

## Execution

### Setup Commands
```bash
# Create environment
conda env create -f environment.yaml
conda activate experiment-env

# Prepare data
python src/prepare_data.py

# Verify setup
python src/utils.py --verify

Training

# Run training
python src/train.py --config config.yaml

# Monitor progress
tail -f logs/training.log

Evaluation

# Run evaluation
python src/evaluate.py --checkpoint results/models/best_model.pt

# View results
python src/utils.py --summarize

Results

Quantitative Metrics

MetricTargetAchievedStatus
Accuracy≥85%[fill in]
F1 Score≥0.80[fill in]
Inference Time<100ms[fill in]
GPU Memory<8GB[fill in]

Qualitative Results

[Description of qualitative findings]

Visualizations

  • Loss curves: results/plots/loss_curves.png
  • Confusion matrix: results/plots/confusion_matrix.png
  • Sample predictions: results/predictions/samples.txt

Analysis

What Worked Well

  1. [Success 1]
  2. [Success 2]

What Didn't Work

  1. [Issue 1 and why]
  2. [Issue 2 and why]

Surprises

  • [Unexpected finding 1]
  • [Unexpected finding 2]

Key Insights

  1. Insight 1: [Description and impact]
  2. Insight 2: [Description and impact]

Comparison with Previous Experiments

ExperimentMetric 1Metric 2Key Difference
This one[value][value][what changed]
[Previous][value][value][baseline]
[Another][value][value][comparison]

Performance Change: [Better/Worse/Similar] - [Explanation]


Reproducibility Checklist

  • All code committed to version control
  • Configuration file complete and tested
  • Environment file captures all dependencies
  • Random seeds set and documented
  • Data preprocessing steps documented
  • Results files saved
  • Logs captured completely
  • README fully filled out

Next Steps

Immediate Actions

  1. [Action 1]
  2. [Action 2]

Future Experiments

  • Idea 1: [Description and rationale]
  • Idea 2: [Description and rationale]

Production Path

[If results are good, what are the steps to production?]


Resources

Code

  • Training script: src/train.py
  • Evaluation script: src/evaluate.py
  • Configuration: config.yaml

Data

  • Input data: data/inputs/
  • Outputs: data/outputs/

Logs

  • Training log: logs/training.log
  • Evaluation log: logs/evaluation.log

External References

  • [Paper/blog post 1]
  • [Documentation 1]

Notes

[YYYY-MM-DD]

[Dated notes about decisions, observations, or issues encountered]

[YYYY-MM-DD]

[More dated notes]


Metadata

Created By: [Your name] Related Experiments: [Links to related experiment directories] Tags: [experiment-type, model-family, dataset-name] Duration: [Total time spent] GPU Hours: [Total GPU hours used]


---

## Configuration Management

All experiment configuration goes in `config.yaml`:

```yaml
# config.yaml - Complete experiment configuration

experiment:
  name: "sentiment-analysis-llama3"
  description: "Sentiment analysis on product reviews"
  created: "2024-10-19T14:30:00"
  random_seed: 42

model:
  name: "llama3.1:8b"
  version: "latest"
  temperature: 0.3
  max_tokens: 100
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0

data:
  dataset_name: "product_reviews"
  dataset_version: "v2.0"
  train_path: "data/inputs/train.csv"
  val_path: "data/inputs/val.csv"
  test_path: "data/inputs/test.csv"
  preprocessing:
    lowercase: true
    remove_stopwords: false
    max_length: 512
    truncation: true

training:
  batch_size: 32
  epochs: 3
  learning_rate: 0.0001
  optimizer: "adam"
  weight_decay: 0.01
  warmup_steps: 100
  gradient_accumulation_steps: 1
  max_grad_norm: 1.0

evaluation:
  metrics:
    - accuracy
    - f1_score
    - precision
    - recall
  save_predictions: true
  plot_confusion_matrix: true

hardware:
  device: "cuda"
  gpu_memory_limit: 8GB
  mixed_precision: true

logging:
  level: "INFO"
  log_interval: 10  # Log every N steps
  save_checkpoints: true
  checkpoint_interval: 500  # Save every N steps

output:
  results_dir: "results/"
  plots_dir: "results/plots/"
  models_dir: "results/models/"
  predictions_dir: "results/predictions/"

Why YAML for configuration?

  • Human-readable and version-controllable
  • Easy to compare across experiments
  • Can be loaded directly in Python
  • Supports comments for documentation

Progress Tracking with PROGRESS.md

Track real-time experiment progress:

# Experiment Progress

**Last Updated:** 2024-10-19 15:45:00

---

## Current Status

🔄 **Phase:** Training (Epoch 2/3)
📊 **Progress:** 67% Complete
⏱️ **Elapsed Time:** 1h 24m
⏰ **Estimated Remaining:** 42m

---

## Checklist

### Setup ✅
- [x] Environment created
- [x] Dependencies installed
- [x] Data downloaded and verified
- [x] Configuration validated

### Data Preparation ✅
- [x] Data loaded
- [x] Preprocessing applied
- [x] Train/val/test splits created
- [x] Data quality checks passed

### Training 🔄
- [x] Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389
- [x] Epoch 2/3 complete - Loss: 0.356, Val Loss: 0.334
- [ ] Epoch 3/3 - In progress
- [ ] Final checkpoint saved

### Evaluation ⏳
- [ ] Load best checkpoint
- [ ] Run evaluation on test set
- [ ] Generate predictions
- [ ] Create visualizations
- [ ] Calculate all metrics

### Analysis ⏳
- [ ] Compare with baseline
- [ ] Analyze errors
- [ ] Document insights
- [ ] Update README with results

---

## Timeline

| Timestamp | Event | Details |
|-----------|-------|---------|
| 14:30:00 | Started | Environment setup |
| 14:35:00 | Data loaded | 10,000 samples |
| 14:40:00 | Training started | Epoch 1/3 |
| 15:02:00 | Epoch 1 complete | Loss: 0.452 → 0.389 |
| 15:24:00 | Epoch 2 complete | Loss: 0.356 → 0.334 |
| 15:45:00 | Epoch 3 in progress | ~67% complete |

---

## Metrics (Real-time)

### Training Metrics
- **Current Loss:** 0.312
- **Best Val Loss:** 0.334 (Epoch 2)
- **Learning Rate:** 0.0001
- **GPU Utilization:** 92%
- **Memory Usage:** 6.8GB / 8GB

### Observations
- Loss decreasing steadily
- No signs of overfitting yet
- GPU utilization good
- Memory usage within limits

---

## Issues Encountered

### Issue 1: Data Loading Slow
- **Time:** 14:32:00
- **Problem:** Initial data loading took 5 minutes
- **Solution:** Implemented data caching
- **Impact:** Reduced to 30 seconds

### Issue 2: None yet
---

## Next Actions

1. ✅ Wait for Epoch 3 to complete (~42m)
2. ⏳ Run evaluation on test set
3. ⏳ Generate visualizations
4. ⏳ Compare with baseline experiment
5. ⏳ Update README with final results

Update this file throughout the experiment to track progress and capture issues as they happen.


Creating New Experiments

Automation Script

Create scripts/utilities/new_experiment.sh:

#!/bin/bash

# Usage: ./scripts/utilities/new_experiment.sh "experiment-description"

EXPERIMENT_NAME=$1
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
EXPERIMENT_DIR="experiments/active/${TIMESTAMP}-${EXPERIMENT_NAME}"
TEMPLATE_DIR="templates/experiment"

if [ -z "$EXPERIMENT_NAME" ]; then
    echo "Usage: $0 'experiment-description'"
    exit 1
fi

echo "Creating new experiment: ${EXPERIMENT_DIR}"

# Create directory structure
mkdir -p "${EXPERIMENT_DIR}"/{data/{inputs,outputs},src,results/{plots,models,predictions},logs,notebooks}

# Copy templates
cp "${TEMPLATE_DIR}/README.md" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/config.yaml" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/environment.yaml" "${EXPERIMENT_DIR}/"
cp "${TEMPLATE_DIR}/PROGRESS.md" "${EXPERIMENT_DIR}/"
cp -r "${TEMPLATE_DIR}/src/"* "${EXPERIMENT_DIR}/src/"
cp "${TEMPLATE_DIR}/notebooks/analysis.ipynb" "${EXPERIMENT_DIR}/notebooks/"

# Update timestamps in README
sed -i "s/YYYY-MM-DD HH:MM:SS/$(date '+%Y-%m-%d %H:%M:%S')/g" "${EXPERIMENT_DIR}/README.md"

# Initialize git (optional)
# cd "${EXPERIMENT_DIR}" && git init

echo "✅ Experiment created: ${EXPERIMENT_DIR}"
echo ""
echo "Next steps:"
echo "1. cd ${EXPERIMENT_DIR}"
echo "2. Edit config.yaml with your configuration"
echo "3. Edit README.md with experiment details"
echo "4. Run: conda env create -f environment.yaml"
echo "5. Start experimenting!"

Usage:

./scripts/utilities/new_experiment.sh "sentiment-analysis-llama3"

Experiment Comparison

Results Comparison Tool

Create scripts/utilities/compare_experiments.py:

#!/usr/bin/env python3
import json
import sys
from pathlib import Path
from tabulate import tabulate

def load_metrics(experiment_dir):
    """Load metrics from experiment"""
    metrics_file = Path(experiment_dir) / "results" / "metrics.json"
    if not metrics_file.exists():
        return None

    with open(metrics_file) as f:
        return json.load(f)

def compare_experiments(experiment_dirs):
    """Compare multiple experiments"""
    results = []

    for exp_dir in experiment_dirs:
        exp_path = Path(exp_dir)
        exp_name = exp_path.name
        metrics = load_metrics(exp_path)

        if metrics:
            results.append({
                'Experiment': exp_name,
                'Accuracy': f"{metrics.get('accuracy', 0):.2%}",
                'F1 Score': f"{metrics.get('f1_score', 0):.3f}",
                'Inference Time': f"{metrics.get('inference_time_ms', 0):.1f}ms",
                'GPU Memory': f"{metrics.get('gpu_memory_gb', 0):.1f}GB",
            })

    if results:
        print("\n" + "="*80)
        print("EXPERIMENT COMPARISON")
        print("="*80 + "\n")
        print(tabulate(results, headers='keys', tablefmt='grid'))
        print()
    else:
        print("No metrics found in specified experiments")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: compare_experiments.py <exp_dir1> <exp_dir2> ...")
        sys.exit(1)

    compare_experiments(sys.argv[1:])

Usage:

python scripts/utilities/compare_experiments.py \
    experiments/completed/20241019-143000-sentiment-llama3/ \
    experiments/completed/20241019-155000-sentiment-phi3/

Lifecycle Management

Moving Experiments Through States

Active → Completed:

# When experiment finishes successfully
mv experiments/active/20241019-143000-experiment-name \
   experiments/completed/

Completed → Archived:

# After 30 days or when no longer needed for reference
mv experiments/completed/20241019-143000-experiment-name \
   experiments/archived/

Automated Archival Script

Create scripts/automation/archive_old_experiments.sh:

#!/bin/bash

# Archive experiments older than 30 days from completed/

COMPLETED_DIR="experiments/completed"
ARCHIVED_DIR="experiments/archived"
DAYS_OLD=30

echo "Archiving experiments older than ${DAYS_OLD} days..."

find "${COMPLETED_DIR}" -maxdepth 1 -type d -mtime +${DAYS_OLD} | while read exp_dir; do
    if [ "$exp_dir" != "$COMPLETED_DIR" ]; then
        exp_name=$(basename "$exp_dir")
        echo "Archiving: ${exp_name}"
        mv "$exp_dir" "${ARCHIVED_DIR}/"
    fi
done

echo "✅ Archival complete"

Run monthly:

./scripts/automation/archive_old_experiments.sh

Experiment Registry

Track all experiments in a central registry:

Registry File: EXPERIMENTS.md

Location: `(local path)

# Experiment Registry

Central tracking of all ML experiments.

---

## Active Experiments

| Started | Name | Objective | Status | Notes |
|---------|------|-----------|--------|-------|
| 2024-10-19 | sentiment-llama3 | Sentiment analysis with Llama 3.1 | 🔄 Training | Epoch 2/3 |

---

## Completed Experiments (Last 30 Days)

| Completed | Name | Result | Best Metric | Link |
|-----------|------|--------|-------------|------|
| 2024-10-18 | baseline-sentiment | Success | 82% accuracy | [experiments/completed/20241018-090000-baseline-sentiment](/experiments-completed/20241018-090000-baseline-sentiment) |

---

## Top Performers

### Sentiment Analysis
1. **sentiment-llama3** - 91% accuracy (2024-10-19)
2. **baseline-sentiment** - 82% accuracy (2024-10-18)

### Code Generation
[To be filled]

---

## Insights Database

### Temperature Settings
- **Finding:** Lower temperature (0.3) improves consistency for classification
- **Evidence:** sentiment-llama3 (0.3 temp) vs baseline-sentiment (0.7 temp)
- **Applicability:** Classification tasks

### Batch Size
- **Finding:** Batch size 32 optimal for RTX 4090 with 8B models
- **Evidence:** Multiple experiments
- **Applicability:** Similar GPU and model size

---

Update this after every experiment completes.


Best Practices

Before Starting an Experiment

  1. Create from template - Use the automation script
  2. Fill out README - Complete objective, hypothesis, success criteria
  3. Configure completely - All settings in config.yaml
  4. Set random seeds - For reproducibility
  5. Document environment - Complete environment.yaml

During the Experiment

  1. Update PROGRESS.md - Track real-time progress
  2. Monitor logs - Watch for issues early
  3. Capture observations - Note surprises in README
  4. Save checkpoints - Regular model checkpoints
  5. Track metrics - Log to metrics.json

After the Experiment

  1. Complete README - Fill in all results sections
  2. Run comparison - Compare with previous experiments
  3. Extract insights - What did you learn?
  4. Update registry - Add to EXPERIMENTS.md
  5. Move to completed - Clear active/ directory
  6. Create session log - If significant (Part 2)

Integration with Documentation System

The experiment system integrates with the three-tier documentation from Part 2:

Tier 1 (Execution Logs):

  • Stored in experiments/.../logs/
  • Detailed technical output
  • For debugging

Tier 2 (Activity Log):

## 2024-10-19

### Experiments
- Started: 20241019-143000-sentiment-llama3
  - Goal: Improve sentiment accuracy with Llama 3.1
  - Status: Training (Epoch 2/3)
  - Current: 89% validation accuracy

Tier 3 (Session Log):

  • Created when experiment completes
  • Links to experiment directory
  • Provides narrative and insights

Common Pitfalls

Pitfall 1: Incomplete configuration

  • Problem: Missing hyperparameters make reproduction impossible
  • Solution: Use config.yaml template, fill completely

Pitfall 2: No random seeds

  • Problem: Can't reproduce exact results
  • Solution: Always set and document random seeds

Pitfall 3: Undocumented preprocessing

  • Problem: Different preprocessing gives different results
  • Solution: Document every preprocessing step in README and code

Pitfall 4: Lost model versions

  • Problem: "Which model version did I use?"
  • Solution: Record exact model versions in config.yaml

Pitfall 5: No comparison

  • Problem: Don't know if experiment improved things
  • Solution: Always compare with baseline or previous experiments

What's Next

You now have systematic experiment tracking that ensures reproducibility and enables knowledge accumulation. Your experiments have standardized structures, comprehensive documentation, and clear lifecycle management.

In Part 4: Building Production-Ready AI Agents, we'll apply these same principles to agent development:

  • Agent templates and scaffolding
  • Tool integration patterns
  • Testing and evaluation frameworks
  • Production deployment readiness
  • Agent-specific monitoring

We'll create agents that are as well-structured and reproducible as your experiments.


Key Takeaways

  • Reproducibility requires comprehensive tracking of configuration, environment, and methodology
  • Standardized templates ensure nothing gets forgotten
  • Progress tracking captures real-time state for debugging
  • Experiment comparison reveals what works consistently
  • Lifecycle management keeps workspace organized
  • Integration with documentation creates complete knowledge capture

Resources

Templates:

  • Experiment README template (provided above)
  • Configuration template (config.yaml)
  • Progress tracking template (PROGRESS.md)

Scripts:

  • new_experiment.sh - Create experiments from template
  • compare_experiments.py - Compare experiment results
  • archive_old_experiments.sh - Lifecycle management

Tools:

  • Python logging module
  • YAML for configuration
  • JSON for metrics storage
  • Git for version control

Series Navigation

  • Previous: Part 2: Documentation Systems
  • Next: Part 4: Building Production-Ready AI Agents
  • Series Home: Building a Production ML Workspace on GPU Infrastructure

Questions or suggestions? Find me on Twitter @bioinfo or at rundatarun.io


Related Articles


About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.

Links to this entry