A Production ML Workspace· Part 2 of 5
Practical Applications7 min readshipped

Building a Production ML Workspace: Part 2 - Documentation Systems That Scale

Building a Production ML Workspace: Part 2 - Documentation Systems That Scale

You've built a clean workspace structure. Your directories are organized, your experiments have homes, and your Ollama models are properly categorized. But there's a problem:

Three months from now, you won't remember why you ran that experiment.

You won't recall which hyperparameters worked, what insights you gained, or why Model A outperformed Model B. Without documentation, your past work becomes a black box—inaccessible and unreproducible.

This article shows you how to build a three-tier documentation system that captures ML work at different levels of detail, serves different audiences, and transforms your experiments into shareable knowledge.

About This Series

This is Part 2 of a 5-part series on building production ML workspaces. Previous parts:

  • Part 1: Workspace Structure

Coming next:

  • Part 3: Experiment Tracking
  • Part 4: Agent Templates
  • Part 5: Ollama Integration

The Documentation Problem

Most ML practitioners face these challenges:

The Short-Term Problem:

  • Which model did I use for that benchmark?
  • What dataset preprocessing steps did I apply?
  • Why did this experiment fail?
  • How do I reproduce these results?

The Long-Term Problem:

  • What have I learned this month?
  • Which approaches consistently work?
  • What patterns am I seeing across projects?
  • How do I share this knowledge?

The Communication Problem:

  • How do I explain my work to teammates?
  • How do I create blog posts from experiments?
  • How do I document insights without losing technical details?

Single-level documentation can't solve all these. Execution logs are too detailed for review. High-level summaries lack debugging context. You need different documentation for different purposes.


The Three-Tier Documentation System

Our solution: Three complementary documentation levels, each serving specific needs:

Tier 1: EXECUTION LOGS
├─ Purpose: Technical debugging and troubleshooting
├─ Audience: You (future debugging)
├─ Detail Level: High (full stack traces, timing, errors)
└─ Location: (local path)

Tier 2: ACTIVITY LOG
├─ Purpose: Daily progress tracking and review
├─ Audience: You (weekly review, accountability)
├─ Detail Level: Medium (summaries, key decisions, outcomes)
└─ Location: (local path)

Tier 3: SESSION LOGS
├─ Purpose: Knowledge sharing and blog content
├─ Audience: Others (teammates, blog readers)
├─ Detail Level: Narrative (context, process, insights)
└─ Location: (local path) or blog drafts/

Each tier captures the same work from a different perspective. Let's build each one.


Tier 1: Execution Logs

Execution logs capture technical details for debugging. When something breaks at 2 AM, you'll be grateful these exist.

Log Structure

(local path)
├── experiments/        # Experiment execution logs
│   └── 20241019-143000-sentiment-analysis.log
├── training/          # Model training logs
│   └── 20241019-155500-llama-finetuning.log
└── agents/            # Agent execution logs
    └── 20241020-090000-customer-agent.log

What to Log

For experiments:

  • Timestamp of each step
  • Model and version used
  • Hyperparameters and configuration
  • Dataset information
  • Execution time for each step
  • Memory usage and GPU utilization
  • Errors and warnings (with full stack traces)
  • Output file locations

For training:

  • Training configuration
  • Epoch-by-epoch metrics
  • Loss curves
  • Validation metrics
  • Checkpoint locations
  • Hardware utilization
  • Training time

For agents:

  • Input/output for each interaction
  • Tool calls and responses
  • Decision reasoning
  • Latency per operation
  • Errors and retry logic
  • Token usage

Example Execution Log

2024-10-19 14:30:00 - INFO - Experiment: sentiment-analysis
2024-10-19 14:30:00 - INFO - Model: llama3.1:8b
2024-10-19 14:30:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 14:30:01 - INFO - Config: {temperature: 0.3, max_tokens: 100}
2024-10-19 14:30:01 - INFO - Starting inference...
2024-10-19 14:30:15 - INFO - Batch 1/100 complete (14.2s, 7.0 tokens/sec)
2024-10-19 14:30:28 - INFO - Batch 2/100 complete (13.1s, 7.6 tokens/sec)
...
2024-10-19 14:52:45 - INFO - Inference complete (22m 44s)
2024-10-19 14:52:45 - INFO - Results saved: results/20241019-sentiment-analysis/
2024-10-19 14:52:45 - INFO - Accuracy: 87.3%, F1: 0.852
2024-10-19 14:52:45 - INFO - Peak GPU memory: 6.2GB

Python Logging Template

import logging
from datetime import datetime

# Setup logging
log_file = f"logs/experiments/{datetime.now().strftime('%Y%m%d-%H%M%S')}-{experiment_name}.log"
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file),
        logging.StreamHandler()  # Also print to console
    ]
)

logger = logging.getLogger(__name__)

# Usage in your code
logger.info(f"Experiment: {experiment_name}")
logger.info(f"Model: {model_name}")
logger.info(f"Dataset: {dataset_name} ({len(dataset)} samples)")

try:
    results = run_experiment()
    logger.info(f"Results: {results}")
except Exception as e:
    logger.error(f"Experiment failed: {str(e)}", exc_info=True)

Best Practices

  • Structured logging: Use consistent formats for parsing
  • Timestamps: Every log entry needs a timestamp
  • Context: Log enough to reproduce the exact scenario
  • Errors: Always include full stack traces
  • Rotation: Archive old logs periodically (> 30 days)

Tier 2: Activity Log

The activity log is your daily work journal. It answers "What did I do today?" and creates accountability.

ACTIVITY.md Structure

Location: `(local path)

# ML Workspace Activity Log

Quick daily summaries of ML development work. For detailed technical logs, see logs/ directory.

---

## 2024-10-19

### Models
- Pulled llama3.1:8b from Ollama
- Created custom model: llama3-sentiment:v1 (temperature: 0.3, system prompt optimized)
- Tested phi3:mini for code generation tasks

### Experiments
- Started: 20241019-143000-sentiment-analysis
  - Goal: Evaluate llama3.1 on product reviews
  - Status: Complete
  - Result: 87.3% accuracy, 0.852 F1 score
  - Insight: Lower temperature (0.3) significantly improved consistency

### Agents
- Built prototype customer-support-agent
  - Uses llama3-sentiment:v1
  - Integrated with knowledge base tool
  - Tested with 50 sample queries
  - Next: Add conversation history

### Datasets
- Processed: reviews_raw_20241019.csv → reviews_processed_20241019.csv
  - Cleaned 10,000 product reviews
  - Removed duplicates, normalized text
  - Added sentiment labels from manual annotation

### Code
- Created: scripts/utilities/preprocess_reviews.py
- Updated: templates/experiment/README.md with sentiment analysis template

### Learning
- Discovery: Llama3.1 at temp 0.3 gives more consistent sentiment predictions than 0.7
- Insight: Batch processing (100 samples/batch) optimal for GPU utilization
- Note: Fine-tuning might help with domain-specific language

### Next Steps
- [ ] Fine-tune llama3 on domain-specific reviews
- [ ] Expand customer-support-agent with multi-turn conversations
- [ ] Create comparison benchmark: llama3.1 vs phi3 vs mistral

---

## 2024-10-18

### Models
- Initial Ollama setup
- Pulled: llama3.1:8b, phi3:mini, mistral:7b

### Workspace
- Created complete workspace structure (30+ directories)
- Set up logging infrastructure
- Created experiment and agent templates

### Learning
- Ollama model management best practices
- Workspace organization patterns for ML

### Next Steps
- [ ] Run first sentiment analysis experiment
- [ ] Create custom Ollama model for specific domain

---

What to Include

Models:

  • Models pulled, created, or tested
  • Custom model configurations
  • Performance observations

Experiments:

  • Experiments started or completed
  • Key results and metrics
  • Important insights

Agents:

  • Agent development progress
  • Features added or tested
  • Integration work

Datasets:

  • Data processing done
  • Dataset creation or cleaning
  • Annotations added

Code:

  • Scripts created or updated
  • Tools built
  • Automation added

Learning:

  • Insights gained
  • Patterns discovered
  • Mistakes made (and learned from)

Next Steps:

  • Priorities for tomorrow/next session
  • Blockers to resolve
  • Ideas to explore

Benefits of Activity Logging

Accountability:

  • End each day reviewing what was done
  • Identify if time was well-spent
  • Maintain momentum

Pattern Recognition:

  • See which approaches work consistently
  • Identify time sinks
  • Track skill development

Blog Content:

  • Daily entries become session summaries
  • Insights turn into article topics
  • Natural documentation of journey

External Memory:

  • Quickly recall "what did I do last week?"
  • Find experiments by date
  • Track project evolution

Tier 3: Session Logs

Session logs are narrative documentation for sharing knowledge. They're the bridge between your work and blog posts.

When to Create Session Logs

Not every day needs a session log. Create them when:

  • You complete a significant experiment
  • You build something worth sharing
  • You learn something important
  • You solve a challenging problem
  • You have insights worth documenting

Session Log Structure

Location: (local path) or blog drafts/`

# Session: [Descriptive Title]

**Date:** YYYY-MM-DD
**Session Type:** [Setup/Experiment/Development/Optimization]
**Duration:** ~X hours
**Objective:** [What you set out to accomplish]

---

## Session Overview

[High-level summary of what happened and why it matters]

**Key Achievement:** [Most important outcome in one sentence]

---

## Tasks Completed

### 1. [Task Name] ✓
**Task:** [What needed to be done]

**Process:**
- Step 1
- Step 2
- Step 3

**Execution:**
```bash
# Commands or code used

Details:

  • Important details
  • Decisions made
  • Challenges encountered

Outcome: [What happened]

2. [Next Task] ✓

[Same structure]


Key Decisions & Rationale

Decision 1: [Decision Name]

Decision: [What you decided]

Rationale:

  • Reason 1
  • Reason 2

Alternative Considered: [What else you thought about] Rejected Because: [Why you didn't choose it]


Technical Insights

Insight 1: [Insight Name]

Discovery: [What you learned]

Impact:

  • How this affects your work
  • Why this matters

Evidence: [Data or observations]


Results & Metrics

[Quantitative results if applicable]

  • Metric 1: Value
  • Metric 2: Value
  • Performance: Details

Lessons Learned

  1. Lesson 1: [What you learned]

    • Why this matters
    • How to apply it
  2. Lesson 2: [Another learning]


Next Steps & Recommendations

Immediate:

  1. Action item 1
  2. Action item 2

Future:

  • Future direction 1
  • Future direction 2

Blog Post Ideas

Post 1: "[Title]"

  • Angle: [How to approach the topic]
  • Target Audience: [Who this helps]
  • Key Takeaways: [Main points]

Resources Referenced

  • Link 1
  • Link 2
  • Documentation reference

Summary

[Wrap up with key achievements, success metrics, and what made this session valuable]


### Example: From Activity Log to Session Log

**Activity Log Entry (Tier 2):**
```markdown
## 2024-10-19

### Experiments
- Completed: 20241019-143000-sentiment-analysis
  - Result: 87.3% accuracy, 0.852 F1 score
  - Insight: Lower temperature improved consistency

Session Log (Tier 3):

# Session: Sentiment Analysis with Llama 3.1

**Key Achievement:** Built production-ready sentiment analysis pipeline achieving 87.3% accuracy with optimized inference.

[Full narrative with context, process, insights, and lessons learned]

**Blog Post Ideas:**
- "Building a Production Sentiment Analyzer with Llama 3.1"
- "Temperature Tuning for Consistent LLM Predictions"

Session Log Benefits

Knowledge Sharing:

  • Teammates learn from your process
  • Context for future you
  • Blog-ready content

Reproducibility:

  • Detailed enough to recreate
  • Includes rationale for decisions
  • Documents what worked and why

Portfolio Building:

  • Demonstrates skills and thinking
  • Shows problem-solving approach
  • Creates shareable artifacts

Integration: How the Three Tiers Work Together

Here's how a real ML project flows through all three tiers:

Example: Training a Custom Model

Tier 1 (Execution Log):

2024-10-19 15:00:00 - INFO - Starting model training
2024-10-19 15:00:01 - INFO - Base model: llama3.1:8b
2024-10-19 15:00:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 15:00:01 - INFO - Config: {epochs: 3, batch_size: 32, learning_rate: 0.0001}
2024-10-19 15:00:01 - INFO - Epoch 1/3 starting...
2024-10-19 15:15:23 - INFO - Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389
2024-10-19 15:15:23 - INFO - Checkpoint saved: checkpoints/epoch_1.pt
...

Tier 2 (Activity Log):

### Experiments
- Completed: Fine-tuning llama3.1 on product reviews
  - 3 epochs, final loss: 0.312
  - Validation accuracy improved from 82% → 91%
  - Created: llama3-reviews:v1
  - Next: Deploy to production agent

Tier 3 (Session Log):

# Session: Fine-Tuning Llama 3.1 for Domain-Specific Sentiment Analysis

**Key Achievement:** Fine-tuned model improved accuracy by 9% (82% → 91%) on domain-specific reviews.

## Process
[Detailed narrative of preparation, training, evaluation]

## Key Insights
1. Domain-specific fine-tuning significantly improves accuracy
2. 3 epochs optimal (4+ epochs showed overfitting)
3. Lower learning rate (0.0001) more stable than default

## Blog Post Ideas
- "Fine-Tuning Llama 3.1: A Practical Guide"
- "When to Fine-Tune vs. Prompt Engineering"

Result: Same work, three perspectives:

  • Tier 1: Technical debugging reference
  • Tier 2: Quick daily summary
  • Tier 3: Shareable knowledge

Documentation Automation

Make documentation easy with these automation strategies:

1. Logging Boilerplate

# scripts/utilities/setup_logging.py
import logging
from datetime import datetime
from pathlib import Path

def setup_experiment_logging(experiment_name, log_dir="logs/experiments"):
    """Set up logging for an experiment"""
    Path(log_dir).mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
    log_file = f"{log_dir}/{timestamp}-{experiment_name}.log"

    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(log_file),
            logging.StreamHandler()
        ]
    )
    return logging.getLogger(__name__), log_file

# Usage
logger, log_file = setup_experiment_logging("sentiment-analysis")
logger.info("Experiment started")

2. Activity Log Helper

# Add to (local path) or create script
alias log-activity='code (local path)'
alias today='date +%Y-%m-%d'

# Quick entry function
function ml-log() {
    echo "\n## $(date +%Y-%m-%d)\n" >> (local path)
    echo "### $1\n- $2\n" >> (local path)
}

# Usage
ml-log "Experiments" "Completed sentiment analysis: 87.3% accuracy"

3. Session Template Generator

# scripts/utilities/new_session.sh
#!/bin/bash

SESSION_NAME=$1
SESSION_TYPE=$2
DATE=$(date +%Y-%m-%d)
FILENAME="docs/sessions/${DATE}-${SESSION_NAME}.md"

cat > "$FILENAME" << EOF
# Session: $SESSION_NAME

**Date:** $DATE
**Session Type:** $SESSION_TYPE
**Duration:** ~X hours
**Objective:** [Fill in objective]

---

## Session Overview

[High-level summary]

**Key Achievement:** [Most important outcome]

---

## Tasks Completed

### 1. Task Name ✓

...

EOF

echo "Created session log: $FILENAME"
code "$FILENAME"

Usage:

./scripts/utilities/new_session.sh "sentiment-analysis" "Experiment"

Best Practices

For All Tiers

Be Consistent:

  • Log daily (Tier 2)
  • Use templates (Tier 3)
  • Automate logging setup (Tier 1)

Be Honest:

  • Document failures and learnings
  • Record what didn't work
  • Note mistakes and fixes

Be Actionable:

  • Include next steps
  • Note blockers
  • Record questions

Tier-Specific Tips

Execution Logs (Tier 1):

  • Log everything (disk is cheap)
  • Include full stack traces
  • Timestamp every entry
  • Archive after 30 days

Activity Log (Tier 2):

  • Write end-of-day
  • Keep entries concise (5-10 lines per section)
  • Use checkboxes for next steps
  • Review weekly

Session Logs (Tier 3):

  • Write within 24 hours (while fresh)
  • Include enough context for others
  • Focus on "why" not just "what"
  • Tag potential blog topics

Common Pitfalls

Pitfall 1: Too much documentation

  • Problem: Spending more time documenting than doing
  • Solution: Start with Tier 2 only, add others as needed

Pitfall 2: Inconsistent logging

  • Problem: Logging for a week, then stopping
  • Solution: Make it easy (templates, aliases), make it habit

Pitfall 3: Wrong detail level

  • Problem: Technical details in activity log, summaries in execution logs
  • Solution: Remember tier purposes: debug, review, share

Pitfall 4: No connection between tiers

  • Problem: Logs mention "experiment X" but no link to execution logs
  • Solution: Reference log files in activity log, cross-link freely

What's Next

You now have a three-tier documentation system that captures ML work at multiple levels. But documentation is only part of reproducibility—you also need systematic experiment tracking.

In Part 3: Experiment Tracking and Reproducibility, we'll build:

  • Complete experiment templates with comprehensive READMEs
  • Progress tracking systems
  • Success criteria frameworks
  • Results comparison tools
  • Lifecycle management automation

We'll ensure every experiment is reproducible, comparable, and builds toward knowledge accumulation.


Key Takeaways

  • Three tiers serve different purposes: debugging, review, sharing
  • Execution logs capture technical details for troubleshooting
  • Activity logs track daily progress and build accountability
  • Session logs transform work into shareable knowledge
  • Automation makes documentation sustainable
  • Consistency matters more than perfection

Resources

Templates:

  • Python logging setup (provided above)
  • Activity log template (provided above)
  • Session log template (provided above)
  • Bash automation scripts (provided above)

Tools:

  • Python logging module
  • Bash aliases and functions
  • Code editor integration

Series Navigation

  • Previous: Part 1: Workspace Structure
  • Next: Part 3: Experiment Tracking (coming soon)
  • Series Home: Building a Production ML Workspace on GPU Infrastructure

Questions or suggestions? Find me on Twitter @bioinfo or at rundatarun.io


Related Articles


About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on Building a Production ML Workspace: Part 2 - Documentation Systems That Scale? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.

Links to this entry