Practical ApplicationsOctober 19, 20257 min readshipped

Building a Production ML Workspace: Part 2 - Documentation Systems That Scale

You've built a clean workspace structure. Your directories are organized, your experiments have homes, and your Ollama models are properly categorized. But there's a problem:

Three months from now, you won't remember why you ran that experiment.

You won't recall which hyperparameters worked, what insights you gained, or why Model A outperformed Model B. Without documentation, your past work becomes a black box—inaccessible and unreproducible.

This article shows you how to build a three-tier documentation system that captures ML work at different levels of detail, serves different audiences, and transforms your experiments into shareable knowledge.

About This Series

This is Part 2 of a 5-part series on building production ML workspaces. Previous parts:

Part 1: Workspace Structure

Coming next:

Part 3: Experiment Tracking
Part 4: Agent Templates
Part 5: Ollama Integration

The Documentation Problem

Most ML practitioners face these challenges:

The Short-Term Problem:

Which model did I use for that benchmark?
What dataset preprocessing steps did I apply?
Why did this experiment fail?
How do I reproduce these results?

The Long-Term Problem:

What have I learned this month?
Which approaches consistently work?
What patterns am I seeing across projects?
How do I share this knowledge?

The Communication Problem:

How do I explain my work to teammates?
How do I create blog posts from experiments?
How do I document insights without losing technical details?

Single-level documentation can't solve all these. Execution logs are too detailed for review. High-level summaries lack debugging context. You need different documentation for different purposes.

The Three-Tier Documentation System

Our solution: Three complementary documentation levels, each serving specific needs:

Tier 1: EXECUTION LOGS
├─ Purpose: Technical debugging and troubleshooting
├─ Audience: You (future debugging)
├─ Detail Level: High (full stack traces, timing, errors)
└─ Location: (local path)

Tier 2: ACTIVITY LOG
├─ Purpose: Daily progress tracking and review
├─ Audience: You (weekly review, accountability)
├─ Detail Level: Medium (summaries, key decisions, outcomes)
└─ Location: (local path)

Tier 3: SESSION LOGS
├─ Purpose: Knowledge sharing and blog content
├─ Audience: Others (teammates, blog readers)
├─ Detail Level: Narrative (context, process, insights)
└─ Location: (local path) or blog drafts/

Each tier captures the same work from a different perspective. Let's build each one.

Tier 1: Execution Logs

Execution logs capture technical details for debugging. When something breaks at 2 AM, you'll be grateful these exist.

Log Structure

(local path)
├── experiments/        # Experiment execution logs
│   └── 20241019-143000-sentiment-analysis.log
├── training/          # Model training logs
│   └── 20241019-155500-llama-finetuning.log
└── agents/            # Agent execution logs
    └── 20241020-090000-customer-agent.log

What to Log

For experiments:

Timestamp of each step
Model and version used
Hyperparameters and configuration
Dataset information
Execution time for each step
Memory usage and GPU utilization
Errors and warnings (with full stack traces)
Output file locations

For training:

Training configuration
Epoch-by-epoch metrics
Loss curves
Validation metrics
Checkpoint locations
Hardware utilization
Training time

For agents:

Input/output for each interaction
Tool calls and responses
Decision reasoning
Latency per operation
Errors and retry logic
Token usage

Example Execution Log

2024-10-19 14:30:00 - INFO - Experiment: sentiment-analysis
2024-10-19 14:30:00 - INFO - Model: llama3.1:8b
2024-10-19 14:30:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 14:30:01 - INFO - Config: {temperature: 0.3, max_tokens: 100}
2024-10-19 14:30:01 - INFO - Starting inference...
2024-10-19 14:30:15 - INFO - Batch 1/100 complete (14.2s, 7.0 tokens/sec)
2024-10-19 14:30:28 - INFO - Batch 2/100 complete (13.1s, 7.6 tokens/sec)
...
2024-10-19 14:52:45 - INFO - Inference complete (22m 44s)
2024-10-19 14:52:45 - INFO - Results saved: results/20241019-sentiment-analysis/
2024-10-19 14:52:45 - INFO - Accuracy: 87.3%, F1: 0.852
2024-10-19 14:52:45 - INFO - Peak GPU memory: 6.2GB

Python Logging Template

import logging
from datetime import datetime

# Setup logging
log_file = f"logs/experiments/{datetime.now().strftime('%Y%m%d-%H%M%S')}-{experiment_name}.log"
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file),
        logging.StreamHandler()  # Also print to console
    ]
)

logger = logging.getLogger(__name__)

# Usage in your code
logger.info(f"Experiment: {experiment_name}")
logger.info(f"Model: {model_name}")
logger.info(f"Dataset: {dataset_name} ({len(dataset)} samples)")

try:
    results = run_experiment()
    logger.info(f"Results: {results}")
except Exception as e:
    logger.error(f"Experiment failed: {str(e)}", exc_info=True)

Best Practices

Structured logging: Use consistent formats for parsing
Timestamps: Every log entry needs a timestamp
Context: Log enough to reproduce the exact scenario
Errors: Always include full stack traces
Rotation: Archive old logs periodically (> 30 days)

Tier 2: Activity Log

The activity log is your daily work journal. It answers "What did I do today?" and creates accountability.

ACTIVITY.md Structure

Location: `(local path)

# ML Workspace Activity Log

Quick daily summaries of ML development work. For detailed technical logs, see logs/ directory.

---

## 2024-10-19

### Models
- Pulled llama3.1:8b from Ollama
- Created custom model: llama3-sentiment:v1 (temperature: 0.3, system prompt optimized)
- Tested phi3:mini for code generation tasks

### Experiments
- Started: 20241019-143000-sentiment-analysis
  - Goal: Evaluate llama3.1 on product reviews
  - Status: Complete
  - Result: 87.3% accuracy, 0.852 F1 score
  - Insight: Lower temperature (0.3) significantly improved consistency

### Agents
- Built prototype customer-support-agent
  - Uses llama3-sentiment:v1
  - Integrated with knowledge base tool
  - Tested with 50 sample queries
  - Next: Add conversation history

### Datasets
- Processed: reviews_raw_20241019.csv → reviews_processed_20241019.csv
  - Cleaned 10,000 product reviews
  - Removed duplicates, normalized text
  - Added sentiment labels from manual annotation

### Code
- Created: scripts/utilities/preprocess_reviews.py
- Updated: templates/experiment/README.md with sentiment analysis template

### Learning
- Discovery: Llama3.1 at temp 0.3 gives more consistent sentiment predictions than 0.7
- Insight: Batch processing (100 samples/batch) optimal for GPU utilization
- Note: Fine-tuning might help with domain-specific language

### Next Steps
- [ ] Fine-tune llama3 on domain-specific reviews
- [ ] Expand customer-support-agent with multi-turn conversations
- [ ] Create comparison benchmark: llama3.1 vs phi3 vs mistral

---

## 2024-10-18

### Models
- Initial Ollama setup
- Pulled: llama3.1:8b, phi3:mini, mistral:7b

### Workspace
- Created complete workspace structure (30+ directories)
- Set up logging infrastructure
- Created experiment and agent templates

### Learning
- Ollama model management best practices
- Workspace organization patterns for ML

### Next Steps
- [ ] Run first sentiment analysis experiment
- [ ] Create custom Ollama model for specific domain

---

What to Include

Models:

Models pulled, created, or tested
Custom model configurations
Performance observations

Experiments:

Experiments started or completed
Key results and metrics
Important insights

Agents:

Agent development progress
Features added or tested
Integration work

Datasets:

Data processing done
Dataset creation or cleaning
Annotations added

Code:

Scripts created or updated
Tools built
Automation added

Learning:

Insights gained
Patterns discovered
Mistakes made (and learned from)

Next Steps:

Priorities for tomorrow/next session
Blockers to resolve
Ideas to explore

Benefits of Activity Logging

Accountability:

End each day reviewing what was done
Identify if time was well-spent
Maintain momentum

Pattern Recognition:

See which approaches work consistently
Identify time sinks
Track skill development

Blog Content:

Daily entries become session summaries
Insights turn into article topics
Natural documentation of journey

External Memory:

Quickly recall "what did I do last week?"
Find experiments by date
Track project evolution

Tier 3: Session Logs

Session logs are narrative documentation for sharing knowledge. They're the bridge between your work and blog posts.

When to Create Session Logs

Not every day needs a session log. Create them when:

You complete a significant experiment
You build something worth sharing
You learn something important
You solve a challenging problem
You have insights worth documenting

Session Log Structure

Location: (local path) or blog drafts/`

# Session: [Descriptive Title]

**Date:** YYYY-MM-DD
**Session Type:** [Setup/Experiment/Development/Optimization]
**Duration:** ~X hours
**Objective:** [What you set out to accomplish]

---

## Session Overview

[High-level summary of what happened and why it matters]

**Key Achievement:** [Most important outcome in one sentence]

---

## Tasks Completed

### 1. [Task Name] ✓
**Task:** [What needed to be done]

**Process:**
- Step 1
- Step 2
- Step 3

**Execution:**
```bash
# Commands or code used

Details:

Important details
Decisions made
Challenges encountered

Outcome: [What happened]

2. [Next Task] ✓

[Same structure]

Key Decisions & Rationale

Decision 1: [Decision Name]

Decision: [What you decided]

Rationale:

Reason 1
Reason 2

Alternative Considered: [What else you thought about] Rejected Because: [Why you didn't choose it]

Technical Insights

Insight 1: [Insight Name]

Discovery: [What you learned]

Impact:

How this affects your work
Why this matters

Evidence: [Data or observations]

Results & Metrics

[Quantitative results if applicable]

Metric 1: Value
Metric 2: Value
Performance: Details

Lessons Learned

Lesson 1: [What you learned]
- Why this matters
- How to apply it
Lesson 2: [Another learning]

Next Steps & Recommendations

Immediate:

Action item 1
Action item 2

Future:

Future direction 1
Future direction 2

Blog Post Ideas

Post 1: "[Title]"

Angle: [How to approach the topic]
Target Audience: [Who this helps]
Key Takeaways: [Main points]

Resources Referenced

Link 1
Link 2
Documentation reference

Summary

[Wrap up with key achievements, success metrics, and what made this session valuable]


### Example: From Activity Log to Session Log

**Activity Log Entry (Tier 2):**
```markdown
## 2024-10-19

### Experiments
- Completed: 20241019-143000-sentiment-analysis
  - Result: 87.3% accuracy, 0.852 F1 score
  - Insight: Lower temperature improved consistency

Session Log (Tier 3):

# Session: Sentiment Analysis with Llama 3.1

**Key Achievement:** Built production-ready sentiment analysis pipeline achieving 87.3% accuracy with optimized inference.

[Full narrative with context, process, insights, and lessons learned]

**Blog Post Ideas:**
- "Building a Production Sentiment Analyzer with Llama 3.1"
- "Temperature Tuning for Consistent LLM Predictions"

Session Log Benefits

Knowledge Sharing:

Teammates learn from your process
Context for future you
Blog-ready content

Reproducibility:

Detailed enough to recreate
Includes rationale for decisions
Documents what worked and why

Portfolio Building:

Demonstrates skills and thinking
Shows problem-solving approach
Creates shareable artifacts

Integration: How the Three Tiers Work Together

Here's how a real ML project flows through all three tiers:

Example: Training a Custom Model

Tier 1 (Execution Log):

2024-10-19 15:00:00 - INFO - Starting model training
2024-10-19 15:00:01 - INFO - Base model: llama3.1:8b
2024-10-19 15:00:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 15:00:01 - INFO - Config: {epochs: 3, batch_size: 32, learning_rate: 0.0001}
2024-10-19 15:00:01 - INFO - Epoch 1/3 starting...
2024-10-19 15:15:23 - INFO - Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389
2024-10-19 15:15:23 - INFO - Checkpoint saved: checkpoints/epoch_1.pt
...

Tier 2 (Activity Log):

### Experiments
- Completed: Fine-tuning llama3.1 on product reviews
  - 3 epochs, final loss: 0.312
  - Validation accuracy improved from 82% → 91%
  - Created: llama3-reviews:v1
  - Next: Deploy to production agent

Tier 3 (Session Log):

# Session: Fine-Tuning Llama 3.1 for Domain-Specific Sentiment Analysis

**Key Achievement:** Fine-tuned model improved accuracy by 9% (82% → 91%) on domain-specific reviews.

## Process
[Detailed narrative of preparation, training, evaluation]

## Key Insights
1. Domain-specific fine-tuning significantly improves accuracy
2. 3 epochs optimal (4+ epochs showed overfitting)
3. Lower learning rate (0.0001) more stable than default

## Blog Post Ideas
- "Fine-Tuning Llama 3.1: A Practical Guide"
- "When to Fine-Tune vs. Prompt Engineering"

Result: Same work, three perspectives:

Tier 1: Technical debugging reference
Tier 2: Quick daily summary
Tier 3: Shareable knowledge

Documentation Automation

Make documentation easy with these automation strategies:

1. Logging Boilerplate

# scripts/utilities/setup_logging.py
import logging
from datetime import datetime
from pathlib import Path

def setup_experiment_logging(experiment_name, log_dir="logs/experiments"):
    """Set up logging for an experiment"""
    Path(log_dir).mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
    log_file = f"{log_dir}/{timestamp}-{experiment_name}.log"

    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(log_file),
            logging.StreamHandler()
        ]
    )
    return logging.getLogger(__name__), log_file

# Usage
logger, log_file = setup_experiment_logging("sentiment-analysis")
logger.info("Experiment started")

2. Activity Log Helper

# Add to (local path) or create script
alias log-activity='code (local path)'
alias today='date +%Y-%m-%d'

# Quick entry function
function ml-log() {
    echo "\n## $(date +%Y-%m-%d)\n" >> (local path)
    echo "### $1\n- $2\n" >> (local path)
}

# Usage
ml-log "Experiments" "Completed sentiment analysis: 87.3% accuracy"

3. Session Template Generator

# scripts/utilities/new_session.sh
#!/bin/bash

SESSION_NAME=$1
SESSION_TYPE=$2
DATE=$(date +%Y-%m-%d)
FILENAME="docs/sessions/${DATE}-${SESSION_NAME}.md"

cat > "$FILENAME" << EOF
# Session: $SESSION_NAME

**Date:** $DATE
**Session Type:** $SESSION_TYPE
**Duration:** ~X hours
**Objective:** [Fill in objective]

---

## Session Overview

[High-level summary]

**Key Achievement:** [Most important outcome]

---

## Tasks Completed

### 1. Task Name ✓

...

EOF

echo "Created session log: $FILENAME"
code "$FILENAME"

Usage:

./scripts/utilities/new_session.sh "sentiment-analysis" "Experiment"

Best Practices

For All Tiers

Be Consistent:

Log daily (Tier 2)
Use templates (Tier 3)
Automate logging setup (Tier 1)

Be Honest:

Document failures and learnings
Record what didn't work
Note mistakes and fixes

Be Actionable:

Include next steps
Note blockers
Record questions

Tier-Specific Tips

Execution Logs (Tier 1):

Log everything (disk is cheap)
Include full stack traces
Timestamp every entry
Archive after 30 days

Activity Log (Tier 2):

Write end-of-day
Keep entries concise (5-10 lines per section)
Use checkboxes for next steps
Review weekly

Session Logs (Tier 3):

Write within 24 hours (while fresh)
Include enough context for others
Focus on "why" not just "what"
Tag potential blog topics

Common Pitfalls

Pitfall 1: Too much documentation

Problem: Spending more time documenting than doing
Solution: Start with Tier 2 only, add others as needed

Pitfall 2: Inconsistent logging

Problem: Logging for a week, then stopping
Solution: Make it easy (templates, aliases), make it habit

Pitfall 3: Wrong detail level

Problem: Technical details in activity log, summaries in execution logs
Solution: Remember tier purposes: debug, review, share

Pitfall 4: No connection between tiers

Problem: Logs mention "experiment X" but no link to execution logs
Solution: Reference log files in activity log, cross-link freely

What's Next

You now have a three-tier documentation system that captures ML work at multiple levels. But documentation is only part of reproducibility—you also need systematic experiment tracking.

In Part 3: Experiment Tracking and Reproducibility, we'll build:

Complete experiment templates with comprehensive READMEs
Progress tracking systems
Success criteria frameworks
Results comparison tools
Lifecycle management automation

We'll ensure every experiment is reproducible, comparable, and builds toward knowledge accumulation.

Key Takeaways

Three tiers serve different purposes: debugging, review, sharing
Execution logs capture technical details for troubleshooting
Activity logs track daily progress and build accountability
Session logs transform work into shareable knowledge
Automation makes documentation sustainable
Consistency matters more than perfection

Resources

Templates:

Python logging setup (provided above)
Activity log template (provided above)
Session log template (provided above)
Bash automation scripts (provided above)

Tools:

Python logging module
Bash aliases and functions
Code editor integration

Series Navigation

Previous: Part 1: Workspace Structure
Next: Part 3: Experiment Tracking (coming soon)
Series Home: Building a Production ML Workspace on GPU Infrastructure

Questions or suggestions? Find me on Twitter @bioinfo or at rundatarun.io

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Related experiments

Apparatus

1,429 words · 7 min read

documentation
ml-development
experiment-tracking
knowledge-management
best-practices

Building a Production ML Workspace: Part 2 - Documentation Systems That Scale

The Documentation Problem

The Three-Tier Documentation System

Tier 1: Execution Logs

Log Structure

What to Log

Example Execution Log

Python Logging Template

Best Practices

Tier 2: Activity Log

ACTIVITY.md Structure

What to Include

Benefits of Activity Logging

Tier 3: Session Logs

When to Create Session Logs

Session Log Structure

2. [Next Task] ✓

Key Decisions & Rationale

Decision 1: [Decision Name]

Technical Insights

Insight 1: [Insight Name]

Results & Metrics

Lessons Learned

Next Steps & Recommendations

Blog Post Ideas

Resources Referenced

Summary

Session Log Benefits

Integration: How the Three Tiers Work Together

Example: Training a Custom Model

Documentation Automation

1. Logging Boilerplate

2. Activity Log Helper

3. Session Template Generator

Best Practices

For All Tiers

Tier-Specific Tips

Common Pitfalls

What's Next

Key Takeaways

Resources

Series Navigation

Related Articles

Get the next experiment

Related experiments

Apparatus

Links to this entry