Building a Production ML Workspace: Part 2 - Documentation Systems That Scale
Building a Production ML Workspace: Part 2 - Documentation Systems That Scale
You've built a clean workspace structure. Your directories are organized, your experiments have homes, and your Ollama models are properly categorized. But there's a problem:
Three months from now, you won't remember why you ran that experiment.
You won't recall which hyperparameters worked, what insights you gained, or why Model A outperformed Model B. Without documentation, your past work becomes a black box—inaccessible and unreproducible.
This article shows you how to build a three-tier documentation system that captures ML work at different levels of detail, serves different audiences, and transforms your experiments into shareable knowledge.
This is Part 2 of a 5-part series on building production ML workspaces. Previous parts:
- Part 1: Workspace StructureshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 1 - Designing an Organized StructureLearn how to design a scalable ML workspace structure that handles Ollama models, fine-tuning, agents, and experiments without becoming chaotic.
Coming next:
- Part 3: Experiment Tracking
- Part 4: Agent Templates
- Part 5: Ollama Integration
The Documentation Problem
Most ML practitioners face these challenges:
The Short-Term Problem:
- Which model did I use for that benchmark?
- What dataset preprocessing steps did I apply?
- Why did this experiment fail?
- How do I reproduce these results?
The Long-Term Problem:
- What have I learned this month?
- Which approaches consistently work?
- What patterns am I seeing across projects?
- How do I share this knowledge?
The Communication Problem:
- How do I explain my work to teammates?
- How do I create blog posts from experiments?
- How do I document insights without losing technical details?
Single-level documentation can't solve all these. Execution logs are too detailed for review. High-level summaries lack debugging context. You need different documentation for different purposes.
The Three-Tier Documentation System
Our solution: Three complementary documentation levels, each serving specific needs:
Tier 1: EXECUTION LOGS
├─ Purpose: Technical debugging and troubleshooting
├─ Audience: You (future debugging)
├─ Detail Level: High (full stack traces, timing, errors)
└─ Location: (local path)
Tier 2: ACTIVITY LOG
├─ Purpose: Daily progress tracking and review
├─ Audience: You (weekly review, accountability)
├─ Detail Level: Medium (summaries, key decisions, outcomes)
└─ Location: (local path)
Tier 3: SESSION LOGS
├─ Purpose: Knowledge sharing and blog content
├─ Audience: Others (teammates, blog readers)
├─ Detail Level: Narrative (context, process, insights)
└─ Location: (local path) or blog drafts/
Each tier captures the same work from a different perspective. Let's build each one.
Tier 1: Execution Logs
Execution logs capture technical details for debugging. When something breaks at 2 AM, you'll be grateful these exist.
Log Structure
(local path)
├── experiments/ # Experiment execution logs
│ └── 20241019-143000-sentiment-analysis.log
├── training/ # Model training logs
│ └── 20241019-155500-llama-finetuning.log
└── agents/ # Agent execution logs
└── 20241020-090000-customer-agent.log
What to Log
For experiments:
- Timestamp of each step
- Model and version used
- Hyperparameters and configuration
- Dataset information
- Execution time for each step
- Memory usage and GPU utilization
- Errors and warnings (with full stack traces)
- Output file locations
For training:
- Training configuration
- Epoch-by-epoch metrics
- Loss curves
- Validation metrics
- Checkpoint locations
- Hardware utilization
- Training time
For agents:
- Input/output for each interaction
- Tool calls and responses
- Decision reasoning
- Latency per operation
- Errors and retry logic
- Token usage
Example Execution Log
2024-10-19 14:30:00 - INFO - Experiment: sentiment-analysis
2024-10-19 14:30:00 - INFO - Model: llama3.1:8b
2024-10-19 14:30:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 14:30:01 - INFO - Config: {temperature: 0.3, max_tokens: 100}
2024-10-19 14:30:01 - INFO - Starting inference...
2024-10-19 14:30:15 - INFO - Batch 1/100 complete (14.2s, 7.0 tokens/sec)
2024-10-19 14:30:28 - INFO - Batch 2/100 complete (13.1s, 7.6 tokens/sec)
...
2024-10-19 14:52:45 - INFO - Inference complete (22m 44s)
2024-10-19 14:52:45 - INFO - Results saved: results/20241019-sentiment-analysis/
2024-10-19 14:52:45 - INFO - Accuracy: 87.3%, F1: 0.852
2024-10-19 14:52:45 - INFO - Peak GPU memory: 6.2GB
Python Logging Template
import logging
from datetime import datetime
# Setup logging
log_file = f"logs/experiments/{datetime.now().strftime('%Y%m%d-%H%M%S')}-{experiment_name}.log"
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(log_file),
logging.StreamHandler() # Also print to console
]
)
logger = logging.getLogger(__name__)
# Usage in your code
logger.info(f"Experiment: {experiment_name}")
logger.info(f"Model: {model_name}")
logger.info(f"Dataset: {dataset_name} ({len(dataset)} samples)")
try:
results = run_experiment()
logger.info(f"Results: {results}")
except Exception as e:
logger.error(f"Experiment failed: {str(e)}", exc_info=True)
Best Practices
- Structured logging: Use consistent formats for parsing
- Timestamps: Every log entry needs a timestamp
- Context: Log enough to reproduce the exact scenario
- Errors: Always include full stack traces
- Rotation: Archive old logs periodically (> 30 days)
Tier 2: Activity Log
The activity log is your daily work journal. It answers "What did I do today?" and creates accountability.
ACTIVITY.md Structure
Location: `(local path)
# ML Workspace Activity Log
Quick daily summaries of ML development work. For detailed technical logs, see logs/ directory.
---
## 2024-10-19
### Models
- Pulled llama3.1:8b from Ollama
- Created custom model: llama3-sentiment:v1 (temperature: 0.3, system prompt optimized)
- Tested phi3:mini for code generation tasks
### Experiments
- Started: 20241019-143000-sentiment-analysis
- Goal: Evaluate llama3.1 on product reviews
- Status: Complete
- Result: 87.3% accuracy, 0.852 F1 score
- Insight: Lower temperature (0.3) significantly improved consistency
### Agents
- Built prototype customer-support-agent
- Uses llama3-sentiment:v1
- Integrated with knowledge base tool
- Tested with 50 sample queries
- Next: Add conversation history
### Datasets
- Processed: reviews_raw_20241019.csv → reviews_processed_20241019.csv
- Cleaned 10,000 product reviews
- Removed duplicates, normalized text
- Added sentiment labels from manual annotation
### Code
- Created: scripts/utilities/preprocess_reviews.py
- Updated: templates/experiment/README.md with sentiment analysis template
### Learning
- Discovery: Llama3.1 at temp 0.3 gives more consistent sentiment predictions than 0.7
- Insight: Batch processing (100 samples/batch) optimal for GPU utilization
- Note: Fine-tuning might help with domain-specific language
### Next Steps
- [ ] Fine-tune llama3 on domain-specific reviews
- [ ] Expand customer-support-agent with multi-turn conversations
- [ ] Create comparison benchmark: llama3.1 vs phi3 vs mistral
---
## 2024-10-18
### Models
- Initial Ollama setup
- Pulled: llama3.1:8b, phi3:mini, mistral:7b
### Workspace
- Created complete workspace structure (30+ directories)
- Set up logging infrastructure
- Created experiment and agent templates
### Learning
- Ollama model management best practices
- Workspace organization patterns for ML
### Next Steps
- [ ] Run first sentiment analysis experiment
- [ ] Create custom Ollama model for specific domain
---
What to Include
Models:
- Models pulled, created, or tested
- Custom model configurations
- Performance observations
Experiments:
- Experiments started or completed
- Key results and metrics
- Important insights
Agents:
- Agent development progress
- Features added or tested
- Integration work
Datasets:
- Data processing done
- Dataset creation or cleaning
- Annotations added
Code:
- Scripts created or updated
- Tools built
- Automation added
Learning:
- Insights gained
- Patterns discovered
- Mistakes made (and learned from)
Next Steps:
- Priorities for tomorrow/next session
- Blockers to resolve
- Ideas to explore
Benefits of Activity Logging
Accountability:
- End each day reviewing what was done
- Identify if time was well-spent
- Maintain momentum
Pattern Recognition:
- See which approaches work consistently
- Identify time sinks
- Track skill development
Blog Content:
- Daily entries become session summaries
- Insights turn into article topics
- Natural documentation of journey
External Memory:
- Quickly recall "what did I do last week?"
- Find experiments by date
- Track project evolution
Tier 3: Session Logs
Session logs are narrative documentation for sharing knowledge. They're the bridge between your work and blog posts.
When to Create Session Logs
Not every day needs a session log. Create them when:
- You complete a significant experiment
- You build something worth sharing
- You learn something important
- You solve a challenging problem
- You have insights worth documenting
Session Log Structure
Location: (local path) or blog drafts/`
# Session: [Descriptive Title]
**Date:** YYYY-MM-DD
**Session Type:** [Setup/Experiment/Development/Optimization]
**Duration:** ~X hours
**Objective:** [What you set out to accomplish]
---
## Session Overview
[High-level summary of what happened and why it matters]
**Key Achievement:** [Most important outcome in one sentence]
---
## Tasks Completed
### 1. [Task Name] ✓
**Task:** [What needed to be done]
**Process:**
- Step 1
- Step 2
- Step 3
**Execution:**
```bash
# Commands or code used
Details:
- Important details
- Decisions made
- Challenges encountered
Outcome: [What happened]
2. [Next Task] ✓
[Same structure]
Key Decisions & Rationale
Decision 1: [Decision Name]
Decision: [What you decided]
Rationale:
- Reason 1
- Reason 2
Alternative Considered: [What else you thought about] Rejected Because: [Why you didn't choose it]
Technical Insights
Insight 1: [Insight Name]
Discovery: [What you learned]
Impact:
- How this affects your work
- Why this matters
Evidence: [Data or observations]
Results & Metrics
[Quantitative results if applicable]
- Metric 1: Value
- Metric 2: Value
- Performance: Details
Lessons Learned
-
Lesson 1: [What you learned]
- Why this matters
- How to apply it
-
Lesson 2: [Another learning]
Next Steps & Recommendations
Immediate:
- Action item 1
- Action item 2
Future:
- Future direction 1
- Future direction 2
Blog Post Ideas
Post 1: "[Title]"
- Angle: [How to approach the topic]
- Target Audience: [Who this helps]
- Key Takeaways: [Main points]
Resources Referenced
- Link 1
- Link 2
- Documentation reference
Summary
[Wrap up with key achievements, success metrics, and what made this session valuable]
### Example: From Activity Log to Session Log
**Activity Log Entry (Tier 2):**
```markdown
## 2024-10-19
### Experiments
- Completed: 20241019-143000-sentiment-analysis
- Result: 87.3% accuracy, 0.852 F1 score
- Insight: Lower temperature improved consistency
Session Log (Tier 3):
# Session: Sentiment Analysis with Llama 3.1
**Key Achievement:** Built production-ready sentiment analysis pipeline achieving 87.3% accuracy with optimized inference.
[Full narrative with context, process, insights, and lessons learned]
**Blog Post Ideas:**
- "Building a Production Sentiment Analyzer with Llama 3.1"
- "Temperature Tuning for Consistent LLM Predictions"
Session Log Benefits
Knowledge Sharing:
- Teammates learn from your process
- Context for future you
- Blog-ready content
Reproducibility:
- Detailed enough to recreate
- Includes rationale for decisions
- Documents what worked and why
Portfolio Building:
- Demonstrates skills and thinking
- Shows problem-solving approach
- Creates shareable artifacts
Integration: How the Three Tiers Work Together
Here's how a real ML project flows through all three tiers:
Example: Training a Custom Model
Tier 1 (Execution Log):
2024-10-19 15:00:00 - INFO - Starting model training
2024-10-19 15:00:01 - INFO - Base model: llama3.1:8b
2024-10-19 15:00:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 15:00:01 - INFO - Config: {epochs: 3, batch_size: 32, learning_rate: 0.0001}
2024-10-19 15:00:01 - INFO - Epoch 1/3 starting...
2024-10-19 15:15:23 - INFO - Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389
2024-10-19 15:15:23 - INFO - Checkpoint saved: checkpoints/epoch_1.pt
...
Tier 2 (Activity Log):
### Experiments
- Completed: Fine-tuning llama3.1 on product reviews
- 3 epochs, final loss: 0.312
- Validation accuracy improved from 82% → 91%
- Created: llama3-reviews:v1
- Next: Deploy to production agent
Tier 3 (Session Log):
# Session: Fine-Tuning Llama 3.1 for Domain-Specific Sentiment Analysis
**Key Achievement:** Fine-tuned model improved accuracy by 9% (82% → 91%) on domain-specific reviews.
## Process
[Detailed narrative of preparation, training, evaluation]
## Key Insights
1. Domain-specific fine-tuning significantly improves accuracy
2. 3 epochs optimal (4+ epochs showed overfitting)
3. Lower learning rate (0.0001) more stable than default
## Blog Post Ideas
- "Fine-Tuning Llama 3.1: A Practical Guide"
- "When to Fine-Tune vs. Prompt Engineering"
Result: Same work, three perspectives:
- Tier 1: Technical debugging reference
- Tier 2: Quick daily summary
- Tier 3: Shareable knowledge
Documentation Automation
Make documentation easy with these automation strategies:
1. Logging Boilerplate
# scripts/utilities/setup_logging.py
import logging
from datetime import datetime
from pathlib import Path
def setup_experiment_logging(experiment_name, log_dir="logs/experiments"):
"""Set up logging for an experiment"""
Path(log_dir).mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
log_file = f"{log_dir}/{timestamp}-{experiment_name}.log"
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(log_file),
logging.StreamHandler()
]
)
return logging.getLogger(__name__), log_file
# Usage
logger, log_file = setup_experiment_logging("sentiment-analysis")
logger.info("Experiment started")
2. Activity Log Helper
# Add to (local path) or create script
alias log-activity='code (local path)'
alias today='date +%Y-%m-%d'
# Quick entry function
function ml-log() {
echo "\n## $(date +%Y-%m-%d)\n" >> (local path)
echo "### $1\n- $2\n" >> (local path)
}
# Usage
ml-log "Experiments" "Completed sentiment analysis: 87.3% accuracy"
3. Session Template Generator
# scripts/utilities/new_session.sh
#!/bin/bash
SESSION_NAME=$1
SESSION_TYPE=$2
DATE=$(date +%Y-%m-%d)
FILENAME="docs/sessions/${DATE}-${SESSION_NAME}.md"
cat > "$FILENAME" << EOF
# Session: $SESSION_NAME
**Date:** $DATE
**Session Type:** $SESSION_TYPE
**Duration:** ~X hours
**Objective:** [Fill in objective]
---
## Session Overview
[High-level summary]
**Key Achievement:** [Most important outcome]
---
## Tasks Completed
### 1. Task Name ✓
...
EOF
echo "Created session log: $FILENAME"
code "$FILENAME"
Usage:
./scripts/utilities/new_session.sh "sentiment-analysis" "Experiment"
Best Practices
For All Tiers
Be Consistent:
- Log daily (Tier 2)
- Use templates (Tier 3)
- Automate logging setup (Tier 1)
Be Honest:
- Document failures and learnings
- Record what didn't work
- Note mistakes and fixes
Be Actionable:
- Include next steps
- Note blockers
- Record questions
Tier-Specific Tips
Execution Logs (Tier 1):
- Log everything (disk is cheap)
- Include full stack traces
- Timestamp every entry
- Archive after 30 days
Activity Log (Tier 2):
- Write end-of-day
- Keep entries concise (5-10 lines per section)
- Use checkboxes for next steps
- Review weekly
Session Logs (Tier 3):
- Write within 24 hours (while fresh)
- Include enough context for others
- Focus on "why" not just "what"
- Tag potential blog topics
Common Pitfalls
Pitfall 1: Too much documentation
- Problem: Spending more time documenting than doing
- Solution: Start with Tier 2 only, add others as needed
Pitfall 2: Inconsistent logging
- Problem: Logging for a week, then stopping
- Solution: Make it easy (templates, aliases), make it habit
Pitfall 3: Wrong detail level
- Problem: Technical details in activity log, summaries in execution logs
- Solution: Remember tier purposes: debug, review, share
Pitfall 4: No connection between tiers
- Problem: Logs mention "experiment X" but no link to execution logs
- Solution: Reference log files in activity log, cross-link freely
What's Next
You now have a three-tier documentation system that captures ML work at multiple levels. But documentation is only part of reproducibility—you also need systematic experiment tracking.
In Part 3: Experiment Tracking and Reproducibility, we'll build:
- Complete experiment templates with comprehensive READMEs
- Progress tracking systems
- Success criteria frameworks
- Results comparison tools
- Lifecycle management automation
We'll ensure every experiment is reproducible, comparable, and builds toward knowledge accumulation.
Key Takeaways
- Three tiers serve different purposes: debugging, review, sharing
- Execution logs capture technical details for troubleshooting
- Activity logs track daily progress and build accountability
- Session logs transform work into shareable knowledge
- Automation makes documentation sustainable
- Consistency matters more than perfection
Resources
Templates:
- Python logging setup (provided above)
- Activity log template (provided above)
- Session log template (provided above)
- Bash automation scripts (provided above)
Tools:
- Python
loggingmodule - Bash aliases and functions
- Code editor integration
Series Navigation
- Previous: Part 1: Workspace StructureshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 1 - Designing an Organized StructureLearn how to design a scalable ML workspace structure that handles Ollama models, fine-tuning, agents, and experiments without becoming chaotic.
- Next: Part 3: Experiment Tracking (coming soon)
- Series Home: Building a Production ML Workspace on GPU Infrastructure
Questions or suggestions? Find me on Twitter @bioinfo or at rundatarun.io
Related Articles
- Building a Production ML Workspace: Part 3 - Experiment Tracking and ReproducibilityshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 3 - Experiment Tracking and ReproducibilityBuild systematic experiment tracking with templates, progress monitoring, and lifecycle management to ensure every ML experiment is reproducible and builds toward knowledge.
- Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow IntegrationshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow IntegrationComplete your production ML workspace with team collaboration patterns, workflow automation, version control strategies, and integration frameworks that scale.
- Building a Production ML Workspace: Part 4 - Production-Ready AI Agent TemplatesshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 4 - Production-Ready AI Agent TemplatesBuild production-ready AI agents with standardized templates, tool integration patterns, comprehensive testing, and deployment readiness frameworks.
About the Author: Justin Johnson builds AI systems and writes about practical AI development.
justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe
Follow the lab
Get the next experiment
Enjoyed the breakdown on Building a Production ML Workspace: Part 2 - Documentation Systems That Scale? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.
Related experiments
- Practical ApplicationsBuilding a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration
- Practical ApplicationsBuilding a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility
- Practical ApplicationsBuilding a Production ML Workspace: Part 1 - Designing an Organized Structure
Apparatus
1,429 words · 7 min read
- documentation
- ml-development
- experiment-tracking
- knowledge-management
- best-practices
Links to this entry
- Building a Production ML Workspace: Part 1 - Designing an Organized Structure
- Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility
- Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates
- Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration