# Building a Production ML Workspace: Part 2 - Documentation Systems That Scale
You've built a clean workspace structure. Your directories are organized, your experiments have homes, and your Ollama models are properly categorized. But there's a problem:
**Three months from now, you won't remember why you ran that experiment.**
You won't recall which hyperparameters worked, what insights you gained, or why Model A outperformed Model B. Without documentation, your past work becomes a black box—inaccessible and unreproducible.
This article shows you how to build a three-tier documentation system that captures ML work at different levels of detail, serves different audiences, and transforms your experiments into shareable knowledge.
<div class="callout" data-callout="info">
<div class="callout-title">About This Series</div>
<div class="callout-content">
This is Part 2 of a 5-part series on building production ML workspaces. Previous parts:
- [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]]
Coming next:
- **Part 3**: Experiment Tracking
- **Part 4**: Agent Templates
- **Part 5**: Ollama Integration
</div>
</div>
---
## The Documentation Problem
Most ML practitioners face these challenges:
**The Short-Term Problem:**
- Which model did I use for that benchmark?
- What dataset preprocessing steps did I apply?
- Why did this experiment fail?
- How do I reproduce these results?
**The Long-Term Problem:**
- What have I learned this month?
- Which approaches consistently work?
- What patterns am I seeing across projects?
- How do I share this knowledge?
**The Communication Problem:**
- How do I explain my work to teammates?
- How do I create blog posts from experiments?
- How do I document insights without losing technical details?
**Single-level documentation can't solve all these.** Execution logs are too detailed for review. High-level summaries lack debugging context. You need different documentation for different purposes.
---
## The Three-Tier Documentation System
Our solution: Three complementary documentation levels, each serving specific needs:
```
Tier 1: EXECUTION LOGS
├─ Purpose: Technical debugging and troubleshooting
├─ Audience: You (future debugging)
├─ Detail Level: High (full stack traces, timing, errors)
└─ Location: ~/workspace/logs/
Tier 2: ACTIVITY LOG
├─ Purpose: Daily progress tracking and review
├─ Audience: You (weekly review, accountability)
├─ Detail Level: Medium (summaries, key decisions, outcomes)
└─ Location: ~/workspace/ACTIVITY.md
Tier 3: SESSION LOGS
├─ Purpose: Knowledge sharing and blog content
├─ Audience: Others (teammates, blog readers)
├─ Detail Level: Narrative (context, process, insights)
└─ Location: ~/docs/sessions/ or blog drafts/
```
Each tier captures the same work from a different perspective. Let's build each one.
---
## Tier 1: Execution Logs
Execution logs capture technical details for debugging. When something breaks at 2 AM, you'll be grateful these exist.
### Log Structure
```bash
~/workspace/logs/
├── experiments/ # Experiment execution logs
│ └── 20241019-143000-sentiment-analysis.log
├── training/ # Model training logs
│ └── 20241019-155500-llama-finetuning.log
└── agents/ # Agent execution logs
└── 20241020-090000-customer-agent.log
```
### What to Log
**For experiments:**
- Timestamp of each step
- Model and version used
- Hyperparameters and configuration
- Dataset information
- Execution time for each step
- Memory usage and GPU utilization
- Errors and warnings (with full stack traces)
- Output file locations
**For training:**
- Training configuration
- Epoch-by-epoch metrics
- Loss curves
- Validation metrics
- Checkpoint locations
- Hardware utilization
- Training time
**For agents:**
- Input/output for each interaction
- Tool calls and responses
- Decision reasoning
- Latency per operation
- Errors and retry logic
- Token usage
### Example Execution Log
```
2024-10-19 14:30:00 - INFO - Experiment: sentiment-analysis
2024-10-19 14:30:00 - INFO - Model: llama3.1:8b
2024-10-19 14:30:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 14:30:01 - INFO - Config: {temperature: 0.3, max_tokens: 100}
2024-10-19 14:30:01 - INFO - Starting inference...
2024-10-19 14:30:15 - INFO - Batch 1/100 complete (14.2s, 7.0 tokens/sec)
2024-10-19 14:30:28 - INFO - Batch 2/100 complete (13.1s, 7.6 tokens/sec)
...
2024-10-19 14:52:45 - INFO - Inference complete (22m 44s)
2024-10-19 14:52:45 - INFO - Results saved: results/20241019-sentiment-analysis/
2024-10-19 14:52:45 - INFO - Accuracy: 87.3%, F1: 0.852
2024-10-19 14:52:45 - INFO - Peak GPU memory: 6.2GB
```
### Python Logging Template
```python
import logging
from datetime import datetime
# Setup logging
log_file = f"logs/experiments/{datetime.now().strftime('%Y%m%d-%H%M%S')}-{experiment_name}.log"
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(log_file),
logging.StreamHandler() # Also print to console
]
)
logger = logging.getLogger(__name__)
# Usage in your code
logger.info(f"Experiment: {experiment_name}")
logger.info(f"Model: {model_name}")
logger.info(f"Dataset: {dataset_name} ({len(dataset)} samples)")
try:
results = run_experiment()
logger.info(f"Results: {results}")
except Exception as e:
logger.error(f"Experiment failed: {str(e)}", exc_info=True)
```
### Best Practices
- **Structured logging:** Use consistent formats for parsing
- **Timestamps:** Every log entry needs a timestamp
- **Context:** Log enough to reproduce the exact scenario
- **Errors:** Always include full stack traces
- **Rotation:** Archive old logs periodically (> 30 days)
---
## Tier 2: Activity Log
The activity log is your daily work journal. It answers "What did I do today?" and creates accountability.
### ACTIVITY.md Structure
Location: `~/workspace/ACTIVITY.md`
```markdown
# ML Workspace Activity Log
Quick daily summaries of ML development work. For detailed technical logs, see logs/ directory.
---
## 2024-10-19
### Models
- Pulled llama3.1:8b from Ollama
- Created custom model: llama3-sentiment:v1 (temperature: 0.3, system prompt optimized)
- Tested phi3:mini for code generation tasks
### Experiments
- Started: 20241019-143000-sentiment-analysis
- Goal: Evaluate llama3.1 on product reviews
- Status: Complete
- Result: 87.3% accuracy, 0.852 F1 score
- Insight: Lower temperature (0.3) significantly improved consistency
### Agents
- Built prototype customer-support-agent
- Uses llama3-sentiment:v1
- Integrated with knowledge base tool
- Tested with 50 sample queries
- Next: Add conversation history
### Datasets
- Processed: reviews_raw_20241019.csv → reviews_processed_20241019.csv
- Cleaned 10,000 product reviews
- Removed duplicates, normalized text
- Added sentiment labels from manual annotation
### Code
- Created: scripts/utilities/preprocess_reviews.py
- Updated: templates/experiment/README.md with sentiment analysis template
### Learning
- Discovery: Llama3.1 at temp 0.3 gives more consistent sentiment predictions than 0.7
- Insight: Batch processing (100 samples/batch) optimal for GPU utilization
- Note: Fine-tuning might help with domain-specific language
### Next Steps
- [ ] Fine-tune llama3 on domain-specific reviews
- [ ] Expand customer-support-agent with multi-turn conversations
- [ ] Create comparison benchmark: llama3.1 vs phi3 vs mistral
---
## 2024-10-18
### Models
- Initial Ollama setup
- Pulled: llama3.1:8b, phi3:mini, mistral:7b
### Workspace
- Created complete workspace structure (30+ directories)
- Set up logging infrastructure
- Created experiment and agent templates
### Learning
- Ollama model management best practices
- Workspace organization patterns for ML
### Next Steps
- [ ] Run first sentiment analysis experiment
- [ ] Create custom Ollama model for specific domain
---
```
### What to Include
**Models:**
- Models pulled, created, or tested
- Custom model configurations
- Performance observations
**Experiments:**
- Experiments started or completed
- Key results and metrics
- Important insights
**Agents:**
- Agent development progress
- Features added or tested
- Integration work
**Datasets:**
- Data processing done
- Dataset creation or cleaning
- Annotations added
**Code:**
- Scripts created or updated
- Tools built
- Automation added
**Learning:**
- Insights gained
- Patterns discovered
- Mistakes made (and learned from)
**Next Steps:**
- Priorities for tomorrow/next session
- Blockers to resolve
- Ideas to explore
### Benefits of Activity Logging
**Accountability:**
- End each day reviewing what was done
- Identify if time was well-spent
- Maintain momentum
**Pattern Recognition:**
- See which approaches work consistently
- Identify time sinks
- Track skill development
**Blog Content:**
- Daily entries become session summaries
- Insights turn into article topics
- Natural documentation of journey
**External Memory:**
- Quickly recall "what did I do last week?"
- Find experiments by date
- Track project evolution
---
## Tier 3: Session Logs
Session logs are narrative documentation for sharing knowledge. They're the bridge between your work and blog posts.
### When to Create Session Logs
Not every day needs a session log. Create them when:
- You complete a significant experiment
- You build something worth sharing
- You learn something important
- You solve a challenging problem
- You have insights worth documenting
### Session Log Structure
Location: `~/docs/sessions/` or blog `drafts/`
```markdown
# Session: [Descriptive Title]
**Date:** YYYY-MM-DD
**Session Type:** [Setup/Experiment/Development/Optimization]
**Duration:** ~X hours
**Objective:** [What you set out to accomplish]
---
## Session Overview
[High-level summary of what happened and why it matters]
**Key Achievement:** [Most important outcome in one sentence]
---
## Tasks Completed
### 1. [Task Name] ✓
**Task:** [What needed to be done]
**Process:**
- Step 1
- Step 2
- Step 3
**Execution:**
```bash
# Commands or code used
```
**Details:**
- Important details
- Decisions made
- Challenges encountered
**Outcome:** [What happened]
### 2. [Next Task] ✓
[Same structure]
---
## Key Decisions & Rationale
### Decision 1: [Decision Name]
**Decision:** [What you decided]
**Rationale:**
- Reason 1
- Reason 2
**Alternative Considered:** [What else you thought about]
**Rejected Because:** [Why you didn't choose it]
---
## Technical Insights
### Insight 1: [Insight Name]
**Discovery:** [What you learned]
**Impact:**
- How this affects your work
- Why this matters
**Evidence:** [Data or observations]
---
## Results & Metrics
[Quantitative results if applicable]
- Metric 1: Value
- Metric 2: Value
- Performance: Details
---
## Lessons Learned
1. **Lesson 1:** [What you learned]
- Why this matters
- How to apply it
2. **Lesson 2:** [Another learning]
---
## Next Steps & Recommendations
**Immediate:**
1. Action item 1
2. Action item 2
**Future:**
- Future direction 1
- Future direction 2
---
## Blog Post Ideas
**Post 1:** "[Title]"
- Angle: [How to approach the topic]
- Target Audience: [Who this helps]
- Key Takeaways: [Main points]
---
## Resources Referenced
- Link 1
- Link 2
- Documentation reference
---
## Summary
[Wrap up with key achievements, success metrics, and what made this session valuable]
```
### Example: From Activity Log to Session Log
**Activity Log Entry (Tier 2):**
```markdown
## 2024-10-19
### Experiments
- Completed: 20241019-143000-sentiment-analysis
- Result: 87.3% accuracy, 0.852 F1 score
- Insight: Lower temperature improved consistency
```
**Session Log (Tier 3):**
```markdown
# Session: Sentiment Analysis with Llama 3.1
**Key Achievement:** Built production-ready sentiment analysis pipeline achieving 87.3% accuracy with optimized inference.
[Full narrative with context, process, insights, and lessons learned]
**Blog Post Ideas:**
- "Building a Production Sentiment Analyzer with Llama 3.1"
- "Temperature Tuning for Consistent LLM Predictions"
```
### Session Log Benefits
**Knowledge Sharing:**
- Teammates learn from your process
- Context for future you
- Blog-ready content
**Reproducibility:**
- Detailed enough to recreate
- Includes rationale for decisions
- Documents what worked and why
**Portfolio Building:**
- Demonstrates skills and thinking
- Shows problem-solving approach
- Creates shareable artifacts
---
## Integration: How the Three Tiers Work Together
Here's how a real ML project flows through all three tiers:
### Example: Training a Custom Model
**Tier 1 (Execution Log):**
```
2024-10-19 15:00:00 - INFO - Starting model training
2024-10-19 15:00:01 - INFO - Base model: llama3.1:8b
2024-10-19 15:00:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples)
2024-10-19 15:00:01 - INFO - Config: {epochs: 3, batch_size: 32, learning_rate: 0.0001}
2024-10-19 15:00:01 - INFO - Epoch 1/3 starting...
2024-10-19 15:15:23 - INFO - Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389
2024-10-19 15:15:23 - INFO - Checkpoint saved: checkpoints/epoch_1.pt
...
```
**Tier 2 (Activity Log):**
```markdown
### Experiments
- Completed: Fine-tuning llama3.1 on product reviews
- 3 epochs, final loss: 0.312
- Validation accuracy improved from 82% → 91%
- Created: llama3-reviews:v1
- Next: Deploy to production agent
```
**Tier 3 (Session Log):**
```markdown
# Session: Fine-Tuning Llama 3.1 for Domain-Specific Sentiment Analysis
**Key Achievement:** Fine-tuned model improved accuracy by 9% (82% → 91%) on domain-specific reviews.
## Process
[Detailed narrative of preparation, training, evaluation]
## Key Insights
1. Domain-specific fine-tuning significantly improves accuracy
2. 3 epochs optimal (4+ epochs showed overfitting)
3. Lower learning rate (0.0001) more stable than default
## Blog Post Ideas
- "Fine-Tuning Llama 3.1: A Practical Guide"
- "When to Fine-Tune vs. Prompt Engineering"
```
**Result:** Same work, three perspectives:
- Tier 1: Technical debugging reference
- Tier 2: Quick daily summary
- Tier 3: Shareable knowledge
---
## Documentation Automation
Make documentation easy with these automation strategies:
### 1. Logging Boilerplate
```python
# scripts/utilities/setup_logging.py
import logging
from datetime import datetime
from pathlib import Path
def setup_experiment_logging(experiment_name, log_dir="logs/experiments"):
"""Set up logging for an experiment"""
Path(log_dir).mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
log_file = f"{log_dir}/{timestamp}-{experiment_name}.log"
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(log_file),
logging.StreamHandler()
]
)
return logging.getLogger(__name__), log_file
# Usage
logger, log_file = setup_experiment_logging("sentiment-analysis")
logger.info("Experiment started")
```
### 2. Activity Log Helper
```bash
# Add to ~/.bashrc or create script
alias log-activity='code ~/workspace/ACTIVITY.md'
alias today='date +%Y-%m-%d'
# Quick entry function
function ml-log() {
echo "\n## $(date +%Y-%m-%d)\n" >> ~/workspace/ACTIVITY.md
echo "### $1\n- $2\n" >> ~/workspace/ACTIVITY.md
}
# Usage
ml-log "Experiments" "Completed sentiment analysis: 87.3% accuracy"
```
### 3. Session Template Generator
```bash
# scripts/utilities/new_session.sh
#!/bin/bash
SESSION_NAME=$1
SESSION_TYPE=$2
DATE=$(date +%Y-%m-%d)
FILENAME="docs/sessions/${DATE}-${SESSION_NAME}.md"
cat > "$FILENAME" << EOF
# Session: $SESSION_NAME
**Date:** $DATE
**Session Type:** $SESSION_TYPE
**Duration:** ~X hours
**Objective:** [Fill in objective]
---
## Session Overview
[High-level summary]
**Key Achievement:** [Most important outcome]
---
## Tasks Completed
### 1. Task Name ✓
...
EOF
echo "Created session log: $FILENAME"
code "$FILENAME"
```
Usage:
```bash
./scripts/utilities/new_session.sh "sentiment-analysis" "Experiment"
```
---
## Best Practices
### For All Tiers
**Be Consistent:**
- Log daily (Tier 2)
- Use templates (Tier 3)
- Automate logging setup (Tier 1)
**Be Honest:**
- Document failures and learnings
- Record what didn't work
- Note mistakes and fixes
**Be Actionable:**
- Include next steps
- Note blockers
- Record questions
### Tier-Specific Tips
**Execution Logs (Tier 1):**
- Log everything (disk is cheap)
- Include full stack traces
- Timestamp every entry
- Archive after 30 days
**Activity Log (Tier 2):**
- Write end-of-day
- Keep entries concise (5-10 lines per section)
- Use checkboxes for next steps
- Review weekly
**Session Logs (Tier 3):**
- Write within 24 hours (while fresh)
- Include enough context for others
- Focus on "why" not just "what"
- Tag potential blog topics
---
## Common Pitfalls
**Pitfall 1: Too much documentation**
- **Problem:** Spending more time documenting than doing
- **Solution:** Start with Tier 2 only, add others as needed
**Pitfall 2: Inconsistent logging**
- **Problem:** Logging for a week, then stopping
- **Solution:** Make it easy (templates, aliases), make it habit
**Pitfall 3: Wrong detail level**
- **Problem:** Technical details in activity log, summaries in execution logs
- **Solution:** Remember tier purposes: debug, review, share
**Pitfall 4: No connection between tiers**
- **Problem:** Logs mention "experiment X" but no link to execution logs
- **Solution:** Reference log files in activity log, cross-link freely
---
## What's Next
You now have a three-tier documentation system that captures ML work at multiple levels. But documentation is only part of reproducibility—you also need systematic experiment tracking.
In **Part 3: Experiment Tracking and Reproducibility**, we'll build:
- Complete experiment templates with comprehensive READMEs
- Progress tracking systems
- Success criteria frameworks
- Results comparison tools
- Lifecycle management automation
We'll ensure every experiment is reproducible, comparable, and builds toward knowledge accumulation.
---
## Key Takeaways
- **Three tiers serve different purposes**: debugging, review, sharing
- **Execution logs** capture technical details for troubleshooting
- **Activity logs** track daily progress and build accountability
- **Session logs** transform work into shareable knowledge
- **Automation** makes documentation sustainable
- **Consistency** matters more than perfection
---
## Resources
**Templates:**
- Python logging setup (provided above)
- Activity log template (provided above)
- Session log template (provided above)
- Bash automation scripts (provided above)
**Tools:**
- Python `logging` module
- Bash aliases and functions
- Code editor integration
---
## Series Navigation
- **Previous:** [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]]
- **Next:** Part 3: Experiment Tracking (coming soon)
- **Series Home:** Building a Production ML Workspace on GPU Infrastructure
---
**Questions or suggestions?** Find me on Twitter [@bioinfo](https://twitter.com/bioinfo) or at [rundatarun.io](https://rundatarun.io)
---
### Related Articles
- [[building-production-ml-workspace-part-3-experiments|Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility]]
- [[building-production-ml-workspace-part-5-collaboration|Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration]]
- [[building-production-ml-workspace-part-4-agents|Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates]]
---
<p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p>
<p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>