# Building a Production ML Workspace: Part 2 - Documentation Systems That Scale You've built a clean workspace structure. Your directories are organized, your experiments have homes, and your Ollama models are properly categorized. But there's a problem: **Three months from now, you won't remember why you ran that experiment.** You won't recall which hyperparameters worked, what insights you gained, or why Model A outperformed Model B. Without documentation, your past work becomes a black box—inaccessible and unreproducible. This article shows you how to build a three-tier documentation system that captures ML work at different levels of detail, serves different audiences, and transforms your experiments into shareable knowledge. <div class="callout" data-callout="info"> <div class="callout-title">About This Series</div> <div class="callout-content"> This is Part 2 of a 5-part series on building production ML workspaces. Previous parts: - [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]] Coming next: - **Part 3**: Experiment Tracking - **Part 4**: Agent Templates - **Part 5**: Ollama Integration </div> </div> --- ## The Documentation Problem Most ML practitioners face these challenges: **The Short-Term Problem:** - Which model did I use for that benchmark? - What dataset preprocessing steps did I apply? - Why did this experiment fail? - How do I reproduce these results? **The Long-Term Problem:** - What have I learned this month? - Which approaches consistently work? - What patterns am I seeing across projects? - How do I share this knowledge? **The Communication Problem:** - How do I explain my work to teammates? - How do I create blog posts from experiments? - How do I document insights without losing technical details? **Single-level documentation can't solve all these.** Execution logs are too detailed for review. High-level summaries lack debugging context. You need different documentation for different purposes. --- ## The Three-Tier Documentation System Our solution: Three complementary documentation levels, each serving specific needs: ``` Tier 1: EXECUTION LOGS ├─ Purpose: Technical debugging and troubleshooting ├─ Audience: You (future debugging) ├─ Detail Level: High (full stack traces, timing, errors) └─ Location: ~/workspace/logs/ Tier 2: ACTIVITY LOG ├─ Purpose: Daily progress tracking and review ├─ Audience: You (weekly review, accountability) ├─ Detail Level: Medium (summaries, key decisions, outcomes) └─ Location: ~/workspace/ACTIVITY.md Tier 3: SESSION LOGS ├─ Purpose: Knowledge sharing and blog content ├─ Audience: Others (teammates, blog readers) ├─ Detail Level: Narrative (context, process, insights) └─ Location: ~/docs/sessions/ or blog drafts/ ``` Each tier captures the same work from a different perspective. Let's build each one. --- ## Tier 1: Execution Logs Execution logs capture technical details for debugging. When something breaks at 2 AM, you'll be grateful these exist. ### Log Structure ```bash ~/workspace/logs/ ├── experiments/ # Experiment execution logs │ └── 20241019-143000-sentiment-analysis.log ├── training/ # Model training logs │ └── 20241019-155500-llama-finetuning.log └── agents/ # Agent execution logs └── 20241020-090000-customer-agent.log ``` ### What to Log **For experiments:** - Timestamp of each step - Model and version used - Hyperparameters and configuration - Dataset information - Execution time for each step - Memory usage and GPU utilization - Errors and warnings (with full stack traces) - Output file locations **For training:** - Training configuration - Epoch-by-epoch metrics - Loss curves - Validation metrics - Checkpoint locations - Hardware utilization - Training time **For agents:** - Input/output for each interaction - Tool calls and responses - Decision reasoning - Latency per operation - Errors and retry logic - Token usage ### Example Execution Log ``` 2024-10-19 14:30:00 - INFO - Experiment: sentiment-analysis 2024-10-19 14:30:00 - INFO - Model: llama3.1:8b 2024-10-19 14:30:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples) 2024-10-19 14:30:01 - INFO - Config: {temperature: 0.3, max_tokens: 100} 2024-10-19 14:30:01 - INFO - Starting inference... 2024-10-19 14:30:15 - INFO - Batch 1/100 complete (14.2s, 7.0 tokens/sec) 2024-10-19 14:30:28 - INFO - Batch 2/100 complete (13.1s, 7.6 tokens/sec) ... 2024-10-19 14:52:45 - INFO - Inference complete (22m 44s) 2024-10-19 14:52:45 - INFO - Results saved: results/20241019-sentiment-analysis/ 2024-10-19 14:52:45 - INFO - Accuracy: 87.3%, F1: 0.852 2024-10-19 14:52:45 - INFO - Peak GPU memory: 6.2GB ``` ### Python Logging Template ```python import logging from datetime import datetime # Setup logging log_file = f"logs/experiments/{datetime.now().strftime('%Y%m%d-%H%M%S')}-{experiment_name}.log" logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(log_file), logging.StreamHandler() # Also print to console ] ) logger = logging.getLogger(__name__) # Usage in your code logger.info(f"Experiment: {experiment_name}") logger.info(f"Model: {model_name}") logger.info(f"Dataset: {dataset_name} ({len(dataset)} samples)") try: results = run_experiment() logger.info(f"Results: {results}") except Exception as e: logger.error(f"Experiment failed: {str(e)}", exc_info=True) ``` ### Best Practices - **Structured logging:** Use consistent formats for parsing - **Timestamps:** Every log entry needs a timestamp - **Context:** Log enough to reproduce the exact scenario - **Errors:** Always include full stack traces - **Rotation:** Archive old logs periodically (> 30 days) --- ## Tier 2: Activity Log The activity log is your daily work journal. It answers "What did I do today?" and creates accountability. ### ACTIVITY.md Structure Location: `~/workspace/ACTIVITY.md` ```markdown # ML Workspace Activity Log Quick daily summaries of ML development work. For detailed technical logs, see logs/ directory. --- ## 2024-10-19 ### Models - Pulled llama3.1:8b from Ollama - Created custom model: llama3-sentiment:v1 (temperature: 0.3, system prompt optimized) - Tested phi3:mini for code generation tasks ### Experiments - Started: 20241019-143000-sentiment-analysis - Goal: Evaluate llama3.1 on product reviews - Status: Complete - Result: 87.3% accuracy, 0.852 F1 score - Insight: Lower temperature (0.3) significantly improved consistency ### Agents - Built prototype customer-support-agent - Uses llama3-sentiment:v1 - Integrated with knowledge base tool - Tested with 50 sample queries - Next: Add conversation history ### Datasets - Processed: reviews_raw_20241019.csv → reviews_processed_20241019.csv - Cleaned 10,000 product reviews - Removed duplicates, normalized text - Added sentiment labels from manual annotation ### Code - Created: scripts/utilities/preprocess_reviews.py - Updated: templates/experiment/README.md with sentiment analysis template ### Learning - Discovery: Llama3.1 at temp 0.3 gives more consistent sentiment predictions than 0.7 - Insight: Batch processing (100 samples/batch) optimal for GPU utilization - Note: Fine-tuning might help with domain-specific language ### Next Steps - [ ] Fine-tune llama3 on domain-specific reviews - [ ] Expand customer-support-agent with multi-turn conversations - [ ] Create comparison benchmark: llama3.1 vs phi3 vs mistral --- ## 2024-10-18 ### Models - Initial Ollama setup - Pulled: llama3.1:8b, phi3:mini, mistral:7b ### Workspace - Created complete workspace structure (30+ directories) - Set up logging infrastructure - Created experiment and agent templates ### Learning - Ollama model management best practices - Workspace organization patterns for ML ### Next Steps - [ ] Run first sentiment analysis experiment - [ ] Create custom Ollama model for specific domain --- ``` ### What to Include **Models:** - Models pulled, created, or tested - Custom model configurations - Performance observations **Experiments:** - Experiments started or completed - Key results and metrics - Important insights **Agents:** - Agent development progress - Features added or tested - Integration work **Datasets:** - Data processing done - Dataset creation or cleaning - Annotations added **Code:** - Scripts created or updated - Tools built - Automation added **Learning:** - Insights gained - Patterns discovered - Mistakes made (and learned from) **Next Steps:** - Priorities for tomorrow/next session - Blockers to resolve - Ideas to explore ### Benefits of Activity Logging **Accountability:** - End each day reviewing what was done - Identify if time was well-spent - Maintain momentum **Pattern Recognition:** - See which approaches work consistently - Identify time sinks - Track skill development **Blog Content:** - Daily entries become session summaries - Insights turn into article topics - Natural documentation of journey **External Memory:** - Quickly recall "what did I do last week?" - Find experiments by date - Track project evolution --- ## Tier 3: Session Logs Session logs are narrative documentation for sharing knowledge. They're the bridge between your work and blog posts. ### When to Create Session Logs Not every day needs a session log. Create them when: - You complete a significant experiment - You build something worth sharing - You learn something important - You solve a challenging problem - You have insights worth documenting ### Session Log Structure Location: `~/docs/sessions/` or blog `drafts/` ```markdown # Session: [Descriptive Title] **Date:** YYYY-MM-DD **Session Type:** [Setup/Experiment/Development/Optimization] **Duration:** ~X hours **Objective:** [What you set out to accomplish] --- ## Session Overview [High-level summary of what happened and why it matters] **Key Achievement:** [Most important outcome in one sentence] --- ## Tasks Completed ### 1. [Task Name] ✓ **Task:** [What needed to be done] **Process:** - Step 1 - Step 2 - Step 3 **Execution:** ```bash # Commands or code used ``` **Details:** - Important details - Decisions made - Challenges encountered **Outcome:** [What happened] ### 2. [Next Task] ✓ [Same structure] --- ## Key Decisions & Rationale ### Decision 1: [Decision Name] **Decision:** [What you decided] **Rationale:** - Reason 1 - Reason 2 **Alternative Considered:** [What else you thought about] **Rejected Because:** [Why you didn't choose it] --- ## Technical Insights ### Insight 1: [Insight Name] **Discovery:** [What you learned] **Impact:** - How this affects your work - Why this matters **Evidence:** [Data or observations] --- ## Results & Metrics [Quantitative results if applicable] - Metric 1: Value - Metric 2: Value - Performance: Details --- ## Lessons Learned 1. **Lesson 1:** [What you learned] - Why this matters - How to apply it 2. **Lesson 2:** [Another learning] --- ## Next Steps & Recommendations **Immediate:** 1. Action item 1 2. Action item 2 **Future:** - Future direction 1 - Future direction 2 --- ## Blog Post Ideas **Post 1:** "[Title]" - Angle: [How to approach the topic] - Target Audience: [Who this helps] - Key Takeaways: [Main points] --- ## Resources Referenced - Link 1 - Link 2 - Documentation reference --- ## Summary [Wrap up with key achievements, success metrics, and what made this session valuable] ``` ### Example: From Activity Log to Session Log **Activity Log Entry (Tier 2):** ```markdown ## 2024-10-19 ### Experiments - Completed: 20241019-143000-sentiment-analysis - Result: 87.3% accuracy, 0.852 F1 score - Insight: Lower temperature improved consistency ``` **Session Log (Tier 3):** ```markdown # Session: Sentiment Analysis with Llama 3.1 **Key Achievement:** Built production-ready sentiment analysis pipeline achieving 87.3% accuracy with optimized inference. [Full narrative with context, process, insights, and lessons learned] **Blog Post Ideas:** - "Building a Production Sentiment Analyzer with Llama 3.1" - "Temperature Tuning for Consistent LLM Predictions" ``` ### Session Log Benefits **Knowledge Sharing:** - Teammates learn from your process - Context for future you - Blog-ready content **Reproducibility:** - Detailed enough to recreate - Includes rationale for decisions - Documents what worked and why **Portfolio Building:** - Demonstrates skills and thinking - Shows problem-solving approach - Creates shareable artifacts --- ## Integration: How the Three Tiers Work Together Here's how a real ML project flows through all three tiers: ### Example: Training a Custom Model **Tier 1 (Execution Log):** ``` 2024-10-19 15:00:00 - INFO - Starting model training 2024-10-19 15:00:01 - INFO - Base model: llama3.1:8b 2024-10-19 15:00:01 - INFO - Dataset: reviews_processed_20241019.csv (10000 samples) 2024-10-19 15:00:01 - INFO - Config: {epochs: 3, batch_size: 32, learning_rate: 0.0001} 2024-10-19 15:00:01 - INFO - Epoch 1/3 starting... 2024-10-19 15:15:23 - INFO - Epoch 1/3 complete - Loss: 0.452, Val Loss: 0.389 2024-10-19 15:15:23 - INFO - Checkpoint saved: checkpoints/epoch_1.pt ... ``` **Tier 2 (Activity Log):** ```markdown ### Experiments - Completed: Fine-tuning llama3.1 on product reviews - 3 epochs, final loss: 0.312 - Validation accuracy improved from 82% → 91% - Created: llama3-reviews:v1 - Next: Deploy to production agent ``` **Tier 3 (Session Log):** ```markdown # Session: Fine-Tuning Llama 3.1 for Domain-Specific Sentiment Analysis **Key Achievement:** Fine-tuned model improved accuracy by 9% (82% → 91%) on domain-specific reviews. ## Process [Detailed narrative of preparation, training, evaluation] ## Key Insights 1. Domain-specific fine-tuning significantly improves accuracy 2. 3 epochs optimal (4+ epochs showed overfitting) 3. Lower learning rate (0.0001) more stable than default ## Blog Post Ideas - "Fine-Tuning Llama 3.1: A Practical Guide" - "When to Fine-Tune vs. Prompt Engineering" ``` **Result:** Same work, three perspectives: - Tier 1: Technical debugging reference - Tier 2: Quick daily summary - Tier 3: Shareable knowledge --- ## Documentation Automation Make documentation easy with these automation strategies: ### 1. Logging Boilerplate ```python # scripts/utilities/setup_logging.py import logging from datetime import datetime from pathlib import Path def setup_experiment_logging(experiment_name, log_dir="logs/experiments"): """Set up logging for an experiment""" Path(log_dir).mkdir(parents=True, exist_ok=True) timestamp = datetime.now().strftime('%Y%m%d-%H%M%S') log_file = f"{log_dir}/{timestamp}-{experiment_name}.log" logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(log_file), logging.StreamHandler() ] ) return logging.getLogger(__name__), log_file # Usage logger, log_file = setup_experiment_logging("sentiment-analysis") logger.info("Experiment started") ``` ### 2. Activity Log Helper ```bash # Add to ~/.bashrc or create script alias log-activity='code ~/workspace/ACTIVITY.md' alias today='date +%Y-%m-%d' # Quick entry function function ml-log() { echo "\n## $(date +%Y-%m-%d)\n" >> ~/workspace/ACTIVITY.md echo "### $1\n- $2\n" >> ~/workspace/ACTIVITY.md } # Usage ml-log "Experiments" "Completed sentiment analysis: 87.3% accuracy" ``` ### 3. Session Template Generator ```bash # scripts/utilities/new_session.sh #!/bin/bash SESSION_NAME=$1 SESSION_TYPE=$2 DATE=$(date +%Y-%m-%d) FILENAME="docs/sessions/${DATE}-${SESSION_NAME}.md" cat > "$FILENAME" << EOF # Session: $SESSION_NAME **Date:** $DATE **Session Type:** $SESSION_TYPE **Duration:** ~X hours **Objective:** [Fill in objective] --- ## Session Overview [High-level summary] **Key Achievement:** [Most important outcome] --- ## Tasks Completed ### 1. Task Name ✓ ... EOF echo "Created session log: $FILENAME" code "$FILENAME" ``` Usage: ```bash ./scripts/utilities/new_session.sh "sentiment-analysis" "Experiment" ``` --- ## Best Practices ### For All Tiers **Be Consistent:** - Log daily (Tier 2) - Use templates (Tier 3) - Automate logging setup (Tier 1) **Be Honest:** - Document failures and learnings - Record what didn't work - Note mistakes and fixes **Be Actionable:** - Include next steps - Note blockers - Record questions ### Tier-Specific Tips **Execution Logs (Tier 1):** - Log everything (disk is cheap) - Include full stack traces - Timestamp every entry - Archive after 30 days **Activity Log (Tier 2):** - Write end-of-day - Keep entries concise (5-10 lines per section) - Use checkboxes for next steps - Review weekly **Session Logs (Tier 3):** - Write within 24 hours (while fresh) - Include enough context for others - Focus on "why" not just "what" - Tag potential blog topics --- ## Common Pitfalls **Pitfall 1: Too much documentation** - **Problem:** Spending more time documenting than doing - **Solution:** Start with Tier 2 only, add others as needed **Pitfall 2: Inconsistent logging** - **Problem:** Logging for a week, then stopping - **Solution:** Make it easy (templates, aliases), make it habit **Pitfall 3: Wrong detail level** - **Problem:** Technical details in activity log, summaries in execution logs - **Solution:** Remember tier purposes: debug, review, share **Pitfall 4: No connection between tiers** - **Problem:** Logs mention "experiment X" but no link to execution logs - **Solution:** Reference log files in activity log, cross-link freely --- ## What's Next You now have a three-tier documentation system that captures ML work at multiple levels. But documentation is only part of reproducibility—you also need systematic experiment tracking. In **Part 3: Experiment Tracking and Reproducibility**, we'll build: - Complete experiment templates with comprehensive READMEs - Progress tracking systems - Success criteria frameworks - Results comparison tools - Lifecycle management automation We'll ensure every experiment is reproducible, comparable, and builds toward knowledge accumulation. --- ## Key Takeaways - **Three tiers serve different purposes**: debugging, review, sharing - **Execution logs** capture technical details for troubleshooting - **Activity logs** track daily progress and build accountability - **Session logs** transform work into shareable knowledge - **Automation** makes documentation sustainable - **Consistency** matters more than perfection --- ## Resources **Templates:** - Python logging setup (provided above) - Activity log template (provided above) - Session log template (provided above) - Bash automation scripts (provided above) **Tools:** - Python `logging` module - Bash aliases and functions - Code editor integration --- ## Series Navigation - **Previous:** [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]] - **Next:** Part 3: Experiment Tracking (coming soon) - **Series Home:** Building a Production ML Workspace on GPU Infrastructure --- **Questions or suggestions?** Find me on Twitter [@bioinfo](https://twitter.com/bioinfo) or at [rundatarun.io](https://rundatarun.io) --- ### Related Articles - [[building-production-ml-workspace-part-3-experiments|Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility]] - [[building-production-ml-workspace-part-5-collaboration|Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration]] - [[building-production-ml-workspace-part-4-agents|Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates]] --- <p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p> <p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>