building-production-ml-workspace-part-5-collaboration - AIXplore

# Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration You've built the foundation: workspace structure, documentation systems, experiment tracking, and production-ready agents. Your individual workflows are solid. But **ML development is a team sport**. When multiple researchers share GPU infrastructure, collaborate on experiments, and deploy agents to production, new challenges emerge: - How do team members discover and reuse each other's work? - How do you prevent conflicts when multiple people experiment simultaneously? - How do you maintain consistency across different developer environments? - How do you automate the workflow from experiment to production deployment? - How do you handle model versioning and Ollama model lifecycle management? This final article completes the series by showing you how to **integrate everything into a collaborative, automated workflow** that scales from solo research to full team production. <div class="callout" data-callout="info"> <div class="callout-title">About This Series</div> <div class="callout-content"> This is Part 5 (final) of a 5-part series on building production ML workspaces: - [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]] - [[building-production-ml-workspace-part-2-documentation|Part 2: Documentation Systems]] - [[building-production-ml-workspace-part-3-experiments|Part 3: Experiment Tracking]] - [[building-production-ml-workspace-part-4-agents|Part 4: Production-Ready AI Agent Templates]] This article ties everything together with team collaboration and workflow automation. </div> </div> --- ## The Team Collaboration Problem Individual productivity is different from team effectiveness. Here's what breaks down when teams scale: **Discovery Problems:** - "Did someone already try this approach?" - "Where's the trained model Sarah mentioned?" - "Which agent template should I start from?" - "What experiments ran last week?" **Conflict Problems:** - Two researchers overwrite each other's experiments - Model files conflict in shared directories - GPU allocation conflicts during training - Inconsistent Python environments cause "works on my machine" **Quality Problems:** - No peer review before production deployment - Undocumented experiments no one can reproduce - Agents deployed without proper testing - Configuration drift across environments **Workflow Problems:** - Manual steps between experiment and deployment - No clear promotion path from prototype to production - Unclear ownership of models and agents - No standardized release process --- ## The Integrated Workflow Solution Our solution: **Automated workflows with clear promotion paths and team visibility**. ``` Integrated ML Workflow │ ├── Individual Development │ ├── Local experiments with tracking │ ├── Personal branches for prototypes │ ├── Automated environment setup │ └── Self-service model management │ ├── Team Collaboration │ ├── Shared experiment registry │ ├── Code review for production code │ ├── Automated testing gates │ └── Centralized model registry │ ├── Production Pipeline │ ├── Automated deployment │ ├── Model versioning │ ├── Monitoring & alerting │ └── Rollback capabilities │ └── Governance ├── Resource allocation ├── Cost tracking ├── Compliance & security └── Knowledge sharing ``` --- ## Part 1: Version Control Strategy ### Repository Structure Monorepo approach for shared workspace: ```bash ml-workspace/ ├── .git/ ├── .github/ │ └── workflows/ # CI/CD automation │ ├── experiment-validation.yml │ ├── agent-tests.yml │ └── deploy-production.yml ├── experiments/ │ ├── active/ # Current experiments │ │ └── [researcher]/ # Personal namespace │ └── archive/ # Completed experiments ├── agents/ │ ├── prototypes/ # WIP agents (no review needed) │ └── production/ # Reviewed production agents ├── models/ │ ├── registry.yaml # Model catalog │ └── checkpoints/ # Versioned model files ├── shared/ # Team utilities │ ├── tools/ # Common tools │ ├── prompts/ # Reusable prompts │ └── configs/ # Standard configs ├── docs/ │ ├── runbooks/ # Operational guides │ └── adrs/ # Architecture decisions └── scripts/ ├── setup-env.sh # Environment setup ├── sync-ollama.sh # Model sync └── deploy-agent.sh # Deployment automation ``` ### Branching Strategy **Branch Types:** ```bash main # Production-ready code ├── develop # Integration branch ├── experiment/* # Individual experiments ├── feature/* # New capabilities └── hotfix/* # Production fixes ``` **Workflow:** ```bash # Start new experiment git checkout develop git checkout -b experiment/username/model-comparison # Work on experiment # ... run experiments, document results ... # Share experiment (no merge) git push origin experiment/username/model-comparison # Promote to production (requires review) git checkout develop git merge experiment/username/model-comparison # ... PR review, tests pass ... git checkout main git merge develop ``` ### What to Commit vs. What to Ignore **.gitignore Configuration:** ```bash # Commit these: # - Experiment code # - Configuration files # - Documentation # - Small reference datasets (<10MB) # - Model registry metadata # Ignore these (use .gitignore): *.pyc __pycache__/ .ipynb_checkpoints/ *.log .env # Large files *.pth *.safetensors *.gguf datasets/large/ models/checkpoints/*.bin # Experiment artifacts (tracked separately) experiments/*/outputs/ experiments/*/runs/ mlruns/ wandb/ # Personal configs .vscode/ .idea/ *.swp ``` **Large File Strategy:** ```bash # Use Git LFS for model files git lfs track "*.pth" git lfs track "*.safetensors" # Or use external storage with manifests models/ ├── registry.yaml # Committed (metadata only) └── checkpoints/ └── .gitignore # Ignore actual files # Actual files stored in shared NAS or S3 ``` --- ## Part 2: Ollama Model Management ### Model Registry System Centralized catalog of all models: **models/registry.yaml:** ```yaml models: llama3.1-8b-base: source: "ollama" model_name: "llama3.1:8b" version: "latest" purpose: "General purpose chat and reasoning" tags: ["base", "chat", "reasoning"] owners: ["team"] created: "2024-10-01" updated: "2024-10-15" medical-assistant-v2: source: "custom" base_model: "llama3.1:8b" modelfile: "modelfiles/medical-assistant-v2.txt" version: "v2.1.0" purpose: "Medical query assistant with RAG" tags: ["custom", "medical", "rag"] owners: ["sarah"] created: "2024-10-10" updated: "2024-10-18" performance: accuracy: 0.89 latency_p50: "850ms" latency_p95: "1.2s" code-reviewer-v1: source: "custom" base_model: "codellama:13b" modelfile: "modelfiles/code-reviewer-v1.txt" version: "v1.0.0" purpose: "Code review and security analysis" tags: ["custom", "code", "security"] owners: ["john"] created: "2024-10-15" status: "production" ``` ### Modelfile Version Control **modelfiles/medical-assistant-v2.txt:** ```dockerfile FROM llama3.1:8b # System prompt SYSTEM """You are a medical research assistant with expertise in clinical trials and biomedical literature. Provide accurate, evidence-based responses with citations when possible. Always acknowledge uncertainty.""" # Parameters optimized for medical domain PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER top_k 40 PARAMETER repeat_penalty 1.1 # Custom stop sequences PARAMETER stop "###" PARAMETER stop "<END>" # Template for structured responses TEMPLATE """### Question: {{ .Prompt }} ### Response: {{ .Response }} ### Confidence: [High/Medium/Low] ### Citations: [If applicable] """ ``` ### Model Sync Script **scripts/sync-ollama.sh:** ```bash #!/bin/bash # Sync Ollama models across team based on registry set -e REGISTRY_FILE="models/registry.yaml" MODELFILES_DIR="modelfiles" echo "🔄 Syncing Ollama models from registry..." # Parse registry and pull/build models # (Requires yq for YAML parsing) # Pull base models echo "📥 Pulling base models..." yq eval '.models[] | select(.source == "ollama") | .model_name' "$REGISTRY_FILE" | \ while read -r model; do echo " Pulling $model..." ollama pull "$model" done # Build custom models from Modelfiles echo "🔨 Building custom models..." yq eval '.models[] | select(.source == "custom") | [.model_name, .modelfile] | @tsv' "$REGISTRY_FILE" | \ while IFS=\t' read -r name modelfile; do if [ -f "$modelfile" ]; then echo " Building $name from $modelfile..." ollama create "$name" -f "$modelfile" else echo " ⚠️ Modelfile not found: $modelfile" fi done echo "✅ Model sync complete!" echo "" echo "Available models:" ollama list ``` ### Model Lifecycle Management **Model States:** ``` Development → Testing → Staging → Production → Deprecated ↓ ↓ ↓ ↓ ↓ Experiment Validation Preview Live Archived ``` **Promotion Script:** ```bash #!/bin/bash # scripts/promote-model.sh MODEL_NAME=$1 FROM_ENV=$2 TO_ENV=$3 echo "🚀 Promoting model: $MODEL_NAME" echo " From: $FROM_ENV → To: $TO_ENV" # Validation checks case $TO_ENV in testing) echo "✓ Running unit tests..." pytest tests/models/test_${MODEL_NAME}.py ;; staging) echo "✓ Running integration tests..." pytest tests/integration/test_${MODEL_NAME}_integration.py echo "✓ Performance benchmarks..." python scripts/benchmark-model.py "$MODEL_NAME" ;; production) echo "✓ Final validation..." python scripts/validate-production-ready.py "$MODEL_NAME" echo "✓ Creating backup..." # Backup current production model ;; esac # Update registry echo "📝 Updating registry..." python scripts/update-registry.py "$MODEL_NAME" --environment "$TO_ENV" echo "✅ Promotion complete!" ``` --- ## Part 3: Experiment Collaboration ### Shared Experiment Registry Team dashboard for all experiments: **scripts/generate-experiment-dashboard.py:** ```python #!/usr/bin/env python3 """Generate team experiment dashboard""" import yaml import json from pathlib import Path from datetime import datetime, timedelta import pandas as pd def scan_experiments(): """Scan all experiments and build registry""" experiments = [] exp_dir = Path("experiments/active") for researcher_dir in exp_dir.iterdir(): if not researcher_dir.is_dir(): continue researcher = researcher_dir.name for exp_dir in researcher_dir.iterdir(): metadata_file = exp_dir / "metadata.yaml" if not metadata_file.exists(): continue with open(metadata_file) as f: meta = yaml.safe_load(f) experiments.append({ "researcher": researcher, "experiment": exp_dir.name, "goal": meta.get("goal", "N/A"), "status": meta.get("status", "unknown"), "created": meta.get("created"), "updated": meta.get("updated"), "tags": meta.get("tags", []), "best_metric": meta.get("results", {}).get("best_metric") }) return experiments def generate_dashboard(experiments): """Generate HTML dashboard""" df = pd.DataFrame(experiments) html = f""" <html> <head><title>Team Experiments Dashboard</title></head> <body> <h1>ML Team Experiments</h1> <p>Last updated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p> <h2>Active Experiments ({len(df)})</h2> {df.to_html(index=False)} <h2>Recent Activity</h2> {df.sort_values('updated', ascending=False).head(10).to_html(index=False)} </body> </html> """ with open("docs/experiment-dashboard.html", "w") as f: f.write(html) print("✅ Dashboard generated: docs/experiment-dashboard.html") if __name__ == "__main__": experiments = scan_experiments() generate_dashboard(experiments) ``` ### Experiment Handoff Process When handing off experiments between team members: **1. Document thoroughly:** ```yaml # experiments/active/sarah/medical-rag/HANDOFF.md ## Experiment Handoff **From:** Sarah Johnson **To:** Mike Chen **Date:** 2024-10-19 ### Current Status - Completed initial RAG pipeline with llama3.1:8b - Best accuracy: 89% on validation set - Main bottleneck: Retrieval latency (avg 850ms) ### What Works - Document chunking strategy (500 tokens, 50 overlap) - Embedding model: all-MiniLM-L6-v2 - Reranking with cross-encoder significantly improves results ### What Doesn't Work - ChromaDB too slow for >100K documents - Need better medical entity recognition - Current prompt struggles with complex multi-hop questions ### Next Steps 1. Try Qdrant or Milvus for vector store 2. Fine-tune NER model on medical corpus 3. Implement query decomposition for complex questions ### Files to Review - `src/rag_pipeline.py` - Main pipeline - `experiments/results/analysis.ipynb` - Performance analysis - `docs/architecture.md` - System design ### How to Run ```bash conda activate medical-rag python src/train.py --config configs/baseline.yaml python src/evaluate.py --checkpoint outputs/best_model.pth ``` ### Questions? Slack: @sarah or email: [email protected] ``` **2. Pair programming session:** - 30-60 minute walkthrough - Run the experiment together - Answer questions in real-time **3. Update documentation:** - Ensure README is current - Add inline comments for complex logic - Update architecture diagrams --- ## Part 4: Environment Consistency ### Reproducible Environments **environment.yaml (conda):** ```yaml name: ml-workspace channels: - conda-forge - defaults dependencies: - python=3.11 - pytorch=2.1.0 - pytorch-cuda=12.1 - transformers=4.35.0 - numpy=1.24.0 - pandas=2.1.0 - scikit-learn=1.3.0 - jupyter=1.0.0 - pip - pip: - ollama==0.1.7 - chromadb==0.4.15 - langchain==0.0.335 - wandb==0.15.12 ``` **Setup Script:** ```bash #!/bin/bash # scripts/setup-env.sh echo "🔧 Setting up ML workspace environment..." # Check prerequisites command -v conda >/dev/null 2>&1 || { echo "❌ Conda not found. Please install Miniconda/Anaconda first." exit 1 } command -v ollama >/dev/null 2>&1 || { echo "❌ Ollama not found. Please install from ollama.ai" exit 1 } # Create conda environment echo "📦 Creating conda environment..." conda env create -f environment.yaml # Activate environment echo "🔄 Activating environment..." eval "$(conda shell.bash hook)" conda activate ml-workspace # Sync Ollama models echo "🤖 Syncing Ollama models..." bash scripts/sync-ollama.sh # Initialize experiment tracking echo "📊 Setting up experiment tracking..." python scripts/init-mlflow.py # Verify installation echo "✅ Verifying installation..." python -c "import torch; print(f'PyTorch: {torch.__version__}')" python -c "import transformers; print(f'Transformers: {transformers.__version__}')" ollama list echo "" echo "✅ Setup complete!" echo " Activate with: conda activate ml-workspace" ``` ### Docker for Production Agents **agents/production/[agent]/Dockerfile:** ```dockerfile FROM python:3.11-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y \ curl \ && rm -rf /var/lib/apt/lists/* # Install Ollama client RUN pip install ollama # Copy application COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . # Health check HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ CMD python -c "import requests; requests.get('http://localhost:8000/health')" # Run agent server CMD ["python", "-m", "agent.server"] ``` **docker-compose.yml:** ```yaml version: '3.8' services: agent: build: . container_name: medical-assistant ports: - "8000:8000" environment: - OLLAMA_HOST=http://ollama:11434 - LOG_LEVEL=INFO volumes: - ./logs:/app/logs - ./config.yaml:/app/config.yaml depends_on: - ollama restart: unless-stopped ollama: image: ollama/ollama:latest container_name: ollama-server ports: - "11434:11434" volumes: - ollama-data:/root/.ollama deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes: ollama-data: ``` --- ## Part 5: CI/CD Automation ### GitHub Actions for Experiments **.github/workflows/experiment-validation.yml:** ```yaml name: Validate Experiment on: push: paths: - 'experiments/**' pull_request: paths: - 'experiments/**' jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Check metadata exists run: | python scripts/validate-experiment-metadata.py - name: Verify documentation run: | python scripts/check-experiment-docs.py - name: Run linting run: | pip install ruff ruff check experiments/ - name: Test experiment code run: | pip install pytest pytest experiments/*/tests/ -v ``` ### Agent Testing Pipeline **.github/workflows/agent-tests.yml:** ```yaml name: Agent Tests on: pull_request: paths: - 'agents/production/**' jobs: test: runs-on: ubuntu-latest services: ollama: image: ollama/ollama:latest ports: - 11434:11434 steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install dependencies run: | pip install -r agents/production/${{ github.event.pull_request.head.ref }}/requirements.txt pip install pytest pytest-cov - name: Pull required models run: | ollama pull llama3.1:8b - name: Run unit tests run: | pytest agents/production/*/tests/ \ --cov=agents/production \ --cov-report=html \ --cov-fail-under=80 - name: Run integration tests run: | pytest agents/production/*/tests/test_integration.py -v - name: Security scan run: | pip install bandit bandit -r agents/production/ - name: Upload coverage uses: codecov/codecov-action@v3 with: files: ./coverage.xml ``` ### Production Deployment Pipeline **.github/workflows/deploy-production.yml:** ```yaml name: Deploy to Production on: push: branches: - main paths: - 'agents/production/**' jobs: deploy: runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v3 - name: Build Docker image run: | cd agents/production/$AGENT_NAME docker build -t $REGISTRY/$AGENT_NAME:$VERSION . - name: Run smoke tests run: | docker run --rm $REGISTRY/$AGENT_NAME:$VERSION python -m pytest tests/smoke/ - name: Push to registry run: | echo ${{ secrets.REGISTRY_TOKEN }} | docker login -u ${{ secrets.REGISTRY_USER }} --password-stdin docker push $REGISTRY/$AGENT_NAME:$VERSION - name: Deploy to Kubernetes run: | kubectl set image deployment/$AGENT_NAME \ $AGENT_NAME=$REGISTRY/$AGENT_NAME:$VERSION - name: Verify deployment run: | kubectl rollout status deployment/$AGENT_NAME kubectl get pods -l app=$AGENT_NAME - name: Smoke test production run: | python scripts/smoke-test-prod.py $AGENT_NAME ``` --- ## Part 6: Team Workflows ### Daily Standup Dashboard **scripts/generate-standup.py:** ```python #!/usr/bin/env python3 """Generate daily standup report""" from datetime import datetime, timedelta import subprocess import yaml from pathlib import Path def get_recent_commits(): """Get commits from last 24 hours""" yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d') result = subprocess.run( ['git', 'log', f'--since={yesterday}', '--pretty=format:%an|%s'], capture_output=True, text=True ) commits = [line.split('|') for line in result.stdout.strip().split('\n') if line] return commits def get_active_experiments(): """Get experiments updated in last 24 hours""" experiments = [] exp_dir = Path("experiments/active") for researcher_dir in exp_dir.iterdir(): for exp in researcher_dir.iterdir(): metadata = exp / "metadata.yaml" if not metadata.exists(): continue mtime = datetime.fromtimestamp(metadata.stat().st_mtime) if mtime > datetime.now() - timedelta(days=1): with open(metadata) as f: meta = yaml.safe_load(f) experiments.append({ 'researcher': researcher_dir.name, 'experiment': exp.name, 'status': meta.get('status'), 'goal': meta.get('goal') }) return experiments def generate_report(): """Generate standup report""" print("📊 Daily Standup Report") print(f"📅 {datetime.now().strftime('%Y-%m-%d')}") print("=" * 50) print("\n🚀 Recent Commits:") commits = get_recent_commits() for author, message in commits[:10]: print(f" • {author}: {message}") print("\n🧪 Active Experiments:") experiments = get_active_experiments() for exp in experiments: print(f" • {exp['researcher']}: {exp['experiment']}") print(f" Goal: {exp['goal']}") print(f" Status: {exp['status']}") print("\n📈 MLflow Experiments:") # Query MLflow for recent runs print(" Run: python scripts/query-mlflow.py --since yesterday") if __name__ == "__main__": generate_report() ``` ### Weekly Team Review **scripts/weekly-review.py:** ```python #!/usr/bin/env python3 """Generate weekly team review""" import pandas as pd from datetime import datetime, timedelta from pathlib import Path import yaml import json def analyze_experiments(): """Analyze experiment activity""" week_ago = datetime.now() - timedelta(days=7) stats = { 'total_experiments': 0, 'completed': 0, 'in_progress': 0, 'by_researcher': {} } exp_dir = Path("experiments/active") for researcher_dir in exp_dir.iterdir(): researcher = researcher_dir.name stats['by_researcher'][researcher] = 0 for exp in researcher_dir.iterdir(): metadata = exp / "metadata.yaml" if not metadata.exists(): continue with open(metadata) as f: meta = yaml.safe_load(f) created = datetime.fromisoformat(meta.get('created', '2000-01-01')) if created > week_ago: stats['total_experiments'] += 1 stats['by_researcher'][researcher] += 1 status = meta.get('status') if status == 'completed': stats['completed'] += 1 elif status == 'in_progress': stats['in_progress'] += 1 return stats def analyze_models(): """Analyze model registry""" with open("models/registry.yaml") as f: registry = yaml.safe_load(f) return { 'total_models': len(registry.get('models', {})), 'production_models': sum( 1 for m in registry.get('models', {}).values() if m.get('status') == 'production' ), 'custom_models': sum( 1 for m in registry.get('models', {}).values() if m.get('source') == 'custom' ) } def generate_report(): """Generate comprehensive weekly report""" print("📊 Weekly Team Review") print(f"📅 Week ending {datetime.now().strftime('%Y-%m-%d')}") print("=" * 70) exp_stats = analyze_experiments() print("\n🧪 Experiment Activity:") print(f" Total new experiments: {exp_stats['total_experiments']}") print(f" Completed: {exp_stats['completed']}") print(f" In progress: {exp_stats['in_progress']}") print("\n By researcher:") for researcher, count in exp_stats['by_researcher'].items(): print(f" • {researcher}: {count} experiments") model_stats = analyze_models() print("\n🤖 Model Registry:") print(f" Total models: {model_stats['total_models']}") print(f" Production models: {model_stats['production_models']}") print(f" Custom models: {model_stats['custom_models']}") print("\n📈 Key Metrics:") print(" GPU Utilization: [Query from monitoring]") print(" Agent Uptime: [Query from deployment]") print(" Experiment Success Rate: [Calculate from results]") print("\n💡 Recommendations:") print(" • [Auto-generated based on patterns]") print(" • Review abandoned experiments") print(" • Share successful experiment patterns") if __name__ == "__main__": generate_report() ``` ### Code Review Guidelines **docs/runbooks/code-review-checklist.md:** ```markdown # Code Review Checklist ## For Experiments ### Required - [ ] Metadata file exists and is complete - [ ] README documents experiment goal and methodology - [ ] Configuration is externalized (no hardcoded values) - [ ] Dependencies listed in requirements.txt - [ ] Results directory structure follows template - [ ] Code runs without errors ### Recommended - [ ] Inline comments explain complex logic - [ ] Visualization notebooks for results - [ ] Performance metrics documented - [ ] Comparison with baseline ## For Production Agents ### Critical (Must Pass) - [ ] All tests pass (unit, integration, behavior) - [ ] Test coverage >80% - [ ] No security vulnerabilities (bandit scan passes) - [ ] Complete README with usage examples - [ ] Configuration externalized - [ ] Logging comprehensive - [ ] Error handling robust - [ ] Dockerfile builds successfully ### Important (Should Pass) - [ ] Code follows team style guide - [ ] Docstrings for all public functions - [ ] Type hints for function signatures - [ ] Performance benchmarks run - [ ] Metrics collection implemented - [ ] Health check endpoint works ### Nice to Have - [ ] Architecture diagram included - [ ] Troubleshooting guide in README - [ ] Example use cases demonstrated - [ ] Monitoring dashboards defined ## Review Process 1. **Self Review**: Author completes checklist before PR 2. **Peer Review**: Team member reviews code 3. **Testing**: CI/CD pipeline validates automatically 4. **Approval**: 1 approval required for experiments, 2 for production agents 5. **Merge**: Squash and merge with descriptive message ``` --- ## Part 7: Resource Management ### GPU Allocation **scripts/check-gpu-usage.sh:** ```bash #!/bin/bash # Check GPU usage and availability echo "🎮 GPU Resource Status" echo "=====================" # Check NVIDIA GPUs nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total \ --format=csv,noheader,nounits | \ while IFS=',' read -r index name util mem_used mem_total; do echo "GPU $index: $name" echo " Utilization: ${util}%" echo " Memory: ${mem_used}MB / ${mem_total}MB" # Check if GPU is idle (< 20% utilization) if [ "$util" -lt 20 ]; then echo " Status: ✅ Available" else echo " Status: 🔴 Busy" # Show which process is using it nvidia-smi --query-compute-apps=pid,process_name,used_memory \ --format=csv,noheader | grep -v "^quot; | \ while IFS=',' read -r pid process mem; do user=$(ps -o user= -p "$pid" 2>/dev/null) echo " Process: $process (PID: $pid, User: $user)" done fi echo "" done ``` ### Resource Reservation System **scripts/reserve-gpu.py:** ```python #!/usr/bin/env python3 """GPU reservation system""" import json import sys from datetime import datetime, timedelta from pathlib import Path RESERVATIONS_FILE = "shared/gpu-reservations.json" def load_reservations(): """Load current reservations""" if Path(RESERVATIONS_FILE).exists(): with open(RESERVATIONS_FILE) as f: return json.load(f) return {"reservations": []} def save_reservations(data): """Save reservations""" with open(RESERVATIONS_FILE, 'w') as f: json.dump(data, f, indent=2) def reserve_gpu(gpu_id, user, duration_hours, purpose): """Reserve a GPU""" data = load_reservations() # Check if GPU is already reserved now = datetime.now() for res in data['reservations']: if res['gpu_id'] == gpu_id: end_time = datetime.fromisoformat(res['end_time']) if end_time > now: print(f"❌ GPU {gpu_id} is reserved by {res['user']} until {end_time}") return False # Create reservation reservation = { 'gpu_id': gpu_id, 'user': user, 'purpose': purpose, 'start_time': now.isoformat(), 'end_time': (now + timedelta(hours=duration_hours)).isoformat() } data['reservations'].append(reservation) save_reservations(data) print(f"✅ GPU {gpu_id} reserved for {user}") print(f" Duration: {duration_hours} hours") print(f" Until: {reservation['end_time']}") return True def list_reservations(): """List all current reservations""" data = load_reservations() now = datetime.now() print("📅 Current GPU Reservations") print("=" * 50) active_reservations = [ res for res in data['reservations'] if datetime.fromisoformat(res['end_time']) > now ] if not active_reservations: print("No active reservations") return for res in active_reservations: end = datetime.fromisoformat(res['end_time']) remaining = end - now print(f"GPU {res['gpu_id']}: {res['user']}") print(f" Purpose: {res['purpose']}") print(f" Time remaining: {remaining}") print("") if __name__ == "__main__": if len(sys.argv) < 2: print("Usage:") print(" reserve-gpu.py list") print(" reserve-gpu.py reserve <gpu_id> <user> <hours> <purpose>") sys.exit(1) command = sys.argv[1] if command == "list": list_reservations() elif command == "reserve": if len(sys.argv) != 6: print("Usage: reserve-gpu.py reserve <gpu_id> <user> <hours> <purpose>") sys.exit(1) gpu_id = int(sys.argv[2]) user = sys.argv[3] hours = int(sys.argv[4]) purpose = sys.argv[5] reserve_gpu(gpu_id, user, hours, purpose) ``` ### Cost Tracking **scripts/track-compute-costs.py:** ```python #!/usr/bin/env python3 """Track compute costs""" import json from datetime import datetime, timedelta from pathlib import Path # Cost per GPU hour (example pricing) GPU_COST_PER_HOUR = { 'A100': 3.00, 'A6000': 2.00, 'RTX4090': 1.50 } def calculate_experiment_cost(experiment_dir): """Calculate cost for an experiment""" metadata_file = Path(experiment_dir) / "metadata.yaml" if not metadata_file.exists(): return 0.0 # Parse training time from logs or metadata # This is simplified - you'd parse actual logs training_hours = 2.5 # Example gpu_type = "A100" # From metadata cost = training_hours * GPU_COST_PER_HOUR.get(gpu_type, 1.0) return cost def generate_cost_report(): """Generate monthly cost report""" print("💰 Compute Cost Report") print("=" * 50) total_cost = 0.0 costs_by_user = {} exp_dir = Path("experiments/active") for researcher_dir in exp_dir.iterdir(): researcher = researcher_dir.name costs_by_user[researcher] = 0.0 for exp in researcher_dir.iterdir(): cost = calculate_experiment_cost(exp) costs_by_user[researcher] += cost total_cost += cost print(f"Total cost: ${total_cost:.2f}") print("\nBy researcher:") for user, cost in sorted(costs_by_user.items(), key=lambda x: x[1], reverse=True): print(f" {user}: ${cost:.2f}") if __name__ == "__main__": generate_cost_report() ``` --- ## Part 8: Complete Workflow Example Let's see how everything fits together with a real workflow from experiment to production. ### Scenario: Building a Medical Q&A Agent **Day 1-2: Initial Experiment** ```bash # Sarah starts new experiment cd ml-workspace git checkout develop git checkout -b experiment/sarah/medical-qa # Setup experiment mkdir -p experiments/active/sarah/medical-qa cd experiments/active/sarah/medical-qa # Initialize experiment cat > metadata.yaml <<EOF experiment_id: "medical-qa-v1" researcher: "sarah" created: "2024-10-19" updated: "2024-10-19" status: "in_progress" goal: "Build RAG-based medical Q&A system" hypothesis: "llama3.1:8b with medical corpus RAG can answer clinical questions" tags: ["rag", "medical", "qa"] mlflow_experiment: "medical-qa" EOF # Create README cat > README.md <<EOF # Medical Q&A Experiment ## Goal Build RAG system for medical question answering. ## Approach 1. Index medical corpus (PubMed abstracts) 2. Implement retrieval with ChromaDB 3. Test llama3.1:8b with retrieved context 4. Evaluate on MedQA benchmark ## Running \`\`\`bash python train.py --config config.yaml python evaluate.py --checkpoint outputs/best_model.pth \`\`\` EOF # Write experiment code # ... implement RAG pipeline ... # Track with MLflow python train.py # Commit and share git add . git commit -m "Initial medical QA RAG experiment" git push origin experiment/sarah/medical-qa ``` **Day 3-4: Iteration and Results** ```bash # Run multiple experiments mlflow ui # View results # Best config identified # - Chunk size: 500 tokens # - Retrieval: top-5 with reranking # - Model: llama3.1:8b, temp=0.3 # Document results cat > results/summary.md <<EOF # Results Summary ## Best Configuration - Accuracy: 89% - Latency p50: 850ms - Retrieval precision: 0.85 ## Conclusion Ready for prototype agent deployment EOF # Update metadata # status: completed ``` **Day 5: Promote to Prototype Agent** ```bash # Create prototype agent cd agents/prototypes/ mkdir medical-qa cd medical-qa # Copy experiment code cp -r ../../../experiments/active/sarah/medical-qa/src/* ./ # Create agent wrapper cat > agent.py <<EOF """Medical QA Agent - Prototype""" from rag_pipeline import MedicalRAG class MedicalQAAgent: def __init__(self): self.rag = MedicalRAG(model="llama3.1:8b") def answer_question(self, question): context = self.rag.retrieve(question) answer = self.rag.generate(question, context) return answer EOF # Test prototype python test_agent.py # Share with team git add . git commit -m "Add medical QA prototype agent" git push origin experiment/sarah/medical-qa ``` **Week 2: Production Promotion (Team Review)** ```bash # Mike volunteers to productionize git checkout experiment/sarah/medical-qa cd agents/production/ # Use production template cp -r templates/agent-template medical-qa-agent cd medical-qa-agent # Fill out complete structure # - Add comprehensive tests # - Create Dockerfile # - Write complete README # - Add configuration # - Implement monitoring # Create PR git checkout -b feature/medical-qa-production git add . git commit -m "Production-ready medical QA agent - Complete test suite (85% coverage) - Dockerized with health checks - Comprehensive documentation - Monitoring and metrics - Performance benchmarks" git push origin feature/medical-qa-production # Open PR on GitHub # - CI/CD runs automatically # - Tests pass # - Sarah reviews and approves # - Merge to develop ``` **Week 3: Production Deployment** ```bash # Merge to main triggers deployment git checkout main git merge develop # GitHub Actions: # 1. Build Docker image # 2. Run tests # 3. Push to registry # 4. Deploy to Kubernetes # 5. Run smoke tests # Agent now live at https://api.company.com/medical-qa # Update model registry cat >> models/registry.yaml <<EOF medical-qa-v1: source: "custom" base_model: "llama3.1:8b" version: "v1.0.0" purpose: "Medical question answering with RAG" status: "production" owners: ["sarah", "mike"] performance: accuracy: 0.89 latency_p50: "850ms" EOF ``` ### Timeline Summary - **Days 1-4**: Individual experiment (fast iteration) - **Day 5**: Prototype agent (share with team) - **Week 2**: Production hardening (quality focus) - **Week 3**: Deployment (automated pipeline) **Total: 3 weeks from idea to production** --- ## Best Practices ### For Individual Researchers **1. Document as you go** - Write README first (defines what you're building) - Update metadata.yaml daily - Commit frequently with good messages - Screenshot interesting results **2. Follow templates** - Use experiment template - Use agent template for production - Don't skip required fields - Consistency helps others help you **3. Share early** - Push experiments even if incomplete - Ask for code review on tricky parts - Present results in team meetings - Write handoff docs when switching projects ### For Team Leads **1. Automate everything** - Use CI/CD for validation - Generate dashboards automatically - Automate environment setup - Script common workflows **2. Maintain standards** - Enforce code review for production - Require tests for production agents - Keep templates updated - Document architecture decisions **3. Foster collaboration** - Weekly demo sessions - Pair programming for complex features - Shared Slack channel for questions - Regular retrospectives ### For Production Deployment **1. Progressive rollout** - Deploy to staging first - Canary deployment (10% → 50% → 100%) - Monitor metrics closely - Have rollback plan ready **2. Observability** - Comprehensive logging - Metrics dashboards - Alerting on errors - Performance tracking **3. Documentation** - Runbooks for operations - Troubleshooting guides - Architecture diagrams - API documentation --- ## Common Pitfalls to Avoid ### Don't ❌ **Skip documentation** - Future you (and your team) will regret it ❌ **Commit large binary files** - Use Git LFS or external storage ❌ **Work directly on main** - Always use feature branches ❌ **Deploy without tests** - Tests prevent production disasters ❌ **Hardcode secrets** - Use environment variables or secret management ❌ **Ignore failed CI** - Fix immediately, don't accumulate debt ❌ **Deploy without monitoring** - You need visibility in production ❌ **Skip code review** - Fresh eyes catch bugs and improve design ### Do ✅ **Use templates consistently** - Reduces cognitive load ✅ **Automate repetitive tasks** - Scripts save time and reduce errors ✅ **Version everything** - Code, models, prompts, configs ✅ **Communicate proactively** - Share blockers and successes ✅ **Test thoroughly** - Unit, integration, and end-to-end ✅ **Monitor production** - Metrics, logs, alerts ✅ **Document decisions** - ADRs explain the "why" ✅ **Iterate quickly** - Prototype → Test → Production --- ## What You've Built By completing this series, you now have: **Foundation** (Part 1): - Well-organized workspace structure - Separation of experiments, agents, and documentation - Clear file naming conventions - Shared utilities and templates **Knowledge System** (Part 2): - Memory bank for context persistence - Learning logs for insights - Project documentation - Architecture decision records **Experiment Tracking** (Part 3): - MLflow integration - Reproducible experiments - Version-controlled configurations - Result visualization **Production Agents** (Part 4): - Two-track system (prototype/production) - Standardized agent structure - Comprehensive testing - Deployment readiness **Team Collaboration** (Part 5): - Version control strategy - Model registry and lifecycle - CI/CD automation - Resource management - Complete workflows --- ## Series Conclusion You've built a **production-ready ML workspace** that scales from individual research to full team collaboration. This system: **Accelerates Development:** - Templates reduce setup time - Automation removes manual steps - Clear structure reduces decisions - Reusable components speed development **Improves Quality:** - Tests catch bugs early - Code review improves design - Standards ensure consistency - Monitoring catches production issues **Enables Collaboration:** - Shared structure everyone understands - Version control prevents conflicts - Documentation enables handoffs - Registry provides discoverability **Scales with Your Team:** - Works for solo researchers - Supports small teams - Scales to larger organizations - Adapts to changing needs --- ## Key Takeaways 1. **Structure enables speed** - Good organization removes friction 2. **Automate everything** - Scripts and CI/CD save countless hours 3. **Documentation is infrastructure** - Undocumented work is wasted work 4. **Version all artifacts** - Code, models, prompts, configs 5. **Test before production** - Tests prevent disasters 6. **Monitor what matters** - You can't improve what you don't measure 7. **Collaborate deliberately** - Clear processes enable teamwork 8. **Iterate continuously** - Prototype → Test → Improve → Repeat --- ## Next Steps **Immediate:** 1. Set up your workspace structure (Part 1) 2. Initialize version control 3. Create first experiment with template 4. Set up MLflow tracking **Short-term (1-2 weeks):** 1. Build memory bank system 2. Create agent templates 3. Implement basic CI/CD 4. Set up model registry **Long-term (1-3 months):** 1. Establish team workflows 2. Build automation scripts 3. Deploy first production agent 4. Iterate based on team feedback **Ongoing:** 1. Refine templates based on learnings 2. Update documentation regularly 3. Share successful patterns 4. Continuously improve automation --- ## Resources **Code Examples:** - All scripts from this series (GitHub link) - Template repository - Example experiments - Sample agents **Further Reading:** - [[AI Development & Agents/agent-architecture-patterns|Agent Architecture Patterns]] - [[AI Systems & Architecture/ml-system-design|ML System Design]] - [[Practical Applications/experiment-tracking-best-practices|Experiment Tracking Best Practices]] **Tools:** - MLflow for experiment tracking - Ollama for local LLMs - Docker for containerization - GitHub Actions for CI/CD --- ## Series Navigation - **Part 1:** [[building-production-ml-workspace-part-1-structure|Workspace Structure]] - **Part 2:** [[building-production-ml-workspace-part-2-documentation|Documentation Systems]] - **Part 3:** [[building-production-ml-workspace-part-3-experiments|Experiment Tracking]] - **Part 4:** [[building-production-ml-workspace-part-4-agents|Production-Ready AI Agents]] - **Part 5:** Team Collaboration and Workflow Integration (this article) --- **Questions or want to share your workspace setup?** Find me on Twitter [@bioinfo](https://twitter.com/bioinfo) or at [rundatarun.io](https://rundatarun.io) **Thank you for following this series!** I hope it helps you build better ML workflows and more effective teams. --- ### Related Articles - [[building-production-ml-workspace-part-4-agents|Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates]] - [[building-production-ml-workspace-part-2-documentation|Building a Production ML Workspace: Part 2 - Documentation Systems That Scale]] - [[building-production-ml-workspace-part-3-experiments|Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility]] --- <p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p> <p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>