Practical ApplicationsOctober 19, 202514 min readshipped

Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration

You've built the foundation: workspace structure, documentation systems, experiment tracking, and production-ready agents. Your individual workflows are solid. But ML development is a team sport.

When multiple researchers share GPU infrastructure, collaborate on experiments, and deploy agents to production, new challenges emerge:

How do team members discover and reuse each other's work?
How do you prevent conflicts when multiple people experiment simultaneously?
How do you maintain consistency across different developer environments?
How do you automate the workflow from experiment to production deployment?
How do you handle model versioning and Ollama model lifecycle management?

This final article completes the series by showing you how to integrate everything into a collaborative, automated workflow that scales from solo research to full team production.

About This Series

This is Part 5 (final) of a 5-part series on building production ML workspaces:

This article ties everything together with team collaboration and workflow automation.

The Team Collaboration Problem

Individual productivity is different from team effectiveness. Here's what breaks down when teams scale:

Discovery Problems:

"Did someone already try this approach?"
"Where's the trained model Sarah mentioned?"
"Which agent template should I start from?"
"What experiments ran last week?"

Conflict Problems:

Two researchers overwrite each other's experiments
Model files conflict in shared directories
GPU allocation conflicts during training
Inconsistent Python environments cause "works on my machine"

Quality Problems:

No peer review before production deployment
Undocumented experiments no one can reproduce
Agents deployed without proper testing
Configuration drift across environments

Workflow Problems:

Manual steps between experiment and deployment
No clear promotion path from prototype to production
Unclear ownership of models and agents
No standardized release process

The Integrated Workflow Solution

Our solution: Automated workflows with clear promotion paths and team visibility.

Integrated ML Workflow
│
├── Individual Development
│   ├── Local experiments with tracking
│   ├── Personal branches for prototypes
│   ├── Automated environment setup
│   └── Self-service model management
│
├── Team Collaboration
│   ├── Shared experiment registry
│   ├── Code review for production code
│   ├── Automated testing gates
│   └── Centralized model registry
│
├── Production Pipeline
│   ├── Automated deployment
│   ├── Model versioning
│   ├── Monitoring & alerting
│   └── Rollback capabilities
│
└── Governance
    ├── Resource allocation
    ├── Cost tracking
    ├── Compliance & security
    └── Knowledge sharing

Part 1: Version Control Strategy

Repository Structure

Monorepo approach for shared workspace:

ml-workspace/
├── .git/
├── .github/
│   └── workflows/          # CI/CD automation
│       ├── experiment-validation.yml
│       ├── agent-tests.yml
│       └── deploy-production.yml
├── experiments/
│   ├── active/            # Current experiments
│   │   └── [researcher]/  # Personal namespace
│   └── archive/           # Completed experiments
├── agents/
│   ├── prototypes/        # WIP agents (no review needed)
│   └── production/        # Reviewed production agents
├── models/
│   ├── registry.yaml     # Model catalog
│   └── checkpoints/      # Versioned model files
├── shared/               # Team utilities
│   ├── tools/           # Common tools
│   ├── prompts/         # Reusable prompts
│   └── configs/         # Standard configs
├── docs/
│   ├── runbooks/        # Operational guides
│   └── adrs/            # Architecture decisions
└── scripts/
    ├── setup-env.sh     # Environment setup
    ├── sync-ollama.sh   # Model sync
    └── deploy-agent.sh  # Deployment automation

Branching Strategy

Branch Types:

main                    # Production-ready code
├── develop            # Integration branch
├── experiment/*       # Individual experiments
├── feature/*         # New capabilities
└── hotfix/*          # Production fixes

Workflow:

# Start new experiment
git checkout develop
git checkout -b experiment/username/model-comparison

# Work on experiment
# ... run experiments, document results ...

# Share experiment (no merge)
git push origin experiment/username/model-comparison

# Promote to production (requires review)
git checkout develop
git merge experiment/username/model-comparison
# ... PR review, tests pass ...
git checkout main
git merge develop

What to Commit vs. What to Ignore

.gitignore Configuration:

# Commit these:
# - Experiment code
# - Configuration files
# - Documentation
# - Small reference datasets (<10MB)
# - Model registry metadata

# Ignore these (use .gitignore):
*.pyc
__pycache__/
.ipynb_checkpoints/
*.log
.env

# Large files
*.pth
*.safetensors
*.gguf
datasets/large/
models/checkpoints/*.bin

# Experiment artifacts (tracked separately)
experiments/*/outputs/
experiments/*/runs/
mlruns/
wandb/

# Personal configs
.vscode/
.idea/
*.swp

Large File Strategy:

# Use Git LFS for model files
git lfs track "*.pth"
git lfs track "*.safetensors"

# Or use external storage with manifests
models/
├── registry.yaml       # Committed (metadata only)
└── checkpoints/
    └── .gitignore      # Ignore actual files
    # Actual files stored in shared NAS or S3

Part 2: Ollama Model Management

Model Registry System

Centralized catalog of all models:

models/registry.yaml:

models:
  llama3.1-8b-base:
    source: "ollama"
    model_name: "llama3.1:8b"
    version: "latest"
    purpose: "General purpose chat and reasoning"
    tags: ["base", "chat", "reasoning"]
    owners: ["team"]
    created: "2024-10-01"
    updated: "2024-10-15"

  medical-assistant-v2:
    source: "custom"
    base_model: "llama3.1:8b"
    modelfile: "modelfiles/medical-assistant-v2.txt"
    version: "v2.1.0"
    purpose: "Medical query assistant with RAG"
    tags: ["custom", "medical", "rag"]
    owners: ["sarah"]
    created: "2024-10-10"
    updated: "2024-10-18"
    performance:
      accuracy: 0.89
      latency_p50: "850ms"
      latency_p95: "1.2s"

  code-reviewer-v1:
    source: "custom"
    base_model: "codellama:13b"
    modelfile: "modelfiles/code-reviewer-v1.txt"
    version: "v1.0.0"
    purpose: "Code review and security analysis"
    tags: ["custom", "code", "security"]
    owners: ["john"]
    created: "2024-10-15"
    status: "production"

Modelfile Version Control

modelfiles/medical-assistant-v2.txt:

FROM llama3.1:8b

# System prompt
SYSTEM """You are a medical research assistant with expertise in clinical
trials and biomedical literature. Provide accurate, evidence-based responses
with citations when possible. Always acknowledge uncertainty."""

# Parameters optimized for medical domain
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1

# Custom stop sequences
PARAMETER stop "###"
PARAMETER stop "<END>"

# Template for structured responses
TEMPLATE """### Question: {{ .Prompt }}

### Response:
{{ .Response }}

### Confidence: [High/Medium/Low]
### Citations: [If applicable]
"""

Model Sync Script

scripts/sync-ollama.sh:

#!/bin/bash
# Sync Ollama models across team based on registry

set -e

REGISTRY_FILE="models/registry.yaml"
MODELFILES_DIR="modelfiles"

echo "🔄 Syncing Ollama models from registry..."

# Parse registry and pull/build models
# (Requires yq for YAML parsing)

# Pull base models
echo "📥 Pulling base models..."
yq eval '.models[] | select(.source == "ollama") | .model_name' "$REGISTRY_FILE" | \
while read -r model; do
    echo "  Pulling $model..."
    ollama pull "$model"
done

# Build custom models from Modelfiles
echo "🔨 Building custom models..."
yq eval '.models[] | select(.source == "custom") | [.model_name, .modelfile] | @tsv' "$REGISTRY_FILE" | \
while IFS=$'\t' read -r name modelfile; do
    if [ -f "$modelfile" ]; then
        echo "  Building $name from $modelfile..."
        ollama create "$name" -f "$modelfile"
    else
        echo "  ⚠️  Modelfile not found: $modelfile"
    fi
done

echo "✅ Model sync complete!"
echo ""
echo "Available models:"
ollama list

Model Lifecycle Management

Model States:

Development → Testing → Staging → Production → Deprecated
     ↓           ↓         ↓          ↓            ↓
 Experiment   Validation  Preview   Live      Archived

Promotion Script:

#!/bin/bash
# scripts/promote-model.sh

MODEL_NAME=$1
FROM_ENV=$2
TO_ENV=$3

echo "🚀 Promoting model: $MODEL_NAME"
echo "   From: $FROM_ENV → To: $TO_ENV"

# Validation checks
case $TO_ENV in
    testing)
        echo "✓ Running unit tests..."
        pytest tests/models/test_${MODEL_NAME}.py
        ;;
    staging)
        echo "✓ Running integration tests..."
        pytest tests/integration/test_${MODEL_NAME}_integration.py
        echo "✓ Performance benchmarks..."
        python scripts/benchmark-model.py "$MODEL_NAME"
        ;;
    production)
        echo "✓ Final validation..."
        python scripts/validate-production-ready.py "$MODEL_NAME"
        echo "✓ Creating backup..."
        # Backup current production model
        ;;
esac

# Update registry
echo "📝 Updating registry..."
python scripts/update-registry.py "$MODEL_NAME" --environment "$TO_ENV"

echo "✅ Promotion complete!"

Part 3: Experiment Collaboration

Shared Experiment Registry

Team dashboard for all experiments:

scripts/generate-experiment-dashboard.py:

#!/usr/bin/env python3
"""Generate team experiment dashboard"""

import yaml
import json
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd

def scan_experiments():
    """Scan all experiments and build registry"""
    experiments = []

    exp_dir = Path("experiments/active")
    for researcher_dir in exp_dir.iterdir():
        if not researcher_dir.is_dir():
            continue

        researcher = researcher_dir.name

        for exp_dir in researcher_dir.iterdir():
            metadata_file = exp_dir / "metadata.yaml"
            if not metadata_file.exists():
                continue

            with open(metadata_file) as f:
                meta = yaml.safe_load(f)

            experiments.append({
                "researcher": researcher,
                "experiment": exp_dir.name,
                "goal": meta.get("goal", "N/A"),
                "status": meta.get("status", "unknown"),
                "created": meta.get("created"),
                "updated": meta.get("updated"),
                "tags": meta.get("tags", []),
                "best_metric": meta.get("results", {}).get("best_metric")
            })

    return experiments

def generate_dashboard(experiments):
    """Generate HTML dashboard"""
    df = pd.DataFrame(experiments)

    html = f"""
    <html>
    <head><title>Team Experiments Dashboard</title></head>
    <body>
    <h1>ML Team Experiments</h1>
    <p>Last updated: {datetime.now().strftime('%Y-%m-%d %H:%M')}</p>

    <h2>Active Experiments ({len(df)})</h2>
    {df.to_html(index=False)}

    <h2>Recent Activity</h2>
    {df.sort_values('updated', ascending=False).head(10).to_html(index=False)}

    </body>
    </html>
    """

    with open("docs/experiment-dashboard.html", "w") as f:
        f.write(html)

    print("✅ Dashboard generated: docs/experiment-dashboard.html")

if __name__ == "__main__":
    experiments = scan_experiments()
    generate_dashboard(experiments)

Experiment Handoff Process

When handing off experiments between team members:

1. Document thoroughly:

# experiments/active/sarah/medical-rag/HANDOFF.md

## Experiment Handoff

**From:** Sarah Johnson
**To:** Mike Chen
**Date:** 2024-10-19

### Current Status
- Completed initial RAG pipeline with llama3.1:8b
- Best accuracy: 89% on validation set
- Main bottleneck: Retrieval latency (avg 850ms)

### What Works
- Document chunking strategy (500 tokens, 50 overlap)
- Embedding model: all-MiniLM-L6-v2
- Reranking with cross-encoder significantly improves results

### What Doesn't Work
- ChromaDB too slow for >100K documents
- Need better medical entity recognition
- Current prompt struggles with complex multi-hop questions

### Next Steps
1. Try Qdrant or Milvus for vector store
2. Fine-tune NER model on medical corpus
3. Implement query decomposition for complex questions

### Files to Review
- `src/rag_pipeline.py` - Main pipeline
- `experiments/results/analysis.ipynb` - Performance analysis
- `docs/architecture.md` - System design

### How to Run
```bash
conda activate medical-rag
python src/train.py --config configs/baseline.yaml
python src/evaluate.py --checkpoint outputs/best_model.pth

Questions?

Slack: @sarah or email: sarah@company.com


**2. Pair programming session:**
- 30-60 minute walkthrough
- Run the experiment together
- Answer questions in real-time

**3. Update documentation:**
- Ensure README is current
- Add inline comments for complex logic
- Update architecture diagrams

---

## Part 4: Environment Consistency

### Reproducible Environments

**environment.yaml (conda):**

```yaml
name: ml-workspace
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pytorch=2.1.0
  - pytorch-cuda=12.1
  - transformers=4.35.0
  - numpy=1.24.0
  - pandas=2.1.0
  - scikit-learn=1.3.0
  - jupyter=1.0.0
  - pip
  - pip:
      - ollama==0.1.7
      - chromadb==0.4.15
      - langchain==0.0.335
      - wandb==0.15.12

Setup Script:

#!/bin/bash
# scripts/setup-env.sh

echo "🔧 Setting up ML workspace environment..."

# Check prerequisites
command -v conda >/dev/null 2>&1 || {
    echo "❌ Conda not found. Please install Miniconda/Anaconda first."
    exit 1
}

command -v ollama >/dev/null 2>&1 || {
    echo "❌ Ollama not found. Please install from ollama.ai"
    exit 1
}

# Create conda environment
echo "📦 Creating conda environment..."
conda env create -f environment.yaml

# Activate environment
echo "🔄 Activating environment..."
eval "$(conda shell.bash hook)"
conda activate ml-workspace

# Sync Ollama models
echo "🤖 Syncing Ollama models..."
bash scripts/sync-ollama.sh

# Initialize experiment tracking
echo "📊 Setting up experiment tracking..."
python scripts/init-mlflow.py

# Verify installation
echo "✅ Verifying installation..."
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"
ollama list

echo ""
echo "✅ Setup complete!"
echo "   Activate with: conda activate ml-workspace"

Docker for Production Agents

agents/production/[agent]/Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Install Ollama client
RUN pip install ollama

# Copy application
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')"

# Run agent server
CMD ["python", "-m", "agent.server"]

docker-compose.yml:

version: '3.8'

services:
  agent:
    build: .
    container_name: medical-assistant
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_HOST=http://ollama:11434
      - LOG_LEVEL=INFO
    volumes:
      - ./logs:/app/logs
      - ./config.yaml:/app/config.yaml
    depends_on:
      - ollama
    restart: unless-stopped

  ollama:
    image: ollama/ollama:latest
    container_name: ollama-server
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama-data:

Part 5: CI/CD Automation

GitHub Actions for Experiments

.github/workflows/experiment-validation.yml:

name: Validate Experiment

on:
  push:
    paths:
      - 'experiments/**'
  pull_request:
    paths:
      - 'experiments/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Check metadata exists
        run: |
          python scripts/validate-experiment-metadata.py

      - name: Verify documentation
        run: |
          python scripts/check-experiment-docs.py

      - name: Run linting
        run: |
          pip install ruff
          ruff check experiments/

      - name: Test experiment code
        run: |
          pip install pytest
          pytest experiments/*/tests/ -v

Agent Testing Pipeline

.github/workflows/agent-tests.yml:

name: Agent Tests

on:
  pull_request:
    paths:
      - 'agents/production/**'

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      ollama:
        image: ollama/ollama:latest
        ports:
          - 11434:11434

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r agents/production/${{ github.event.pull_request.head.ref }}/requirements.txt
          pip install pytest pytest-cov

      - name: Pull required models
        run: |
          ollama pull llama3.1:8b

      - name: Run unit tests
        run: |
          pytest agents/production/*/tests/ \
            --cov=agents/production \
            --cov-report=html \
            --cov-fail-under=80

      - name: Run integration tests
        run: |
          pytest agents/production/*/tests/test_integration.py -v

      - name: Security scan
        run: |
          pip install bandit
          bandit -r agents/production/

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml

Production Deployment Pipeline

.github/workflows/deploy-production.yml:

name: Deploy to Production

on:
  push:
    branches:
      - main
    paths:
      - 'agents/production/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v3

      - name: Build Docker image
        run: |
          cd agents/production/$AGENT_NAME
          docker build -t $REGISTRY/$AGENT_NAME:$VERSION .

      - name: Run smoke tests
        run: |
          docker run --rm $REGISTRY/$AGENT_NAME:$VERSION python -m pytest tests/smoke/

      - name: Push to registry
        run: |
          echo ${{ secrets.REGISTRY_TOKEN }} | docker login -u ${{ secrets.REGISTRY_USER }} --password-stdin
          docker push $REGISTRY/$AGENT_NAME:$VERSION

      - name: Deploy to Kubernetes
        run: |
          kubectl set image deployment/$AGENT_NAME \
            $AGENT_NAME=$REGISTRY/$AGENT_NAME:$VERSION

      - name: Verify deployment
        run: |
          kubectl rollout status deployment/$AGENT_NAME
          kubectl get pods -l app=$AGENT_NAME

      - name: Smoke test production
        run: |
          python scripts/smoke-test-prod.py $AGENT_NAME

Part 6: Team Workflows

Daily Standup Dashboard

scripts/generate-standup.py:

#!/usr/bin/env python3
"""Generate daily standup report"""

from datetime import datetime, timedelta
import subprocess
import yaml
from pathlib import Path

def get_recent_commits():
    """Get commits from last 24 hours"""
    yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
    result = subprocess.run(
        ['git', 'log', f'--since={yesterday}', '--pretty=format:%an|%s'],
        capture_output=True, text=True
    )
    commits = [line.split('|') for line in result.stdout.strip().split('\n') if line]
    return commits

def get_active_experiments():
    """Get experiments updated in last 24 hours"""
    experiments = []
    exp_dir = Path("experiments/active")

    for researcher_dir in exp_dir.iterdir():
        for exp in researcher_dir.iterdir():
            metadata = exp / "metadata.yaml"
            if not metadata.exists():
                continue

            mtime = datetime.fromtimestamp(metadata.stat().st_mtime)
            if mtime > datetime.now() - timedelta(days=1):
                with open(metadata) as f:
                    meta = yaml.safe_load(f)
                experiments.append({
                    'researcher': researcher_dir.name,
                    'experiment': exp.name,
                    'status': meta.get('status'),
                    'goal': meta.get('goal')
                })

    return experiments

def generate_report():
    """Generate standup report"""
    print("📊 Daily Standup Report")
    print(f"📅 {datetime.now().strftime('%Y-%m-%d')}")
    print("=" * 50)

    print("\n🚀 Recent Commits:")
    commits = get_recent_commits()
    for author, message in commits[:10]:
        print(f"  • {author}: {message}")

    print("\n🧪 Active Experiments:")
    experiments = get_active_experiments()
    for exp in experiments:
        print(f"  • {exp['researcher']}: {exp['experiment']}")
        print(f"    Goal: {exp['goal']}")
        print(f"    Status: {exp['status']}")

    print("\n📈 MLflow Experiments:")
    # Query MLflow for recent runs
    print("  Run: python scripts/query-mlflow.py --since yesterday")

if __name__ == "__main__":
    generate_report()

Weekly Team Review

scripts/weekly-review.py:

#!/usr/bin/env python3
"""Generate weekly team review"""

import pandas as pd
from datetime import datetime, timedelta
from pathlib import Path
import yaml
import json

def analyze_experiments():
    """Analyze experiment activity"""
    week_ago = datetime.now() - timedelta(days=7)
    stats = {
        'total_experiments': 0,
        'completed': 0,
        'in_progress': 0,
        'by_researcher': {}
    }

    exp_dir = Path("experiments/active")
    for researcher_dir in exp_dir.iterdir():
        researcher = researcher_dir.name
        stats['by_researcher'][researcher] = 0

        for exp in researcher_dir.iterdir():
            metadata = exp / "metadata.yaml"
            if not metadata.exists():
                continue

            with open(metadata) as f:
                meta = yaml.safe_load(f)

            created = datetime.fromisoformat(meta.get('created', '2000-01-01'))
            if created > week_ago:
                stats['total_experiments'] += 1
                stats['by_researcher'][researcher] += 1

                status = meta.get('status')
                if status == 'completed':
                    stats['completed'] += 1
                elif status == 'in_progress':
                    stats['in_progress'] += 1

    return stats

def analyze_models():
    """Analyze model registry"""
    with open("models/registry.yaml") as f:
        registry = yaml.safe_load(f)

    return {
        'total_models': len(registry.get('models', {})),
        'production_models': sum(
            1 for m in registry.get('models', {}).values()
            if m.get('status') == 'production'
        ),
        'custom_models': sum(
            1 for m in registry.get('models', {}).values()
            if m.get('source') == 'custom'
        )
    }

def generate_report():
    """Generate comprehensive weekly report"""
    print("📊 Weekly Team Review")
    print(f"📅 Week ending {datetime.now().strftime('%Y-%m-%d')}")
    print("=" * 70)

    exp_stats = analyze_experiments()
    print("\n🧪 Experiment Activity:")
    print(f"  Total new experiments: {exp_stats['total_experiments']}")
    print(f"  Completed: {exp_stats['completed']}")
    print(f"  In progress: {exp_stats['in_progress']}")
    print("\n  By researcher:")
    for researcher, count in exp_stats['by_researcher'].items():
        print(f"    • {researcher}: {count} experiments")

    model_stats = analyze_models()
    print("\n🤖 Model Registry:")
    print(f"  Total models: {model_stats['total_models']}")
    print(f"  Production models: {model_stats['production_models']}")
    print(f"  Custom models: {model_stats['custom_models']}")

    print("\n📈 Key Metrics:")
    print("  GPU Utilization: [Query from monitoring]")
    print("  Agent Uptime: [Query from deployment]")
    print("  Experiment Success Rate: [Calculate from results]")

    print("\n💡 Recommendations:")
    print("  • [Auto-generated based on patterns]")
    print("  • Review abandoned experiments")
    print("  • Share successful experiment patterns")

if __name__ == "__main__":
    generate_report()

Code Review Guidelines

docs/runbooks/code-review-checklist.md:

# Code Review Checklist

## For Experiments

### Required
- [ ] Metadata file exists and is complete
- [ ] README documents experiment goal and methodology
- [ ] Configuration is externalized (no hardcoded values)
- [ ] Dependencies listed in requirements.txt
- [ ] Results directory structure follows template
- [ ] Code runs without errors

### Recommended
- [ ] Inline comments explain complex logic
- [ ] Visualization notebooks for results
- [ ] Performance metrics documented
- [ ] Comparison with baseline

## For Production Agents

### Critical (Must Pass)
- [ ] All tests pass (unit, integration, behavior)
- [ ] Test coverage >80%
- [ ] No security vulnerabilities (bandit scan passes)
- [ ] Complete README with usage examples
- [ ] Configuration externalized
- [ ] Logging comprehensive
- [ ] Error handling robust
- [ ] Dockerfile builds successfully

### Important (Should Pass)
- [ ] Code follows team style guide
- [ ] Docstrings for all public functions
- [ ] Type hints for function signatures
- [ ] Performance benchmarks run
- [ ] Metrics collection implemented
- [ ] Health check endpoint works

### Nice to Have
- [ ] Architecture diagram included
- [ ] Troubleshooting guide in README
- [ ] Example use cases demonstrated
- [ ] Monitoring dashboards defined

## Review Process

1. **Self Review**: Author completes checklist before PR
2. **Peer Review**: Team member reviews code
3. **Testing**: CI/CD pipeline validates automatically
4. **Approval**: 1 approval required for experiments, 2 for production agents
5. **Merge**: Squash and merge with descriptive message

Part 7: Resource Management

GPU Allocation

scripts/check-gpu-usage.sh:

#!/bin/bash
# Check GPU usage and availability

echo "🎮 GPU Resource Status"
echo "====================="

# Check NVIDIA GPUs
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total \
  --format=csv,noheader,nounits | \
while IFS=',' read -r index name util mem_used mem_total; do
    echo "GPU $index: $name"
    echo "  Utilization: ${util}%"
    echo "  Memory: ${mem_used}MB / ${mem_total}MB"

    # Check if GPU is idle (< 20% utilization)
    if [ "$util" -lt 20 ]; then
        echo "  Status: ✅ Available"
    else
        echo "  Status: 🔴 Busy"
        # Show which process is using it
        nvidia-smi --query-compute-apps=pid,process_name,used_memory \
          --format=csv,noheader | grep -v "^$" | \
        while IFS=',' read -r pid process mem; do
            user=$(ps -o user= -p "$pid" 2>/dev/null)
            echo "    Process: $process (PID: $pid, User: $user)"
        done
    fi
    echo ""
done

Resource Reservation System

scripts/reserve-gpu.py:

#!/usr/bin/env python3
"""GPU reservation system"""

import json
import sys
from datetime import datetime, timedelta
from pathlib import Path

RESERVATIONS_FILE = "shared/gpu-reservations.json"

def load_reservations():
    """Load current reservations"""
    if Path(RESERVATIONS_FILE).exists():
        with open(RESERVATIONS_FILE) as f:
            return json.load(f)
    return {"reservations": []}

def save_reservations(data):
    """Save reservations"""
    with open(RESERVATIONS_FILE, 'w') as f:
        json.dump(data, f, indent=2)

def reserve_gpu(gpu_id, user, duration_hours, purpose):
    """Reserve a GPU"""
    data = load_reservations()

    # Check if GPU is already reserved
    now = datetime.now()
    for res in data['reservations']:
        if res['gpu_id'] == gpu_id:
            end_time = datetime.fromisoformat(res['end_time'])
            if end_time > now:
                print(f"❌ GPU {gpu_id} is reserved by {res['user']} until {end_time}")
                return False

    # Create reservation
    reservation = {
        'gpu_id': gpu_id,
        'user': user,
        'purpose': purpose,
        'start_time': now.isoformat(),
        'end_time': (now + timedelta(hours=duration_hours)).isoformat()
    }

    data['reservations'].append(reservation)
    save_reservations(data)

    print(f"✅ GPU {gpu_id} reserved for {user}")
    print(f"   Duration: {duration_hours} hours")
    print(f"   Until: {reservation['end_time']}")
    return True

def list_reservations():
    """List all current reservations"""
    data = load_reservations()
    now = datetime.now()

    print("📅 Current GPU Reservations")
    print("=" * 50)

    active_reservations = [
        res for res in data['reservations']
        if datetime.fromisoformat(res['end_time']) > now
    ]

    if not active_reservations:
        print("No active reservations")
        return

    for res in active_reservations:
        end = datetime.fromisoformat(res['end_time'])
        remaining = end - now
        print(f"GPU {res['gpu_id']}: {res['user']}")
        print(f"  Purpose: {res['purpose']}")
        print(f"  Time remaining: {remaining}")
        print("")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage:")
        print("  reserve-gpu.py list")
        print("  reserve-gpu.py reserve <gpu_id> <user> <hours> <purpose>")
        sys.exit(1)

    command = sys.argv[1]

    if command == "list":
        list_reservations()
    elif command == "reserve":
        if len(sys.argv) != 6:
            print("Usage: reserve-gpu.py reserve <gpu_id> <user> <hours> <purpose>")
            sys.exit(1)

        gpu_id = int(sys.argv[2])
        user = sys.argv[3]
        hours = int(sys.argv[4])
        purpose = sys.argv[5]

        reserve_gpu(gpu_id, user, hours, purpose)

Cost Tracking

scripts/track-compute-costs.py:

#!/usr/bin/env python3
"""Track compute costs"""

import json
from datetime import datetime, timedelta
from pathlib import Path

# Cost per GPU hour (example pricing)
GPU_COST_PER_HOUR = {
    'A100': 3.00,
    'A6000': 2.00,
    'RTX4090': 1.50
}

def calculate_experiment_cost(experiment_dir):
    """Calculate cost for an experiment"""
    metadata_file = Path(experiment_dir) / "metadata.yaml"
    if not metadata_file.exists():
        return 0.0

    # Parse training time from logs or metadata
    # This is simplified - you'd parse actual logs
    training_hours = 2.5  # Example
    gpu_type = "A100"  # From metadata

    cost = training_hours * GPU_COST_PER_HOUR.get(gpu_type, 1.0)
    return cost

def generate_cost_report():
    """Generate monthly cost report"""
    print("💰 Compute Cost Report")
    print("=" * 50)

    total_cost = 0.0
    costs_by_user = {}

    exp_dir = Path("experiments/active")
    for researcher_dir in exp_dir.iterdir():
        researcher = researcher_dir.name
        costs_by_user[researcher] = 0.0

        for exp in researcher_dir.iterdir():
            cost = calculate_experiment_cost(exp)
            costs_by_user[researcher] += cost
            total_cost += cost

    print(f"Total cost: ${total_cost:.2f}")
    print("\nBy researcher:")
    for user, cost in sorted(costs_by_user.items(), key=lambda x: x[1], reverse=True):
        print(f"  {user}: ${cost:.2f}")

if __name__ == "__main__":
    generate_cost_report()

Part 8: Complete Workflow Example

Let's see how everything fits together with a real workflow from experiment to production.

Scenario: Building a Medical Q&A Agent

Day 1-2: Initial Experiment

# Sarah starts new experiment
cd ml-workspace
git checkout develop
git checkout -b experiment/sarah/medical-qa

# Setup experiment
mkdir -p experiments/active/sarah/medical-qa
cd experiments/active/sarah/medical-qa

# Initialize experiment
cat > metadata.yaml <<EOF
experiment_id: "medical-qa-v1"
researcher: "sarah"
created: "2024-10-19"
updated: "2024-10-19"
status: "in_progress"
goal: "Build RAG-based medical Q&A system"
hypothesis: "llama3.1:8b with medical corpus RAG can answer clinical questions"
tags: ["rag", "medical", "qa"]
mlflow_experiment: "medical-qa"
EOF

# Create README
cat > README.md <<EOF
# Medical Q&A Experiment

## Goal
Build RAG system for medical question answering.

## Approach
1. Index medical corpus (PubMed abstracts)
2. Implement retrieval with ChromaDB
3. Test llama3.1:8b with retrieved context
4. Evaluate on MedQA benchmark

## Running
\`\`\`bash
python train.py --config config.yaml
python evaluate.py --checkpoint outputs/best_model.pth
\`\`\`
EOF

# Write experiment code
# ... implement RAG pipeline ...

# Track with MLflow
python train.py

# Commit and share
git add .
git commit -m "Initial medical QA RAG experiment"
git push origin experiment/sarah/medical-qa

Day 3-4: Iteration and Results

# Run multiple experiments
mlflow ui  # View results

# Best config identified
# - Chunk size: 500 tokens
# - Retrieval: top-5 with reranking
# - Model: llama3.1:8b, temp=0.3

# Document results
cat > results/summary.md <<EOF
# Results Summary

## Best Configuration
- Accuracy: 89%
- Latency p50: 850ms
- Retrieval precision: 0.85

## Conclusion
Ready for prototype agent deployment
EOF

# Update metadata
# status: completed

Day 5: Promote to Prototype Agent

# Create prototype agent
cd agents/prototypes/
mkdir medical-qa
cd medical-qa

# Copy experiment code
cp -r ../../../experiments/active/sarah/medical-qa/src/* ./

# Create agent wrapper
cat > agent.py <<EOF
"""Medical QA Agent - Prototype"""
from rag_pipeline import MedicalRAG

class MedicalQAAgent:
    def __init__(self):
        self.rag = MedicalRAG(model="llama3.1:8b")

    def answer_question(self, question):
        context = self.rag.retrieve(question)
        answer = self.rag.generate(question, context)
        return answer
EOF

# Test prototype
python test_agent.py

# Share with team
git add .
git commit -m "Add medical QA prototype agent"
git push origin experiment/sarah/medical-qa

Week 2: Production Promotion (Team Review)

# Mike volunteers to productionize
git checkout experiment/sarah/medical-qa
cd agents/production/

# Use production template
cp -r templates/agent-template medical-qa-agent
cd medical-qa-agent

# Fill out complete structure
# - Add comprehensive tests
# - Create Dockerfile
# - Write complete README
# - Add configuration
# - Implement monitoring

# Create PR
git checkout -b feature/medical-qa-production
git add .
git commit -m "Production-ready medical QA agent

- Complete test suite (85% coverage)
- Dockerized with health checks
- Comprehensive documentation
- Monitoring and metrics
- Performance benchmarks"

git push origin feature/medical-qa-production

# Open PR on GitHub
# - CI/CD runs automatically
# - Tests pass
# - Sarah reviews and approves
# - Merge to develop

Week 3: Production Deployment

# Merge to main triggers deployment
git checkout main
git merge develop

# GitHub Actions:
# 1. Build Docker image
# 2. Run tests
# 3. Push to registry
# 4. Deploy to Kubernetes
# 5. Run smoke tests

# Agent now live at https://api.company.com/medical-qa

# Update model registry
cat >> models/registry.yaml <<EOF
  medical-qa-v1:
    source: "custom"
    base_model: "llama3.1:8b"
    version: "v1.0.0"
    purpose: "Medical question answering with RAG"
    status: "production"
    owners: ["sarah", "mike"]
    performance:
      accuracy: 0.89
      latency_p50: "850ms"
EOF

Timeline Summary

Days 1-4: Individual experiment (fast iteration)
Day 5: Prototype agent (share with team)
Week 2: Production hardening (quality focus)
Week 3: Deployment (automated pipeline)

Total: 3 weeks from idea to production

Best Practices

For Individual Researchers

1. Document as you go

Write README first (defines what you're building)
Update metadata.yaml daily
Commit frequently with good messages
Screenshot interesting results

2. Follow templates

Use experiment template
Use agent template for production
Don't skip required fields
Consistency helps others help you

3. Share early

Push experiments even if incomplete
Ask for code review on tricky parts
Present results in team meetings
Write handoff docs when switching projects

For Team Leads

1. Automate everything

Use CI/CD for validation
Generate dashboards automatically
Automate environment setup
Script common workflows

2. Maintain standards

Enforce code review for production
Require tests for production agents
Keep templates updated
Document architecture decisions

3. Foster collaboration

Weekly demo sessions
Pair programming for complex features
Shared Slack channel for questions
Regular retrospectives

For Production Deployment

1. Progressive rollout

Deploy to staging first
Canary deployment (10% → 50% → 100%)
Monitor metrics closely
Have rollback plan ready

2. Observability

Comprehensive logging
Metrics dashboards
Alerting on errors
Performance tracking

3. Documentation

Runbooks for operations
Troubleshooting guides
Architecture diagrams
API documentation

Common Pitfalls to Avoid

Don't

❌ Skip documentation - Future you (and your team) will regret it

❌ Commit large binary files - Use Git LFS or external storage

❌ Work directly on main - Always use feature branches

❌ Deploy without tests - Tests prevent production disasters

❌ Hardcode secrets - Use environment variables or secret management

❌ Ignore failed CI - Fix immediately, don't accumulate debt

❌ Deploy without monitoring - You need visibility in production

❌ Skip code review - Fresh eyes catch bugs and improve design

Do

✅ Use templates consistently - Reduces cognitive load

✅ Automate repetitive tasks - Scripts save time and reduce errors

✅ Version everything - Code, models, prompts, configs

✅ Communicate proactively - Share blockers and successes

✅ Test thoroughly - Unit, integration, and end-to-end

✅ Monitor production - Metrics, logs, alerts

✅ Document decisions - ADRs explain the "why"

✅ Iterate quickly - Prototype → Test → Production

What You've Built

By completing this series, you now have:

Foundation (Part 1):

Well-organized workspace structure
Separation of experiments, agents, and documentation
Clear file naming conventions
Shared utilities and templates

Knowledge System (Part 2):

Memory bank for context persistence
Learning logs for insights
Project documentation
Architecture decision records

Experiment Tracking (Part 3):

MLflow integration
Reproducible experiments
Version-controlled configurations
Result visualization

Production Agents (Part 4):

Two-track system (prototype/production)
Standardized agent structure
Comprehensive testing
Deployment readiness

Team Collaboration (Part 5):

Version control strategy
Model registry and lifecycle
CI/CD automation
Resource management
Complete workflows

Series Conclusion

You've built a production-ready ML workspace that scales from individual research to full team collaboration. This system:

Accelerates Development:

Templates reduce setup time
Automation removes manual steps
Clear structure reduces decisions
Reusable components speed development

Improves Quality:

Tests catch bugs early
Code review improves design
Standards ensure consistency
Monitoring catches production issues

Enables Collaboration:

Shared structure everyone understands
Version control prevents conflicts
Documentation enables handoffs
Registry provides discoverability

Scales with Your Team:

Works for solo researchers
Supports small teams
Scales to larger organizations
Adapts to changing needs

Key Takeaways

Structure enables speed - Good organization removes friction
Automate everything - Scripts and CI/CD save countless hours
Documentation is infrastructure - Undocumented work is wasted work
Version all artifacts - Code, models, prompts, configs
Test before production - Tests prevent disasters
Monitor what matters - You can't improve what you don't measure
Collaborate deliberately - Clear processes enable teamwork
Iterate continuously - Prototype → Test → Improve → Repeat

Next Steps

Immediate:

Set up your workspace structure (Part 1)
Initialize version control
Create first experiment with template
Set up MLflow tracking

Short-term (1-2 weeks):

Build memory bank system
Create agent templates
Implement basic CI/CD
Set up model registry

Long-term (1-3 months):

Establish team workflows
Build automation scripts
Deploy first production agent
Iterate based on team feedback

Ongoing:

Refine templates based on learnings
Update documentation regularly
Share successful patterns
Continuously improve automation

Resources

Code Examples:

All scripts from this series (GitHub link)
Template repository
Example experiments
Sample agents

Further Reading:

Tools:

MLflow for experiment tracking
Ollama for local LLMs
Docker for containerization
GitHub Actions for CI/CD

Series Navigation

Part 1: Workspace Structure
Part 2: Documentation Systems
Part 3: Experiment Tracking
Part 4: Production-Ready AI Agents
Part 5: Team Collaboration and Workflow Integration (this article)

Questions or want to share your workspace setup? Find me on Twitter @bioinfo or at rundatarun.io

Thank you for following this series! I hope it helps you build better ML workflows and more effective teams.

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Related experiments

Apparatus

1,618 words · 14 min read

ml-development
team-collaboration
workflows
best-practices
production-systems