A Production ML Workspace· Part 4 of 5
Practical Applications10 min readshipped

Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates

Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates

You've organized your workspace, built documentation systems, and implemented experiment tracking. Your ML workflows are reproducible and well-documented. Now it's time to tackle the most complex artifact in your workspace: AI agents.

Unlike experiments (which run once) or models (which are static), agents are dynamic systems that interact with tools, maintain state, and make decisions. Without proper structure, agent development quickly becomes tangled code with unclear responsibilities and impossible debugging.

This article shows you how to build production-ready AI agents using standardized templates, clear architecture patterns, comprehensive testing, and deployment readiness frameworks.

About This Series

This is Part 4 of a 5-part series on building production ML workspaces. Previous parts:

  • Part 1: Workspace Structure
  • Part 2: Documentation Systems
  • Part 3: Experiment Tracking

Coming next:

  • Part 5: Ollama Model Management and Workflow Integration

The Agent Development Problem

AI agents present unique challenges that experiments and models don't:

Complexity Challenges:

  • Multiple components (LLM, tools, memory, state)
  • Complex interaction patterns
  • Error handling across tool calls
  • State management and conversation history
  • Tool orchestration and chaining

Production Challenges:

  • How do you test an agent thoroughly?
  • How do you debug multi-step reasoning?
  • How do you version control prompts?
  • How do you monitor production behavior?
  • How do you handle tool failures gracefully?

Organizational Challenges:

  • Prototypes that never reach production
  • Unclear distinction between experimental and production code
  • Lack of reusable components
  • No standardized agent structure
  • Difficulty onboarding to existing agents

The Two-Track Agent System

Our solution: Separate tracks for prototype and production agents with clear promotion criteria.

Agent Development System
│
├── Prototype Track (agents/prototypes/)
│   ├── Fast iteration
│   ├── Minimal documentation
│   ├── Breaking changes OK
│   └── No tests required
│
└── Production Track (agents/production/)
    ├── Stable API
    ├── Comprehensive docs
    ├── Full test coverage
    └── Monitoring & logging

When to Use Each Track

Prototype Track:

  • Initial exploration and experimentation
  • Testing new LLM capabilities
  • Rapid tool integration trials
  • Proof-of-concept development
  • Research and learning

Production Track:

  • Agents used by others or in automation
  • Critical business workflows
  • Public-facing applications
  • Deployed services
  • Agents with users who expect reliability

Production Agent Template Structure

Every production agent gets this standardized structure:

agents/production/agent-name/
├── README.md                  # Complete documentation
├── agent.py                   # Main agent code
├── config.yaml               # Configuration
├── requirements.txt          # Dependencies
├── environment.yaml          # Full environment
├── prompts/                  # Version-controlled prompts
│   ├── system.txt           # System prompt
│   ├── user_template.txt    # User prompt template
│   └── few_shot.yaml        # Few-shot examples
├── tools/                    # Agent tools
│   ├── __init__.py
│   ├── tool_registry.py     # Tool management
│   └── [tool_name].py       # Individual tools
├── memory/                   # State and memory
│   ├── conversation.py      # Conversation history
│   ├── context.py          # Context management
│   └── state.py            # Agent state
├── tests/                   # Comprehensive tests
│   ├── test_agent.py       # Agent behavior tests
│   ├── test_tools.py       # Tool tests
│   └── test_integration.py # End-to-end tests
├── logs/                    # Execution logs
│   └── [timestamp].log
├── metrics/                 # Performance tracking
│   └── metrics.json
└── deployment/             # Deployment assets
    ├── Dockerfile
    ├── docker-compose.yml
    └── deploy.sh

The Complete Agent README

Location: agents/production/[agent-name]/README.md

# Agent: [Agent Name]

**Status:** 🟢 Production | 🟡 Beta | 🔴 Experimental
**Version:** v1.0.0
**Last Updated:** YYYY-MM-DD
**Maintainer:** Your Name

---

## Quick Summary

**Purpose:** [One sentence describing what this agent does]

**Use Cases:**
1. [Primary use case]
2. [Secondary use case]
3. [Additional use case]

**Key Capabilities:**
- [Capability 1]
- [Capability 2]
- [Capability 3]

---

## Quick Start

### Installation

```bash
# Clone or navigate to agent directory
cd agents/production/agent-name

# Create environment
conda env create -f environment.yaml
conda activate agent-env

# Install dependencies
pip install -r requirements.txt

# Verify installation
python agent.py --verify

Basic Usage

from agent import AgentName

# Initialize agent
agent = AgentName(
    model="llama3.1:8b",
    temperature=0.7,
    max_iterations=5
)

# Run agent
result = agent.run("Your task here")
print(result)

Configuration

Edit config.yaml:

model:
  name: "llama3.1:8b"
  temperature: 0.7
  max_tokens: 2000

agent:
  max_iterations: 5
  max_tool_calls: 10
  timeout_seconds: 300

Architecture

System Overview

User Input
    ↓
[Agent Core] ←→ [LLM (Ollama)]
    ↓
[Tool Orchestrator]
    ↓
[Tools] → [External APIs/Systems]
    ↓
[Memory/State]
    ↓
Agent Response

Components

Agent Core (agent.py):

  • Main agent logic and reasoning loop
  • Decision-making and planning
  • Error handling and recovery
  • Response generation

Tool System (tools/):

  • Tool registry and management
  • Individual tool implementations
  • Tool call validation
  • Result processing

Memory System (memory/):

  • Conversation history management
  • Context window management
  • State persistence
  • Session management

Prompt System (prompts/):

  • System prompt definition
  • User prompt templates
  • Few-shot examples
  • Prompt versioning

Detailed Usage

Basic Example

from agent import CustomerSupportAgent

# Initialize
agent = CustomerSupportAgent()

# Simple query
response = agent.run(
    query="How do I reset my password?",
    context={"user_id": "12345"}
)

print(response.text)
print(f"Tools used: {response.tools_called}")
print(f"Confidence: {response.confidence}")

Advanced Example

# Multi-turn conversation
agent = CustomerSupportAgent()

conversation_id = agent.start_conversation(
    user_id="12345",
    initial_context={"account_type": "premium"}
)

# Turn 1
response1 = agent.continue_conversation(
    conversation_id=conversation_id,
    message="I can't log in to my account"
)

# Turn 2
response2 = agent.continue_conversation(
    conversation_id=conversation_id,
    message="I tried that already, still not working"
)

# Get conversation history
history = agent.get_conversation_history(conversation_id)

Streaming Response

# Stream agent response
for chunk in agent.run_streaming("Complex analysis task"):
    print(chunk.text, end="")
    if chunk.tool_call:
        print(f"\n[Using tool: {chunk.tool_call.name}]")

Tools Available

Built-in Tools

  1. knowledge_base_search

    • Purpose: Search internal knowledge base
    • Input: Query string
    • Output: Relevant documents
    • Example: {"query": "password reset policy"}
  2. api_call

    • Purpose: Call external API
    • Input: Endpoint, method, params
    • Output: API response
    • Example: {"endpoint": "/user/status", "method": "GET"}
  3. database_query

    • Purpose: Query user database
    • Input: Query parameters
    • Output: User data
    • Example: {"user_id": "12345", "fields": ["email", "status"]}

Adding Custom Tools

# tools/custom_tool.py
from tools.base import BaseTool

class CustomTool(BaseTool):
    name = "custom_tool"
    description = "What this tool does"

    def __init__(self):
        super().__init__()

    def execute(self, **kwargs):
        """Execute tool logic"""
        # Your implementation
        return {"result": "value"}

    def validate_input(self, **kwargs):
        """Validate inputs"""
        required = ["param1", "param2"]
        return all(k in kwargs for k in required)

Register in tools/tool_registry.py:

from tools.custom_tool import CustomTool

TOOLS = {
    "custom_tool": CustomTool(),
    # ... other tools
}

Configuration Reference

config.yaml Structure

# Model configuration
model:
  name: "llama3.1:8b"          # Ollama model
  temperature: 0.7              # Sampling temperature
  max_tokens: 2000             # Max response length
  top_p: 0.9                   # Nucleus sampling

# Agent behavior
agent:
  max_iterations: 5            # Max reasoning loops
  max_tool_calls: 10          # Max tool invocations
  timeout_seconds: 300        # Overall timeout
  retry_on_error: true        # Retry failed operations
  max_retries: 3              # Max retry attempts

# Tools configuration
tools:
  enabled:
    - knowledge_base_search
    - api_call
    - database_query
  timeout_per_tool: 30        # Seconds per tool

# Memory configuration
memory:
  max_history_length: 10      # Messages to keep
  context_window: 4096        # Token limit
  summarize_old: true         # Summarize old messages

# Logging
logging:
  level: "INFO"
  log_tool_calls: true
  log_prompts: false          # Sensitive: disable in prod
  save_conversations: true

Testing

Running Tests

# All tests
pytest tests/

# Specific test file
pytest tests/test_agent.py

# With coverage
pytest --cov=. tests/

# Verbose output
pytest -v tests/

Test Structure

Unit Tests (test_tools.py):

def test_knowledge_base_search():
    """Test knowledge base search tool"""
    tool = KnowledgeBaseSearch()
    result = tool.execute(query="test query")
    assert result["documents"] is not None
    assert len(result["documents"]) > 0

Integration Tests (test_integration.py):

def test_agent_conversation_flow():
    """Test complete conversation flow"""
    agent = CustomerSupportAgent()

    # Start conversation
    conv_id = agent.start_conversation(user_id="test")

    # First message
    response1 = agent.continue_conversation(
        conversation_id=conv_id,
        message="I need help"
    )
    assert response1.text is not None

    # Follow-up
    response2 = agent.continue_conversation(
        conversation_id=conv_id,
        message="Tell me more"
    )
    assert len(response2.conversation_history) == 4  # 2 exchanges

Behavior Tests (test_agent.py):

def test_agent_handles_tool_failure():
    """Test graceful handling of tool failures"""
    agent = CustomerSupportAgent()

    # Mock tool to fail
    with patch('tools.api_call.execute') as mock_tool:
        mock_tool.side_effect = Exception("API error")

        response = agent.run("Query requiring API call")

        # Agent should recover gracefully
        assert response.text is not None
        assert "error" in response.text.lower()
        assert response.success is False

Monitoring & Metrics

Metrics Tracked

{
  "conversation_id": "abc123",
  "timestamp": "2024-10-19T14:30:00",
  "duration_seconds": 4.2,
  "model_used": "llama3.1:8b",
  "tokens": {
    "prompt": 1024,
    "completion": 256,
    "total": 1280
  },
  "tools_called": [
    {
      "name": "knowledge_base_search",
      "duration_ms": 120,
      "success": true
    }
  ],
  "iterations": 2,
  "success": true,
  "error": null,
  "user_feedback": null
}

Accessing Metrics

# Get agent metrics
from metrics import AgentMetrics

metrics = AgentMetrics.load("metrics/metrics.json")

# Summary statistics
print(f"Total conversations: {metrics.total_conversations}")
print(f"Average duration: {metrics.avg_duration}s")
print(f"Success rate: {metrics.success_rate}%")
print(f"Most used tools: {metrics.top_tools}")

Deployment

Docker Deployment

# Build image
docker build -t agent-name:v1.0.0 .

# Run container
docker run -d \
  --name agent-name \
  -p 8000:8000 \
  -v $(pwd)/logs:/app/logs \
  -e OLLAMA_HOST=http://host.docker.internal:11434 \
  agent-name:v1.0.0

# Check logs
docker logs -f agent-name

API Server

The agent includes a FastAPI server:

# Start server
python -m agent.server

# Server runs on http://localhost:8000

API Endpoints:

# Health check
curl http://localhost:8000/health

# Run agent
curl -X POST http://localhost:8000/run \
  -H "Content-Type: application/json" \
  -d '{"query": "Your question here"}'

# Start conversation
curl -X POST http://localhost:8000/conversations \
  -H "Content-Type: application/json" \
  -d '{"user_id": "12345"}'

Troubleshooting

Common Issues

Issue: Agent times out

  • Cause: Task too complex or tools too slow
  • Solution: Increase timeout_seconds in config.yaml or optimize tools

Issue: Tool calls fail

  • Cause: Tool errors or invalid inputs
  • Solution: Check logs/ for detailed error messages, verify tool inputs

Issue: Poor responses

  • Cause: Prompt issues or model limitations
  • Solution: Review prompts/system.txt, add few-shot examples

Issue: Memory errors

  • Cause: Context window exceeded
  • Solution: Enable summarize_old in memory config

Debug Mode

# Enable debug logging
agent = CustomerSupportAgent(debug=True)

# This logs:
# - Full prompts sent to LLM
# - Tool inputs/outputs
# - Reasoning steps
# - Error traces

Development Workflow

Prototype → Production Checklist

  • Code Complete

    • All features implemented
    • Error handling comprehensive
    • Configuration externalized
  • Documentation

    • README complete
    • Docstrings for all functions
    • Usage examples provided
    • Architecture documented
  • Testing

    • Unit tests (>80% coverage)
    • Integration tests
    • Behavior tests
    • All tests passing
  • Performance

    • Response time acceptable (<5s typical)
    • Tool timeouts configured
    • Memory usage reasonable
  • Production Readiness

    • Logging comprehensive
    • Metrics collection enabled
    • Deployment assets created
    • Security review complete

Versioning Strategy

Semantic Versioning:

  • v1.0.0 - Major: Breaking API changes
  • v1.1.0 - Minor: New features, backward compatible
  • v1.1.1 - Patch: Bug fixes

Prompt Versioning:

prompts/
├── v1/
│   ├── system.txt
│   └── user_template.txt
└── v2/
    ├── system.txt
    └── user_template.txt

Best Practices

Agent Design

Keep It Focused:

  • One agent, one clear purpose
  • Don't build "do everything" agents
  • Prefer specialized agents with clear domains

Fail Gracefully:

  • Validate inputs before tool calls
  • Handle tool failures without crashing
  • Provide helpful error messages
  • Always return a response (even if degraded)

Be Observable:

  • Log all tool calls
  • Track metrics consistently
  • Make debugging easy
  • Provide clear status indicators

Prompt Engineering

Version Control Prompts:

  • Keep prompts in separate files
  • Use git for prompt versioning
  • Document prompt changes
  • A/B test prompt variations

Structure System Prompts:

1. Role definition
2. Capabilities and limitations
3. Tool descriptions
4. Response format requirements
5. Behavioral guidelines

Tool Development

Make Tools Atomic:

  • One tool, one clear function
  • Avoid tool interdependencies
  • Return structured data
  • Include success/failure indicators

Validate Everything:

  • Check inputs before execution
  • Validate outputs before returning
  • Handle edge cases explicitly
  • Provide clear error messages

Performance Optimization

Response Time

Target: <5 seconds for typical queries

Strategies:

  1. Cache tool results - Don't repeat identical queries
  2. Parallel tool calls - When tools are independent
  3. Optimize prompts - Shorter prompts = faster inference
  4. Early stopping - Return when confident enough

Memory Management

Context Window Management:

# Summarize old messages when context is full
if context_length > max_context:
    summary = summarize_messages(old_messages)
    context = [summary] + recent_messages

Cost Optimization

Token Usage:

  • Monitor tokens per conversation
  • Set appropriate max_tokens
  • Use smaller models when possible
  • Cache frequent queries

Security Considerations

Input Validation:

  • Sanitize all user inputs
  • Validate tool parameters
  • Prevent injection attacks
  • Limit query length

Access Control:

  • Authenticate API requests
  • Authorize tool access per user
  • Audit sensitive operations
  • Rate limit requests

Data Privacy:

  • Don't log sensitive data
  • Encrypt stored conversations
  • Comply with privacy regulations
  • Provide data deletion

What's Next

You now have production-ready AI agent templates with standardized structure, comprehensive testing, and deployment readiness. Your agents are well-architected, observable, and maintainable.

In Part 5: Ollama Model Management and Workflow Integration, we'll complete the series by covering:

  • Ollama model lifecycle management
  • Custom Modelfile patterns
  • Model versioning and registry
  • Integration with agents and experiments
  • Complete workflow automation

We'll tie together all the pieces—workspace structure, documentation, experiments, and agents—into a unified Ollama-centered workflow.


Key Takeaways

  • Two-track system separates prototypes from production agents
  • Standardized templates ensure consistent agent structure
  • Comprehensive testing makes agents reliable and maintainable
  • Clear architecture with separated concerns (agent/tools/memory/prompts)
  • Production readiness includes deployment, monitoring, and security
  • Documentation makes agents usable by others

Resources

Templates:

  • Agent README template (provided above)
  • Agent structure (directory layout)
  • Test templates (unit/integration/behavior)

Code Examples:

  • Basic agent implementation
  • Tool development pattern
  • Memory management
  • API server

Series Navigation

  • Previous: Part 3: Experiment Tracking
  • Next: Part 5: Ollama Integration (coming soon)
  • Series Home: Building a Production ML Workspace on GPU Infrastructure

Questions or suggestions? Find me on Twitter @bioinfo or at rundatarun.io


Related Articles


About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.

Links to this entry