Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates
Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates
You've organized your workspace, built documentation systems, and implemented experiment tracking. Your ML workflows are reproducible and well-documented. Now it's time to tackle the most complex artifact in your workspace: AI agents.
Unlike experiments (which run once) or models (which are static), agents are dynamic systems that interact with tools, maintain state, and make decisions. Without proper structure, agent development quickly becomes tangled code with unclear responsibilities and impossible debugging.
This article shows you how to build production-ready AI agents using standardized templates, clear architecture patterns, comprehensive testing, and deployment readiness frameworks.
This is Part 4 of a 5-part series on building production ML workspaces. Previous parts:
- Part 1: Workspace StructureshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 1 - Designing an Organized StructureLearn how to design a scalable ML workspace structure that handles Ollama models, fine-tuning, agents, and experiments without becoming chaotic.
- Part 2: Documentation SystemsshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 2 - Documentation Systems That ScaleBuild a three-tier documentation system that captures ML work for debugging, review, and blog content—turning your experiments into shareable knowledge.
- Part 3: Experiment TrackingshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 3 - Experiment Tracking and ReproducibilityBuild systematic experiment tracking with templates, progress monitoring, and lifecycle management to ensure every ML experiment is reproducible and builds toward knowledge.
Coming next:
- Part 5: Ollama Model Management and Workflow Integration
The Agent Development Problem
AI agents present unique challenges that experiments and models don't:
Complexity Challenges:
- Multiple components (LLM, tools, memory, state)
- Complex interaction patterns
- Error handling across tool calls
- State management and conversation history
- Tool orchestration and chaining
Production Challenges:
- How do you test an agent thoroughly?
- How do you debug multi-step reasoning?
- How do you version control prompts?
- How do you monitor production behavior?
- How do you handle tool failures gracefully?
Organizational Challenges:
- Prototypes that never reach production
- Unclear distinction between experimental and production code
- Lack of reusable components
- No standardized agent structure
- Difficulty onboarding to existing agents
The Two-Track Agent System
Our solution: Separate tracks for prototype and production agents with clear promotion criteria.
Agent Development System
│
├── Prototype Track (agents/prototypes/)
│ ├── Fast iteration
│ ├── Minimal documentation
│ ├── Breaking changes OK
│ └── No tests required
│
└── Production Track (agents/production/)
├── Stable API
├── Comprehensive docs
├── Full test coverage
└── Monitoring & logging
When to Use Each Track
Prototype Track:
- Initial exploration and experimentation
- Testing new LLM capabilities
- Rapid tool integration trials
- Proof-of-concept development
- Research and learning
Production Track:
- Agents used by others or in automation
- Critical business workflows
- Public-facing applications
- Deployed services
- Agents with users who expect reliability
Production Agent Template Structure
Every production agent gets this standardized structure:
agents/production/agent-name/
├── README.md # Complete documentation
├── agent.py # Main agent code
├── config.yaml # Configuration
├── requirements.txt # Dependencies
├── environment.yaml # Full environment
├── prompts/ # Version-controlled prompts
│ ├── system.txt # System prompt
│ ├── user_template.txt # User prompt template
│ └── few_shot.yaml # Few-shot examples
├── tools/ # Agent tools
│ ├── __init__.py
│ ├── tool_registry.py # Tool management
│ └── [tool_name].py # Individual tools
├── memory/ # State and memory
│ ├── conversation.py # Conversation history
│ ├── context.py # Context management
│ └── state.py # Agent state
├── tests/ # Comprehensive tests
│ ├── test_agent.py # Agent behavior tests
│ ├── test_tools.py # Tool tests
│ └── test_integration.py # End-to-end tests
├── logs/ # Execution logs
│ └── [timestamp].log
├── metrics/ # Performance tracking
│ └── metrics.json
└── deployment/ # Deployment assets
├── Dockerfile
├── docker-compose.yml
└── deploy.sh
The Complete Agent README
Location: agents/production/[agent-name]/README.md
# Agent: [Agent Name]
**Status:** 🟢 Production | 🟡 Beta | 🔴 Experimental
**Version:** v1.0.0
**Last Updated:** YYYY-MM-DD
**Maintainer:** Your Name
---
## Quick Summary
**Purpose:** [One sentence describing what this agent does]
**Use Cases:**
1. [Primary use case]
2. [Secondary use case]
3. [Additional use case]
**Key Capabilities:**
- [Capability 1]
- [Capability 2]
- [Capability 3]
---
## Quick Start
### Installation
```bash
# Clone or navigate to agent directory
cd agents/production/agent-name
# Create environment
conda env create -f environment.yaml
conda activate agent-env
# Install dependencies
pip install -r requirements.txt
# Verify installation
python agent.py --verify
Basic Usage
from agent import AgentName
# Initialize agent
agent = AgentName(
model="llama3.1:8b",
temperature=0.7,
max_iterations=5
)
# Run agent
result = agent.run("Your task here")
print(result)
Configuration
Edit config.yaml:
model:
name: "llama3.1:8b"
temperature: 0.7
max_tokens: 2000
agent:
max_iterations: 5
max_tool_calls: 10
timeout_seconds: 300
Architecture
System Overview
User Input
↓
[Agent Core] ←→ [LLM (Ollama)]
↓
[Tool Orchestrator]
↓
[Tools] → [External APIs/Systems]
↓
[Memory/State]
↓
Agent Response
Components
Agent Core (agent.py):
- Main agent logic and reasoning loop
- Decision-making and planning
- Error handling and recovery
- Response generation
Tool System (tools/):
- Tool registry and management
- Individual tool implementations
- Tool call validation
- Result processing
Memory System (memory/):
- Conversation history management
- Context window management
- State persistence
- Session management
Prompt System (prompts/):
- System prompt definition
- User prompt templates
- Few-shot examples
- Prompt versioning
Detailed Usage
Basic Example
from agent import CustomerSupportAgent
# Initialize
agent = CustomerSupportAgent()
# Simple query
response = agent.run(
query="How do I reset my password?",
context={"user_id": "12345"}
)
print(response.text)
print(f"Tools used: {response.tools_called}")
print(f"Confidence: {response.confidence}")
Advanced Example
# Multi-turn conversation
agent = CustomerSupportAgent()
conversation_id = agent.start_conversation(
user_id="12345",
initial_context={"account_type": "premium"}
)
# Turn 1
response1 = agent.continue_conversation(
conversation_id=conversation_id,
message="I can't log in to my account"
)
# Turn 2
response2 = agent.continue_conversation(
conversation_id=conversation_id,
message="I tried that already, still not working"
)
# Get conversation history
history = agent.get_conversation_history(conversation_id)
Streaming Response
# Stream agent response
for chunk in agent.run_streaming("Complex analysis task"):
print(chunk.text, end="")
if chunk.tool_call:
print(f"\n[Using tool: {chunk.tool_call.name}]")
Tools Available
Built-in Tools
-
knowledge_base_search
- Purpose: Search internal knowledge base
- Input: Query string
- Output: Relevant documents
- Example:
{"query": "password reset policy"}
-
api_call
- Purpose: Call external API
- Input: Endpoint, method, params
- Output: API response
- Example:
{"endpoint": "/user/status", "method": "GET"}
-
database_query
- Purpose: Query user database
- Input: Query parameters
- Output: User data
- Example:
{"user_id": "12345", "fields": ["email", "status"]}
Adding Custom Tools
# tools/custom_tool.py
from tools.base import BaseTool
class CustomTool(BaseTool):
name = "custom_tool"
description = "What this tool does"
def __init__(self):
super().__init__()
def execute(self, **kwargs):
"""Execute tool logic"""
# Your implementation
return {"result": "value"}
def validate_input(self, **kwargs):
"""Validate inputs"""
required = ["param1", "param2"]
return all(k in kwargs for k in required)
Register in tools/tool_registry.py:
from tools.custom_tool import CustomTool
TOOLS = {
"custom_tool": CustomTool(),
# ... other tools
}
Configuration Reference
config.yaml Structure
# Model configuration
model:
name: "llama3.1:8b" # Ollama model
temperature: 0.7 # Sampling temperature
max_tokens: 2000 # Max response length
top_p: 0.9 # Nucleus sampling
# Agent behavior
agent:
max_iterations: 5 # Max reasoning loops
max_tool_calls: 10 # Max tool invocations
timeout_seconds: 300 # Overall timeout
retry_on_error: true # Retry failed operations
max_retries: 3 # Max retry attempts
# Tools configuration
tools:
enabled:
- knowledge_base_search
- api_call
- database_query
timeout_per_tool: 30 # Seconds per tool
# Memory configuration
memory:
max_history_length: 10 # Messages to keep
context_window: 4096 # Token limit
summarize_old: true # Summarize old messages
# Logging
logging:
level: "INFO"
log_tool_calls: true
log_prompts: false # Sensitive: disable in prod
save_conversations: true
Testing
Running Tests
# All tests
pytest tests/
# Specific test file
pytest tests/test_agent.py
# With coverage
pytest --cov=. tests/
# Verbose output
pytest -v tests/
Test Structure
Unit Tests (test_tools.py):
def test_knowledge_base_search():
"""Test knowledge base search tool"""
tool = KnowledgeBaseSearch()
result = tool.execute(query="test query")
assert result["documents"] is not None
assert len(result["documents"]) > 0
Integration Tests (test_integration.py):
def test_agent_conversation_flow():
"""Test complete conversation flow"""
agent = CustomerSupportAgent()
# Start conversation
conv_id = agent.start_conversation(user_id="test")
# First message
response1 = agent.continue_conversation(
conversation_id=conv_id,
message="I need help"
)
assert response1.text is not None
# Follow-up
response2 = agent.continue_conversation(
conversation_id=conv_id,
message="Tell me more"
)
assert len(response2.conversation_history) == 4 # 2 exchanges
Behavior Tests (test_agent.py):
def test_agent_handles_tool_failure():
"""Test graceful handling of tool failures"""
agent = CustomerSupportAgent()
# Mock tool to fail
with patch('tools.api_call.execute') as mock_tool:
mock_tool.side_effect = Exception("API error")
response = agent.run("Query requiring API call")
# Agent should recover gracefully
assert response.text is not None
assert "error" in response.text.lower()
assert response.success is False
Monitoring & Metrics
Metrics Tracked
{
"conversation_id": "abc123",
"timestamp": "2024-10-19T14:30:00",
"duration_seconds": 4.2,
"model_used": "llama3.1:8b",
"tokens": {
"prompt": 1024,
"completion": 256,
"total": 1280
},
"tools_called": [
{
"name": "knowledge_base_search",
"duration_ms": 120,
"success": true
}
],
"iterations": 2,
"success": true,
"error": null,
"user_feedback": null
}
Accessing Metrics
# Get agent metrics
from metrics import AgentMetrics
metrics = AgentMetrics.load("metrics/metrics.json")
# Summary statistics
print(f"Total conversations: {metrics.total_conversations}")
print(f"Average duration: {metrics.avg_duration}s")
print(f"Success rate: {metrics.success_rate}%")
print(f"Most used tools: {metrics.top_tools}")
Deployment
Docker Deployment
# Build image
docker build -t agent-name:v1.0.0 .
# Run container
docker run -d \
--name agent-name \
-p 8000:8000 \
-v $(pwd)/logs:/app/logs \
-e OLLAMA_HOST=http://host.docker.internal:11434 \
agent-name:v1.0.0
# Check logs
docker logs -f agent-name
API Server
The agent includes a FastAPI server:
# Start server
python -m agent.server
# Server runs on http://localhost:8000
API Endpoints:
# Health check
curl http://localhost:8000/health
# Run agent
curl -X POST http://localhost:8000/run \
-H "Content-Type: application/json" \
-d '{"query": "Your question here"}'
# Start conversation
curl -X POST http://localhost:8000/conversations \
-H "Content-Type: application/json" \
-d '{"user_id": "12345"}'
Troubleshooting
Common Issues
Issue: Agent times out
- Cause: Task too complex or tools too slow
- Solution: Increase
timeout_secondsin config.yaml or optimize tools
Issue: Tool calls fail
- Cause: Tool errors or invalid inputs
- Solution: Check logs/ for detailed error messages, verify tool inputs
Issue: Poor responses
- Cause: Prompt issues or model limitations
- Solution: Review
prompts/system.txt, add few-shot examples
Issue: Memory errors
- Cause: Context window exceeded
- Solution: Enable
summarize_oldin memory config
Debug Mode
# Enable debug logging
agent = CustomerSupportAgent(debug=True)
# This logs:
# - Full prompts sent to LLM
# - Tool inputs/outputs
# - Reasoning steps
# - Error traces
Development Workflow
Prototype → Production Checklist
-
Code Complete
- All features implemented
- Error handling comprehensive
- Configuration externalized
-
Documentation
- README complete
- Docstrings for all functions
- Usage examples provided
- Architecture documented
-
Testing
- Unit tests (>80% coverage)
- Integration tests
- Behavior tests
- All tests passing
-
Performance
- Response time acceptable (<5s typical)
- Tool timeouts configured
- Memory usage reasonable
-
Production Readiness
- Logging comprehensive
- Metrics collection enabled
- Deployment assets created
- Security review complete
Versioning Strategy
Semantic Versioning:
- v1.0.0 - Major: Breaking API changes
- v1.1.0 - Minor: New features, backward compatible
- v1.1.1 - Patch: Bug fixes
Prompt Versioning:
prompts/
├── v1/
│ ├── system.txt
│ └── user_template.txt
└── v2/
├── system.txt
└── user_template.txt
Best Practices
Agent Design
Keep It Focused:
- One agent, one clear purpose
- Don't build "do everything" agents
- Prefer specialized agents with clear domains
Fail Gracefully:
- Validate inputs before tool calls
- Handle tool failures without crashing
- Provide helpful error messages
- Always return a response (even if degraded)
Be Observable:
- Log all tool calls
- Track metrics consistently
- Make debugging easy
- Provide clear status indicators
Prompt Engineering
Version Control Prompts:
- Keep prompts in separate files
- Use git for prompt versioning
- Document prompt changes
- A/B test prompt variations
Structure System Prompts:
1. Role definition
2. Capabilities and limitations
3. Tool descriptions
4. Response format requirements
5. Behavioral guidelines
Tool Development
Make Tools Atomic:
- One tool, one clear function
- Avoid tool interdependencies
- Return structured data
- Include success/failure indicators
Validate Everything:
- Check inputs before execution
- Validate outputs before returning
- Handle edge cases explicitly
- Provide clear error messages
Performance Optimization
Response Time
Target: <5 seconds for typical queries
Strategies:
- Cache tool results - Don't repeat identical queries
- Parallel tool calls - When tools are independent
- Optimize prompts - Shorter prompts = faster inference
- Early stopping - Return when confident enough
Memory Management
Context Window Management:
# Summarize old messages when context is full
if context_length > max_context:
summary = summarize_messages(old_messages)
context = [summary] + recent_messages
Cost Optimization
Token Usage:
- Monitor tokens per conversation
- Set appropriate
max_tokens - Use smaller models when possible
- Cache frequent queries
Security Considerations
Input Validation:
- Sanitize all user inputs
- Validate tool parameters
- Prevent injection attacks
- Limit query length
Access Control:
- Authenticate API requests
- Authorize tool access per user
- Audit sensitive operations
- Rate limit requests
Data Privacy:
- Don't log sensitive data
- Encrypt stored conversations
- Comply with privacy regulations
- Provide data deletion
What's Next
You now have production-ready AI agent templates with standardized structure, comprehensive testing, and deployment readiness. Your agents are well-architected, observable, and maintainable.
In Part 5: Ollama Model Management and Workflow Integration, we'll complete the series by covering:
- Ollama model lifecycle management
- Custom Modelfile patterns
- Model versioning and registry
- Integration with agents and experiments
- Complete workflow automation
We'll tie together all the pieces—workspace structure, documentation, experiments, and agents—into a unified Ollama-centered workflow.
Key Takeaways
- Two-track system separates prototypes from production agents
- Standardized templates ensure consistent agent structure
- Comprehensive testing makes agents reliable and maintainable
- Clear architecture with separated concerns (agent/tools/memory/prompts)
- Production readiness includes deployment, monitoring, and security
- Documentation makes agents usable by others
Resources
Templates:
- Agent README template (provided above)
- Agent structure (directory layout)
- Test templates (unit/integration/behavior)
Code Examples:
- Basic agent implementation
- Tool development pattern
- Memory management
- API server
Series Navigation
- Previous: Part 3: Experiment TrackingshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 3 - Experiment Tracking and ReproducibilityBuild systematic experiment tracking with templates, progress monitoring, and lifecycle management to ensure every ML experiment is reproducible and builds toward knowledge.
- Next: Part 5: Ollama Integration (coming soon)
- Series Home: Building a Production ML Workspace on GPU Infrastructure
Questions or suggestions? Find me on Twitter @bioinfo or at rundatarun.io
Related Articles
- Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow IntegrationshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow IntegrationComplete your production ML workspace with team collaboration patterns, workflow automation, version control strategies, and integration frameworks that scale.
- Building a Production ML Workspace: Part 2 - Documentation Systems That ScaleshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 2 - Documentation Systems That ScaleBuild a three-tier documentation system that captures ML work for debugging, review, and blog content—turning your experiments into shareable knowledge.
- Building a Production ML Workspace: Part 3 - Experiment Tracking and ReproducibilityshippedPractical ApplicationsOct 19, 2025Building a Production ML Workspace: Part 3 - Experiment Tracking and ReproducibilityBuild systematic experiment tracking with templates, progress monitoring, and lifecycle management to ensure every ML experiment is reproducible and builds toward knowledge.
About the Author: Justin Johnson builds AI systems and writes about practical AI development.
justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe
Follow the lab
Get the next experiment
Enjoyed the breakdown on Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.
Related experiments
Apparatus
1,555 words · 10 min read
- ai-agents
- agent-development
- ml-development
- best-practices
- production-systems
Links to this entry
- Building a Production ML Workspace: Part 2 - Documentation Systems That Scale
- Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility
- Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration
- DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1
- The Hidden Crisis in LLM Fine-Tuning: When Your Model Silently Forgets Everything