building-production-ml-workspace-part-4-agents - AIXplore

# Building a Production ML Workspace: Part 4 - Production-Ready AI Agent Templates You've organized your workspace, built documentation systems, and implemented experiment tracking. Your ML workflows are reproducible and well-documented. Now it's time to tackle the most complex artifact in your workspace: **AI agents**. Unlike experiments (which run once) or models (which are static), agents are dynamic systems that interact with tools, maintain state, and make decisions. Without proper structure, agent development quickly becomes tangled code with unclear responsibilities and impossible debugging. This article shows you how to build production-ready AI agents using standardized templates, clear architecture patterns, comprehensive testing, and deployment readiness frameworks. <div class="callout" data-callout="info"> <div class="callout-title">About This Series</div> <div class="callout-content"> This is Part 4 of a 5-part series on building production ML workspaces. Previous parts: - [[building-production-ml-workspace-part-1-structure|Part 1: Workspace Structure]] - [[building-production-ml-workspace-part-2-documentation|Part 2: Documentation Systems]] - [[building-production-ml-workspace-part-3-experiments|Part 3: Experiment Tracking]] Coming next: - **Part 5**: Ollama Model Management and Workflow Integration </div> </div> --- ## The Agent Development Problem AI agents present unique challenges that experiments and models don't: **Complexity Challenges:** - Multiple components (LLM, tools, memory, state) - Complex interaction patterns - Error handling across tool calls - State management and conversation history - Tool orchestration and chaining **Production Challenges:** - How do you test an agent thoroughly? - How do you debug multi-step reasoning? - How do you version control prompts? - How do you monitor production behavior? - How do you handle tool failures gracefully? **Organizational Challenges:** - Prototypes that never reach production - Unclear distinction between experimental and production code - Lack of reusable components - No standardized agent structure - Difficulty onboarding to existing agents --- ## The Two-Track Agent System Our solution: Separate tracks for prototype and production agents with clear promotion criteria. ``` Agent Development System │ ├── Prototype Track (agents/prototypes/) │ ├── Fast iteration │ ├── Minimal documentation │ ├── Breaking changes OK │ └── No tests required │ └── Production Track (agents/production/) ├── Stable API ├── Comprehensive docs ├── Full test coverage └── Monitoring & logging ``` ### When to Use Each Track **Prototype Track:** - Initial exploration and experimentation - Testing new LLM capabilities - Rapid tool integration trials - Proof-of-concept development - Research and learning **Production Track:** - Agents used by others or in automation - Critical business workflows - Public-facing applications - Deployed services - Agents with users who expect reliability --- ## Production Agent Template Structure Every production agent gets this standardized structure: ```bash agents/production/agent-name/ ├── README.md # Complete documentation ├── agent.py # Main agent code ├── config.yaml # Configuration ├── requirements.txt # Dependencies ├── environment.yaml # Full environment ├── prompts/ # Version-controlled prompts │ ├── system.txt # System prompt │ ├── user_template.txt # User prompt template │ └── few_shot.yaml # Few-shot examples ├── tools/ # Agent tools │ ├── __init__.py │ ├── tool_registry.py # Tool management │ └── [tool_name].py # Individual tools ├── memory/ # State and memory │ ├── conversation.py # Conversation history │ ├── context.py # Context management │ └── state.py # Agent state ├── tests/ # Comprehensive tests │ ├── test_agent.py # Agent behavior tests │ ├── test_tools.py # Tool tests │ └── test_integration.py # End-to-end tests ├── logs/ # Execution logs │ └── [timestamp].log ├── metrics/ # Performance tracking │ └── metrics.json └── deployment/ # Deployment assets ├── Dockerfile ├── docker-compose.yml └── deploy.sh ``` --- ## The Complete Agent README Location: `agents/production/[agent-name]/README.md` ```markdown # Agent: [Agent Name] **Status:** 🟢 Production | 🟡 Beta | 🔴 Experimental **Version:** v1.0.0 **Last Updated:** YYYY-MM-DD **Maintainer:** Your Name --- ## Quick Summary **Purpose:** [One sentence describing what this agent does] **Use Cases:** 1. [Primary use case] 2. [Secondary use case] 3. [Additional use case] **Key Capabilities:** - [Capability 1] - [Capability 2] - [Capability 3] --- ## Quick Start ### Installation ```bash # Clone or navigate to agent directory cd agents/production/agent-name # Create environment conda env create -f environment.yaml conda activate agent-env # Install dependencies pip install -r requirements.txt # Verify installation python agent.py --verify ``` ### Basic Usage ```python from agent import AgentName # Initialize agent agent = AgentName( model="llama3.1:8b", temperature=0.7, max_iterations=5 ) # Run agent result = agent.run("Your task here") print(result) ``` ### Configuration Edit `config.yaml`: ```yaml model: name: "llama3.1:8b" temperature: 0.7 max_tokens: 2000 agent: max_iterations: 5 max_tool_calls: 10 timeout_seconds: 300 ``` --- ## Architecture ### System Overview ``` User Input ↓ [Agent Core] ←→ [LLM (Ollama)] ↓ [Tool Orchestrator] ↓ [Tools] → [External APIs/Systems] ↓ [Memory/State] ↓ Agent Response ``` ### Components **Agent Core (`agent.py`):** - Main agent logic and reasoning loop - Decision-making and planning - Error handling and recovery - Response generation **Tool System (`tools/`):** - Tool registry and management - Individual tool implementations - Tool call validation - Result processing **Memory System (`memory/`):** - Conversation history management - Context window management - State persistence - Session management **Prompt System (`prompts/`):** - System prompt definition - User prompt templates - Few-shot examples - Prompt versioning --- ## Detailed Usage ### Basic Example ```python from agent import CustomerSupportAgent # Initialize agent = CustomerSupportAgent() # Simple query response = agent.run( query="How do I reset my password?", context={"user_id": "12345"} ) print(response.text) print(f"Tools used: {response.tools_called}") print(f"Confidence: {response.confidence}") ``` ### Advanced Example ```python # Multi-turn conversation agent = CustomerSupportAgent() conversation_id = agent.start_conversation( user_id="12345", initial_context={"account_type": "premium"} ) # Turn 1 response1 = agent.continue_conversation( conversation_id=conversation_id, message="I can't log in to my account" ) # Turn 2 response2 = agent.continue_conversation( conversation_id=conversation_id, message="I tried that already, still not working" ) # Get conversation history history = agent.get_conversation_history(conversation_id) ``` ### Streaming Response ```python # Stream agent response for chunk in agent.run_streaming("Complex analysis task"): print(chunk.text, end="") if chunk.tool_call: print(f"\n[Using tool: {chunk.tool_call.name}]") ``` --- ## Tools Available ### Built-in Tools 1. **knowledge_base_search** - Purpose: Search internal knowledge base - Input: Query string - Output: Relevant documents - Example: `{"query": "password reset policy"}` 2. **api_call** - Purpose: Call external API - Input: Endpoint, method, params - Output: API response - Example: `{"endpoint": "/user/status", "method": "GET"}` 3. **database_query** - Purpose: Query user database - Input: Query parameters - Output: User data - Example: `{"user_id": "12345", "fields": ["email", "status"]}` ### Adding Custom Tools ```python # tools/custom_tool.py from tools.base import BaseTool class CustomTool(BaseTool): name = "custom_tool" description = "What this tool does" def __init__(self): super().__init__() def execute(self, **kwargs): """Execute tool logic""" # Your implementation return {"result": "value"} def validate_input(self, **kwargs): """Validate inputs""" required = ["param1", "param2"] return all(k in kwargs for k in required) ``` Register in `tools/tool_registry.py`: ```python from tools.custom_tool import CustomTool TOOLS = { "custom_tool": CustomTool(), # ... other tools } ``` --- ## Configuration Reference ### config.yaml Structure ```yaml # Model configuration model: name: "llama3.1:8b" # Ollama model temperature: 0.7 # Sampling temperature max_tokens: 2000 # Max response length top_p: 0.9 # Nucleus sampling # Agent behavior agent: max_iterations: 5 # Max reasoning loops max_tool_calls: 10 # Max tool invocations timeout_seconds: 300 # Overall timeout retry_on_error: true # Retry failed operations max_retries: 3 # Max retry attempts # Tools configuration tools: enabled: - knowledge_base_search - api_call - database_query timeout_per_tool: 30 # Seconds per tool # Memory configuration memory: max_history_length: 10 # Messages to keep context_window: 4096 # Token limit summarize_old: true # Summarize old messages # Logging logging: level: "INFO" log_tool_calls: true log_prompts: false # Sensitive: disable in prod save_conversations: true ``` --- ## Testing ### Running Tests ```bash # All tests pytest tests/ # Specific test file pytest tests/test_agent.py # With coverage pytest --cov=. tests/ # Verbose output pytest -v tests/ ``` ### Test Structure **Unit Tests (`test_tools.py`):** ```python def test_knowledge_base_search(): """Test knowledge base search tool""" tool = KnowledgeBaseSearch() result = tool.execute(query="test query") assert result["documents"] is not None assert len(result["documents"]) > 0 ``` **Integration Tests (`test_integration.py`):** ```python def test_agent_conversation_flow(): """Test complete conversation flow""" agent = CustomerSupportAgent() # Start conversation conv_id = agent.start_conversation(user_id="test") # First message response1 = agent.continue_conversation( conversation_id=conv_id, message="I need help" ) assert response1.text is not None # Follow-up response2 = agent.continue_conversation( conversation_id=conv_id, message="Tell me more" ) assert len(response2.conversation_history) == 4 # 2 exchanges ``` **Behavior Tests (`test_agent.py`):** ```python def test_agent_handles_tool_failure(): """Test graceful handling of tool failures""" agent = CustomerSupportAgent() # Mock tool to fail with patch('tools.api_call.execute') as mock_tool: mock_tool.side_effect = Exception("API error") response = agent.run("Query requiring API call") # Agent should recover gracefully assert response.text is not None assert "error" in response.text.lower() assert response.success is False ``` --- ## Monitoring & Metrics ### Metrics Tracked ```json { "conversation_id": "abc123", "timestamp": "2024-10-19T14:30:00", "duration_seconds": 4.2, "model_used": "llama3.1:8b", "tokens": { "prompt": 1024, "completion": 256, "total": 1280 }, "tools_called": [ { "name": "knowledge_base_search", "duration_ms": 120, "success": true } ], "iterations": 2, "success": true, "error": null, "user_feedback": null } ``` ### Accessing Metrics ```python # Get agent metrics from metrics import AgentMetrics metrics = AgentMetrics.load("metrics/metrics.json") # Summary statistics print(f"Total conversations: {metrics.total_conversations}") print(f"Average duration: {metrics.avg_duration}s") print(f"Success rate: {metrics.success_rate}%") print(f"Most used tools: {metrics.top_tools}") ``` --- ## Deployment ### Docker Deployment ```bash # Build image docker build -t agent-name:v1.0.0 . # Run container docker run -d \ --name agent-name \ -p 8000:8000 \ -v $(pwd)/logs:/app/logs \ -e OLLAMA_HOST=http://host.docker.internal:11434 \ agent-name:v1.0.0 # Check logs docker logs -f agent-name ``` ### API Server The agent includes a FastAPI server: ```bash # Start server python -m agent.server # Server runs on http://localhost:8000 ``` **API Endpoints:** ```bash # Health check curl http://localhost:8000/health # Run agent curl -X POST http://localhost:8000/run \ -H "Content-Type: application/json" \ -d '{"query": "Your question here"}' # Start conversation curl -X POST http://localhost:8000/conversations \ -H "Content-Type: application/json" \ -d '{"user_id": "12345"}' ``` --- ## Troubleshooting ### Common Issues **Issue: Agent times out** - **Cause:** Task too complex or tools too slow - **Solution:** Increase `timeout_seconds` in config.yaml or optimize tools **Issue: Tool calls fail** - **Cause:** Tool errors or invalid inputs - **Solution:** Check logs/ for detailed error messages, verify tool inputs **Issue: Poor responses** - **Cause:** Prompt issues or model limitations - **Solution:** Review `prompts/system.txt`, add few-shot examples **Issue: Memory errors** - **Cause:** Context window exceeded - **Solution:** Enable `summarize_old` in memory config ### Debug Mode ```python # Enable debug logging agent = CustomerSupportAgent(debug=True) # This logs: # - Full prompts sent to LLM # - Tool inputs/outputs # - Reasoning steps # - Error traces ``` --- ## Development Workflow ### Prototype → Production Checklist - [ ] **Code Complete** - [ ] All features implemented - [ ] Error handling comprehensive - [ ] Configuration externalized - [ ] **Documentation** - [ ] README complete - [ ] Docstrings for all functions - [ ] Usage examples provided - [ ] Architecture documented - [ ] **Testing** - [ ] Unit tests (>80% coverage) - [ ] Integration tests - [ ] Behavior tests - [ ] All tests passing - [ ] **Performance** - [ ] Response time acceptable (<5s typical) - [ ] Tool timeouts configured - [ ] Memory usage reasonable - [ ] **Production Readiness** - [ ] Logging comprehensive - [ ] Metrics collection enabled - [ ] Deployment assets created - [ ] Security review complete ### Versioning Strategy **Semantic Versioning:** - **v1.0.0** - Major: Breaking API changes - **v1.1.0** - Minor: New features, backward compatible - **v1.1.1** - Patch: Bug fixes **Prompt Versioning:** ``` prompts/ ├── v1/ │ ├── system.txt │ └── user_template.txt └── v2/ ├── system.txt └── user_template.txt ``` --- ## Best Practices ### Agent Design **Keep It Focused:** - One agent, one clear purpose - Don't build "do everything" agents - Prefer specialized agents with clear domains **Fail Gracefully:** - Validate inputs before tool calls - Handle tool failures without crashing - Provide helpful error messages - Always return a response (even if degraded) **Be Observable:** - Log all tool calls - Track metrics consistently - Make debugging easy - Provide clear status indicators ### Prompt Engineering **Version Control Prompts:** - Keep prompts in separate files - Use git for prompt versioning - Document prompt changes - A/B test prompt variations **Structure System Prompts:** ``` 1. Role definition 2. Capabilities and limitations 3. Tool descriptions 4. Response format requirements 5. Behavioral guidelines ``` ### Tool Development **Make Tools Atomic:** - One tool, one clear function - Avoid tool interdependencies - Return structured data - Include success/failure indicators **Validate Everything:** - Check inputs before execution - Validate outputs before returning - Handle edge cases explicitly - Provide clear error messages --- ## Performance Optimization ### Response Time **Target:** <5 seconds for typical queries **Strategies:** 1. **Cache tool results** - Don't repeat identical queries 2. **Parallel tool calls** - When tools are independent 3. **Optimize prompts** - Shorter prompts = faster inference 4. **Early stopping** - Return when confident enough ### Memory Management **Context Window Management:** ```python # Summarize old messages when context is full if context_length > max_context: summary = summarize_messages(old_messages) context = [summary] + recent_messages ``` ### Cost Optimization **Token Usage:** - Monitor tokens per conversation - Set appropriate `max_tokens` - Use smaller models when possible - Cache frequent queries --- ## Security Considerations **Input Validation:** - Sanitize all user inputs - Validate tool parameters - Prevent injection attacks - Limit query length **Access Control:** - Authenticate API requests - Authorize tool access per user - Audit sensitive operations - Rate limit requests **Data Privacy:** - Don't log sensitive data - Encrypt stored conversations - Comply with privacy regulations - Provide data deletion --- ## What's Next You now have production-ready AI agent templates with standardized structure, comprehensive testing, and deployment readiness. Your agents are well-architected, observable, and maintainable. In **Part 5: Ollama Model Management and Workflow Integration**, we'll complete the series by covering: - Ollama model lifecycle management - Custom Modelfile patterns - Model versioning and registry - Integration with agents and experiments - Complete workflow automation We'll tie together all the pieces—workspace structure, documentation, experiments, and agents—into a unified Ollama-centered workflow. --- ## Key Takeaways - **Two-track system** separates prototypes from production agents - **Standardized templates** ensure consistent agent structure - **Comprehensive testing** makes agents reliable and maintainable - **Clear architecture** with separated concerns (agent/tools/memory/prompts) - **Production readiness** includes deployment, monitoring, and security - **Documentation** makes agents usable by others --- ## Resources **Templates:** - Agent README template (provided above) - Agent structure (directory layout) - Test templates (unit/integration/behavior) **Code Examples:** - Basic agent implementation - Tool development pattern - Memory management - API server --- ## Series Navigation - **Previous:** [[building-production-ml-workspace-part-3-experiments|Part 3: Experiment Tracking]] - **Next:** Part 5: Ollama Integration (coming soon) - **Series Home:** Building a Production ML Workspace on GPU Infrastructure --- **Questions or suggestions?** Find me on Twitter [@bioinfo](https://twitter.com/bioinfo) or at [rundatarun.io](https://rundatarun.io) --- ### Related Articles - [[building-production-ml-workspace-part-5-collaboration|Building a Production ML Workspace: Part 5 - Team Collaboration and Workflow Integration]] - [[building-production-ml-workspace-part-2-documentation|Building a Production ML Workspace: Part 2 - Documentation Systems That Scale]] - [[building-production-ml-workspace-part-3-experiments|Building a Production ML Workspace: Part 3 - Experiment Tracking and Reproducibility]] --- <p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p> <p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>