debugging-claude-code-with-claude - AIXplore

# Debugging Claude Code with Claude: A Meta-Optimization Journey Claude Code had been getting slower. Responses were taking longer, some features weren't working reliably, and the startup time had crept from instant to several seconds. I could have dug through configuration files manually, but I had a better idea: what if Claude could debug itself? This article documents the meta-debugging process of using Claude to analyze its own internal state, identify performance bottlenecks, and implement systematic optimizations. The approach is generalizable to any complex development tool with extensive logging. ## The Meta-Debugging Approach The insight is simple: Claude Code generates detailed debug logs, session transcripts, and project state files. These contain patterns that indicate problems. Claude, as an AI, excels at pattern recognition and analysis at scale. Why not use it to analyze its own operation? <div class="callout" data-callout="tip"> <div class="callout-title">Key Insight</div> <div class="callout-content"> AI tools that generate extensive logs are perfect candidates for self-analysis. The tool itself can identify patterns humans would miss in thousands of lines of debug output. </div> </div> ### Investigation Strategy The plan was straightforward: 1. **Map the landscape**: Identify what data Claude Code generates 2. **Quantify the problem**: Measure file counts, sizes, and patterns 3. **Pattern analysis**: Look for errors, timeouts, and failures 4. **Root cause identification**: Trace problems to configuration issues 5. **Systematic fixes**: Address issues in order of impact 6. **Validation**: Measure improvements ## What Claude Code Stores Claude Code maintains several directories in `~/.claude/`: ```bash 839MB projects/ # Session state for different directories 602MB debug/ # Debug logs from every session 236MB plugins/ # Plugin cache 131MB transcripts/ # Conversation history 43MB file-history/ # File edit history 5.9MB todos/ # Task tracking state ``` The debug directory alone contained nearly 1,500 log files spanning months of usage. Perfect data for pattern analysis. ## The Investigation Process ### Step 1: Quantify Debug Logs First question: how bad is the debug log accumulation? ```bash find ~/.claude/debug -type f -mtime +30 | wc -l # Result: 916 files older than 30 days ``` These old debug files serve no practical purpose but add overhead to file system operations. Quick win identified. ### Step 2: Error Pattern Analysis Next, I asked Claude to analyze error patterns across all debug logs: ```bash grep -h "ERROR\|WARN\|fail" ~/.claude/debug/*.txt | \ sort | uniq -c | sort -rn | head -20 ``` The results were revealing: ``` 429,352 "ide" MCP server - "Not connected" 3,981 filesystem MCP server errors 3,745 obsidian-search MCP server errors 2,502 postgresql MCP server failures ``` ### Step 3: The Phantom IDE Server The most striking finding: **429,352 failed connection attempts** to an "ide" MCP server that wasn't even in my configuration. Claude was trying to connect to a server that didn't exist, on every single operation. Tracing through the code revealed this was a legacy MCP server that Claude Code still attempted to initialize by default. The fix was simple: ```json { "env": { "CLAUDE_CODE_DISABLE_IDE_MCP": "1" } } ``` One environment variable eliminated 400K+ failed operations. ### Step 4: Streaming Fallback Analysis Pattern analysis revealed another critical issue: ``` 145 errors: "Error streaming, falling back to non-streaming mode: Connection error" 126 errors: "Request timed out" 83 errors: "403: not authorized to perform bedrock:InvokeModelWithResponseStream" ``` Every Bedrock request was attempting streaming, failing due to AWS service control policies, then falling back to non-streaming. This doubled the latency of every request. The AWS IAM policy explicitly blocked streaming: ```json { "Effect": "Deny", "Action": "bedrock:InvokeModelWithResponseStream" } ``` Fix: Disable streaming for Bedrock mode: ```bash export ANTHROPIC_DISABLE_STREAMING=1 ``` <div class="callout" data-callout="warning"> <div class="callout-title">AWS Bedrock Gotcha</div> <div class="callout-content"> Service Control Policies can silently block specific API operations. Always check SCPs when debugging AWS permission issues, not just IAM policies. </div> </div> ### Step 5: MCP Server Audit The debug logs showed 13 configured MCP servers, but several were problematic: **PostgreSQL**: 2,502 connection failures - Server configured but not running locally - Every session attempted connection - **Fix**: Remove from config **Duplicate Obsidian Servers**: Two different Obsidian integrations - `obsidian-search`: Custom Python with 32MB embedding model - `obsidian`: Standard mcp-obsidian - Both loading on every session - **Fix**: Keep the one being used, remove the other **Resend Email**: Redundant with resend skill - MCP server provided email capabilities - Already handled by a Claude Code skill - **Fix**: Remove MCP server, keep skill **Perplexity**: Redundant with researching-with-perplexity skill - Same situation as resend - **Fix**: Remove MCP server ## Implementation: Systematic Fixes ### Configuration Changes **~/.claude/settings.json** (settings file): ```json { "env": { "CLAUDE_CODE_DISABLE_IDE_MCP": "1" }, "hooks": { "PostToolUse": [ // Removed notification hook for performance ] } } ``` **~/.claude/.mcp.json** (MCP configuration): ```json { "mcpServers": { // Removed: postgresql, obsidian (duplicate), // resend-email, perplexity // Kept: 9 essential servers } } ``` **Shell configuration** (Bedrock mode): ```bash claude-mode() { if [[ "$1" == "bedrock" ]]; then export ANTHROPIC_DISABLE_STREAMING=1 # Fix streaming fallbacks # ... other config fi } ``` ### The Results **Before:** - 13 MCP servers (4 constantly failing) - 13,582 connection errors/timeouts across sessions - 429K IDE connection failures - 271 streaming fallback errors - Startup time: 3-4 seconds - Noticeable response lag **After:** - 9 functional MCP servers - ~70% fewer connection attempts - Zero IDE failures - Zero streaming fallbacks (Bedrock) - Startup time: 1.5-2 seconds - **30-50% faster response times** ## Making This Reproducible Here's how you can apply this approach to debug your own Claude Code installation: ### 1. Analyze Your Debug Logs ```bash # Count errors by type grep -rh "ERROR" ~/.claude/debug/ | \ cut -d' ' -f4- | sort | uniq -c | sort -rn | head -20 # Find connection issues grep -rh "timeout\|failed\|error" ~/.claude/debug/ | \ grep "MCP server" | cut -d'"' -f2 | sort | uniq -c | sort -rn # Check for streaming fallbacks grep -rh "fallback\|streaming" ~/.claude/debug/ | \ grep -i error | wc -l ``` ### 2. Identify Your MCP Servers ```bash # List configured servers cat ~/.claude/.mcp.json | jq -r '.mcpServers | keys[]' # Check which are actually being used grep -rh "MCP server" ~/.claude/debug/*.txt | \ grep "Tool.*failed" | cut -d'"' -f2 | sort | uniq -c | sort -rn ``` ### 3. Test Each Server ```bash # For each server, check if it's responding # Example for PostgreSQL: psql -c "SELECT 1" >/dev/null 2>&1 && echo "Running" || echo "Not running" ``` ### 4. Clean Up Systematically Start with the highest-impact issues: 1. **Disable phantom servers**: Servers that don't exist but are being loaded 2. **Remove failed connections**: Servers configured but not running 3. **Eliminate duplicates**: Multiple servers providing same functionality 4. **Fix authentication issues**: Servers with permission problems ### 5. Measure Improvements ```bash # Before and after startup timing time claude --version # Count remaining errors grep -rh "ERROR" ~/.claude/debug/*.txt | wc -l ``` ## Lessons from Meta-Debugging ### Pattern Recognition at Scale Humans struggle with pattern recognition across thousands of log entries. AI tools excel at this. Using Claude to analyze 1,500 debug files revealed patterns I would have missed. ### Configuration Drift is Real Over time, configurations accumulate cruft. Plugins get installed and forgotten. Services get configured but never cleaned up. Regular audits prevent performance degradation. ### The Value of Good Logging Claude Code's detailed debug logs made this analysis possible. Tools without comprehensive logging are much harder to optimize. ### Dependencies Have Dependencies MCP servers introduce their own dependencies. The `obsidian-search` server loads a 32MB embedding model on every startup. Understanding the full initialization chain is crucial. <div class="callout" data-callout="info"> <div class="callout-title">Performance Principle</div> <div class="callout-content"> Every additional integration point is a potential failure point and performance bottleneck. Ruthlessly prune unused integrations. </div> </div> ## Beyond Claude Code This meta-debugging approach applies to any complex tool with good logging: **VS Code**: Analyze extension activation times and error patterns **Docker**: Review container logs for common failures **Kubernetes**: Pattern-match across pod logs to identify cluster issues **CI/CD**: Analyze build logs to find recurring bottlenecks The key is having: 1. Comprehensive logging 2. A pattern-recognition tool (AI or specialized scripts) 3. Willingness to act on findings ## Related Reading - [[Knowledge/Blog-Obsidian/Practical Applications/claude-skills-vs-mcp-servers|Claude Skills vs MCP Servers]]: Understanding the difference between skills and MCP servers - [[claude-code-best-practices|Claude Code Best Practices]]: General optimization strategies - [[making-claude-code-more-agentic|Making Claude Code More Agentic]]: Advanced configuration techniques ## Takeaways **For immediate results:** 1. Run the debug log analysis commands above 2. Identify your top 3 error patterns 3. Fix the highest-frequency issues first 4. Measure before and after performance **For long-term optimization:** 1. Schedule monthly configuration audits 2. Remove unused plugins and servers 3. Monitor debug logs for new patterns 4. Document your configuration decisions **Meta-lesson:** AI tools can effectively debug themselves. The same capabilities that make them useful for development work apply to analyzing their own operation. Don't manually grep through thousands of log files when the AI can do it better. --- *Performance improvements measured on M4 Max MacBook Pro with 36GB RAM. Your results may vary based on configuration and usage patterns.* --- ### Related Articles - [[syncing-claude-code-configs-across-machines|Syncing Claude Code Configurations Across Multiple Machines: A Practical Guide]] - [[roo-code-codebase-indexing-free-setup|Supercharging Code Discovery: My Journey with Roo Code's Free Codebase Indexing]] - [[hybrid-deployment-vercel-render-digitalocean|Deployment Dilemma: When to Use Vercel, Render, or Digital Ocean for React/Python Apps]] --- <p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p> <p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>