# How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)
**The Setup:** Fine-tune a medical AI model to answer questions from PubMedQA with high accuracy.
**The Reality:** 9 days, 7 experiments, 60+ hours of work, and constant decisions about when to delegate and when to take control.
**The Result:** 92.4% accuracy and clarity on when AI collaboration works best.
This isn't a story about Claude writing all my code. It's about the dance between delegation and decision-making that turned a 70% baseline into production-ready medical AI.
---
## Day 1-2: Setting Up the Baseline
### My Role: Define the Goal
I knew what I wanted:
- Fine-tune Google's Gemma-3-4b model
- Target: Medical question answering (PubMedQA dataset)
- Method: LoRA (efficient fine-tuning)
- Success metric: Statistically significant improvement
**What I told Claude:**
> "Set up LoRA fine-tuning for Gemma-3-4b on PubMedQA. Start with standard hyperparameters. I want a reproducible baseline before we optimize anything."
### Claude's Role: Implement and Iterate
Claude generated the training pipeline:
- Data loading and preprocessing
- LoRA configuration
- Training loop with logging
- Evaluation scripts
**Five experiments later:** Consistent 70% accuracy.
### The Decision Point
Most people would have stopped here with "it's not working." But consistent results across five runs meant **this was a real baseline**, not random noise.
**I decided:** Keep the 70% as our North Star. Now we optimize.
---
## Day 3-4: The Hyperparameter Phase
### Delegation Strategy: Test One Dimension
**My instruction:**
> "Keep the training code. Change only the hyperparameters. Try higher LoRA rank, lower learning rate, and add GPU cache clearing every 50 steps."
**Why I specified these:**
- Higher rank: More capacity for medical knowledge
- Lower learning rate: Avoid catastrophic forgetting
- Cache clearing: GPU stability on this hardware
### Claude's Implementation
Generated Experiment 7 with:
```python
# Increased from r=16 to r=64
lora_config = LoraConfig(r=64, lora_alpha=128, ...)
# Reduced from 2e-4 to 5e-5
learning_rate=5e-5
# Added stability
if step % 50 == 0:
torch.cuda.empty_cache()
```
**Result:** 82% accuracy (+12% improvement)
### What I Learned
**Delegation worked because:**
- I provided clear constraints ("change only hyperparameters")
- Claude handled the implementation details
- I made the strategic choice of what to optimize
**I didn't need to:** Write the code, debug syntax errors, or structure the experiment.
**I did need to:** Decide which hyperparameters mattered and when to stop.
---
## Day 5-6: The Architecture Change
### The Strategic Decision
With 82% in hand, the question: "Is this good enough?"
For medical AI, I wanted 90%+. That meant changing architecture, not just hyperparameters.
**My instruction:**
> "Switch from standard Trainer to SFTTrainer from HuggingFace TRL. Keep the optimized hyperparameters from Experiment 7. SFTTrainer handles text-to-text better for Q&A tasks."
### Why I Chose This Path
**I knew from experience:**
- SFTTrainer optimizes for instruction-following
- Dynamic padding improves efficiency
- Better label handling for Q&A format
**Claude didn't suggest this.** This was domain knowledge I brought to the table.
### The Implementation
Claude converted the training pipeline:
- Reformatted dataset for text input
- Switched Trainer class to SFTTrainer
- Kept all hyperparameters from Experiment 7
- Added dynamic padding config
**Training time:** 10 hours overnight.
**Result next morning:** Quick eval showed 89% accuracy.
### The Pivot Point
**Me:** "That's promising but not conclusive. We need full evaluation on 500 examples, not 100."
**Claude:** Implemented comprehensive evaluation script with:
- 500 examples (proper sample size)
- McNemar's test for statistical significance
- Detailed error analysis
- Multiple temperature testing
**Final result:** 92.4% accuracy, p < 0.001 (extremely significant).
---
## Day 7: The Hardware Surprise
### When Everything Looked Broken
Ran Experiment 8 evaluation. Got nonsensical results:
- All predictions were "maybe"
- Empty responses
- Random gibberish
**My first instinct:** "The training failed."
### The Debugging Dance
**What I did:**
1. Checked training logs (looked fine)
2. Validated checkpoints (valid)
3. Reviewed dataset format (correct)
**What I asked Claude:**
> "Training looks good but evaluation is broken. Test inference on CPU instead of GPU."
**Why I made this call:** I'd seen GPU inference issues on this hardware before (ARM64 + Blackwell).
### The Discovery
CPU inference: **92.4% accuracy**. Perfect.
The model was fine. GPU inference was broken on this specific hardware.
**Time saved:** 15-20 hours of future debugging by documenting this hardware limitation.
### What This Taught Me
**I could delegate:** Code writing, testing different approaches, running evaluations.
**I couldn't delegate:** Knowing when to question the premise. "Maybe the model isn't broken, maybe the evaluation platform is."
This was human intuition from past experience.
---
## The Decision Framework That Emerged
Over 9 days, a pattern became clear about what to delegate and what to control.
### What I Delegated Successfully
**✅ Implementation Details**
- Writing training loops
- Data preprocessing pipelines
- Evaluation scripts
- Logging and monitoring
**✅ Systematic Testing**
- Running experiments
- Testing hyperparameters
- Error analysis
- Statistical validation
**✅ Documentation**
- Session logs
- README updates
- Deployment guides
- Code comments
**Time saved:** 40+ hours of coding and documentation.
### What I Had to Decide
**🎯 Strategic Direction**
- Which hyperparameters to optimize first
- When to change architecture
- Required sample size for validation
- Temperature settings for medical Q&A
**🎯 Quality Bar**
- 82% is good, but 90%+ is the target
- Need statistical significance, not just improvement
- Production deployment requirements
**🎯 Problem Diagnosis**
- "Model is broken" vs "Evaluation platform is broken"
- When to pivot vs when to persist
- Hardware limitations vs software bugs
**Time invested:** Maybe 10 hours total across 9 days.
---
## The ROI Calculation
### Time Breakdown
**Total experiment:** 60+ hours
- Training time: 40 hours (mostly unattended)
- Evaluation: 15 hours
- Analysis and documentation: 5 hours
**My active time:** ~10-12 hours
- Strategic decisions: 3 hours
- Reviewing results: 4 hours
- Problem diagnosis: 3 hours
- Final validation: 2 hours
**Claude's contribution:** ~48 hours of work (coding, testing, documentation)
### What I Gained
**Direct time savings:** 48 hours of coding and testing
**Quality improvements:**
- Statistical validation I might have skipped (N=500 vs N=100)
- Comprehensive error analysis
- Complete documentation
- Production-ready deployment
**Knowledge captured:**
- Hardware limitation documented
- Progressive optimization strategy validated
- Temperature tuning insights
**Value beyond this project:** The hardware discovery alone saved 15-20 hours on future experiments.
---
## The Collaboration Patterns
### Pattern 1: "Try This Approach"
**When to use:** You know the direction but not the implementation.
**Example:**
> "Switch to SFTTrainer from TRL library. Keep the hyperparameters from Experiment 7."
**Claude handles:** Implementation, testing, debugging.
**You handle:** Knowing SFTTrainer exists and when to use it.
### Pattern 2: "Diagnose This Problem"
**When to use:** Something is broken but you're not sure what.
**Example:**
> "Evaluation shows all 'maybe' predictions. Check: training loss, checkpoint validity, dataset format, and try CPU inference."
**Claude handles:** Systematic testing of each hypothesis.
**You handle:** Defining the diagnostic checklist.
### Pattern 3: "Validate This Thoroughly"
**When to use:** Results look good but need statistical proof.
**Example:**
> "Quick eval shows 89% vs 84%. Run full evaluation on 500 examples with McNemar's test for significance."
**Claude handles:** Implementing proper statistical tests.
**You handle:** Knowing what "thorough" means in your domain.
### Pattern 4: "Explore and Report"
**When to use:** You want to understand tradeoffs.
**Example:**
> "Test temperatures 0.3, 0.7, and 1.0. Compare accuracy, error types, and confidence."
**Claude handles:** Running experiments and analyzing results.
**You handle:** Deciding which temperature to use in production.
---
## The Moments I Had to Intervene
### Intervention 1: Sample Size (Day 6)
**Quick eval showed:** 89% vs 84% (p=0.0736, not significant)
**Claude's conclusion:** "Improvement not statistically significant."
**My call:** "Run full eval with N=500. Quick eval has insufficient power to detect 5-10% differences."
**Why I intervened:** Domain knowledge about medical ML validation requirements.
**Result:** Full eval showed p < 0.001 (extremely significant).
### Intervention 2: Architecture Choice (Day 5)
**Situation:** 82% accuracy with optimized hyperparameters.
**Claude's suggestion:** Continue tuning hyperparameters.
**My call:** "Switch to SFTTrainer instead. Hyperparameters are optimized enough."
**Why I intervened:** Knew from experience that architectural changes would yield more improvement than hyperparameter tweaking.
**Result:** Jumped from 82% to 92.4%.
### Intervention 3: GPU vs CPU (Day 7)
**Situation:** Evaluation producing nonsense on GPU.
**Claude's approach:** Debugging training code.
**My call:** "Try CPU inference first. This hardware has known GPU issues."
**Why I intervened:** Past experience with this specific hardware.
**Result:** CPU inference worked perfectly, revealed GPU limitation.
---
## What Surprised Me
### Surprise 1: Documentation Quality
Claude's documentation was often **better than mine**:
- More complete
- Better structured
- Included examples I would have skipped
**Why:** Takes time I'm usually too impatient to spend.
### Surprise 2: Error Analysis Depth
Generated comprehensive error breakdowns:
- Hedging errors: 36 → 0
- Empty responses: 22 → 20
- Misclassifications: 19 → 13
**I would have looked at:** Overall accuracy only.
**Claude showed:** Where the improvement came from.
### Surprise 3: Statistical Rigor
Implemented proper statistical tests without prompting:
- McNemar's test for paired samples
- Cohen's h for effect size
- 95% confidence intervals
**Why this matters:** Medical AI needs validation rigor. Claude defaulted to best practices.
---
## The Anti-Patterns I Avoided
### Anti-Pattern 1: "Just Fix Everything"
**Temptation:** Change hyperparameters, architecture, and dataset simultaneously.
**Why bad:** Can't tell what helped.
**What I did:** Progressive refinement (baseline → hyperparameters → architecture).
**Result:** Clear understanding of each contribution.
### Anti-Pattern 2: "Claude Will Figure It Out"
**Temptation:** Give vague instructions and expect magic.
**Why bad:** Claude doesn't have your domain knowledge or strategic vision.
**What I did:** Clear, specific instructions with constraints.
**Result:** Efficient collaboration with minimal back-and-forth.
### Anti-Pattern 3: "Quick Eval Is Good Enough"
**Temptation:** N=100 showed improvement, ship it.
**Why bad:** Not statistically significant (p=0.0736).
**What I did:** Required N=500 for proper validation.
**Result:** Achieved p < 0.001, publishable result.
---
## Lessons for Your Next ML Project
### 1. Define Success Upfront
Before delegating anything:
- What accuracy do you need?
- What sample size for validation?
- What deployment constraints?
**I defined:** 90%+ accuracy, N=500 validation, CPU-compatible inference.
### 2. Start with a Baseline
Five experiments at 70% weren't failures. They were establishing a reproducible starting point.
**Without this:** No way to measure improvement.
### 3. Progressive Refinement Beats Big Bang
Optimize one dimension at a time:
1. Baseline
2. Hyperparameters
3. Architecture
**Why:** Know what contributes to improvement.
### 4. Bring Domain Knowledge to Decisions
**Claude knew:** How to implement SFTTrainer.
**I knew:** When to switch to SFTTrainer.
**The combination:** 92.4% accuracy.
### 5. Question Your Assumptions
When evaluation looked broken, I could have spent 20 hours debugging training code.
**Instead:** "Maybe GPU inference is broken, not the model."
**Result:** 5-minute fix (switch to CPU).
### 6. Document the Journey
Claude generated session logs, README updates, and deployment guides.
**Time cost:** Nearly zero (automated).
**Value:** Can return to project months later with full context.
---
## The Bottom Line
**60+ hours of work over 9 days.**
**My active time:** 10-12 hours of strategic decisions and validation.
**Result:** Production-ready medical AI with 92.4% accuracy and complete documentation.
**Time saved:** 48 hours of coding, testing, and documentation.
**Knowledge gained:** Hardware discovery, optimization strategies, validation rigor.
---
## When This Pattern Works
This delegation pattern works best when:
**✅ You have clear goals**
- Know what success looks like
- Can define quality bar
- Understand validation requirements
**✅ You can make strategic decisions**
- What to optimize first
- When to pivot
- When "good enough" isn't enough
**✅ You have domain knowledge**
- Know the tools in your space
- Recognize patterns from experience
- Can diagnose root causes
**✅ You're willing to iterate**
- Not looking for one-shot solutions
- Open to progressive refinement
- Can course-correct based on results
---
## When You Need to Step In
Watch for these signals:
**🚨 Strategic Decisions**
- Which approach to try next
- When to change direction
- Setting quality standards
**🚨 Domain Knowledge**
- Tool selection (SFTTrainer vs Trainer)
- Validation requirements (N=500 not N=100)
- Hardware limitations
**🚨 Root Cause Analysis**
- "Model broken" vs "Platform broken"
- When to question premises
- Pattern recognition from experience
---
## What I'd Do Differently
### Start with Statistical Requirements
**I said:** "Get to 90%+ accuracy."
**I should have said:** "Get to 90%+ accuracy with N=500, p < 0.05, validated on CPU and GPU."
**Why:** Would have avoided the quick eval detour.
### Document Hardware Limitations First
**I discovered:** GPU inference broken on day 7.
**I should have:** Tested both GPU and CPU inference on day 1.
**Why:** Would have saved 15-20 hours of confusion.
### Define "Done" Explicitly
**I said:** "Optimize until it's good."
**I should have said:** "Stop at 90%+ accuracy OR after 3 optimization cycles."
**Why:** Clearer stopping point prevents over-optimization.
---
## Your Turn
Next time you're facing a complex ML project:
**Ask yourself:**
1. What's my baseline?
2. What am I optimizing for?
3. What can I delegate and what must I decide?
4. When will I know I'm done?
**Try delegating:**
- Implementation details
- Systematic testing
- Documentation
- Error analysis
**Keep control of:**
- Strategic direction
- Quality standards
- Domain expertise
- When to pivot
The goal isn't to automate everything. It's to amplify your expertise by delegating the 80% of work that's systematic, while focusing your time on the 20% that requires human judgment.
---
## Related Articles
- [[dgx-lab-building-complete-rag-infrastructure-day-3|Building a Complete RAG Infrastructure - Day 3]]
- [[catastrophic-forgetting-llm-fine-tuning|The Hidden Crisis in LLM Fine-Tuning]]
- [[building-production-ml-workspace-part-3-experiments|Experiment Tracking and Reproducibility]]
- [[dgx-lab-supercharged-bashrc-ml-workflows-day-2|Supercharge Your Shell with ML Productivity Aliases]]
---
**Want to learn more about AI-assisted ML workflows?** This approach to progressive refinement and strategic delegation applies across domains. The key is knowing when to trust your AI partner and when to step in with human expertise.
**Questions about delegating complex experiments?** Reach out on [Twitter/X](https://twitter.com/bioinfo) or check out more DGX Lab Chronicles articles.
---
### Related Articles
- [[dgx-lab-benchmarks-vs-reality-day-4|DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4]]
- [[dgx-lab-building-complete-rag-infrastructure-day-3|DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3]]
- [[dgx-lab-intelligent-gateway-heuristics-vs-ml-day-1|DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1]]
---
<p style="text-align: center;"><strong>About the Author</strong>: Justin Johnson builds AI systems and writes about practical AI development.</p>
<p style="text-align: center;"><a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a></p>