medical-llm-fine-tuning-70-to-92-percent - AIXplore

# How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In) **The Setup:** Fine-tune a medical AI model to answer questions from PubMedQA with high accuracy. **The Reality:** 9 days, 7 experiments, 60+ hours of work, and constant decisions about when to delegate and when to take control. **The Result:** 92.4% accuracy and clarity on when AI collaboration works best. This isn't a story about Claude writing all my code. It's about the dance between delegation and decision-making that turned a 70% baseline into production-ready medical AI. --- ## Day 1-2: Setting Up the Baseline ### My Role: Define the Goal I knew what I wanted: - Fine-tune Google's Gemma-3-4b model - Target: Medical question answering (PubMedQA dataset) - Method: LoRA (efficient fine-tuning) - Success metric: Statistically significant improvement **What I told Claude:** > "Set up LoRA fine-tuning for Gemma-3-4b on PubMedQA. Start with standard hyperparameters. I want a reproducible baseline before we optimize anything." ### Claude's Role: Implement and Iterate Claude generated the training pipeline: - Data loading and preprocessing - LoRA configuration - Training loop with logging - Evaluation scripts **Five experiments later:** Consistent 70% accuracy. ### The Decision Point Most people would have stopped here with "it's not working." But consistent results across five runs meant **this was a real baseline**, not random noise. **I decided:** Keep the 70% as our North Star. Now we optimize. --- ## Day 3-4: The Hyperparameter Phase ### Delegation Strategy: Test One Dimension **My instruction:** > "Keep the training code. Change only the hyperparameters. Try higher LoRA rank, lower learning rate, and add GPU cache clearing every 50 steps." **Why I specified these:** - Higher rank: More capacity for medical knowledge - Lower learning rate: Avoid catastrophic forgetting - Cache clearing: GPU stability on this hardware ### Claude's Implementation Generated Experiment 7 with: ```python # Increased from r=16 to r=64 lora_config = LoraConfig(r=64, lora_alpha=128, ...) # Reduced from 2e-4 to 5e-5 learning_rate=5e-5 # Added stability if step % 50 == 0: torch.cuda.empty_cache() ``` **Result:** 82% accuracy (+12% improvement) ### What I Learned **Delegation worked because:** - I provided clear constraints ("change only hyperparameters") - Claude handled the implementation details - I made the strategic choice of what to optimize **I didn't need to:** Write the code, debug syntax errors, or structure the experiment. **I did need to:** Decide which hyperparameters mattered and when to stop. --- ## Day 5-6: The Architecture Change ### The Strategic Decision With 82% in hand, the question: "Is this good enough?" For medical AI, I wanted 90%+. That meant changing architecture, not just hyperparameters. **My instruction:** > "Switch from standard Trainer to SFTTrainer from HuggingFace TRL. Keep the optimized hyperparameters from Experiment 7. SFTTrainer handles text-to-text better for Q&A tasks." ### Why I Chose This Path **I knew from experience:** - SFTTrainer optimizes for instruction-following - Dynamic padding improves efficiency - Better label handling for Q&A format **Claude didn't suggest this.** This was domain knowledge I brought to the table. ### The Implementation Claude converted the training pipeline: - Reformatted dataset for text input - Switched Trainer class to SFTTrainer - Kept all hyperparameters from Experiment 7 - Added dynamic padding config **Training time:** 10 hours overnight. **Result next morning:** Quick eval showed 89% accuracy. ### The Pivot Point **Me:** "That's promising but not conclusive. We need full evaluation on 500 examples, not 100." **Claude:** Implemented comprehensive evaluation script with: - 500 examples (proper sample size) - McNemar's test for statistical significance - Detailed error analysis - Multiple temperature testing **Final result:** 92.4% accuracy, p < 0.001 (extremely significant). --- ## Day 7: The Hardware Surprise ### When Everything Looked Broken Ran Experiment 8 evaluation. Got nonsensical results: - All predictions were "maybe" - Empty responses - Random gibberish **My first instinct:** "The training failed." ### The Debugging Dance **What I did:** 1. Checked training logs (looked fine) 2. Validated checkpoints (valid) 3. Reviewed dataset format (correct) **What I asked Claude:** > "Training looks good but evaluation is broken. Test inference on CPU instead of GPU." **Why I made this call:** I'd seen GPU inference issues on this hardware before (ARM64 + Blackwell). ### The Discovery CPU inference: **92.4% accuracy**. Perfect. The model was fine. GPU inference was broken on this specific hardware. **Time saved:** 15-20 hours of future debugging by documenting this hardware limitation. ### What This Taught Me **I could delegate:** Code writing, testing different approaches, running evaluations. **I couldn't delegate:** Knowing when to question the premise. "Maybe the model isn't broken, maybe the evaluation platform is." This was human intuition from past experience. --- ## The Decision Framework That Emerged Over 9 days, a pattern became clear about what to delegate and what to control. ### What I Delegated Successfully **✅ Implementation Details** - Writing training loops - Data preprocessing pipelines - Evaluation scripts - Logging and monitoring **✅ Systematic Testing** - Running experiments - Testing hyperparameters - Error analysis - Statistical validation **✅ Documentation** - Session logs - README updates - Deployment guides - Code comments **Time saved:** 40+ hours of coding and documentation. ### What I Had to Decide **🎯 Strategic Direction** - Which hyperparameters to optimize first - When to change architecture - Required sample size for validation - Temperature settings for medical Q&A **🎯 Quality Bar** - 82% is good, but 90%+ is the target - Need statistical significance, not just improvement - Production deployment requirements **🎯 Problem Diagnosis** - "Model is broken" vs "Evaluation platform is broken" - When to pivot vs when to persist - Hardware limitations vs software bugs **Time invested:** Maybe 10 hours total across 9 days. --- ## The ROI Calculation ### Time Breakdown **Total experiment:** 60+ hours - Training time: 40 hours (mostly unattended) - Evaluation: 15 hours - Analysis and documentation: 5 hours **My active time:** ~10-12 hours - Strategic decisions: 3 hours - Reviewing results: 4 hours - Problem diagnosis: 3 hours - Final validation: 2 hours **Claude's contribution:** ~48 hours of work (coding, testing, documentation) ### What I Gained **Direct time savings:** 48 hours of coding and testing **Quality improvements:** - Statistical validation I might have skipped (N=500 vs N=100) - Comprehensive error analysis - Complete documentation - Production-ready deployment **Knowledge captured:** - Hardware limitation documented - Progressive optimization strategy validated - Temperature tuning insights **Value beyond this project:** The hardware discovery alone saved 15-20 hours on future experiments. --- ## The Collaboration Patterns ### Pattern 1: "Try This Approach" **When to use:** You know the direction but not the implementation. **Example:** > "Switch to SFTTrainer from TRL library. Keep the hyperparameters from Experiment 7." **Claude handles:** Implementation, testing, debugging. **You handle:** Knowing SFTTrainer exists and when to use it. ### Pattern 2: "Diagnose This Problem" **When to use:** Something is broken but you're not sure what. **Example:** > "Evaluation shows all 'maybe' predictions. Check: training loss, checkpoint validity, dataset format, and try CPU inference." **Claude handles:** Systematic testing of each hypothesis. **You handle:** Defining the diagnostic checklist. ### Pattern 3: "Validate This Thoroughly" **When to use:** Results look good but need statistical proof. **Example:** > "Quick eval shows 89% vs 84%. Run full evaluation on 500 examples with McNemar's test for significance." **Claude handles:** Implementing proper statistical tests. **You handle:** Knowing what "thorough" means in your domain. ### Pattern 4: "Explore and Report" **When to use:** You want to understand tradeoffs. **Example:** > "Test temperatures 0.3, 0.7, and 1.0. Compare accuracy, error types, and confidence." **Claude handles:** Running experiments and analyzing results. **You handle:** Deciding which temperature to use in production. --- ## The Moments I Had to Intervene ### Intervention 1: Sample Size (Day 6) **Quick eval showed:** 89% vs 84% (p=0.0736, not significant) **Claude's conclusion:** "Improvement not statistically significant." **My call:** "Run full eval with N=500. Quick eval has insufficient power to detect 5-10% differences." **Why I intervened:** Domain knowledge about medical ML validation requirements. **Result:** Full eval showed p < 0.001 (extremely significant). ### Intervention 2: Architecture Choice (Day 5) **Situation:** 82% accuracy with optimized hyperparameters. **Claude's suggestion:** Continue tuning hyperparameters. **My call:** "Switch to SFTTrainer instead. Hyperparameters are optimized enough." **Why I intervened:** Knew from experience that architectural changes would yield more improvement than hyperparameter tweaking. **Result:** Jumped from 82% to 92.4%. ### Intervention 3: GPU vs CPU (Day 7) **Situation:** Evaluation producing nonsense on GPU. **Claude's approach:** Debugging training code. **My call:** "Try CPU inference first. This hardware has known GPU issues." **Why I intervened:** Past experience with this specific hardware. **Result:** CPU inference worked perfectly, revealed GPU limitation. --- ## What Surprised Me ### Surprise 1: Documentation Quality Claude's documentation was often **better than mine**: - More complete - Better structured - Included examples I would have skipped **Why:** Takes time I'm usually too impatient to spend. ### Surprise 2: Error Analysis Depth Generated comprehensive error breakdowns: - Hedging errors: 36 → 0 - Empty responses: 22 → 20 - Misclassifications: 19 → 13 **I would have looked at:** Overall accuracy only. **Claude showed:** Where the improvement came from. ### Surprise 3: Statistical Rigor Implemented proper statistical tests without prompting: - McNemar's test for paired samples - Cohen's h for effect size - 95% confidence intervals **Why this matters:** Medical AI needs validation rigor. Claude defaulted to best practices. --- ## The Anti-Patterns I Avoided ### Anti-Pattern 1: "Just Fix Everything" **Temptation:** Change hyperparameters, architecture, and dataset simultaneously. **Why bad:** Can't tell what helped. **What I did:** Progressive refinement (baseline → hyperparameters → architecture). **Result:** Clear understanding of each contribution. ### Anti-Pattern 2: "Claude Will Figure It Out" **Temptation:** Give vague instructions and expect magic. **Why bad:** Claude doesn't have your domain knowledge or strategic vision. **What I did:** Clear, specific instructions with constraints. **Result:** Efficient collaboration with minimal back-and-forth. ### Anti-Pattern 3: "Quick Eval Is Good Enough" **Temptation:** N=100 showed improvement, ship it. **Why bad:** Not statistically significant (p=0.0736). **What I did:** Required N=500 for proper validation. **Result:** Achieved p < 0.001, publishable result. --- ## Lessons for Your Next ML Project ### 1. Define Success Upfront Before delegating anything: - What accuracy do you need? - What sample size for validation? - What deployment constraints? **I defined:** 90%+ accuracy, N=500 validation, CPU-compatible inference. ### 2. Start with a Baseline Five experiments at 70% weren't failures. They were establishing a reproducible starting point. **Without this:** No way to measure improvement. ### 3. Progressive Refinement Beats Big Bang Optimize one dimension at a time: 1. Baseline 2. Hyperparameters 3. Architecture **Why:** Know what contributes to improvement. ### 4. Bring Domain Knowledge to Decisions **Claude knew:** How to implement SFTTrainer. **I knew:** When to switch to SFTTrainer. **The combination:** 92.4% accuracy. ### 5. Question Your Assumptions When evaluation looked broken, I could have spent 20 hours debugging training code. **Instead:** "Maybe GPU inference is broken, not the model." **Result:** 5-minute fix (switch to CPU). ### 6. Document the Journey Claude generated session logs, README updates, and deployment guides. **Time cost:** Nearly zero (automated). **Value:** Can return to project months later with full context. --- ## The Bottom Line **60+ hours of work over 9 days.** **My active time:** 10-12 hours of strategic decisions and validation. **Result:** Production-ready medical AI with 92.4% accuracy and complete documentation. **Time saved:** 48 hours of coding, testing, and documentation. **Knowledge gained:** Hardware discovery, optimization strategies, validation rigor. --- ## When This Pattern Works This delegation pattern works best when: **✅ You have clear goals** - Know what success looks like - Can define quality bar - Understand validation requirements **✅ You can make strategic decisions** - What to optimize first - When to pivot - When "good enough" isn't enough **✅ You have domain knowledge** - Know the tools in your space - Recognize patterns from experience - Can diagnose root causes **✅ You're willing to iterate** - Not looking for one-shot solutions - Open to progressive refinement - Can course-correct based on results --- ## When You Need to Step In Watch for these signals: **🚨 Strategic Decisions** - Which approach to try next - When to change direction - Setting quality standards **🚨 Domain Knowledge** - Tool selection (SFTTrainer vs Trainer) - Validation requirements (N=500 not N=100) - Hardware limitations **🚨 Root Cause Analysis** - "Model broken" vs "Platform broken" - When to question premises - Pattern recognition from experience --- ## What I'd Do Differently ### Start with Statistical Requirements **I said:** "Get to 90%+ accuracy." **I should have said:** "Get to 90%+ accuracy with N=500, p < 0.05, validated on CPU and GPU." **Why:** Would have avoided the quick eval detour. ### Document Hardware Limitations First **I discovered:** GPU inference broken on day 7. **I should have:** Tested both GPU and CPU inference on day 1. **Why:** Would have saved 15-20 hours of confusion. ### Define "Done" Explicitly **I said:** "Optimize until it's good." **I should have said:** "Stop at 90%+ accuracy OR after 3 optimization cycles." **Why:** Clearer stopping point prevents over-optimization. --- ## Your Turn Next time you're facing a complex ML project: **Ask yourself:** 1. What's my baseline? 2. What am I optimizing for? 3. What can I delegate and what must I decide? 4. When will I know I'm done? **Try delegating:** - Implementation details - Systematic testing - Documentation - Error analysis **Keep control of:** - Strategic direction - Quality standards - Domain expertise - When to pivot The goal isn't to automate everything. It's to amplify your expertise by delegating the 80% of work that's systematic, while focusing your time on the 20% that requires human judgment. --- ## Related Articles - [[dgx-lab-building-complete-rag-infrastructure-day-3|Building a Complete RAG Infrastructure - Day 3]] - [[catastrophic-forgetting-llm-fine-tuning|The Hidden Crisis in LLM Fine-Tuning]] - [[building-production-ml-workspace-part-3-experiments|Experiment Tracking and Reproducibility]] - [[dgx-lab-supercharged-bashrc-ml-workflows-day-2|Supercharge Your Shell with ML Productivity Aliases]] --- **Want to learn more about AI-assisted ML workflows?** This approach to progressive refinement and strategic delegation applies across domains. The key is knowing when to trust your AI partner and when to step in with human expertise. **Questions about delegating complex experiments?** Reach out on [Twitter/X](https://twitter.com/bioinfo) or check out more DGX Lab Chronicles articles. --- ### Related Articles - [[dgx-lab-benchmarks-vs-reality-day-4|DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4]] - [[dgx-lab-building-complete-rag-infrastructure-day-3|DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3]] - [[dgx-lab-intelligent-gateway-heuristics-vs-ml-day-1|DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1]] --- About the Author: Justin Johnson builds AI systems and writes about practical AI development. <a href="https://justinhjohnson.com">justinhjohnson.com</a> | <a href="https://twitter.com/bioinfo">Twitter</a> | <a href="https://www.linkedin.com/in/justinhaywardjohnson/">LinkedIn</a> | <a href="https://rundatarun.io">Run Data Run</a> | <a href="https://subscribe.rundatarun.io">Subscribe</a>