Practical Applications14 min readshipped

How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)

How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)

The Setup: Fine-tune a medical AI model to answer questions from PubMedQA with high accuracy.

The Reality: 9 days, 7 experiments, 60+ hours of work, and constant decisions about when to delegate and when to take control.

The Result: 92.4% accuracy and clarity on when AI collaboration works best.

This isn't a story about Claude writing all my code. It's about the dance between delegation and decision-making that turned a 70% baseline into production-ready medical AI.


Day 1-2: Setting Up the Baseline

My Role: Define the Goal

I knew what I wanted:

  • Fine-tune Google's Gemma-3-4b model
  • Target: Medical question answering (PubMedQA dataset)
  • Method: LoRA (efficient fine-tuning)
  • Success metric: Statistically significant improvement

What I told Claude:

"Set up LoRA fine-tuning for Gemma-3-4b on PubMedQA. Start with standard hyperparameters. I want a reproducible baseline before we optimize anything."

Claude's Role: Implement and Iterate

Claude generated the training pipeline:

  • Data loading and preprocessing
  • LoRA configuration
  • Training loop with logging
  • Evaluation scripts

Five experiments later: Consistent 70% accuracy.

The Decision Point

Most people would have stopped here with "it's not working." But consistent results across five runs meant this was a real baseline, not random noise.

I decided: Keep the 70% as our North Star. Now we optimize.


Day 3-4: The Hyperparameter Phase

Delegation Strategy: Test One Dimension

My instruction:

"Keep the training code. Change only the hyperparameters. Try higher LoRA rank, lower learning rate, and add GPU cache clearing every 50 steps."

Why I specified these:

  • Higher rank: More capacity for medical knowledge
  • Lower learning rate: Avoid catastrophic forgetting
  • Cache clearing: GPU stability on this hardware

Claude's Implementation

Generated Experiment 7 with:

# Increased from r=16 to r=64
lora_config = LoraConfig(r=64, lora_alpha=128, ...)

# Reduced from 2e-4 to 5e-5
learning_rate=5e-5

# Added stability
if step % 50 == 0:
    torch.cuda.empty_cache()

Result: 82% accuracy (+12% improvement)

What I Learned

Delegation worked because:

  • I provided clear constraints ("change only hyperparameters")
  • Claude handled the implementation details
  • I made the strategic choice of what to optimize

I didn't need to: Write the code, debug syntax errors, or structure the experiment.

I did need to: Decide which hyperparameters mattered and when to stop.


Day 5-6: The Architecture Change

The Strategic Decision

With 82% in hand, the question: "Is this good enough?"

For medical AI, I wanted 90%+. That meant changing architecture, not just hyperparameters.

My instruction:

"Switch from standard Trainer to SFTTrainer from HuggingFace TRL. Keep the optimized hyperparameters from Experiment 7. SFTTrainer handles text-to-text better for Q&A tasks."

Why I Chose This Path

I knew from experience:

  • SFTTrainer optimizes for instruction-following
  • Dynamic padding improves efficiency
  • Better label handling for Q&A format

Claude didn't suggest this. This was domain knowledge I brought to the table.

The Implementation

Claude converted the training pipeline:

  • Reformatted dataset for text input
  • Switched Trainer class to SFTTrainer
  • Kept all hyperparameters from Experiment 7
  • Added dynamic padding config

Training time: 10 hours overnight.

Result next morning: Quick eval showed 89% accuracy.

The Pivot Point

Me: "That's promising but not conclusive. We need full evaluation on 500 examples, not 100."

Claude: Implemented comprehensive evaluation script with:

  • 500 examples (proper sample size)
  • McNemar's test for statistical significance
  • Detailed error analysis
  • Multiple temperature testing

Final result: 92.4% accuracy, p < 0.001 (extremely significant).


Day 7: The Hardware Surprise

When Everything Looked Broken

Ran Experiment 8 evaluation. Got nonsensical results:

  • All predictions were "maybe"
  • Empty responses
  • Random gibberish

My first instinct: "The training failed."

The Debugging Dance

What I did:

  1. Checked training logs (looked fine)
  2. Validated checkpoints (valid)
  3. Reviewed dataset format (correct)

What I asked Claude:

"Training looks good but evaluation is broken. Test inference on CPU instead of GPU."

Why I made this call: I'd seen GPU inference issues on this hardware before (ARM64 + Blackwell).

The Discovery

CPU inference: 92.4% accuracy. Perfect.

The model was fine. GPU inference was broken on this specific hardware.

Time saved: 15-20 hours of future debugging by documenting this hardware limitation.

What This Taught Me

I could delegate: Code writing, testing different approaches, running evaluations.

I couldn't delegate: Knowing when to question the premise. "Maybe the model isn't broken, maybe the evaluation platform is."

This was human intuition from past experience.


The Decision Framework That Emerged

Over 9 days, a pattern became clear about what to delegate and what to control.

What I Delegated Successfully

✅ Implementation Details

  • Writing training loops
  • Data preprocessing pipelines
  • Evaluation scripts
  • Logging and monitoring

✅ Systematic Testing

  • Running experiments
  • Testing hyperparameters
  • Error analysis
  • Statistical validation

✅ Documentation

  • Session logs
  • README updates
  • Deployment guides
  • Code comments

Time saved: 40+ hours of coding and documentation.

What I Had to Decide

🎯 Strategic Direction

  • Which hyperparameters to optimize first
  • When to change architecture
  • Required sample size for validation
  • Temperature settings for medical Q&A

🎯 Quality Bar

  • 82% is good, but 90%+ is the target
  • Need statistical significance, not just improvement
  • Production deployment requirements

🎯 Problem Diagnosis

  • "Model is broken" vs "Evaluation platform is broken"
  • When to pivot vs when to persist
  • Hardware limitations vs software bugs

Time invested: Maybe 10 hours total across 9 days.


The ROI Calculation

Time Breakdown

Total experiment: 60+ hours

  • Training time: 40 hours (mostly unattended)
  • Evaluation: 15 hours
  • Analysis and documentation: 5 hours

My active time: ~10-12 hours

  • Strategic decisions: 3 hours
  • Reviewing results: 4 hours
  • Problem diagnosis: 3 hours
  • Final validation: 2 hours

Claude's contribution: ~48 hours of work (coding, testing, documentation)

What I Gained

Direct time savings: 48 hours of coding and testing

Quality improvements:

  • Statistical validation I might have skipped (N=500 vs N=100)
  • Comprehensive error analysis
  • Complete documentation
  • Production-ready deployment

Knowledge captured:

  • Hardware limitation documented
  • Progressive optimization strategy validated
  • Temperature tuning insights

Value beyond this project: The hardware discovery alone saved 15-20 hours on future experiments.


The Collaboration Patterns

Pattern 1: "Try This Approach"

When to use: You know the direction but not the implementation.

Example:

"Switch to SFTTrainer from TRL library. Keep the hyperparameters from Experiment 7."

Claude handles: Implementation, testing, debugging.

You handle: Knowing SFTTrainer exists and when to use it.

Pattern 2: "Diagnose This Problem"

When to use: Something is broken but you're not sure what.

Example:

"Evaluation shows all 'maybe' predictions. Check: training loss, checkpoint validity, dataset format, and try CPU inference."

Claude handles: Systematic testing of each hypothesis.

You handle: Defining the diagnostic checklist.

Pattern 3: "Validate This Thoroughly"

When to use: Results look good but need statistical proof.

Example:

"Quick eval shows 89% vs 84%. Run full evaluation on 500 examples with McNemar's test for significance."

Claude handles: Implementing proper statistical tests.

You handle: Knowing what "thorough" means in your domain.

Pattern 4: "Explore and Report"

When to use: You want to understand tradeoffs.

Example:

"Test temperatures 0.3, 0.7, and 1.0. Compare accuracy, error types, and confidence."

Claude handles: Running experiments and analyzing results.

You handle: Deciding which temperature to use in production.


The Moments I Had to Intervene

Intervention 1: Sample Size (Day 6)

Quick eval showed: 89% vs 84% (p=0.0736, not significant)

Claude's conclusion: "Improvement not statistically significant."

My call: "Run full eval with N=500. Quick eval has insufficient power to detect 5-10% differences."

Why I intervened: Domain knowledge about medical ML validation requirements.

Result: Full eval showed p < 0.001 (extremely significant).

Intervention 2: Architecture Choice (Day 5)

Situation: 82% accuracy with optimized hyperparameters.

Claude's suggestion: Continue tuning hyperparameters.

My call: "Switch to SFTTrainer instead. Hyperparameters are optimized enough."

Why I intervened: Knew from experience that architectural changes would yield more improvement than hyperparameter tweaking.

Result: Jumped from 82% to 92.4%.

Intervention 3: GPU vs CPU (Day 7)

Situation: Evaluation producing nonsense on GPU.

Claude's approach: Debugging training code.

My call: "Try CPU inference first. This hardware has known GPU issues."

Why I intervened: Past experience with this specific hardware.

Result: CPU inference worked perfectly, revealed GPU limitation.


What Surprised Me

Surprise 1: Documentation Quality

Claude's documentation was often better than mine:

  • More complete
  • Better structured
  • Included examples I would have skipped

Why: Takes time I'm usually too impatient to spend.

Surprise 2: Error Analysis Depth

Generated comprehensive error breakdowns:

  • Hedging errors: 36 → 0
  • Empty responses: 22 → 20
  • Misclassifications: 19 → 13

I would have looked at: Overall accuracy only.

Claude showed: Where the improvement came from.

Surprise 3: Statistical Rigor

Implemented proper statistical tests without prompting:

  • McNemar's test for paired samples
  • Cohen's h for effect size
  • 95% confidence intervals

Why this matters: Medical AI needs validation rigor. Claude defaulted to best practices.


The Anti-Patterns I Avoided

Anti-Pattern 1: "Just Fix Everything"

Temptation: Change hyperparameters, architecture, and dataset simultaneously.

Why bad: Can't tell what helped.

What I did: Progressive refinement (baseline → hyperparameters → architecture).

Result: Clear understanding of each contribution.

Anti-Pattern 2: "Claude Will Figure It Out"

Temptation: Give vague instructions and expect magic.

Why bad: Claude doesn't have your domain knowledge or strategic vision.

What I did: Clear, specific instructions with constraints.

Result: Efficient collaboration with minimal back-and-forth.

Anti-Pattern 3: "Quick Eval Is Good Enough"

Temptation: N=100 showed improvement, ship it.

Why bad: Not statistically significant (p=0.0736).

What I did: Required N=500 for proper validation.

Result: Achieved p < 0.001, publishable result.


Lessons for Your Next ML Project

1. Define Success Upfront

Before delegating anything:

  • What accuracy do you need?
  • What sample size for validation?
  • What deployment constraints?

I defined: 90%+ accuracy, N=500 validation, CPU-compatible inference.

2. Start with a Baseline

Five experiments at 70% weren't failures. They were establishing a reproducible starting point.

Without this: No way to measure improvement.

3. Progressive Refinement Beats Big Bang

Optimize one dimension at a time:

  1. Baseline
  2. Hyperparameters
  3. Architecture

Why: Know what contributes to improvement.

4. Bring Domain Knowledge to Decisions

Claude knew: How to implement SFTTrainer.

I knew: When to switch to SFTTrainer.

The combination: 92.4% accuracy.

5. Question Your Assumptions

When evaluation looked broken, I could have spent 20 hours debugging training code.

Instead: "Maybe GPU inference is broken, not the model."

Result: 5-minute fix (switch to CPU).

6. Document the Journey

Claude generated session logs, README updates, and deployment guides.

Time cost: Nearly zero (automated).

Value: Can return to project months later with full context.


The Bottom Line

60+ hours of work over 9 days.

My active time: 10-12 hours of strategic decisions and validation.

Result: Production-ready medical AI with 92.4% accuracy and complete documentation.

Time saved: 48 hours of coding, testing, and documentation.

Knowledge gained: Hardware discovery, optimization strategies, validation rigor.


When This Pattern Works

This delegation pattern works best when:

✅ You have clear goals

  • Know what success looks like
  • Can define quality bar
  • Understand validation requirements

✅ You can make strategic decisions

  • What to optimize first
  • When to pivot
  • When "good enough" isn't enough

✅ You have domain knowledge

  • Know the tools in your space
  • Recognize patterns from experience
  • Can diagnose root causes

✅ You're willing to iterate

  • Not looking for one-shot solutions
  • Open to progressive refinement
  • Can course-correct based on results

When You Need to Step In

Watch for these signals:

🚨 Strategic Decisions

  • Which approach to try next
  • When to change direction
  • Setting quality standards

🚨 Domain Knowledge

  • Tool selection (SFTTrainer vs Trainer)
  • Validation requirements (N=500 not N=100)
  • Hardware limitations

🚨 Root Cause Analysis

  • "Model broken" vs "Platform broken"
  • When to question premises
  • Pattern recognition from experience

What I'd Do Differently

Start with Statistical Requirements

I said: "Get to 90%+ accuracy."

I should have said: "Get to 90%+ accuracy with N=500, p < 0.05, validated on CPU and GPU."

Why: Would have avoided the quick eval detour.

Document Hardware Limitations First

I discovered: GPU inference broken on day 7.

I should have: Tested both GPU and CPU inference on day 1.

Why: Would have saved 15-20 hours of confusion.

Define "Done" Explicitly

I said: "Optimize until it's good."

I should have said: "Stop at 90%+ accuracy OR after 3 optimization cycles."

Why: Clearer stopping point prevents over-optimization.


Your Turn

Next time you're facing a complex ML project:

Ask yourself:

  1. What's my baseline?
  2. What am I optimizing for?
  3. What can I delegate and what must I decide?
  4. When will I know I'm done?

Try delegating:

  • Implementation details
  • Systematic testing
  • Documentation
  • Error analysis

Keep control of:

  • Strategic direction
  • Quality standards
  • Domain expertise
  • When to pivot

The goal isn't to automate everything. It's to amplify your expertise by delegating the 80% of work that's systematic, while focusing your time on the 20% that requires human judgment.


Related Articles

  • Building a Complete RAG Infrastructure - Day 3
  • The Hidden Crisis in LLM Fine-Tuning
  • Experiment Tracking and Reproducibility
  • Supercharge Your Shell with ML Productivity Aliases

Want to learn more about AI-assisted ML workflows? This approach to progressive refinement and strategic delegation applies across domains. The key is knowing when to trust your AI partner and when to step in with human expertise.

Questions about delegating complex experiments? Reach out on Twitter/X or check out more DGX Lab Chronicles articles.


Related Articles

  • DGX Lab: When Benchmark Numbers Meet Production Reality - Day 4
  • DGX Lab: Building a Complete RAG Infrastructure - From Ollama to Qdrant to AnythingLLM - Day 3
  • DGX Lab: When Simple Heuristics Beat ML by 95,000x - Day 1

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.

Links to this entry