Practical ApplicationsOctober 28, 202514 min readshipped

How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)

The Setup: Fine-tune a medical AI model to answer questions from PubMedQA with high accuracy.

The Reality: 9 days, 7 experiments, 60+ hours of work, and constant decisions about when to delegate and when to take control.

The Result: 92.4% accuracy and clarity on when AI collaboration works best.

This isn't a story about Claude writing all my code. It's about the dance between delegation and decision-making that turned a 70% baseline into production-ready medical AI.

Day 1-2: Setting Up the Baseline

My Role: Define the Goal

I knew what I wanted:

Fine-tune Google's Gemma-3-4b model
Target: Medical question answering (PubMedQA dataset)
Method: LoRA (efficient fine-tuning)
Success metric: Statistically significant improvement

What I told Claude:

"Set up LoRA fine-tuning for Gemma-3-4b on PubMedQA. Start with standard hyperparameters. I want a reproducible baseline before we optimize anything."

Claude's Role: Implement and Iterate

Claude generated the training pipeline:

Data loading and preprocessing
LoRA configuration
Training loop with logging
Evaluation scripts

Five experiments later: Consistent 70% accuracy.

The Decision Point

Most people would have stopped here with "it's not working." But consistent results across five runs meant this was a real baseline, not random noise.

I decided: Keep the 70% as our North Star. Now we optimize.

Day 3-4: The Hyperparameter Phase

Delegation Strategy: Test One Dimension

My instruction:

"Keep the training code. Change only the hyperparameters. Try higher LoRA rank, lower learning rate, and add GPU cache clearing every 50 steps."

Why I specified these:

Higher rank: More capacity for medical knowledge
Lower learning rate: Avoid catastrophic forgetting
Cache clearing: GPU stability on this hardware

Claude's Implementation

Generated Experiment 7 with:

# Increased from r=16 to r=64
lora_config = LoraConfig(r=64, lora_alpha=128, ...)

# Reduced from 2e-4 to 5e-5
learning_rate=5e-5

# Added stability
if step % 50 == 0:
    torch.cuda.empty_cache()

Result: 82% accuracy (+12% improvement)

What I Learned

Delegation worked because:

I provided clear constraints ("change only hyperparameters")
Claude handled the implementation details
I made the strategic choice of what to optimize

I didn't need to: Write the code, debug syntax errors, or structure the experiment.

I did need to: Decide which hyperparameters mattered and when to stop.

Day 5-6: The Architecture Change

The Strategic Decision

With 82% in hand, the question: "Is this good enough?"

For medical AI, I wanted 90%+. That meant changing architecture, not just hyperparameters.

My instruction:

"Switch from standard Trainer to SFTTrainer from HuggingFace TRL. Keep the optimized hyperparameters from Experiment 7. SFTTrainer handles text-to-text better for Q&A tasks."

Why I Chose This Path

I knew from experience:

SFTTrainer optimizes for instruction-following
Dynamic padding improves efficiency
Better label handling for Q&A format

Claude didn't suggest this. This was domain knowledge I brought to the table.

The Implementation

Claude converted the training pipeline:

Reformatted dataset for text input
Switched Trainer class to SFTTrainer
Kept all hyperparameters from Experiment 7
Added dynamic padding config

Training time: 10 hours overnight.

Result next morning: Quick eval showed 89% accuracy.

The Pivot Point

Me: "That's promising but not conclusive. We need full evaluation on 500 examples, not 100."

Claude: Implemented comprehensive evaluation script with:

500 examples (proper sample size)
McNemar's test for statistical significance
Detailed error analysis
Multiple temperature testing

Final result: 92.4% accuracy, p < 0.001 (extremely significant).

Day 7: The Hardware Surprise

When Everything Looked Broken

Ran Experiment 8 evaluation. Got nonsensical results:

All predictions were "maybe"
Empty responses
Random gibberish

My first instinct: "The training failed."

The Debugging Dance

What I did:

Checked training logs (looked fine)
Validated checkpoints (valid)
Reviewed dataset format (correct)

What I asked Claude:

"Training looks good but evaluation is broken. Test inference on CPU instead of GPU."

Why I made this call: I'd seen GPU inference issues on this hardware before (ARM64 + Blackwell).

The Discovery

CPU inference: 92.4% accuracy. Perfect.

The model was fine. GPU inference was broken on this specific hardware.

Time saved: 15-20 hours of future debugging by documenting this hardware limitation.

What This Taught Me

I could delegate: Code writing, testing different approaches, running evaluations.

I couldn't delegate: Knowing when to question the premise. "Maybe the model isn't broken, maybe the evaluation platform is."

This was human intuition from past experience.

The Decision Framework That Emerged

Over 9 days, a pattern became clear about what to delegate and what to control.

What I Delegated Successfully

✅ Implementation Details

Writing training loops
Data preprocessing pipelines
Evaluation scripts
Logging and monitoring

✅ Systematic Testing

Running experiments
Testing hyperparameters
Error analysis
Statistical validation

✅ Documentation

Session logs
README updates
Deployment guides
Code comments

Time saved: 40+ hours of coding and documentation.

What I Had to Decide

🎯 Strategic Direction

Which hyperparameters to optimize first
When to change architecture
Required sample size for validation
Temperature settings for medical Q&A

🎯 Quality Bar

82% is good, but 90%+ is the target
Need statistical significance, not just improvement
Production deployment requirements

🎯 Problem Diagnosis

"Model is broken" vs "Evaluation platform is broken"
When to pivot vs when to persist
Hardware limitations vs software bugs

Time invested: Maybe 10 hours total across 9 days.

The ROI Calculation

Time Breakdown

Total experiment: 60+ hours

Training time: 40 hours (mostly unattended)
Evaluation: 15 hours
Analysis and documentation: 5 hours

My active time: ~10-12 hours

Strategic decisions: 3 hours
Reviewing results: 4 hours
Problem diagnosis: 3 hours
Final validation: 2 hours

Claude's contribution: ~48 hours of work (coding, testing, documentation)

What I Gained

Direct time savings: 48 hours of coding and testing

Quality improvements:

Statistical validation I might have skipped (N=500 vs N=100)
Comprehensive error analysis
Complete documentation
Production-ready deployment

Knowledge captured:

Hardware limitation documented
Progressive optimization strategy validated
Temperature tuning insights

Value beyond this project: The hardware discovery alone saved 15-20 hours on future experiments.

The Collaboration Patterns

Pattern 1: "Try This Approach"

When to use: You know the direction but not the implementation.

Example:

"Switch to SFTTrainer from TRL library. Keep the hyperparameters from Experiment 7."

Claude handles: Implementation, testing, debugging.

You handle: Knowing SFTTrainer exists and when to use it.

Pattern 2: "Diagnose This Problem"

When to use: Something is broken but you're not sure what.

Example:

"Evaluation shows all 'maybe' predictions. Check: training loss, checkpoint validity, dataset format, and try CPU inference."

Claude handles: Systematic testing of each hypothesis.

You handle: Defining the diagnostic checklist.

Pattern 3: "Validate This Thoroughly"

When to use: Results look good but need statistical proof.

Example:

"Quick eval shows 89% vs 84%. Run full evaluation on 500 examples with McNemar's test for significance."

Claude handles: Implementing proper statistical tests.

You handle: Knowing what "thorough" means in your domain.

Pattern 4: "Explore and Report"

When to use: You want to understand tradeoffs.

Example:

"Test temperatures 0.3, 0.7, and 1.0. Compare accuracy, error types, and confidence."

Claude handles: Running experiments and analyzing results.

You handle: Deciding which temperature to use in production.

The Moments I Had to Intervene

Intervention 1: Sample Size (Day 6)

Quick eval showed: 89% vs 84% (p=0.0736, not significant)

Claude's conclusion: "Improvement not statistically significant."

My call: "Run full eval with N=500. Quick eval has insufficient power to detect 5-10% differences."

Why I intervened: Domain knowledge about medical ML validation requirements.

Result: Full eval showed p < 0.001 (extremely significant).

Intervention 2: Architecture Choice (Day 5)

Situation: 82% accuracy with optimized hyperparameters.

Claude's suggestion: Continue tuning hyperparameters.

My call: "Switch to SFTTrainer instead. Hyperparameters are optimized enough."

Why I intervened: Knew from experience that architectural changes would yield more improvement than hyperparameter tweaking.

Result: Jumped from 82% to 92.4%.

Intervention 3: GPU vs CPU (Day 7)

Situation: Evaluation producing nonsense on GPU.

Claude's approach: Debugging training code.

My call: "Try CPU inference first. This hardware has known GPU issues."

Why I intervened: Past experience with this specific hardware.

Result: CPU inference worked perfectly, revealed GPU limitation.

What Surprised Me

Surprise 1: Documentation Quality

Claude's documentation was often better than mine:

More complete
Better structured
Included examples I would have skipped

Why: Takes time I'm usually too impatient to spend.

Surprise 2: Error Analysis Depth

Generated comprehensive error breakdowns:

Hedging errors: 36 → 0
Empty responses: 22 → 20
Misclassifications: 19 → 13

I would have looked at: Overall accuracy only.

Claude showed: Where the improvement came from.

Surprise 3: Statistical Rigor

Implemented proper statistical tests without prompting:

McNemar's test for paired samples
Cohen's h for effect size
95% confidence intervals

Why this matters: Medical AI needs validation rigor. Claude defaulted to best practices.

The Anti-Patterns I Avoided

Anti-Pattern 1: "Just Fix Everything"

Temptation: Change hyperparameters, architecture, and dataset simultaneously.

Why bad: Can't tell what helped.

What I did: Progressive refinement (baseline → hyperparameters → architecture).

Result: Clear understanding of each contribution.

Anti-Pattern 2: "Claude Will Figure It Out"

Temptation: Give vague instructions and expect magic.

Why bad: Claude doesn't have your domain knowledge or strategic vision.

What I did: Clear, specific instructions with constraints.

Result: Efficient collaboration with minimal back-and-forth.

Anti-Pattern 3: "Quick Eval Is Good Enough"

Temptation: N=100 showed improvement, ship it.

Why bad: Not statistically significant (p=0.0736).

What I did: Required N=500 for proper validation.

Result: Achieved p < 0.001, publishable result.

Lessons for Your Next ML Project

1. Define Success Upfront

Before delegating anything:

What accuracy do you need?
What sample size for validation?
What deployment constraints?

I defined: 90%+ accuracy, N=500 validation, CPU-compatible inference.

2. Start with a Baseline

Five experiments at 70% weren't failures. They were establishing a reproducible starting point.

Without this: No way to measure improvement.

3. Progressive Refinement Beats Big Bang

Optimize one dimension at a time:

Baseline
Hyperparameters
Architecture

Why: Know what contributes to improvement.

4. Bring Domain Knowledge to Decisions

Claude knew: How to implement SFTTrainer.

I knew: When to switch to SFTTrainer.

The combination: 92.4% accuracy.

5. Question Your Assumptions

When evaluation looked broken, I could have spent 20 hours debugging training code.

Instead: "Maybe GPU inference is broken, not the model."

Result: 5-minute fix (switch to CPU).

6. Document the Journey

Claude generated session logs, README updates, and deployment guides.

Time cost: Nearly zero (automated).

Value: Can return to project months later with full context.

The Bottom Line

60+ hours of work over 9 days.

My active time: 10-12 hours of strategic decisions and validation.

Result: Production-ready medical AI with 92.4% accuracy and complete documentation.

Time saved: 48 hours of coding, testing, and documentation.

Knowledge gained: Hardware discovery, optimization strategies, validation rigor.

When This Pattern Works

This delegation pattern works best when:

✅ You have clear goals

Know what success looks like
Can define quality bar
Understand validation requirements

✅ You can make strategic decisions

What to optimize first
When to pivot
When "good enough" isn't enough

✅ You have domain knowledge

Know the tools in your space
Recognize patterns from experience
Can diagnose root causes

✅ You're willing to iterate

Not looking for one-shot solutions
Open to progressive refinement
Can course-correct based on results

When You Need to Step In

Watch for these signals:

🚨 Strategic Decisions

Which approach to try next
When to change direction
Setting quality standards

🚨 Domain Knowledge

Tool selection (SFTTrainer vs Trainer)
Validation requirements (N=500 not N=100)
Hardware limitations

🚨 Root Cause Analysis

"Model broken" vs "Platform broken"
When to question premises
Pattern recognition from experience

What I'd Do Differently

Start with Statistical Requirements

I said: "Get to 90%+ accuracy."

I should have said: "Get to 90%+ accuracy with N=500, p < 0.05, validated on CPU and GPU."

Why: Would have avoided the quick eval detour.

Document Hardware Limitations First

I discovered: GPU inference broken on day 7.

I should have: Tested both GPU and CPU inference on day 1.

Why: Would have saved 15-20 hours of confusion.

Define "Done" Explicitly

I said: "Optimize until it's good."

I should have said: "Stop at 90%+ accuracy OR after 3 optimization cycles."

Why: Clearer stopping point prevents over-optimization.

Your Turn

Next time you're facing a complex ML project:

Ask yourself:

What's my baseline?
What am I optimizing for?
What can I delegate and what must I decide?
When will I know I'm done?

Try delegating:

Implementation details
Systematic testing
Documentation
Error analysis

Keep control of:

Strategic direction
Quality standards
Domain expertise
When to pivot

The goal isn't to automate everything. It's to amplify your expertise by delegating the 80% of work that's systematic, while focusing your time on the 20% that requires human judgment.

Want to learn more about AI-assisted ML workflows? This approach to progressive refinement and strategic delegation applies across domains. The key is knowing when to trust your AI partner and when to step in with human expertise.

Questions about delegating complex experiments? Reach out on Twitter/X or check out more DGX Lab Chronicles articles.

About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Related experiments

Apparatus

1,213 words · 14 min read

ai-collaboration
delegation
ml-ops
medical-ai
fine-tuning

How I Delegated a 9-Day Medical AI Experiment (and Learned When to Step In)

Day 1-2: Setting Up the Baseline

My Role: Define the Goal

Claude's Role: Implement and Iterate

The Decision Point

Day 3-4: The Hyperparameter Phase

Delegation Strategy: Test One Dimension

Claude's Implementation

What I Learned

Day 5-6: The Architecture Change

The Strategic Decision

Why I Chose This Path

The Implementation

The Pivot Point

Day 7: The Hardware Surprise

When Everything Looked Broken

The Debugging Dance

The Discovery

What This Taught Me

The Decision Framework That Emerged

What I Delegated Successfully

What I Had to Decide

The ROI Calculation

Time Breakdown

What I Gained

The Collaboration Patterns

Pattern 1: "Try This Approach"

Pattern 2: "Diagnose This Problem"

Pattern 3: "Validate This Thoroughly"

Pattern 4: "Explore and Report"

The Moments I Had to Intervene

Intervention 1: Sample Size (Day 6)

Intervention 2: Architecture Choice (Day 5)

Intervention 3: GPU vs CPU (Day 7)

What Surprised Me

Surprise 1: Documentation Quality

Surprise 2: Error Analysis Depth

Surprise 3: Statistical Rigor

The Anti-Patterns I Avoided

Anti-Pattern 1: "Just Fix Everything"

Anti-Pattern 2: "Claude Will Figure It Out"

Anti-Pattern 3: "Quick Eval Is Good Enough"

Lessons for Your Next ML Project

1. Define Success Upfront

2. Start with a Baseline

3. Progressive Refinement Beats Big Bang

4. Bring Domain Knowledge to Decisions

5. Question Your Assumptions

6. Document the Journey

The Bottom Line

When This Pattern Works

When You Need to Step In

What I'd Do Differently

Start with Statistical Requirements

Document Hardware Limitations First

Define "Done" Explicitly

Your Turn

Related Articles

Related Articles

Get the next experiment

Related experiments

Apparatus

Links to this entry