AIXplorethe lab
Emerging Trends13 min readshipped

The Hidden Crisis in LLM Fine-Tuning: When Your Model Silently Forgets Everything

The Hidden Crisis in LLM Fine-Tuning: When Your Model Silently Forgets Everything

I thought I had a successful fine-tuning experiment. The metrics showed 52% accuracy on a medical QA task. Four subsequent experiments failed at 0%. Something seemed wrong, so I investigated.

What I discovered was worse than I imagined: all five experiments had failed. The "successful" baseline never worked at all. The models weren't generating text. They were returning empty strings, silently forgetting everything they once knew.

This is catastrophic forgetting in action, and it's one of the most insidious problems in modern LLM fine-tuning.

The Silent Killer

What Catastrophic Forgetting Looks Like

When a model experiences complete catastrophic forgetting:

  • No errors or warnings during training
  • Model loads successfully and inference runs
  • Zero new tokens generated (returns only the prompt)
  • Evaluation metrics show 0% accuracy
  • No indication what went wrong

Here's what I saw in the logs:

Prompt length: 2171 chars
Full decoded length: 2171 chars  # SAME LENGTH!
New tokens generated: 0
Response: '' (empty string)
Predicted answer: 'unknown'
Accuracy: 0/3 (0%)

The model was loaded, inference was running, but nothing was coming out. Just the prompt echoed back.

How We Got Here: The Investigation

Experiment 1 Seemed to Work

Initial results from Experiment 1:

  • Accuracy: 52% on PubMedQA test set
  • Training loss: 1.99 (reasonable)
  • LoRA configuration: rank=16, alpha=32, LR=2e-4

This became our baseline. We built four more experiments trying to improve on it.

Experiments 2-5 All Failed

Every subsequent experiment produced 0% accuracy:

  • Experiment 2: Different learning rate (1e-4)
  • Experiment 3: More epochs (3 instead of 2)
  • Experiment 4: Lower learning rate (5e-5)
  • Experiment 5: Combined best settings

Pattern seemed clear: Experiment 1 worked, everything else broke. But why?

The Smoking Gun

When I tested Experiment 1 with simple prompts ("Hello, how are you?", "What is 2+2?"), I got empty strings. Every time.

That shouldn't be possible with a model showing 52% accuracy.

Deeper investigation revealed the truth:

# Testing generation capability
test_prompts = [
    "Hello, how are you?",
    "What is 2+2?",
    "Is the sky blue? Answer yes or no."
]

for prompt in test_prompts:
    output = model.generate(prompt, max_new_tokens=100)
    print(f"Generated tokens: {len(output) - len(prompt)}")
    # Result: 0, 0, 0

Experiment 1 never generated a single token. The 52% accuracy result was invalid, likely from a script version mismatch or evaluation bug.

All five experiments had completely failed.

The Root Cause: Data, Not Hyperparameters

After confirming total failure, I dove into recent research on LoRA fine-tuning (2024-2025 papers and industry guides). The answer was staring me in the face.

What We Used

  • 200 training examples
  • Rank: 16
  • Alpha: 32
  • Learning rate: varied (5e-5 to 2e-4)
  • Epochs: 1-3

What Research Says You Need

ParameterOur ConfigRecommendedGap
Dataset Size2005,000-100,00025-500x under
Rank (r)168-16✅ Correct
Alpha322×rank (16-32)✅ Correct
Learning Rate5e-5 to 2e-41e-4 to 2e-4✅ Close
Epochs1-31-3✅ Correct

The problem wasn't our hyperparameters. We were obsessing over learning rates and epochs while sitting on a dataset 25x smaller than the minimum threshold.

Critical Finding

Research consensus from multiple 2024-2025 studies:

  • Minimum: 5,000 examples for basic fine-tuning success
  • Optimal: 50,000-100,000 examples for strong performance
  • Saturation: ~100,000 examples (diminishing returns beyond)

With 200 examples, catastrophic forgetting is almost guaranteed.

Why Dataset Size Matters More Than You Think

The latest research on catastrophic forgetting reveals scaling laws:

"LoRA Learns Less and Forgets Less" (Biderman et al., 2024):

  • Forgetting increases with number of update steps
  • Parameter-efficient methods still suffer from forgetting with small datasets
  • Dataset size has a larger effect than hyperparameter tuning

"Scaling Laws for Forgetting When Fine-Tuning LLMs" (2024):

  • Can achieve good results with <1% of weights trainable
  • Only if dataset size is sufficient (5k+ examples)
  • Forgetting scales predictably with data insufficiency

The Irony: We Had the Data

Here's what hurts: we had access to massive datasets all along:

Available on HuggingFace:

  • PubMedQA Artificial: 211,269 examples
  • MedQA (English): 12,723 examples
  • MedMCQA: ~194,000 examples

We used a tiny 200-example subset for "quick experiments."

Quick experiments that all failed.

What Modern Research Tells Us

I reviewed recent papers and industry guides to understand best practices:

Key Sources (2024-2025)

  1. Sebastian Raschka - "Practical Tips for Finetuning LLMs Using LoRA"
  2. Databricks - "Efficient Fine-Tuning with LoRA"
  3. QLoRA paper (Dettmers et al.)
  4. Med42 study on medical domain LoRA
  5. Google AI - "Fine-tune Gemma in Keras using LoRA"

Consensus Configuration

For Medical Domain QA:

lora_config = {
    "r": 16,                      # Rank (medical needs higher than r=8 baseline)
    "lora_alpha": 32,             # Alpha = 2×rank (Microsoft recommendation)
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj"      # MLP (all layers!)
    ],
    "lora_dropout": 0.05,
    "learning_rate": 1e-4,        # Industry standard
    "num_epochs": 2,              # Conservative
    "dataset_size": 10000         # Well above 5k threshold
}

Critical: Target ALL seven LoRA layers, not just attention. The QLoRA paper showed this matches full fine-tuning quality.

Prevention Strategies

1. Check Dataset Size FIRST

Before you tune a single hyperparameter:

def estimate_min_dataset_size(task_complexity, domain):
    base_minimum = 5000

    complexity_multipliers = {
        "simple": 1.0,      # Binary classification
        "medium": 1.5,      # Multi-class, QA
        "complex": 2.0      # Generation, reasoning
    }

    domain_multipliers = {
        "general": 1.0,
        "technical": 1.2,   # Code, legal
        "medical": 1.5      # Specialized knowledge
    }

    return int(base_minimum *
               complexity_multipliers[task_complexity] *
               domain_multipliers[domain])

# Medical QA is medium complexity, medical domain
min_size = estimate_min_dataset_size("medium", "medical")
print(f"Minimum dataset size: {min_size}")  # 7,500 examples

2. Test Generation Early

Don't wait for full evaluation. Test during training:

def smoke_test_generation(model, tokenizer):
    """Quick check that model can generate anything"""
    test_prompts = [
        "Hello, how are you?",
        "What is 2+2?",
        "Explain photosynthesis in one sentence."
    ]

    for prompt in test_prompts:
        output = model.generate(
            tokenizer.encode(prompt, return_tensors="pt"),
            max_new_tokens=50
        )
        generated_text = tokenizer.decode(output[0])
        new_tokens = len(output[0]) - len(tokenizer.encode(prompt)[0])

        if new_tokens == 0:
            return False, f"Failed on: {prompt}"

    return True, "All tests passed"

# Run after each epoch
success, message = smoke_test_generation(model, tokenizer)
if not success:
    print(f"⚠️  WARNING: Catastrophic forgetting detected: {message}")
    break  # Stop training

3. Monitor Training Loss

While not sufficient alone, training loss gives signals:

# During training
if epoch_loss > 2.0:
    print("⚠️  High training loss - model struggling to learn")
    print("   Possible causes:")
    print("   1. Learning rate too high/low")
    print("   2. Dataset too small")
    print("   3. Task mismatch with model capabilities")

4. Use LoRA Over Full Fine-Tuning

LoRA's "learns less, forgets less" property helps:

  • Trains only 0.1-1% of parameters
  • Preserves pretrained knowledge better
  • Reduces catastrophic forgetting risk
  • But still needs sufficient data (5k+)
LoRA Is Not a Magic Bullet

LoRA reduces forgetting compared to full fine-tuning, but it cannot overcome fundamentally insufficient data. With 200 examples, even LoRA will fail.

Think of LoRA as a seatbelt: it improves safety, but you still shouldn't drive off a cliff.

Advanced Solutions (When Basic Approaches Fail)

If you've hit the dataset size threshold and still see forgetting:

Forgetting-Aware Methods

FIP (Forgetting-Aware Instance Prioritization):

  • Prioritizes examples that prevent forgetting
  • Requires identifying "anchor" instances

I-LoRA (Importance-based LoRA):

  • Weights LoRA parameters by importance
  • Reduces forgetting on critical capabilities

EWC (Elastic Weight Consolidation):

  • Constrains updates to preserve important weights
  • Classic approach from continual learning

Experience Replay:

  • Mix in examples from pretraining distribution
  • Maintains general capabilities while specializing

Multi-Dataset Blending

Instead of single-domain overfit:

# Blend datasets to preserve general capabilities
train_data = {
    "medical_qa": 10000,      # Your target task
    "general_qa": 2000,       # Preserve general QA
    "instruction": 1000        # Preserve instruction following
}

# 13,000 total, 77% medical focus, 23% preservation

Real-World Case Studies

Med42 (April 2024)

  • Task: Medical benchmark including PubMedQA
  • Method: LoRA with r=8, 8 epochs
  • Result: Matched full fine-tuning quality
  • Key: Used full PubMedQA dataset, not tiny subset

QLoRA on PubMedQA (June 2024)

  • Setup: 4-bit quantization, single A100 GPU
  • Result: +1.0% to +8.0% accuracy improvement
  • Key: Proper dataset size, standard LoRA config

Both succeeded because they used adequate data.

The Path Forward: A Research-Backed Plan

Based on 2024-2025 consensus, here's the systematic approach:

Phase 1: Establish Baseline (Week 1)

  • Dataset: 10,000 examples (2x minimum threshold)
  • Config: Industry-standard LoRA (r=16, alpha=32, LR=1e-4)
  • Goal: Prove fine-tuning CAN work (>50% accuracy)
  • Validation: Smoke tests every epoch

Phase 2: Optimize (Week 2)

  • Hyperparameter sweep: Rank (8/16/32), LR (5e-5 to 3e-4), Epochs (1-4)
  • Goal: Find optimal config for your domain
  • Validation: Compare on held-out test set

Phase 3: Scale Data (Week 3)

  • Dataset sizes: Test 5k, 10k, 25k, 50k, 100k
  • Goal: Plot accuracy vs dataset size curve
  • Finding: Identify your sweet spot (cost vs performance)

Phase 4: Production (Week 4)

  • Multi-dataset blend: Add general QA for robustness
  • QLoRA comparison: Memory vs accuracy tradeoff
  • Goal: Deploy-ready model with verified quality

Key Takeaways

The Hidden Nature of Catastrophic Forgetting:

  1. Silent failure (no errors or warnings)
  2. Looks like working code (model loads, inference runs)
  3. Only detected by output inspection (empty strings, 0% accuracy)

The Real Solution:

  1. Dataset size >>> hyperparameters (25x+ impact)
  2. Check data BEFORE tuning (5k+ minimum)
  3. Test generation early (smoke tests during training)
  4. Use LoRA, but don't rely on it alone (still needs sufficient data)

The Expensive Lesson:

  • 5 failed experiments
  • 2-3 weeks of GPU time
  • Could have been avoided with literature review upfront
The One-Minute Check

Before starting ANY fine-tuning project:

if dataset_size < 5000:
    print("⚠️  STOP: Insufficient data")
    print(f"   You have: {dataset_size} examples")
    print(f"   You need: 5,000+ minimum")
    print("   Get more data or risk 100% failure")
    sys.exit(1)

This simple check would have saved us weeks.

The Emerging Challenge

Catastrophic forgetting is becoming more visible as fine-tuning moves from research to production:

  • More teams fine-tuning LLMs for specialized domains
  • Pressure to use small datasets (cost, privacy, availability)
  • Silent failures in production (models deployed without proper validation)
  • Industry learning what actually works (2024-2025 consensus emerging)

The gap between "it trained successfully" and "it actually works" is wider than many realize.

What This Means for You

If you're planning to fine-tune:

  • Budget for data FIRST, compute second
  • 5k examples minimum, aim for 10k+
  • Test generation capability during training
  • Read recent literature (2024-2025) before experimenting

If you've seen mysterious failures:

  • Check your dataset size immediately
  • Test generation with simple prompts
  • Don't assume metrics are valid without inspecting outputs
  • Small improvements in hyperparameters won't fix insufficient data

If you're deploying fine-tuned models:

  • Validate generation works before production
  • Monitor for empty outputs or degraded quality
  • Have rollback plan if forgetting emerges post-deployment
  • Consider multi-dataset blending for robustness

Looking Ahead

The field is rapidly evolving:

  • Better forgetting-aware methods (FIP, I-LoRA gaining traction)
  • Automated dataset size estimation (per-task recommendations)
  • Improved monitoring tools (detect forgetting during training)
  • Multi-dataset best practices (preservation + specialization)

But the fundamental lesson remains: you cannot tune your way out of insufficient data.


Related Articles

  • Advanced Prompt Engineering for Oncology Data Science
  • Building a Production ML Workspace: Part 4 - AI Agents and Automation
Try It Yourself

Before your next fine-tuning project:

  1. Check dataset size against 5k minimum threshold
  2. Review 2024-2025 LoRA best practices (papers + industry guides)
  3. Add smoke tests to your training loop (test generation every epoch)
  4. Monitor training loss and investigate if >2.0 persists
  5. Inspect outputs manually before trusting accuracy metrics

Save yourself the painful lesson of catastrophic forgetting.


Related Articles


About the Author: Justin Johnson builds AI systems and writes about practical AI development.

justinhjohnson.com | Twitter | LinkedIn | Run Data Run | Subscribe

Follow the lab

Get the next experiment

Enjoyed the breakdown on The Hidden Crisis in LLM Fine-Tuning: When Your Model Silently Forgets Everything? New entries land roughly weekly. No digest, no roundup. Just the next build log, when it ships.