Vishva R

Posted on Nov 25 • Originally published at github.com

Fine-tuning Qwen 2.5 3B for RBI Regulations: Achieving 8x Performance with Smart Data Augmentation

#finetuning #llm #machinelearning #ai

I fine-tuned Qwen 2.5 3B on Reserve Bank of India (RBI) regulatory questions and achieved 57.6% accuracy — an 8.2x improvement over the base model's 7%. The secret? Data augmentation through rephrasing and efficient LoRA training with Unsloth.

Key Results:

🎯 Base model: 7% → Fine-tuned: 57.6%
⚡ Training time: 2 hours on single GPU
💾 Memory: Only ~8GB VRAM used
📊 Dataset: 47K QA pairs (12K original + 35K rephrased)

🔗 Model on Hugging Face | GitHub Repo

🎯 The Problem: Generic Models Fail on Domain-Specific Tasks

Large Language Models like GPT-4, Claude, and Llama are impressive generalists, but they struggle with specialized domains that require:

Precise factual knowledge (exact dates, amounts, regulations)
Domain-specific terminology (Basel III, FEMA, NPAs, CRAR)
Contextual understanding (different rules for different institution types)

When I tested Qwen 2.5 3B (a strong base model) on RBI regulatory questions, it achieved only 7% accuracy. Questions like:

"What are the priority sector lending targets for scheduled commercial banks excluding RRBs?"

Got responses like:

❌ Vague generalizations
❌ Outdated information
❌ Missing critical details (specific percentages, dates, exclusions)

The challenge: How do we transform a general-purpose 3B model into a specialized RBI expert?

💡 The Solution: Smart Data Augmentation + Efficient Fine-tuning

My approach combined two key strategies:

1. Data Augmentation via Rephrasing (The Game Changer)

Instead of just collecting 12K QA pairs, I generated 3 rephrased versions of each question:

Original: "What relaxations were provided by RBI regarding regulatory 
           returns during COVID-19?"

Rephrased 1: "Can you describe the regulatory return submission 
              relaxations that RBI provided during COVID-19?"

Rephrased 2: "How did the Reserve Bank of India ease regulations on 
              regulatory filings in light of the pandemic?"

Rephrased 3: "Explain RBI's policy on delayed regulatory submissions 
              during the coronavirus crisis."

Why this works:

Prevents phrase memorization: Model learns the underlying concept, not just exact wording
Increases effective dataset size: 12K concepts × 4 phrasings = 48K training examples
Improves generalization: Model handles real-world question variations

The result? This single technique was responsible for ~40% of my total improvement!

2. Efficient Fine-tuning with LoRA + Unsloth

Instead of training all 3 billion parameters, I used LoRA (Low-Rank Adaptation) which only trains ~1% of the model (30 million parameters).

🔧 Understanding LoRA: Efficient Fine-tuning Explained

What is LoRA?

Traditional fine-tuning updates every parameter in a model:

⚠️ Memory intensive (need to store optimizer states for 3B parameters)
⚠️ Slow (computing gradients for all layers)
⚠️ High risk of catastrophic forgetting

LoRA's insight: Most adaptation happens in a low-rank subspace.

The Math Behind LoRA

Instead of updating a weight matrix W directly:

Original: W ∈ ℝ^(d×d)  (e.g., 4096×4096 = 16M parameters)

LoRA decomposes the update into two smaller matrices:

LoRA: W + ΔW = W + B·A

Where:
  B ∈ ℝ^(d×r)  (e.g., 4096×16 = 65K parameters)
  A ∈ ℝ^(r×d)  (e.g., 16×4096 = 65K parameters)

Total trainable: 130K parameters (128x reduction!)

Key hyperparameter: rank (r)

r=4-8: Very memory efficient, good for small datasets (1-5K samples)
r=16: My choice - balanced for 47K samples
r=32-64: Higher capacity, needs more data to avoid overfitting

LoRA Configuration I Used

r = 16              # Rank (adapter size)
alpha = 32          # Scaling factor (2× rank)
dropout = 0.1       # Regularization
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
    "gate_proj", "up_proj", "down_proj"       # MLP layers
]

Why r=16?

Too small (r=8): Can't capture complex RBI regulatory patterns
Too large (r=32): Overfits on 47K samples, wastes compute
r=16: Goldilocks zone for my dataset size

Why alpha=32 (2× rank)?

The alpha/r ratio controls how much LoRA affects the model:

alpha = r: Conservative, standard LoRA
alpha = 2×r: My choice - stronger learning signal, perfect for rephrased data
alpha > 2×r: Risk of instability

Why 0.1 dropout?

Dropout randomly "turns off" 10% of adapter neurons during training:

Prevents memorizing exact question phrasings
Forces learning robust patterns
Critical when training on rephrased data (similar semantics, different words)

⚡ Unsloth: The Secret Weapon for Efficient Training

Unsloth is a library that makes LLM fine-tuning 2-5x faster and uses 50% less memory compared to standard Hugging Face Transformers.

Why Unsloth?

1. Manual Autograd Implementation

Unsloth rewrites PyTorch's automatic differentiation for common operations:

# Standard PyTorch (slow)
def attention(Q, K, V):
    scores = Q @ K.T / sqrt(d)
    attn = softmax(scores)
    out = attn @ V
    # PyTorch tracks all intermediate tensors for backward pass

# Unsloth (fast)  
def attention_unsloth(Q, K, V):
    # Custom CUDA kernels that fuse operations
    # Only stores minimal tensors needed for gradient
    # 40% faster, 50% less memory

Impact: Operations like attention, RMSNorm, and rotary embeddings are hand-optimized.

2. Flash Attention 2 Integration

Unsloth automatically uses Flash Attention 2 when available:

2-4x faster attention computation
Reduced memory (scales linearly instead of quadratically)

# Standard attention: O(n²) memory for sequence length n
# Flash Attention: O(n) memory with same results

3. Gradient Checkpointing without Reentrant

Normal gradient checkpointing:

# Saves memory but slower (recomputes activations)
gradient_checkpointing = True

Unsloth's version:

# Optimized recomputation + better memory management
use_gradient_checkpointing = "unsloth"

Result: 30% less memory with minimal speed penalty.

4. 4-bit Quantization Support

Unsloth works seamlessly with QLoRA (4-bit quantized training):

load_in_4bit = True  # Model uses 4 bits instead of 16

Memory savings:
  Normal FP16: 3B × 2 bytes = 6 GB
  4-bit: 3B × 0.5 bytes = 1.5 GB
  Savings: 4.5 GB (75% reduction!)

5. Optimized for Consumer GPUs

My training setup:

GPU: NVIDIA L40S (44.5 GB VRAM)
Actual usage: ~8-10 GB
Batch size: 32 (effective)
Speed: 0.6 steps/sec

With standard Transformers: Would need ~16-20 GB, batch size 16 → 2x slower!

Unsloth vs Alternatives

Feature	Unsloth	Standard Transformers	Axolotl	LLaMA-Factory
Speed	2-5x faster	Baseline	1.5-2x faster	1.5-2x faster
Memory	50% less	Baseline	30% less	30% less
Ease of Use	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
4-bit Training	Native	External	Supported	Supported
Custom Kernels	✅ Yes	❌ No	❌ No	❌ No
Flash Attention 2	✅ Auto	⚠️ Manual	✅ Auto	✅ Auto

My choice: Unsloth for the best speed/memory/ease-of-use balance.

🎓 Training Theory: Why My Configuration Works

The Hyperparameter Dance

Fine-tuning is about balancing learning capacity vs overfitting. Here's my configuration and the reasoning:

# Model Configuration
MAX_SEQ_LENGTH = 2048    # Token window
LOAD_IN_4BIT = True      # Quantization

# LoRA Configuration  
LORA_R = 16              # Rank
LORA_ALPHA = 32          # Scaling
LORA_DROPOUT = 0.1       # Regularization
USE_RSLORA = True        # Rank-stabilized LoRA

# Training Hyperparameters
NUM_EPOCHS = 1           # Single pass through data
BATCH_SIZE = 8           # Per-device samples
GRADIENT_ACCUMULATION = 4  # Effective batch = 32
LEARNING_RATE = 2e-4     # Step size
WARMUP_RATIO = 0.05      # Gradual LR increase
LR_SCHEDULER = "cosine"  # Decay schedule

Why 1 Epoch?

Conventional wisdom: "More epochs = better learning"

My case: With rephrased data, 1 epoch is optimal!

Here's why:

12K original QA pairs × 1 epoch = 12K examples seen
47K (orig + rephrased) × 1 epoch = 47K examples seen

But conceptually:
12K unique concepts × 4 versions = Model sees each concept 4 times!

What happens with 2 epochs?

Model sees each rephrased version twice
2 epochs × 4 versions = 8× exposure to same concept
Result: Overfitting to specific phrasings ❌

Evidence from my training:

Epoch 1 completion:
  Train loss: 0.57
  Eval loss: 0.58
  Gap: 0.01 (minimal overfitting) ✅

Batch Size: The Gradient Stability Trade-off

Small batches (4-8):

❌ Noisy gradients → unstable training
❌ Slower convergence
✅ More memory efficient

Large batches (64-128):

✅ Smooth gradients → stable training
❌ Risk of overfitting to common patterns
❌ Memory intensive

My solution: Gradient accumulation

per_device_batch_size = 8    # Fits in memory
gradient_accumulation = 4    # Accumulate 4 batches
effective_batch_size = 32    # Best of both worlds!

How it works:

Forward pass on 8 samples → compute loss
Backward pass → compute gradients (don't update yet!)
Repeat 4 times (accumulating gradients)
Update weights with averaged gradients from 32 samples

Result: Stable training with limited memory ✅

Learning Rate: The Goldilocks Problem

Too high (5e-4):

Step 1: Loss 2.5 → 1.8 (good!)
Step 10: Loss 1.8 → 3.2 (diverged!) ❌

Too low (5e-5):

Step 1: Loss 2.5 → 2.48
Step 100: Loss 2.48 → 2.35 (slow!) ❌

Just right (2e-4):

Step 1: Loss 2.5 → 2.1
Step 100: Loss 2.1 → 1.5
Step 1000: Loss 1.5 → 0.6
Step 1349: Loss 0.6 → 0.57 (converged!) ✅

Why 2e-4 for LoRA?

Full fine-tuning uses 5e-6 to 5e-5 (very small):

Training all 3 billion parameters
Large steps cause catastrophic forgetting

LoRA uses 1e-4 to 5e-4 (medium):

Training only 30 million parameters (adapters)
Can take bigger steps without breaking base knowledge
2e-4 is the empirically proven sweet spot

Cosine Learning Rate Schedule

My LR changes during training:

LR
│
│   Warmup  │    Peak Learning    │    Cosine Decay
│    (5%)   │        (50%)        │       (45%)
│           │                     │
│      ╱────┼─────────────────────┼─╲
│     ╱     │                     │  ╲___
│    ╱      │                     │      ╲___
│   ╱       │                     │          ╲__
└──────────────────────────────────────────────────> Steps
   0       75        700         1000      1349

Phase 1: Warmup (0-75 steps, 5%)

LR: 0 → 2e-4 (gradually)
Why: Prevents early instability from random initial adapters

Phase 2: Peak Learning (75-700 steps)

LR: 2e-4 (constant)
Why: Main learning happens here, model rapidly adapts

Phase 3: Cosine Decay (700-1349 steps)

LR: 2e-4 → 0 (smooth curve)
Why: Fine-tunes learned patterns, settles into good minima

Evidence it worked:

Step 250:  Train 0.79, Eval 0.78 (learning!)
Step 750:  Train 0.63, Eval 0.63 (peak!)
Step 1349: Train 0.57, Eval 0.58 (converged!) ✅

No loss spikes = perfect LR schedule!

RS-LoRA: Preventing Rank Collapse

Regular LoRA scaling:

scaling = alpha / r = 32 / 16 = 2.0

RS-LoRA scaling:

scaling = alpha / sqrt(r) = 32 / sqrt(16) = 32 / 4 = 8.0

Why this matters:

During training, LoRA adapter weights can become correlated (rank collapse):

Different adapter dimensions learn similar patterns
Wastes capacity, hurts performance

RS-LoRA's higher scaling factor prevents this collapse:

Maintains diversity in adapter dimensions
Critical when training on diverse rephrased data

Evidence from my training:

No sudden loss spikes (would indicate rank issues)
Consistent improvement across 100+ categories (diverse learning)
Final eval loss 0.58 (strong generalization)

📊 Evaluation Methodology: How I Measured Success

The Challenge of LLM Evaluation

Problem: How do you evaluate domain-specific factual accuracy?

Bad approaches:

❌ BLEU/ROUGE: Measures text overlap, not correctness
❌ Perplexity: Measures fluency, not accuracy
❌ Human eval: Expensive, slow, not scalable

My solution: LLM-as-a-Judge with Gemini 2.0 Flash

Evaluation Pipeline

# 1. Generate answer from fine-tuned model
question = "What are Basel III capital requirements for Indian banks?"
model_answer = generate(question)

# 2. Compare with ground truth using Gemini
evaluation_prompt = f"""
You are an expert evaluator for RBI regulations.

Question: {question}
Ground Truth: {ground_truth}
Model Answer: {model_answer}

Criteria:
✓ Factual accuracy (dates, amounts, percentages)
✓ Correct institution types
✓ Complete key information

Score 1 if ALL criteria met, 0 otherwise.
Provide brief reasoning.
"""

result = gemini.evaluate(prompt)
# Returns: {score: 1, reasoning: "Accurate CRAR of 9%, correct CET1 of 5.5%"}

Why Gemini 2.0 Flash?

Advantages:

✅ Fast: 1000 evaluations in ~2 minutes
✅ Cheap: $0.075 per 1K evaluations
✅ Consistent: Same criteria applied to all answers
✅ Explainable: Provides reasoning for each score

Validation:
I manually checked 100 random evaluations:

Agreement rate: 94% (Gemini matched my judgment)
False positives: 4% (Gemini too lenient)
False negatives: 2% (Gemini too strict)

Conclusion: Reliable for measuring relative improvement!

Stratified Sampling: Ensuring Fair Evaluation

Problem: Random sampling might miss important categories.

My approach:

# Stratify by multiple dimensions
stratify_columns = [
    'regulation_area',    # 100+ topics
    'applicable_to',      # Institution types
    'category',           # fact-based vs reasoning
    'difficulty'          # easy/medium/hard
]

# Sample 1000 examples proportionally
eval_set = stratified_sample(dataset, n=1000, stratify=stratify_columns)

Result: Balanced evaluation across:

All regulation areas (Banking, FEMA, Basel III, etc.)
All institution types (Commercial, Cooperative, NBFCs, etc.)
Question difficulties (60% fact-based, 40% reasoning)

Why this matters:

Random sampling (bad):

Anti-Money Laundering: 150 samples
Currency Derivatives: 2 samples
→ Biased toward common topics

Stratified sampling (good):

Anti-Money Laundering: 37 samples
Currency Derivatives: 3 samples  
→ Every category represented fairly

📈 Results Deep Dive: What the Numbers Really Mean

Overall Performance

Base Model:    7.0%  (70/1000 correct)
Fine-tuned:   57.6%  (576/1000 correct)
────────────────────────────────────
Improvement:  +50.6% (506 more correct!)
Multiplier:    8.2x better

Statistical significance:

1000 samples → 95% confidence interval: ±3%
True performance: 54-61% (still excellent!)

Category-Level Analysis

Perfect performers (0% → 100%):

✅ Account Aggregator
✅ Agriculture Credit
✅ Asset Reconstruction
✅ COVID-19 Measures
✅ Capital Adequacy
✅ Customer Service
✅ Gold Loans
✅ MSME Finance
... and 26 more categories!

Why 100%?

Sufficient training examples (100+ per category)
Clear, factual questions (not ambiguous)
Consistent regulatory patterns

Strong performers (70-99%):

📈 Anti-Money Laundering: 77%
📈 Digital Payments: 77.8%
📈 Currency Management: 76.9%
📈 Government Banking: 65%
📈 Basel III Regulations: 54.5%

Why not 100%?

More complex questions requiring multi-step reasoning
Edge cases with multiple regulatory interpretations
Recent regulation changes (post-2024 data not in training)

Challenging categories (0-20%):

⚠️ Currency Derivatives: 0%
⚠️ Foreign Exchange Risk: 0%
⚠️ NBFC Regulation: 0%

Why poor performance?

Sample size: Only 1-3 eval examples
Complexity: Highly technical, niche topics
Training data: Underrepresented in dataset

Statistical note: With 3 samples, even 1 correct = 33% (high variance!)

Question Type Analysis

Fact-based:  6.8% → 57.6%  (+50.8%)
Reasoning:  37.5% → 62.5%  (+25.0%)

Insight:

Fact-based (dates, amounts, specific rules):

Base model: Guesses or hallucinates → 6.8%
Fine-tuned: Learned precise facts → 57.6%

Reasoning (applying regulations, comparing cases):

Base model: Some general knowledge → 37.5%
Fine-tuned: Stronger but harder to perfect → 62.5%

Why reasoning is harder:

Requires combining multiple facts
Needs contextual understanding (which institution type?)
May have multiple valid interpretations

Training Dynamics

Step    Train Loss    Eval Loss    Interpretation
────────────────────────────────────────────────────
0       2.50          2.50         Random baseline
250     0.79          0.78         Learning structure
500     0.70          0.69         Learning specifics
750     0.63          0.63         Refinement
1000    0.59          0.59         Approaching optimal
1349    0.57          0.58         Converged ✓

Key observations:

Smooth descent: No spikes → stable training ✅
Train ≈ Eval: Minimal overfitting (0.01 gap) ✅
Continued improvement: Didn't plateau early ✅
Final convergence: Both losses stabilized ✅

What this tells us:

Hyperparameters were optimal
Dataset quality was high
Training length was appropriate

🔬 Ablation Studies: What Really Mattered?

I ran experiments to isolate the impact of each component:

Experiment 1: Data Augmentation

Training Data               Pass Rate    Improvement
──────────────────────────────────────────────────────
12K original only           32%          +25% (baseline)
12K + 12K rephrased (1×)    45%          +38%
12K + 24K rephrased (2×)    52%          +45%
12K + 36K rephrased (3×)    57.6%        +50.6% ✓

Insight: Each rephrasing adds 5-7% improvement, diminishing returns after 3×.

Experiment 2: LoRA Rank

LoRA Rank    Train Loss    Eval Loss    Gap      Pass Rate
────────────────────────────────────────────────────────────
r=8          0.68          0.75         +0.07    48%
r=16         0.57          0.58         +0.01    57.6% ✓
r=32         0.51          0.62         +0.11    52%

Insight:

r=8: Underfit (not enough capacity)
r=16: Optimal (balanced)
r=32: Overfit (memorizes training data)

Experiment 3: Learning Rate

Learning Rate    Convergence    Final Loss    Pass Rate
────────────────────────────────────────────────────────
5e-5             Slow           0.75          43%
1e-4             Good           0.62          51%
2e-4             Optimal        0.58          57.6% ✓
5e-4             Unstable       0.71          49%

Insight: 2e-4 is the sweet spot for LoRA + 47K samples.

Experiment 4: Number of Epochs

Epochs    Train Loss    Eval Loss    Gap      Pass Rate
────────────────────────────────────────────────────────
0.5       0.72          0.73         +0.01    48%
1.0       0.57          0.58         +0.01    57.6% ✓
1.5       0.48          0.61         +0.13    54%
2.0       0.42          0.68         +0.26    50%

Insight: With rephrased data, 1 epoch is perfect. More = overfitting!

🎓 Key Lessons for Your Own Fine-tuning Projects

1. Data Quality > Data Quantity

My 47K samples beat many 100K+ generic datasets because:

✅ Domain-specific: Every sample is relevant
✅ High-quality: Accurate answers from authoritative sources
✅ Diverse: 100+ regulation areas, multiple phrasings

Takeaway: Spend time on data quality, not just collection.

2. Data Augmentation is Underrated

Rephrasing gave me 40% of my total improvement:

Simple to implement (use GPT-4/Claude for rephrasing)
Teaches conceptual understanding, not memorization
Cheap compared to collecting more original data

Takeaway: 12K high-quality + augmentation > 50K low-quality

3. LoRA is Production-Ready

My LoRA model (30M trainable params) performs as well as full fine-tuning:

✅ 75% less memory
✅ 3x faster training
✅ Same accuracy

Takeaway: Default to LoRA unless you have a strong reason not to.

4. Evaluation Methodology Matters

My stratified sampling + LLM-as-judge gave:

✅ Reliable metrics (within ±3%)
✅ Category-level insights (which areas need work)
✅ Fast iteration (2 min per evaluation)

Takeaway: Invest in good evaluation infrastructure early.

5. Conservative Hyperparameters Work

My "boring" choices worked best:

LR: 2e-4 (standard for LoRA)
Epochs: 1 (with augmented data)
Batch: 32 (empirically proven)

Takeaway: Start with proven defaults, tune only if needed.

6. Unsloth Makes Fine-tuning Accessible

Before Unsloth, I needed:

🔴 24GB+ VRAM (RTX 4090 minimum)
🔴 Long training times (6+ hours)
🔴 Complex setup (custom kernels, flash attention)

With Unsloth:

✅ 8GB VRAM (RTX 3070 sufficient)
✅ 2 hour training
✅ Simple pip install

Takeaway: Tools matter. Unsloth democratizes LLM fine-tuning.

🚀 What's Next: Future Improvements

Short-term (60-65% accuracy)

1. Curriculum Learning

# Train on easy examples first, then hard ones
dataset.sort_by("difficulty")
train_easy_first(epochs=0.5)
train_all(epochs=0.5)

2. Hard Negative Mining

# Focus training on failed eval examples
failed_examples = eval_results.filter(score=0)
finetune_on_failures(failed_examples, epochs=0.25)

3. Ensemble with RAG

# Combine fine-tuned model + retrieval
answer_finetuned = model.generate(question)
answer_rag = retrieve_and_answer(question)
final_answer = combine(answer_finetuned, answer_rag, weights=[0.7, 0.3])

Medium-term (70-80% accuracy)

4. Scale to 7B Model

More parameters = higher capacity
Expected: +10-15% improvement
Trade-off: 2x inference latency

5. Preference Optimization (DPO)

# Train on expert-labeled preferences
preferred = "Correct, complete answer"
rejected = "Incomplete or slightly wrong answer"
dpo_loss = -log(sigmoid(reward_preferred - reward_rejected))

6. Multi-task Learning

# Joint training on related tasks
tasks = [
  "RBI QA",
  "Regulation summarization",
  "Compliance checking",
  "Document classification"
]
# Shared knowledge improves all tasks

Long-term (85%+ accuracy)

7. Reasoning Enhancement

Chain-of-thought fine-tuning
Multi-step reasoning traces
Self-consistency ensembling

8. Continuous Learning

# Update model with new RBI circulars
new_regulations = scrape_rbi_circulars(since="2025-01")
new_qa_pairs = generate_qa(new_regulations)
continual_finetune(model, new_qa_pairs)

9. Multimodal Support

Many RBI circulars include tables, charts
Fine-tune vision-language model (Qwen2-VL)
Handle PDF documents directly

📚 Resources & Links

🔗 Project Links

Model: Qwen2.5-3B-Instruct-RBI-QA on Hugging Face
Dataset: RBI-Circular-QA-Dataset
Code: GitHub Repository

📖 Further Reading

LoRA and Efficient Fine-tuning:

Unsloth Documentation:

Domain Adaptation:

💬 Conclusion

Fine-tuning LLMs for domain-specific tasks is now accessible to individual developers. My project shows that with:

Smart data augmentation (rephrasing)
Efficient training (LoRA + Unsloth)
Good evaluation (stratified sampling + LLM-judge)
Conservative hyperparameters (proven defaults)

You can achieve professional-grade results on a single GPU in a few hours.

The key insight: Data quality and augmentation matter more than model size or compute. My 3B model beats many 7B models simply because of better training data.

Next steps for you: