Rukshan J. Senanayaka

Posted on Apr 21

I Fine-Tuned Gemma 4 for LaTeX OCR. The Success Was the Problem.

#gemma4 #unsloth #nlp

A fine-tuning post-mortem, and three tests that showed me what my model actually learned.

Today I fine-tuned Google's gemma-4-E2B-it on the unsloth/LaTeX_OCR dataset using LoRA on a RunPod RTX 3090. Nine hours of training, about $2 in GPU cost, and an adapter uploaded to Hugging Face.

The training loss dropped from 13.66 to 0.018. On the test set, the outputs were near-perfect.

Then I ran three tests that, taken together, told me exactly what the model had, and hadn't, learned:

An image from the training dataset: correct output.
The same image, with only the colors changed (white-on-blue instead of black-on-white): notation went wrong.
A handwritten equation: complete hallucination. The model invented Hebrew letters and fractions that weren't there.

The interesting lesson wasn't any one of those tests. It was how the model failed worse and worse as the input moved further from its training data.

What I Did

Item	Value
Base model	`unsloth/gemma-4-E2B-it` (vision-language, ~9.5 GB)
Dataset	`unsloth/LaTeX_OCR` (image → LaTeX pairs, 68.7k rows)
Method	LoRA (rank 8, alpha 8, all linear layers, vision + language)
Hardware	RunPod, RTX 3090 (24 GB VRAM)
Framework	Unsloth Studio + TRL SFTTrainer
Epochs	1.0 (8,586 steps)
Batch size	8 (effective 8, no gradient accumulation)
Peak LR	2.0e-4, linear decay, 5-step warmup
Precision	bf16 (not 4-bit)
Training time	8h 53m, ~$1.96

Final adapter size: 58 MB. That's the whole point of LoRA: a tiny, portable delta instead of a new 9 GB model.

What the Training Looked Like

Training loss behaved the way you hope for: fast descent, then a long tail.

Step	Train Loss	Eval Loss
1	13.66	-
100	0.94	-
500	0.17	-
859 (10%)	~0.13	2.21
4,295 (50%)	~0.04	2.15
6,013 (70%)	~0.02	2.14 (best)
8,586 (100%)	0.018	2.18

Final training loss: 0.018. Best eval loss: 2.14 at step 6,013.

If you stop reading here you'd conclude the model learned the task. That's what I concluded.

The Problem I Missed Until I Ran Real Inference

Look at eval loss more carefully.

Step 859 (10% through training): eval loss 2.21
Step 8,586 (100% through training): eval loss 2.18

Eval loss barely moved for 90% of training. Meanwhile training loss went from 0.13 to 0.018. That gap is overfitting. Best eval loss was at step 6,013. The final checkpoint is worse than an earlier one.

I read this as "I should have used early stopping" and moved on. The real lesson was bigger, and showed up only when I actually tested the model on images.

Three Tests, Getting Worse

Three tests, one dataset image as the starting point. Each column shows the input, the model's LaTeX output, and what that output actually renders as. How the model breaks down, step by step, is the whole story.

Test 1: An image from the training set

I grabbed what's shown as the first row in the unsloth/LaTeX_OCR dataset viewer:

Prompt: Transcribe the LaTeX from this image.
Output: \frac{N}{M} \in \mathbf{Z} , \frac{M}{P} \in \mathbf{Z} , \frac{P}{Q} \in \mathbf{Z}

Clean, valid LaTeX. \mathbf{Z} is the modern equivalent of \bf Z, which is exactly how the dataset labels this image:

Training label: { \frac { N } { M } } \in { \bf Z } , { \frac { M } { P } } \in { \bf Z } , { \frac { P } { Q } } \in { \bf Z }

Verdict: correct. But almost certainly because this image was in the training set. With 1 epoch on 68.7k examples and visible overfitting in the loss curve, the model has seen this image and learned to reproduce its label. High performance on the training distribution is not evidence the task was learned; it's often evidence the distribution was memorized.

Test 2: The same image, only the colors changed

I took the same image and did one thing: linearly remapped pixel colors so that black became white and white became a clear blue. Anti-aliased edges preserved exactly. No content, resolution, font, or layout changes.

Prompt: Transcribe the LaTeX from this image.
Output: \frac{N}{M} \in \mathbb{Z}, \frac{M}{P} \in $\mathbb{Z}$, $\frac{P}{Q}$ \in $\mathbb{Z}$

Verdict: malformed LaTeX. The fractions and \in are still there, but look at those stray $ signs. $ is a math-mode delimiter in LaTeX: it closes math mode when you're in it and opens it when you're in text. Scattered mid-expression like this, they break the parser. If you fed this output to a LaTeX compiler, the \frac{P}{Q} inside $...$ gets treated as text, the \mathbb{Z} outside math mode fails to render correctly, and the decoded image (see the visual above) comes out visibly broken.

Nothing changed except color. The adapter's learned mapping (image pattern → LaTeX token) partially collapsed. The model went from confident math-mode prediction on the white version to inserting $ delimiters as if it's uncertain whether it's in math or text mode, because the pixel-statistics that anchored its learned mapping aren't there anymore.

There's also a notation shift: on white bg the model produces \mathbf{Z} (matching training labels). On blue bg it outputs \mathbb{Z} (the blackboard-bold that the original image actually shows, but which doesn't appear in the training labels). The model is pulling tokens from a different part of its vocabulary on this input.

This is shortcut learning: the adapter latched onto the easiest signal that correlated with labels on the training set (pixel brightness statistics) instead of the right signal (glyph shapes). Flip the brightness, and the shortcut stops firing.

Test 3: A handwritten equation

I snapped a photo of a handwritten fraction and fed it in.

Prompt: Transcribe the LaTeX from this image.
Output: \frac { \aleph } { \mathcal{M}} \ \in \ \zeta , \ \frac { \mathcal{M}} { \varphi } \ \in \ \zeta , \ \frac { \mathcal{L}} { \mathcal{Q} } \ \in \ \zeta

Verdict: complete hallucination. The output has nothing to do with the image. The model invents Hebrew letters (\aleph), script letters (\mathcal{M}, \mathcal{L}, \mathcal{Q}), Greek (\zeta, \varphi), and hallucinates a three-fraction structure that isn't in the picture at all.

The adapter has no mapping for this visual input. It falls back on the surface-level shape of its training outputs ("this is usually about three fractions joined by \in") and produces plausible-looking LaTeX that is entirely decoupled from what's in the image.

The pattern

Test	Distribution shift	Output quality	Failure mode
Training-set image	None	Correct (matches training label)	-
Same content, blue bg	Colors only	Malformed LaTeX, wrong notation	Shortcut on pixel stats
Handwritten	Everything about rendering	Total nonsense	Domain collapse

The failure mode gets worse as distribution shift increases. There's no clean line between "works" and "doesn't work". The model gets worse in steps.

What this actually shows

The model didn't learn "recognize LaTeX symbols." It learned "when I see images that look like unsloth/LaTeX_OCR's rendering style, produce outputs that look like unsloth/LaTeX_OCR's labels." Move the image away from that style, and the output degrades in a predictable way: first it loses confidence in its math-mode markers (and starts inserting $ delimiters mid-expression), then notation drifts, then structure breaks completely and it hallucinates.

Eval loss never caught any of this. The eval split comes from the same narrow distribution as training, so it only measures generalization within that narrow world. A stable eval loss can mask the fact that the model is learning shortcuts that will fall apart the moment conditions change.

Eval loss tells you if you're overfitting the training set. It doesn't tell you if you're overfitting the dataset itself.

Early stopping at step 2,000 wouldn't have fixed any of this. It's not a training-duration problem. It's a dataset-diversity and missing-augmentation problem.

One More Thing That Almost Fooled Me: The Prompt

Before I understood the pattern above, I had one more thing to rule out.

The unsloth/LaTeX_OCR dataset has no "prompt" column, only image and text. So how does the model know what instruction it's responding to?

Unsloth Studio auto-generates one. It inspects the dataset name and content, matches a task pattern, and wraps every training example with a default instruction. For unsloth/LaTeX_OCR, that instruction is:

"Convert this image to LaTeX notation."

(In the unslothai/unsloth repo, at /studio/backend/utils/datasets/vlm_processing.py, line 79. Unsloth Studio's code auto-matches the "latex" task pattern via dataset-name and content heuristics.)

My inference code used "Transcribe the LaTeX from this image.", a different prompt. That one-line difference pushes the adapter off-distribution on the text side, independent of any image-side issues. The stray $ delimiters on Test 2, the hallucinated \aleph and \zeta on Test 3, even some of the notation drift: all of these are likely worse than they'd be under the correct prompt.

But here's the thing. I used the same wrong prompt for all three tests. That means the prompt is held constant across Test 1, Test 2, and Test 3. If the prompt were the dominant cause of failure, Test 1 (the dataset image) should fail too. It doesn't: Test 1 produces clean, correct LaTeX that matches the training label exactly.

So the variable that differs between working and broken output is not the prompt. It's the image. The image-side failure is real regardless of the prompt. The prompt choice changes the exact shape of the failure but not whether it happens.

The lesson generalizes: the prompt is part of the training distribution. An adapter is tuned for specific (image, instruction) pairs; change the instruction and you've shifted the distribution. Prompt mismatch looks exactly like a model weakness (degraded output, inconsistent notation, odd hallucinations) when really you're just asking a question the model wasn't trained to answer.

Practical rule: before testing a fine-tuned adapter, find out what instruction it was trained with and use that exact string at inference. For automated tools like Unsloth Studio, check the preprocessing code. For custom scripts, check your own training loop. Never assume.

What the Numbers Actually Told Me (In Hindsight)

The overfitting curve

Progress	Train Loss	Eval Loss	What this meant
10%	~0.13	2.21	Real learning
50%	~0.04	2.15	Marginal gains
70%	~0.02	2.14	Best generalization
100%	0.018	2.18	Eval got worse

The cost of not knowing

Scenario	Steps	Time	Cost ($0.22/hr)
What I did	8,586	8h 54m	~$1.96
Stop at best eval	6,013	~6h 13m	~$1.37
Aggressive early stop	2,000	~2h 4m	~$0.46
2K + packing + 4-bit	~2,000	~1h (est.)	~$0.22

Nearly the same quality model, same memorization of the same narrow distribution, roughly 9x cheaper if I'd known what to watch for. Note: "same quality" here means same behavior on the dataset. The handwriting hallucination would be identical regardless of training length; that failure isn't solvable by spending more hours.

What I'd Do Differently

1. Watch eval loss and stop when it plateaus

Configure early stopping, or at least evaluate more frequently than every 10% of training. Eval had stopped improving by ~step 2,000. I trained four times longer than needed.

2. Don't trust eval loss alone. Run out-of-distribution inference during training

Mid-training, grab a few images deliberately outside your dataset style and run inference. If those fail while eval loss looks fine, your dataset is too narrow. The three tests I ran at the end are what I should have been running every few thousand steps.

3. Diversify the dataset

One narrow dataset → one narrow model. For a real LaTeX OCR model I'd mix:

unsloth/LaTeX_OCR
Images rendered with different LaTeX engines, fonts, DPIs
Scanned or photographed equations with noise and skew
A small handwritten sample

4. Augment aggressively

A single line of data augmentation would have forced the model to learn shape instead of brightness. Standard in computer vision, not on by default in most LLM fine-tuning pipelines. For vision-language tasks, at minimum:

Random color inversion and channel swap
Brightness and contrast jitter
Small rotations (±5°) and scale jitter
Gaussian noise and JPEG compression artifacts
Random background color / hue shift

Any one of these would have prevented the Test 2 failure. Together they turn a narrow dataset into something meaningfully broader.

5. Enable 4-bit quantization and packing

I trained in bf16 with packing disabled. Both are free wins for most LoRA setups:

QLoRA (4-bit) roughly halves VRAM use with negligible quality impact for LoRA.
Packing concatenates short sequences into one batch, cutting padding waste. For variable-length LaTeX strings, likely 1.5–2x faster.

6. Larger effective batch size

With 4-bit + packing there's room to push batch size from 8 to 16 or 32. Fewer steps, more stable gradients.

7. Use a longer warmup

My warmup was 5 steps, which is about 0.06% of total training. The grad norm peaked at 19.6 on step 1. That kind of early spike is what warmup is supposed to absorb, so next run I'd use a few hundred warmup steps (~1–5% of total) instead. A cosine decay schedule is a common alternative to linear and might train more smoothly, but I didn't test it on this run, so I can't say either way.

8. Save fewer, better checkpoints

Saving every 30 steps produced 288 checkpoints, ~34 GB of disk. Every 200–500 is plenty.

9. Deploy the best checkpoint, not the last one

Step 6,013 (eval 2.14) is what I'd actually use. The reflex of "load the final checkpoint" is wrong whenever training overfits, and for LoRA on narrow data, that's most of the time.

10. Always match the inference prompt to training

Before using the adapter, find the exact instruction the dataset was trained with. Save it next to the model files. Never guess.

Takeaways

Training loss always goes down. That's gravity, not a signal. Eval loss is only meaningful if your eval set represents what you'll actually use the model for.
Success on training-style images is not evidence of learning. It's often evidence of memorization. Test at different levels of difficulty (in-distribution, small shift, large shift) and watch how the output degrades.
Failure happens in steps, not all at once. Small shifts produce wrong notation. Bigger shifts produce hallucination. How it breaks tells you what shortcut the model learned.
Run a one-variable stress test. Change only one thing (color, bg, font) in a training image. If output changes, the model is tracking that variable as a shortcut.
The prompt is part of the training distribution. Find the training prompt, use it verbatim at inference. Prompt mismatch is silent distribution shift and looks exactly like a bad model.
The dataset's styling becomes load-bearing. Color, contrast, resolution, rendering: all get baked in unless augmentation forces otherwise.
Fine-tuning cost is mostly avoidable. Early stopping + packing + 4-bit can cut a run ~9x with no meaningful quality loss. But "quality" here means "behavior on the training distribution." Out-of-distribution failure isn't something more training fixes.

DEV Community