An empirical note on what synthetic invoice data taught a Gemma fine-tune, what it hid, and how one real document exposed the gap.
I fine-tuned a small Gemma model to parse Indian invoices because I wanted a path that was cheaper, more private, and easier to deploy than calling a hosted API for every document.
The training metrics looked excellent.
Then I ran the model on one real invoice.
It got the total right, the supplier right, the address right, and still failed in four ways that would make the output unusable in a real finance workflow.
That invoice was more useful than another few hundred synthetic examples.
None of the headline conclusions here are new to anyone with ML experience:
- synthetic data has domain gap
- synthetic validation can be overly optimistic
- real data changes what you trust
What felt worth documenting was the concrete shape of the failure:
- which fields broke first
- which assumptions in the synthetic distribution caused it
- what the training curves looked like before and after instability
- and which lessons were actually about data, not models
The setup
I did not have a large labeled invoice corpus, so I started with synthetic data.
The extraction target was a strict 22-field JSON schema, and the synthetic dataset was large enough to build a real training pipeline. It was not large enough to tell me whether the model understood real invoices.
Why validation looked so good
The final stable training run used:
- model:
google/gemma-4-E2B-it - framework:
MLX-LM 0.31.2 - trainable params:
7.291M / 4647.450M(0.157%) - iterations:
300 - learning rate:
5e-5 num_layers: 8batch_size: 1grad_accumulation_steps: 8max_seq_length: 1536
It trained cleanly on a Mac with peak memory of about 13.677 GB.
Validation loss improved almost monotonically:
-
Iter 1:0.552 -
Iter 50:0.084 -
Iter 100:0.056 -
Iter 150:0.046 -
Iter 200:0.044 -
Iter 250:0.029 -
Iter 300:0.024
If all I had looked at was the validation curve, I would have said the model was basically ready.
That would have been wrong.
That sentence is obvious in the abstract. It only becomes useful when you can point to the exact fields and failure modes that made it wrong.
One real invoice broke four assumptions
The invoice came from Jon Doe Print.
The model output looked plausible enough to pass a quick skim:
- supplier name:
Jon Doe Print - supplier GSTIN: correct format and state code
- supplier address: mostly correct
- invoice number: a plausible-looking integer value
- invoice date: correctly extracted
- total invoice: captured correctly
But the failure table tells the real story:
| Field | Model output | Correct | Impact |
|---|---|---|---|
description |
3D Printed Prototype |
3D Printed Prototype (Pre filter) |
Wrong item identity in downstream categorization |
taxable_value |
line-item amount | invoice subtotal | Wrong amount booked to accounts |
igst_rate |
0.09 |
0.0 |
Wrong tax treatment and downstream GST logic |
reverse_charge |
0 |
No |
Type mismatch that can break strict downstream parsers |
The model also captured some things correctly:
- total invoice correctly
- tax amounts correctly
That is what made the failure interesting.
The model was not random. It had learned enough invoice structure to look useful. It just had not learned enough real invoice variance to be trustworthy.
That distinction is the center of the project.
The problem was not that the model failed to learn invoice extraction at all.
The problem was that it learned the synthetic version of invoice extraction more faithfully than the real one.
The four assumptions that invoice broke
1. I assumed subtotal rows would be easy to identify
The invoice had multiple line items.
The model extracted a line-level amount as taxable_value instead of the invoice subtotal row.
In a synthetic dataset, subtotal rows are easy to standardize:
- same position
- same label family
- same spacing
In real invoices, subtotal rows compete with:
- unit prices
- per-line totals
- tax-inclusive values
- noisy formatting
The model had learned “there is a number near the items.” It had not learned “this is the subtotal row that should override the line-level values.”
2. I assumed the model would map visible tax rates to the right field
The supplier and place of supply were both in the same state:
- supplier GSTIN state code matched the place-of-supply state code
- the invoice was intra-state
So this was an intra-state invoice.
That means:
CGST > 0SGST > 0IGST = 0
The model still output:
igst_rate = 0.09
This is a subtle but important failure.
It saw a printed 18% tax context on the invoice and mapped that rate into the wrong slot.
That is not an arithmetic problem. It is a field-to-concept mapping problem.
Synthetic data had taught the model what tax fields exist. It had not sufficiently taught it how to disambiguate them when the invoice layout was less explicit.
3. I assumed missing fields would default safely
The model returned:
reverse_charge = 0
The correct value was:
reverse_charge = "No"
This looks small until you think about how these systems get deployed.
If the downstream consumer expects:
- a strict string enum
and gets:
- a number
you now have:
- broken JSON contracts
- parser failures
- brittle rule-engine behavior
The model did not just guess the wrong value. It guessed the wrong type.
That is a very different category of failure.
4. I assumed synthetic layout diversity was enough
The invoice format differed from the synthetic training distribution in small ways:
- weaker or alternate labels
- less structured spacing
- no clean field presentation for some values
- multi-line item complexity
None of those differences are dramatic in isolation.
Together, they were enough to push the model into the wrong extraction path.
That is the real problem with synthetic validation:
You can cover many business scenarios while still under-covering format variance.
The model learns the contract of the synthetic world very well.
Then one real document shows you which parts of the world your contract forgot to mention.
Synthetic data did help. Just not in the way validation suggested.
This is the part that matters most.
The synthetic data was not a waste.
It gave me:
- a working training loop
- a rendered dataset
- stable checkpointing
- a measurable extraction task
- a way to iterate cheaply
And it gave the model enough structure to learn the task.
The training curve from the final stable run proves that:
Val loss 0.552 -> 0.024
That is real learning.
But the real invoice test showed what that learning actually meant:
- the model learned the schema
- it did not yet learn the full shape of real-world invoices
That distinction is the whole article.
Synthetic data was useful because it taught the model the contract.
The real invoice exposed the parts of the contract that were underspecified.
The two failed runs were part of the lesson too
Before the stable run, I had two failed runs that made the later result more believable.
Run 1: the overfit run
The first successful run had a strong early checkpoint and then degraded badly.
Validation loss:
-
Iter 1:0.552 -
Iter 200:0.022 -
Iter 400:1.397 -
Iter 500:0.122
The model got to a very good point by iter 200, then drifted away from it.
That run taught me:
- the best checkpoint is not necessarily the last checkpoint
- a constant aggressive learning rate on a small synthetic dataset can destroy a good run after it already succeeded
Run 2: the NaN run
The second run looked healthier until sequence-length issues showed up.
At iter 150, the log warned:
[WARNING] Some sequences are longer than 1536 tokens.
The longest sentence 1973 will be truncated to 1536.
Immediately afterward:
Train loss nan- then
Val loss nan - then the rest of the run stayed corrupted
The last clean checkpoint in that run was iter 100.
That run taught me:
- token limits are not just throughput constraints
- one bad sample can invalidate the rest of a training run
- “the run finished” is not the same as “the run is usable”
Those failures are worth mentioning because they stop the final result from sounding cleaner than it really was.
They also explain why the stable v3 run is more believable than it would be in isolation. The earlier runs failed in concrete, diagnosable ways.
What the project actually proved
It did not prove that synthetic data is enough.
It proved three narrower things:
1. Synthetic data is excellent for bootstrapping a structured extraction task
It gave me scale, perfect labels, and scenario coverage fast.
2. Validation on synthetic data can dramatically overstate readiness
The model’s synthetic metrics looked excellent before the real-invoice test exposed field-mapping failures.
3. A small real corpus is disproportionately valuable
The single real invoice I tested taught me more about generalization than another hundred synthetic invoices would have.
That is not because synthetic data is bad.
It is because synthetic data and real data teach different things:
| Synthetic data teaches | Real data teaches |
|---|---|
| schema | variance |
| business scenarios | layout ambiguity |
| output format discipline | how documents actually break |
| scale | trust |
What I would change next
Only three things matter now:
- Build a small real-invoice gold set and make it part of evaluation immediately. The main gap here was format variance, not business-rule coverage.
- Add real invoices into training earlier instead of trying to synthetic my way out of layout variance. The real-invoice failure was a distribution problem, not a parameter-count problem.
- Strengthen JSON and type constraints so missing fields fail safely instead of defaulting to
0. Thereverse_charge = 0output is the kind of bug that looks cosmetic in a notebook and expensive in a real pipeline.
Everything else is secondary.
The lesson I am keeping
Synthetic data got the model to the point where it knew what an invoice parser is supposed to do.
One real invoice showed me what it still did not understand.
That is the difference.
Synthetic data teaches the task.
Real data teaches the world.
Top comments (0)