angu10

Posted on Apr 9

The model looked great on validation until one real invoice broke four assumptions

#ai #machinelearning #python #finetuning

An empirical note on what synthetic invoice data taught a Gemma fine-tune, what it hid, and how one real document exposed the gap.

I fine-tuned a small Gemma model to parse Indian invoices because I wanted a path that was cheaper, more private, and easier to deploy than calling a hosted API for every document.

The training metrics looked excellent.

Then I ran the model on one real invoice.

It got the total right, the supplier right, the address right, and still failed in four ways that would make the output unusable in a real finance workflow.

That invoice was more useful than another few hundred synthetic examples.

None of the headline conclusions here are new to anyone with ML experience:

synthetic data has domain gap
synthetic validation can be overly optimistic
real data changes what you trust

What felt worth documenting was the concrete shape of the failure:

which fields broke first
which assumptions in the synthetic distribution caused it
what the training curves looked like before and after instability
and which lessons were actually about data, not models

The setup

I did not have a large labeled invoice corpus, so I started with synthetic data.

The extraction target was a strict 22-field JSON schema, and the synthetic dataset was large enough to build a real training pipeline. It was not large enough to tell me whether the model understood real invoices.

Why validation looked so good

The final stable training run used:

model: google/gemma-4-E2B-it
framework: MLX-LM 0.31.2
trainable params: 7.291M / 4647.450M (0.157%)
iterations: 300
learning rate: 5e-5
num_layers: 8
batch_size: 1
grad_accumulation_steps: 8
max_seq_length: 1536

It trained cleanly on a Mac with peak memory of about 13.677 GB.

Validation loss improved almost monotonically:

Iter 1: 0.552
Iter 50: 0.084
Iter 100: 0.056
Iter 150: 0.046
Iter 200: 0.044
Iter 250: 0.029
Iter 300: 0.024

If all I had looked at was the validation curve, I would have said the model was basically ready.

That would have been wrong.

That sentence is obvious in the abstract. It only becomes useful when you can point to the exact fields and failure modes that made it wrong.

One real invoice broke four assumptions

The invoice came from Jon Doe Print.

The model output looked plausible enough to pass a quick skim:

supplier name: Jon Doe Print
supplier GSTIN: correct format and state code
supplier address: mostly correct
invoice number: a plausible-looking integer value
invoice date: correctly extracted
total invoice: captured correctly

But the failure table tells the real story:

Field	Model output	Correct	Impact
`description`	`3D Printed Prototype`	`3D Printed Prototype (Pre filter)`	Wrong item identity in downstream categorization
`taxable_value`	line-item amount	invoice subtotal	Wrong amount booked to accounts
`igst_rate`	`0.09`	`0.0`	Wrong tax treatment and downstream GST logic
`reverse_charge`	`0`	`No`	Type mismatch that can break strict downstream parsers

The model also captured some things correctly:

total invoice correctly
tax amounts correctly

That is what made the failure interesting.

The model was not random. It had learned enough invoice structure to look useful. It just had not learned enough real invoice variance to be trustworthy.

That distinction is the center of the project.

The problem was not that the model failed to learn invoice extraction at all.

The problem was that it learned the synthetic version of invoice extraction more faithfully than the real one.

The four assumptions that invoice broke

1. I assumed subtotal rows would be easy to identify

The invoice had multiple line items.

The model extracted a line-level amount as taxable_value instead of the invoice subtotal row.

In a synthetic dataset, subtotal rows are easy to standardize:

same position
same label family
same spacing

In real invoices, subtotal rows compete with:

unit prices
per-line totals
tax-inclusive values
noisy formatting

The model had learned “there is a number near the items.” It had not learned “this is the subtotal row that should override the line-level values.”

2. I assumed the model would map visible tax rates to the right field

The supplier and place of supply were both in the same state:

supplier GSTIN state code matched the place-of-supply state code
the invoice was intra-state

So this was an intra-state invoice.

That means:

CGST > 0
SGST > 0
IGST = 0

The model still output:

igst_rate = 0.09

This is a subtle but important failure.

It saw a printed 18% tax context on the invoice and mapped that rate into the wrong slot.

That is not an arithmetic problem. It is a field-to-concept mapping problem.

Synthetic data had taught the model what tax fields exist. It had not sufficiently taught it how to disambiguate them when the invoice layout was less explicit.

3. I assumed missing fields would default safely

The model returned:

reverse_charge = 0

The correct value was:

reverse_charge = "No"

This looks small until you think about how these systems get deployed.

If the downstream consumer expects:

a strict string enum

and gets:

a number

you now have:

broken JSON contracts
parser failures
brittle rule-engine behavior

The model did not just guess the wrong value. It guessed the wrong type.

That is a very different category of failure.

4. I assumed synthetic layout diversity was enough

The invoice format differed from the synthetic training distribution in small ways:

weaker or alternate labels
less structured spacing
no clean field presentation for some values
multi-line item complexity

None of those differences are dramatic in isolation.

Together, they were enough to push the model into the wrong extraction path.

That is the real problem with synthetic validation:

You can cover many business scenarios while still under-covering format variance.

The model learns the contract of the synthetic world very well.

Then one real document shows you which parts of the world your contract forgot to mention.

Synthetic data did help. Just not in the way validation suggested.

This is the part that matters most.

The synthetic data was not a waste.

It gave me:

a working training loop
a rendered dataset
stable checkpointing
a measurable extraction task
a way to iterate cheaply

And it gave the model enough structure to learn the task.

The training curve from the final stable run proves that:

Val loss 0.552 -> 0.024

That is real learning.

But the real invoice test showed what that learning actually meant:

the model learned the schema
it did not yet learn the full shape of real-world invoices

That distinction is the whole article.

Synthetic data was useful because it taught the model the contract.

The real invoice exposed the parts of the contract that were underspecified.

The two failed runs were part of the lesson too

Before the stable run, I had two failed runs that made the later result more believable.

Run 1: the overfit run

The first successful run had a strong early checkpoint and then degraded badly.

Validation loss:

Iter 1: 0.552
Iter 200: 0.022
Iter 400: 1.397
Iter 500: 0.122

The model got to a very good point by iter 200, then drifted away from it.

That run taught me:

the best checkpoint is not necessarily the last checkpoint
a constant aggressive learning rate on a small synthetic dataset can destroy a good run after it already succeeded

Run 2: the NaN run

The second run looked healthier until sequence-length issues showed up.

At iter 150, the log warned:

[WARNING] Some sequences are longer than 1536 tokens.
The longest sentence 1973 will be truncated to 1536.

Immediately afterward:

Train loss nan
then Val loss nan
then the rest of the run stayed corrupted

The last clean checkpoint in that run was iter 100.

That run taught me:

token limits are not just throughput constraints
one bad sample can invalidate the rest of a training run
“the run finished” is not the same as “the run is usable”

Those failures are worth mentioning because they stop the final result from sounding cleaner than it really was.

They also explain why the stable v3 run is more believable than it would be in isolation. The earlier runs failed in concrete, diagnosable ways.

What the project actually proved

It did not prove that synthetic data is enough.

It proved three narrower things:

1. Synthetic data is excellent for bootstrapping a structured extraction task

It gave me scale, perfect labels, and scenario coverage fast.

2. Validation on synthetic data can dramatically overstate readiness

The model’s synthetic metrics looked excellent before the real-invoice test exposed field-mapping failures.

3. A small real corpus is disproportionately valuable

The single real invoice I tested taught me more about generalization than another hundred synthetic invoices would have.

That is not because synthetic data is bad.

It is because synthetic data and real data teach different things:

Synthetic data teaches	Real data teaches
schema	variance
business scenarios	layout ambiguity
output format discipline	how documents actually break
scale	trust

What I would change next

Only three things matter now:

Build a small real-invoice gold set and make it part of evaluation immediately. The main gap here was format variance, not business-rule coverage.
Add real invoices into training earlier instead of trying to synthetic my way out of layout variance. The real-invoice failure was a distribution problem, not a parameter-count problem.
Strengthen JSON and type constraints so missing fields fail safely instead of defaulting to 0. The reverse_charge = 0 output is the kind of bug that looks cosmetic in a notebook and expensive in a real pipeline.

Everything else is secondary.

The lesson I am keeping

Synthetic data got the model to the point where it knew what an invoice parser is supposed to do.

One real invoice showed me what it still did not understand.

That is the difference.

Synthetic data teaches the task.

Real data teaches the world.

DEV Community