angu10

Posted on May 8

The model isn’t the hard part: the data pipeline I built to teach Gemma 4 E2B to read Indian GST invoices.

#gemmachallenge #gemma #devchallenge

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

How I fine-tuned a small Gemma 4 model on a Mac to extract 22 invoice fields privately, and why the data strategy mattered more than the prompt.

I needed to read Indian GST invoices without sending them to an external API every time.

Gemma 4 E2B is an open multimodal model designed for local and edge deployment, with a 128K context window, native system prompt support, and an instruction-tuned variant that is usable without a giant serving stack. Google positions the small Gemma 4 models as practical for on-device and local workflows, not just as miniatures of the larger models. That made it a good fit for a problem I care about: structured invoice extraction where privacy, cost, and control matter as much as raw quality.

At my document volume, a hosted model would have been simple to prototype but expensive to normalize around. Roughly speaking, a model like GPT-4o lands around a cent per invoice at this prompt and output length. A local Gemma 4 setup costs time up front, but effectively $0 per call after that.

My goal was simple:

take OCR text from Indian GST invoices
extract 22 fields into strict JSON
fine-tune locally instead of paying per document to a hosted model

This was not a benchmark project. It was a practical test of what a small local Gemma 4 model can actually learn.

Why Gemma 4 E2B was the right model to try

I did not need a general-purpose assistant. I needed a model that could learn a narrow, structured task and run locally.

That made Gemma 4 E2B interesting for three reasons:

It is small enough to experiment with on local hardware.
It is capable enough to handle long, messy invoice OCR.
It is open enough to fine-tune and evaluate honestly.

The instruction-tuned google/gemma-4-E2B-it model gave me a real starting point, not just a base model that needed a large GPU cluster to become useful.

I ran LoRA fine-tuning with:

model: google/gemma-4-E2B-it.
framework: MLX-LM
trainable params: 7.291M / 4647.450.M
trainable fraction: 0.157%
hardware: Mac
peak memory during stable runs: roughly 12.4 GB to 13.0 GB '.GB

That was the first encouraging sign. This was na ot theory. A small Gemma 4 model could be trained locally on a real business-shaped extraction task.

The task

The extraction target was a strict 22-field JSON schema for Indian GST invoices:

supplier identity
buyer identity
invoice number and dates
place of supply
HSN or SAC
description
taxable value
tax rates and amounts
total invoice
reverse charge
e-invoice IRN

The downstream requirement was not "answer roughly correctly." It was:

valid JSON
stable field typing
exact field mapping

That is a much harder and more useful task than general summarization.

What fine-tuning actually changed

The fastest way to see the difference between a "generic capable model" and a "task-adapted model" is to look at one invoice.

Before fine-tuning, the baseline model was capable of understanding the document but not disciplined enough to behave like an extractor. In some runs, it produced malformed JSON, mixed reasoning-style text into the answer, or mapped totals into the wrong fields.

After fine-tuning, the same invoice started producing compact,t structured outputs with the fields it had learned reliably:

"json { "supplier_name": "Sample Supplier Pvt Ltd", “supplier_gstin”: “27XXXXXXXXXX1ZX”, “invoice_no”: “INV-001”, “invoice_date”: “16-02-2026”, “cgst_rate”: 0.09, "cgst_amt": 285.3, "sgst_amt": 285.3, "total_invoice": 3741, “igst_rate”: 0.0, "igst_amt": 0, "reverse_charge": "No" }``

That did not mean the model was finished. Fields like taxable_value still need more real training examples to get right, which is where the project is heading next. But it had crossed the line from "general model guessing at documents" to "specialized extractor that can be improved with data."

The first version: synthetic data was enough to build the pipeline

I started with synthetic data because I did not have a large labeled corpus of invoices.

That synthetic pipeline gave me:

OCR-like invoice text
paired 22-field targets
tax arithmetic coverage
repeatable training exports
a way to debug LoRA and evaluation locally

The first clean synthetic-only run looked excellent.

Validation loss on the synthetic holdout improved from:

0.552 at iteration 1
to 0.024 at iteration 300

On paper, that looked close to done.

But synthetic validation was measuring whether the model understood the synthetic world I had created, not whether it understood real invoices from real suppliers.

That distinction ended up shaping the whole project.

The real work was data engineering

The model was not the hard part.

The hard part was teaching the model what real invoice variance looks like.

I eventually built the dataset in layers:

Layer 1: generic synthetic invoices

These were useful for:

schema coverage
GST arithmetic patterns
JSON output discipline
basic extraction behavior

Layer 2: real annotated invoices

I merged and cleaned real invoice annotations into a single CSV:

28 real invoices
22 unique suppliers
a mix of PDF and image invoices

Before retraining, I split them into:

20 real train invoices
8 real holdout invoices

That was the first time I had a real evaluation set that could tell me something meaningful.

Layer 3: Archive-derived layout variants

This was the most important change to the dataset.

Instead of generating more generic synthetic invoices, I reused the structure of real invoice layouts from an Archive/ folder:

alternate labels like No., Bill No., GST No
weakly labeled subtotal rows
dense table layouts
multiline descriptions
inconsistent spacing
subtotal-only item blocks

From those real layouts, I generated synthetic OCR variants that preserved layout difficulty while changing values and identities.

The final hybrid training mix was:

250 generic synthetic examples
360 Archive-layout variants
8 exact real OCR train examples matched back to source documents

Validation used:

8 held-out real invoices

That was the first dataset composition that looked like a real fine-tuning strategy rather than a synthetic demo.

What Gemma 4 E2B actually learned

This is the part I think matters most for anyone considering Gemma 4 for domain adaptation.

Gemma 4 E2B clearly learned the task.

Not in the vague sense of "it sounded good," but in the operational sense:

JSON stayed structurally stable
The model learned invoice field boundaries
It handled many supplier layouts
It converged reliably on a Mac with a tiny trainable parameter budget

The most meaningful run was the first hybrid real-holdout run.

Its validation loss on the real holdout set improved from:

0.786 at iteration 1
to 0.132 at iteration 250

Then I fixed issues in the Archive-variant generator:

removed annotation-format leakage like @ ₹xxx = ₹xxx
parsed real per-line amounts instead of splitting totals evenly
forced igst = 0 for intra-state invoices in the target JSON

That produced a slightly better run:

final real-holdout validation loss: 0.130

I also tried a prompt change specifically aimed at multiline extraction. That got worse, not better:

final validation loss: 0.147

I would not overstate that result. On an 8-invoice real holdout, that difference is not strong enough to claim that prompt engineering is harmful in general.

What it did show me is something narrower and more useful:

The prompt tweak did not clearly beat the best data-only run
The biggest gains in this project came from dataset composition, not instruction wording

What this taught me about Gemma 4

The headline lesson was not that Gemma 4 needed heroic prompt engineering.

It was that Gemma 4 E2B was already capable enough that the next bottleneck was dataset quality.

That is a good sign for the model.

Small models become interesting when they are strong enough that your time moves from "can this model learn the task at all?" to "what data do I need to make it trustworthy?"

That is where this project ended up.

What went wrong, and why it was useful

I do not think a useful Gemma 4 write-up should pretend everything worked on the first run.

Two failures were especially instructive.

Failure 1: overfitting after a great checkpoint

In an earlier run, validation loss got very low and then degraded badly later:

0.552 at iteration 1
0.022 at iteration 200
1.397 at iteration 400

That taught me:

The best checkpoint is not necessarily the last checkpoint
Dense evaluation checkpoints matter
Small local runs can still overtrain quickly

Failure 2: NaN training from sequence problems

Another run looked healthy until long examples were truncated. After that:

train loss became nan.
validation loss became na.n
The rest of the run was unusable

That forced me to treat dataset export and sequence control as first-class parts of the training pipeline, not cleanup tasks.

Both failures improved the eventual Gemma 4 setup more than another round of prompt edits would have.

The practical result

I would summarize the outcome this way:

Gemma 4 E2B was strong enough to learn a real structured extraction task locally
LoRA fine-tuning worked within a very small trainable parameter budget
The model benefited more from better data composition than from prompt tweaking
Synthetic data was useful for bootstrapping, but real layout variance determined what I could trust

That is a very good place for a small, open model.

If I were starting again

I would do three things earlier:

Create a real holdout set before the first "good" run.
Build layout-derived synthetic variants before scaling generic synthetic data.
Evaluate field-level errors on real invoices sooner instead of trusting synthetic validation curves.

Those are not just invoice lessons. They are reusable lessons for anyone fine-tuning Gemma 4 on domain-specific extraction tasks.

Why this made me more optimistic about Gemma 4, not less

The most interesting thing about this project was not that Gemma 4 E2B solved everything immediately.

It was that a local, open, small model got far enough that the real work shifted to data design, evaluation discipline, and layout coverage.

That is exactly the kind of capability I want from an open model.

Not a toy.
Not a benchmark artifact.
A model that is small enough to run locally, but capable enough to deserve serious data engineering.

For this task, Gemma 4 E2B crossed that line.

Top comments (1)

Dev Fish • May 11

Hey angu10, thanks for sharing this! The core idea that the data pipeline is way harder than the model itself is spot on, and it is a great theme for the challenge.
That said, looking closely at the technical details, I noticed a few things that seem a bit off and might trip up developers trying to replicate your work for production:
I noticed the JSON output you shared uses curly typographic quotes instead of standard straight double quotes. If the model actually generated that, any standard parser like Python's json.loads would instantly fail. Did you have to add an extra post-processing step to clean that up?
Evaluating on just 8 holdout invoices is incredibly risky. Indian GST invoices have thousands of crazy layouts. Relying on a sample size of 8 makes the validation metrics look more like statistical noise rather than actual proof that the model generalizes well.
You mentioned the validation loss dropping to 0.022 before spiking. When fine-tuning a model of this size on such a small dataset, a loss that low strongly points to catastrophic overfitting rather than the model mastering the task.
I was a bit confused by some of the text in the logs. Quirks like na.n instead of standard NaN, or 13.0 GB '.GB, look highly unusual fortokenizationoutput. They actually look a lot like LLM tokenization glitches. Did you use an AI assistant to help draft or format the post? It might have hallucinated some of your training metrics.
Saying local inference is zero dollars per call is a bit of a stretch. It skips over the hefty hardware costs required to run a model locally, not to mention all the engineering hours you spent building this data pipeline.
It is definitely an interesting concept for the challenge, but the methodology probably needs a bit more rigor before being deployed in a real world OCR pipeline. Would love to hear your thoughts on how you handled the overfitting!