This is a submission for the Gemma 4 Challenge: Write About Gemma 4
How I fine-tuned a small Gemma 4 model on a Mac to extract 22 invoice fields privately, and why the data strategy mattered more than the prompt.
I needed to read Indian GST invoices without sending them to an external API every time.
Gemma 4 E2B is an open multimodal model designed for local and edge deployment, with a 128K context window, native system prompt support, and an instruction-tuned variant that is usable without a giant serving stack. Google positions the small Gemma 4 models as practical for on-device and local workflows, not just as miniatures of the larger models. That made it a good fit for a problem I care about: structured invoice extraction where privacy, cost, and control matter as much as raw quality.
At my document volume, a hosted model would have been simple to prototype but expensive to normalize around. Roughly speaking, a model like GPT-4o lands around a cent per invoice at this prompt and output length. A local Gemma 4 setup costs time up front, but effectively $0 per call after that.
My goal was simple:
- take OCR text from Indian GST invoices
- extract 22 fields into strict JSON
- fine-tune locally instead of paying per document to a hosted model
This was not a benchmark project. It was a practical test of what a small local Gemma 4 model can actually learn.
Why Gemma 4 E2B was the right model to try
I did not need a general-purpose assistant. I needed a model that could learn a narrow, structured task and run locally.
That made Gemma 4 E2B interesting for three reasons:
- It is small enough to experiment with on local hardware.
- It is capable enough to handle long, messy invoice OCR.
- It is open enough to fine-tune and evaluate honestly.
The instruction-tuned google/gemma-4-E2B-it model gave me a real starting point, not just a base model that needed a large GPU cluster to become useful.
I ran LoRA fine-tuning with:
- model:
google/gemma-4-E2B-it. - framework:
MLX-LM - trainable params:
7.291M / 4647.450.M - trainable fraction:
0.157% - hardware: Mac
- peak memory during stable runs: roughly
12.4 GBto13.0 GB '.GB
That was the first encouraging sign. This was na ot theory. A small Gemma 4 model could be trained locally on a real business-shaped extraction task.
The task
The extraction target was a strict 22-field JSON schema for Indian GST invoices:
- supplier identity
- buyer identity
- invoice number and dates
- place of supply
- HSN or SAC
- description
- taxable value
- tax rates and amounts
- total invoice
- reverse charge
- e-invoice IRN
The downstream requirement was not "answer roughly correctly." It was:
- valid JSON
- stable field typing
- exact field mapping
That is a much harder and more useful task than general summarization.
What fine-tuning actually changed
The fastest way to see the difference between a "generic capable model" and a "task-adapted model" is to look at one invoice.
Before fine-tuning, the baseline model was capable of understanding the document but not disciplined enough to behave like an extractor. In some runs, it produced malformed JSON, mixed reasoning-style text into the answer, or mapped totals into the wrong fields.
After fine-tuning, the same invoice started producing compact,t structured outputs with the fields it had learned reliably:
"json``
{
"supplier_name": "Sample Supplier Pvt Ltd",
“supplier_gstin”: “27XXXXXXXXXX1ZX”,
“invoice_no”: “INV-001”,
“invoice_date”: “16-02-2026”,
“cgst_rate”: 0.09,
"cgst_amt": 285.3,
"sgst_amt": 285.3,
"total_invoice": 3741,
“igst_rate”: 0.0,
"igst_amt": 0,
"reverse_charge": "No"
}
That did not mean the model was finished. Fields like taxable_value still need more real training examples to get right, which is where the project is heading next. But it had crossed the line from "general model guessing at documents" to "specialized extractor that can be improved with data."
The first version: synthetic data was enough to build the pipeline
I started with synthetic data because I did not have a large labeled corpus of invoices.
That synthetic pipeline gave me:
- OCR-like invoice text
- paired 22-field targets
- tax arithmetic coverage
- repeatable training exports
- a way to debug LoRA and evaluation locally
The first clean synthetic-only run looked excellent.
Validation loss on the synthetic holdout improved from:
-
0.552at iteration 1 - to
0.024at iteration 300
On paper, that looked close to done.
But synthetic validation was measuring whether the model understood the synthetic world I had created, not whether it understood real invoices from real suppliers.
That distinction ended up shaping the whole project.
The real work was data engineering
The model was not the hard part.
The hard part was teaching the model what real invoice variance looks like.
I eventually built the dataset in layers:
Layer 1: generic synthetic invoices
These were useful for:
- schema coverage
- GST arithmetic patterns
- JSON output discipline
- basic extraction behavior
Layer 2: real annotated invoices
I merged and cleaned real invoice annotations into a single CSV:
-
28real invoices -
22unique suppliers - a mix of PDF and image invoices
Before retraining, I split them into:
-
20real train invoices -
8real holdout invoices
That was the first time I had a real evaluation set that could tell me something meaningful.
Layer 3: Archive-derived layout variants
This was the most important change to the dataset.
Instead of generating more generic synthetic invoices, I reused the structure of real invoice layouts from an Archive/ folder:
- alternate labels like
No.,Bill No.,GST No - weakly labeled subtotal rows
- dense table layouts
- multiline descriptions
- inconsistent spacing
- subtotal-only item blocks
From those real layouts, I generated synthetic OCR variants that preserved layout difficulty while changing values and identities.
The final hybrid training mix was:
-
250generic synthetic examples -
360Archive-layout variants -
8exact real OCR train examples matched back to source documents
Validation used:
-
8held-out real invoices
That was the first dataset composition that looked like a real fine-tuning strategy rather than a synthetic demo.
What Gemma 4 E2B actually learned
This is the part I think matters most for anyone considering Gemma 4 for domain adaptation.
Gemma 4 E2B clearly learned the task.
Not in the vague sense of "it sounded good," but in the operational sense:
- JSON stayed structurally stable
- The model learned invoice field boundaries
- It handled many supplier layouts
- It converged reliably on a Mac with a tiny trainable parameter budget
The most meaningful run was the first hybrid real-holdout run.
Its validation loss on the real holdout set improved from:
-
0.786at iteration 1 - to
0.132at iteration 250
Then I fixed issues in the Archive-variant generator:
- removed annotation-format leakage like
@ ₹xxx = ₹xxx - parsed real per-line amounts instead of splitting totals evenly
- forced
igst = 0for intra-state invoices in the target JSON
That produced a slightly better run:
- final real-holdout validation loss:
0.130
I also tried a prompt change specifically aimed at multiline extraction. That got worse, not better:
- final validation loss:
0.147
I would not overstate that result. On an 8-invoice real holdout, that difference is not strong enough to claim that prompt engineering is harmful in general.
What it did show me is something narrower and more useful:
- The prompt tweak did not clearly beat the best data-only run
- The biggest gains in this project came from dataset composition, not instruction wording
What this taught me about Gemma 4
The headline lesson was not that Gemma 4 needed heroic prompt engineering.
It was that Gemma 4 E2B was already capable enough that the next bottleneck was dataset quality.
That is a good sign for the model.
Small models become interesting when they are strong enough that your time moves from "can this model learn the task at all?" to "what data do I need to make it trustworthy?"
That is where this project ended up.
What went wrong, and why it was useful
I do not think a useful Gemma 4 write-up should pretend everything worked on the first run.
Two failures were especially instructive.
Failure 1: overfitting after a great checkpoint
In an earlier run, validation loss got very low and then degraded badly later:
-
0.552at iteration 1 -
0.022at iteration 200 -
1.397at iteration 400
That taught me:
- The best checkpoint is not necessarily the last checkpoint
- Dense evaluation checkpoints matter
- Small local runs can still overtrain quickly
Failure 2: NaN training from sequence problems
Another run looked healthy until long examples were truncated. After that:
- train loss became
nan. - validation loss became
na.n - The rest of the run was unusable
That forced me to treat dataset export and sequence control as first-class parts of the training pipeline, not cleanup tasks.
Both failures improved the eventual Gemma 4 setup more than another round of prompt edits would have.
The practical result
I would summarize the outcome this way:
- Gemma 4 E2B was strong enough to learn a real structured extraction task locally
- LoRA fine-tuning worked within a very small trainable parameter budget
- The model benefited more from better data composition than from prompt tweaking
- Synthetic data was useful for bootstrapping, but real layout variance determined what I could trust
That is a very good place for a small, open model.
If I were starting again
I would do three things earlier:
- Create a real holdout set before the first "good" run.
- Build layout-derived synthetic variants before scaling generic synthetic data.
- Evaluate field-level errors on real invoices sooner instead of trusting synthetic validation curves.
Those are not just invoice lessons. They are reusable lessons for anyone fine-tuning Gemma 4 on domain-specific extraction tasks.
Why this made me more optimistic about Gemma 4, not less
The most interesting thing about this project was not that Gemma 4 E2B solved everything immediately.
It was that a local, open, small model got far enough that the real work shifted to data design, evaluation discipline, and layout coverage.
That is exactly the kind of capability I want from an open model.
Not a toy.
Not a benchmark artifact.
A model that is small enough to run locally, but capable enough to deserve serious data engineering.
For this task, Gemma 4 E2B crossed that line.
Top comments (0)