DEV Community

angu10
angu10

Posted on

The model isn’t the hard part: the data pipeline I built to teach Gemma 4 E2B to read Indian GST invoices.

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

How I fine-tuned a small Gemma 4 model on a Mac to extract 22 invoice fields privately, and why the data strategy mattered more than the prompt.

I needed to read Indian GST invoices without sending them to an external API every time.

Gemma 4 E2B is an open multimodal model designed for local and edge deployment, with a 128K context window, native system prompt support, and an instruction-tuned variant that is usable without a giant serving stack. Google positions the small Gemma 4 models as practical for on-device and local workflows, not just as miniatures of the larger models. That made it a good fit for a problem I care about: structured invoice extraction where privacy, cost, and control matter as much as raw quality.

At my document volume, a hosted model would have been simple to prototype but expensive to normalize around. Roughly speaking, a model like GPT-4o lands around a cent per invoice at this prompt and output length. A local Gemma 4 setup costs time up front, but effectively $0 per call after that.

My goal was simple:

  • take OCR text from Indian GST invoices
  • extract 22 fields into strict JSON
  • fine-tune locally instead of paying per document to a hosted model

This was not a benchmark project. It was a practical test of what a small local Gemma 4 model can actually learn.

Why Gemma 4 E2B was the right model to try

I did not need a general-purpose assistant. I needed a model that could learn a narrow, structured task and run locally.

That made Gemma 4 E2B interesting for three reasons:

  1. It is small enough to experiment with on local hardware.
  2. It is capable enough to handle long, messy invoice OCR.
  3. It is open enough to fine-tune and evaluate honestly.

The instruction-tuned google/gemma-4-E2B-it model gave me a real starting point, not just a base model that needed a large GPU cluster to become useful.

I ran LoRA fine-tuning with:

  • model: google/gemma-4-E2B-it.
  • framework: MLX-LM
  • trainable params: 7.291M / 4647.450.M
  • trainable fraction: 0.157%
  • hardware: Mac
  • peak memory during stable runs: roughly 12.4 GB to 13.0 GB '.GB

That was the first encouraging sign. This was na ot theory. A small Gemma 4 model could be trained locally on a real business-shaped extraction task.

The task

The extraction target was a strict 22-field JSON schema for Indian GST invoices:

  • supplier identity
  • buyer identity
  • invoice number and dates
  • place of supply
  • HSN or SAC
  • description
  • taxable value
  • tax rates and amounts
  • total invoice
  • reverse charge
  • e-invoice IRN

The downstream requirement was not "answer roughly correctly." It was:

  • valid JSON
  • stable field typing
  • exact field mapping

That is a much harder and more useful task than general summarization.

What fine-tuning actually changed

The fastest way to see the difference between a "generic capable model" and a "task-adapted model" is to look at one invoice.

Before fine-tuning, the baseline model was capable of understanding the document but not disciplined enough to behave like an extractor. In some runs, it produced malformed JSON, mixed reasoning-style text into the answer, or mapped totals into the wrong fields.

After fine-tuning, the same invoice started producing compact,t structured outputs with the fields it had learned reliably:

"json
{
"supplier_name": "Sample Supplier Pvt Ltd",
“supplier_gstin”: “27XXXXXXXXXX1ZX”,
“invoice_no”: “INV-001”,
“invoice_date”: “16-02-2026”,
“cgst_rate”: 0.09,
"cgst_amt": 285.3,
"sgst_amt": 285.3,
"total_invoice": 3741,
“igst_rate”: 0.0,
"igst_amt": 0,
"reverse_charge": "No"
}
``

That did not mean the model was finished. Fields like taxable_value still need more real training examples to get right, which is where the project is heading next. But it had crossed the line from "general model guessing at documents" to "specialized extractor that can be improved with data."

The first version: synthetic data was enough to build the pipeline

I started with synthetic data because I did not have a large labeled corpus of invoices.

That synthetic pipeline gave me:

  • OCR-like invoice text
  • paired 22-field targets
  • tax arithmetic coverage
  • repeatable training exports
  • a way to debug LoRA and evaluation locally

The first clean synthetic-only run looked excellent.

Validation loss on the synthetic holdout improved from:

  • 0.552 at iteration 1
  • to 0.024 at iteration 300

On paper, that looked close to done.

But synthetic validation was measuring whether the model understood the synthetic world I had created, not whether it understood real invoices from real suppliers.

That distinction ended up shaping the whole project.

The real work was data engineering

The model was not the hard part.

The hard part was teaching the model what real invoice variance looks like.

I eventually built the dataset in layers:

Layer 1: generic synthetic invoices

These were useful for:

  • schema coverage
  • GST arithmetic patterns
  • JSON output discipline
  • basic extraction behavior

Layer 2: real annotated invoices

I merged and cleaned real invoice annotations into a single CSV:

  • 28 real invoices
  • 22 unique suppliers
  • a mix of PDF and image invoices

Before retraining, I split them into:

  • 20 real train invoices
  • 8 real holdout invoices

That was the first time I had a real evaluation set that could tell me something meaningful.

Layer 3: Archive-derived layout variants

This was the most important change to the dataset.

Instead of generating more generic synthetic invoices, I reused the structure of real invoice layouts from an Archive/ folder:

  • alternate labels like No., Bill No., GST No
  • weakly labeled subtotal rows
  • dense table layouts
  • multiline descriptions
  • inconsistent spacing
  • subtotal-only item blocks

From those real layouts, I generated synthetic OCR variants that preserved layout difficulty while changing values and identities.

The final hybrid training mix was:

  • 250 generic synthetic examples
  • 360 Archive-layout variants
  • 8 exact real OCR train examples matched back to source documents

Validation used:

  • 8 held-out real invoices

That was the first dataset composition that looked like a real fine-tuning strategy rather than a synthetic demo.

What Gemma 4 E2B actually learned

This is the part I think matters most for anyone considering Gemma 4 for domain adaptation.

Gemma 4 E2B clearly learned the task.

Not in the vague sense of "it sounded good," but in the operational sense:

  • JSON stayed structurally stable
  • The model learned invoice field boundaries
  • It handled many supplier layouts
  • It converged reliably on a Mac with a tiny trainable parameter budget

The most meaningful run was the first hybrid real-holdout run.

Its validation loss on the real holdout set improved from:

  • 0.786 at iteration 1
  • to 0.132 at iteration 250

Then I fixed issues in the Archive-variant generator:

  • removed annotation-format leakage like @ ₹xxx = ₹xxx
  • parsed real per-line amounts instead of splitting totals evenly
  • forced igst = 0 for intra-state invoices in the target JSON

That produced a slightly better run:

  • final real-holdout validation loss: 0.130

I also tried a prompt change specifically aimed at multiline extraction. That got worse, not better:

  • final validation loss: 0.147

I would not overstate that result. On an 8-invoice real holdout, that difference is not strong enough to claim that prompt engineering is harmful in general.

What it did show me is something narrower and more useful:

  • The prompt tweak did not clearly beat the best data-only run
  • The biggest gains in this project came from dataset composition, not instruction wording

What this taught me about Gemma 4

The headline lesson was not that Gemma 4 needed heroic prompt engineering.

It was that Gemma 4 E2B was already capable enough that the next bottleneck was dataset quality.

That is a good sign for the model.

Small models become interesting when they are strong enough that your time moves from "can this model learn the task at all?" to "what data do I need to make it trustworthy?"

That is where this project ended up.

What went wrong, and why it was useful

I do not think a useful Gemma 4 write-up should pretend everything worked on the first run.

Two failures were especially instructive.

Failure 1: overfitting after a great checkpoint

In an earlier run, validation loss got very low and then degraded badly later:

  • 0.552 at iteration 1
  • 0.022 at iteration 200
  • 1.397 at iteration 400

That taught me:

  • The best checkpoint is not necessarily the last checkpoint
  • Dense evaluation checkpoints matter
  • Small local runs can still overtrain quickly

Failure 2: NaN training from sequence problems

Another run looked healthy until long examples were truncated. After that:

  • train loss became nan.
  • validation loss became na.n
  • The rest of the run was unusable

That forced me to treat dataset export and sequence control as first-class parts of the training pipeline, not cleanup tasks.

Both failures improved the eventual Gemma 4 setup more than another round of prompt edits would have.

The practical result

I would summarize the outcome this way:

  • Gemma 4 E2B was strong enough to learn a real structured extraction task locally
  • LoRA fine-tuning worked within a very small trainable parameter budget
  • The model benefited more from better data composition than from prompt tweaking
  • Synthetic data was useful for bootstrapping, but real layout variance determined what I could trust

That is a very good place for a small, open model.

If I were starting again

I would do three things earlier:

  1. Create a real holdout set before the first "good" run.
  2. Build layout-derived synthetic variants before scaling generic synthetic data.
  3. Evaluate field-level errors on real invoices sooner instead of trusting synthetic validation curves.

Those are not just invoice lessons. They are reusable lessons for anyone fine-tuning Gemma 4 on domain-specific extraction tasks.

Why this made me more optimistic about Gemma 4, not less

The most interesting thing about this project was not that Gemma 4 E2B solved everything immediately.

It was that a local, open, small model got far enough that the real work shifted to data design, evaluation discipline, and layout coverage.

That is exactly the kind of capability I want from an open model.

Not a toy.
Not a benchmark artifact.
A model that is small enough to run locally, but capable enough to deserve serious data engineering.

For this task, Gemma 4 E2B crossed that line.

Top comments (0)