Building a Document Classification and Extraction Pipeline with Gemini Vision API

#python #ai #machinelearning #gemini

Introduction

Businesses receive hundreds of documents daily — invoices, receipts, contracts — arriving as PDFs, images, and spreadsheets, all mixed together. Someone has to open each one, figure out what it is, and manually pull the relevant data into a database. It's slow, error-prone, and doesn't scale.

In this tutorial I'll show you how to build a pipeline using the Gemini Vision API that automatically classifies an incoming document and extracts its key fields in a single API call, with no preprocessing required.

By the end you'll have working Python code that takes any PDF or image as input and returns clean structured JSON, ready to pipe into a database or downstream workflow.

Why Gemini Vision?

When it comes to reading documents the first instinct is usually OCR. It fell apart quickly for two main reasons:

Inconsistent layouts — there was no standardized column structure or consistent dimensions across documents. Add to that fields that existed outside the main table entirely, and you have OCR's worst nightmare. A 5-column table with a full name or date sitting outside it — OCR could detect the text but had no reliable way to place it correctly.
Approval signatures — one of the most critical data points was whether a document had been signed off by an authorized signatory. OCR sees pixels, not context. It had no reliable way to detect this.

This is where Gemini Vision changes the equation. It understands document structure and semantics, not just characters on a page.

Gemini Vision approaches documents the way a human would. It reads context, not just characters. It understands that a signature in the bottom right corner represents approval, that a value sitting outside a table still belongs to the document's data, and that two columns with different headers can contain the same type of information. This makes it particularly powerful for real world documents where no two look exactly alike.

Setup

To get started you'll need a Gemini API key — grab one from Google AI Studio. Pricing is pay-per-use and negligible for development.

Install the SDK:
pip install google-generativeai

Then set up your imports and configure the client:

import os
import json
import base64
from pathlib import Path
import google.generativeai as genai

genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
model = genai.GenerativeModel("gemini-2.5-pro")

Storing the API key as an environment variable keeps it out of your codebase — never hardcode it directly.

Format Normalization

Before passing anything to Gemini, all incoming documents are normalized to PNG. The logic is straightforward — PDFs convert directly via PyMuPDF, Excel files go through LibreOffice first, and images convert in place. Everything that reaches Gemini is a PNG.

The full implementation is covered in Part 2 — the pseudocode here is enough to understand the architecture before we get to the Gemini call.

function normalize_to_png(file_path):
detect mime_type of file

if mime_type is PDF:
    convert pages to PNG using PyMuPDF

elif mime_type is Excel (.xlsx/.xls):
    convert to PDF via LibreOffice
    then convert PDF to PNG

elif mime_type is Image:
    convert to PNG directly

else:
    flag for manual review

return png_path

The full production implementation including LibreOffice on Cloud Run and Docker configuration is covered in Part 2 of this series.

Loading the Document

To pass an image to the Gemini API, we first encode it as a base64 string. The file is opened in binary mode ("rb") and encoded into a UTF-8 string that the API can receive in the request body.

def encode_image(self, image_path: str) -> str:
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')

The returned string gets passed directly into the data field of the image part in the next step.

Classify and Extract in One Prompt

The process is built on retry logic that can be adjusted based on the scope of error allowed. Greater leniency means more results but a higher chance of error. Stricter settings flag more for manual review but improve accuracy. The judgment is further decided by a confidence score based on how certain the model is about each extracted value.

A key design decision here is combining classification and extraction into a single API call rather than two separate ones. The naive approach is to classify first, then extract in a second call. This doubles your API costs, doubles your latency, and introduces a failure point between the two calls. A single well structured prompt handles both in one shot, and the confidence score covers both the classification certainty and extraction quality simultaneously.

The prompt is the heart of the application. It has to be as detailed and explicit as possible. A well-structured prompt looks like this:

DOCUMENT EXTRACTION PROMPT

Step 1 — Classify: is this an invoice, receipt, contract, or other?
If other → flag for manual review, stop.

Step 2 — Check format: does it contain the required fields?

Step 3 — Extract non-tabular data: vendor name, date, document number

Step 4 — Extract line items and individual values

Step 5 — Sum all values, cross-reference with reported total
If mismatch → reduce confidence score

Step 6 — Check for approval status

Return ONLY this JSON structure, do not guess any values:
{
"doc_type": "",
"confidence": 0.0,
"fields": {...},
"approval_status": {},
"needs_review": false
}

Production Hardening

Sometimes Gemini doesn't produce the JSON perfectly and wraps it in markdown fences. For this we add a check like this:

if '

json' in content: json_str = content.split('

json')[1].split('


')[0].strip()

This strips the markdown fences before parsing, preventing a JSON decode error.

Even with clean JSON, parsing can still fail entirely. A max retry loop handles this well:

for attempt in range(1, self.max_retries + 1):
    try:
        response = self.model.generate_content([prompt, image_part])
        result = json.loads(json_str)
        return result
    except json.JSONDecodeError:
        if attempt < self.max_retries:
            time.sleep(self.retry_delay)
return {"error": "failed after max retries", "extraction_confidence": 0.0}

If parsing fails, the pipeline waits and retries rather than returning bad data.

Finally, if the confidence level falls below the threshold we route it to manual review:

if result.get("confidence", 0) < self.confidence_threshold:
    result["needs_review"] = True
return result

Anything below the threshold is immediately flagged for manual review instead of auto-processing.

Without these checks, the pipeline fails silently in production and returns wrong values instead of flagging for review. I'd also recommend integrating Google Cloud Logging here — it makes isolating failures in production significantly faster than print statements.

## Real Output Example

Once the pipeline processes a document successfully, here's what the output looks like for an invoice:

{
  "doc_type": "invoice",
  "confidence": 0.92,
  "fields": {
    "vendor": "Acme Solutions Ltd",
    "date": "2024-03-15",
    "invoice_number": "INV-2024-0892",
    "line_items": [
      { "description": "Software consultation", "quantity": 3, "unit_price": 850.00, "total": 2550.00 },
      { "description": "Infrastructure setup", "quantity": 1, "unit_price": 1700.00, "total": 1700.00 }
    ],
    "subtotal": 4250.00,
    "tax": 382.50,
    "total_amount": 4632.50
  },
  "approval_status": {
    "is_approved": true,
    "approver": "J. Smith",
    "approval_date": "2024-03-16"
  },
  "needs_review": false
}

A few things worth noting here. The confidence score of 0.92 clears the threshold so this gets auto-processed with no human needed. The line items are extracted individually so they can be inserted directly into a database row by row. The approval status comes from detecting a signature or stamp on the document, not guessed, explicitly found. And `needs_review: false` means this JSON goes straight into the downstream workflow without interruption.

A lower confidence score, say 0.61, would flip `needs_review` to `true` and route the document to a manual review queue instead.

## Limitations and What's Next

Despite the retry logic and confidence scoring, some edge cases will always slip through. A sender splitting one invoice across two attachments will get flagged — the pipeline sees two incomplete documents, not one complete one. The opposite is also true: multiple invoices merged into a single file will confuse the extraction and likely land in manual review.

At scale, API costs add up — check the latest pricing on Google AI Studio before deploying to production.

This pipeline assumes your input is already a PNG. In practice, documents arrive as PDFs, Excel sheets, Word files, and images all mixed together. The next article covers building a format normalization layer that handles all of these and deploys on GCP Cloud Run — so this extraction pipeline always gets a clean input regardless of what the sender attached.