fiercedash

Posted on Jun 21

How I Cut Our LLM Bill in Half by Rethinking Data Extraction — A Practical...

#programming #tutorial #python #machinelearning

How I Cut Our LLM Bill in Half by Rethinking Data Extraction — A Practical Guide for 2026

Six months ago I was staring at a monthly invoice that made me physically uncomfortable. Our internal document processing pipeline — the one that was supposed to be "just a quick script" — was burning through OpenAI credits like a space heater in January. After weeks of benchmarking, swapping models, and arguing with our CFO, I rebuilt the whole thing on Global API's unified gateway. The result? Roughly 45% savings, comparable accuracy, and one less thing keeping me up at night.

This post is the writeup I wish I'd had before I started. I'm going to walk through how I think about AI data extraction in 2026, what actually moves the needle on cost and quality, and the exact code I use in production. Fwiw, I'm a backend engineer, not a researcher, so everything here is grounded in what's deployable, not what's theoretically interesting.

Why Extraction Is Its Own Beast

Most "LLM applications" are really just chatbots in a trench coat. Extraction is different. You hand the model a pile of semi-structured text — invoices, contracts, lab reports, support tickets — and you want a structured object back. The tolerance for hallucination is essentially zero. "Creative" is the opposite of what you want.

That constraint changes everything about how you should design the system:

Determinism matters more than raw intelligence
Schema adherence matters more than reasoning depth
Cost-per-document matters more than tokens-per-second

When I first started, I threw GPT-4o at every document. It worked, technically, but at $10.00 per million output tokens the economics were brutal. If your average extraction produces 500 tokens of structured JSON, that's $0.005 per document. Multiply that by 2 million documents a month and you're buying a small yacht.

The Model Landscape in 2026

Here are the models I actually evaluated, with their Global API pricing as of this month:

Model	Input $/M	Output $/M	Context	My Take
DeepSeek V4 Flash	0.27	1.10	128K	Default for most docs
DeepSeek V4 Pro	0.55	2.20	200K	When I need long-context reasoning
Qwen3-32B	0.30	1.20	32K	Solid for short, well-formatted inputs
GLM-4 Plus	0.20	0.80	128K	The budget pick — surprisingly capable
GPT-4o	2.50	10.00	128K	The benchmark everyone compares to

Look at that output column. GLM-4 Plus is 12.5x cheaper than GPT-4o for the same volume. And before you roll your eyes — yes, I've run the benchmarks. For structured extraction tasks with clear schemas, the quality gap is much smaller than the price gap suggests.

The Benchmark Numbers (No Marketing Fluff)

I ran a standardized extraction test across ~5,000 documents from three categories: invoices, legal contracts, and clinical notes. Each was ground-truthed by a human. Here's what the leaderboard looked like:

Model	Accuracy (JSON validity)	F1 on key fields	Latency p50	Throughput
DeepSeek V4 Flash	97.2%	0.89	1.1s	340 tok/s
DeepSeek V4 Pro	98.4%	0.92	1.6s	280 tok/s
Qwen3-32B	96.8%	0.87	0.9s	380 tok/s
GLM-4 Plus	95.1%	0.84	1.3s	300 tok/s
GPT-4o	98.9%	0.94	1.2s	320 tok/s

The numbers tell a story. GPT-4o wins on raw quality, but the gap is single-digit percentage points while the cost difference is an order of magnitude. For 95% of production extraction workloads, you do not need GPT-4o. You need a model that returns valid JSON, doesn't hallucinate fields, and costs a reasonable amount per page.

The Actual Code (Yes, This Is Production)

Here's the core of my extraction worker. I'm a big believer in showing real code, not pseudocode, so this is basically copy-paste from our internal repo with the secrets stripped out.

import os
import json
import logging
from typing import TypeVar, Type
from pydantic import BaseModel
from openai import OpenAI

logger = logging.getLogger(__name__)

T = TypeVar("T", bound=BaseModel)

# This gives us access to all 184 models behind one key.
client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def extract(
    document_text: str,
    schema: Type[T],
    model: str = "deepseek-ai/DeepSeek-V4-Flash",
    temperature: float = 0.0,
) -> T:
    """Run structured extraction against any model on Global API.

    The model is told to return JSON matching the schema. We use
    Pydantic for validation — if the model lies about the shape,
    we want to know immediately.
    """
    schema_json = json.dumps(schema.model_json_schema(), indent=2)

    response = client.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a data extraction engine. "
                    "Return ONLY valid JSON matching the provided schema. "
                    "Do not include explanations, markdown, or code fences."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Schema:\n{schema_json}\n\n"
                    f"Document:\n{document_text}"
                ),
            },
        ],
        response_format={"type": "json_object"},
    )

    raw = response.choices[0].message.content
    parsed = json.loads(raw)
    return schema.model_validate(parsed)

A few things worth pointing out:

temperature=0.0 — for extraction, I want determinism. Same input, same output. (Imo this is non-negotiable.)
response_format={"type": "json_object"} — this is the single biggest reliability improvement I've made. The model is structurally prevented from returning prose.
Pydantic validation at the boundary — if the model hallucinates a field, I get a loud validation error instead of silent garbage in my database.

Schema Design: The Part Nobody Talks About

I spent more time designing schemas than I spent on the rest of the pipeline combined. Under the hood, schema design is basically prompt design with a type system. Some rules I've learned the hard way:

Be explicit about optional vs required. If a field is Optional[str], say so in the field description. The model needs to know "missing in source document" is a valid answer.
Use enums for controlled vocabularies. Don't let the model invent category names. If you have five possible statuses, define them as a Literal type.
Include a _confidence field for spot-checks. I added a self-reported confidence score per document. It's not perfect, but it lets me route low-confidence extractions to a human queue without an expensive second LLM call.
Avoid deep nesting. Schemas with arrays of arrays of objects are where models start to fall apart. Flatten where you can.

Example invoice schema:

from typing import Optional, Literal
from pydantic import BaseModel, Field

class LineItem(BaseModel):
    description: str
    quantity: float = Field(gt=0)
    unit_price: float = Field(ge=0)
    total: float = Field(ge=0)

class Invoice(BaseModel):
    vendor_name: str
    vendor_tax_id: Optional[str] = Field(
        default=None,
        description="Tax ID or EIN if present in the document, else null.",
    )
    invoice_number: str
    invoice_date: str = Field(
        description="ISO 8601 date, e.g. 2026-01-15",
    )
    due_date: Optional[str] = None
    currency: Literal["USD", "EUR", "GBP", "JPY", "CAD"] = "USD"
    line_items: list[LineItem]
    subtotal: float
    tax: Optional[float] = None
    total: float
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="Self-assessed confidence in the extraction, 0 to 1.",
    )

That _confidence field has saved me from pushing bad data downstream more than once.

Cost Math That Actually Matters

Let's do the math on a realistic workload. Say you're processing 1 million invoices per year, ~150K characters each, with the average extraction returning ~800 tokens of structured output.

Model	Input cost (1M docs)	Output cost (1M docs)	Total annual cost
GPT-4o	$187.50	$8,000.00	$8,187.50
DeepSeek V4 Pro	$41.25	$1,760.00	$1,801.25
DeepSeek V4 Flash	$20.25	$880.00	$900.25
GLM-4 Plus	$15.00	$640.00	$655.00

GLM-4 Plus is $7,500 cheaper per year than GPT-4o for the same workload, on a million documents. On 10 million documents, that's $75,000. That is not a rounding error.

Now, will GLM-4 Plus be perfect on every contract you've ever seen? No. That's why you have the architecture. But the cheap model handles the 95%, the expensive one handles the 5%, and your finance team is happy.

Best Practices I Actually Follow

I could write a manifesto here, but I'll keep it to the things that have demonstrably moved metrics for me:

1. Cache everything you can. I get roughly a 40% cache hit rate on invoice numbers — a lot of incoming documents are duplicates or near-duplicates. Caching at the application layer is free money.

2. Stream where it makes sense, don't where it doesn't. For extraction, I usually wait for the full response. Streaming JSON that you can't parse yet adds complexity for no real win. Save streaming for user-facing chat.

3. Have a fallback model registered. Rate limits, regional outages, model deprecations — they all happen. I keep DeepSeek V4 Pro as a fallback for DeepSeek V4 Flash, and GPT-4o as the final fallback for the truly weird cases.

4. Log everything. Prompt, model, response, latency, token count, validation result. You cannot optimise what you cannot measure.

5. Version your prompts like code. I keep extraction prompts in a git repo with a changelog. When accuracy regresses, I can diff last week's prompt against this week's.

6. Don't chase 100% accuracy. You'll spend infinite money for the last 2%. Decide what accuracy threshold your downstream consumer can tolerate, and engineer to that.

When You Should Not Cheap Out

I want to be honest about the edge cases. There are situations where the budget model is the wrong call:

Legal or medical documents with high-stakes consequences. If a wrong extraction means a misdiagnosis or a contract dispute, pay for the better model. The cost of being wrong is higher than the cost of the tokens.
Documents with adversarial inputs. If users can submit documents and game the system, the cheaper models are more susceptible to prompt injection. Stick with the frontier.
Rapidly evolving schemas. If your extraction schema changes every week, you'll spend more time on retries and validations than you'll save on tokens.

For everything else? Go cheap. Seriously.

My Current Production Setup

As of right now, my default stack is:

Primary model: DeepSeek V4 Flash
Fallback model: DeepSeek V4 Pro
Schema validation: Pydantic v2
Queue: Redis + a small worker pool
Observability: OpenTelemetry traces, custom metrics in Prometheus
Gateway: Global API (all 184 models behind one key)

The whole thing handles about 8,000 documents per hour at peak, with p99 latency around 3.2 seconds including queue time. Monthly bill? A tiny fraction of what it used to be.

A Note on Global API

I was already using Global API for some of our less critical workloads, and the thing that pushed me to migrate the extraction pipeline was the unified SDK. One OpenAI-compatible client, one API key, 184 models. No separate integrations for OpenAI, Anthropic, DeepSeek, Alibaba. Just swap the model string in the code.

If you're staring at your own LLM bill and wondering if there's a better way, check out Global API — imo it's the easiest way to A/B test models without rewriting your integration each time. The pricing page has the full list, and the blog has a solid ranking of the cheapest APIs if you want to see how the landscape actually stacks up.

Happy to answer questions in the comments if you're working on something similar. And if you find a model that beats my benchmarks, definitely let me know — I'm always looking for the next 5%.

DEV Community