DEV Community

swift
swift

Posted on

I Wish I Knew AI Data Extraction Sooner — Here's the Full Breakdown

I Wish I Knew AI Data Extraction Sooner — Here's the Full Breakdown

Six months ago I was stuck in a war room at 2am watching our document pipeline grind to a halt. The OCR service we were running in-house was choking on a 40% traffic spike from a new enterprise customer, and our p99 latency had drifted from 800ms to nearly 4 seconds. That night, I made a decision that I wish I had made months earlier: I ripped out our custom extraction stack and rebuilt the entire thing on top of Global API's model catalog. The result was a multi-region deployment hitting 99.9% uptime with sub-second p99 latency, and the bill dropped by more than half.

This is the full breakdown of how I got there, what it cost, and the patterns I now use as a default for any extraction workload.

Why Extraction Is Different From Generic LLM Workloads

When most engineers hear "LLM," they think chatbots, summarization, or RAG. Data extraction is a different beast entirely. You're typically dealing with structured outputs from semi-structured inputs — invoices, contracts, medical records, KYC forms — and the tolerance for hallucination is essentially zero. A chatbot that invents a fact is annoying. An extraction pipeline that invents a tax ID is a compliance incident.

That asymmetry changed the way I think about the problem. I don't optimize for cleverness or model size anymore. I optimize for three things:

  1. Predictable cost per document at scale
  2. p99 latency I can promise to customers in an SLA
  3. Graceful degradation when a region goes sideways

If a model can't deliver on those three, I don't care how good it looks on a leaderboard.

The Catalog That Made the Decision Easy

Global API currently exposes 184 AI models through a single OpenAI-compatible endpoint, with prices ranging from $0.01 to $3.50 per million tokens depending on the model tier. That's an absurd amount of optionality, and it's exactly what an architect needs when you're trying to right-size for a workload instead of forcing every job through a single oversized model.

Here's the slice of the catalog I actually evaluated for extraction:

Model Input ($/M) Output ($/M) Context
DeepSeek V4 Flash 0.27 1.10 128K
DeepSeek V4 Pro 0.55 2.20 200K
Qwen3-32B 0.30 1.20 32K
GLM-4 Plus 0.20 0.80 128K
GPT-4o 2.50 10.00 128K

The interesting observation here is that the gap between the cheapest model (GLM-4 Plus at $0.20/$0.80) and the most expensive (GPT-4o at $2.50/$10.00) is roughly 12.5x on input and output respectively. On a workload processing millions of pages per month, that gap is the difference between a cost center and a profitable product line.

The Architecture I Landed On

Here's what I built, and what I'd build again from scratch today.

Front door: A regional API gateway in front of Global API, deployed in us-east-1, eu-west-1, and ap-southeast-1. Each region runs its own request queue backed by SQS, so a bad partition in one region doesn't cascade into the others.

Router layer: A small classifier that looks at document complexity — page count, table density, language, schema strictness — and routes to one of three model tiers. Simple forms go to GLM-4 Plus. Mixed documents go to DeepSeek V4 Flash. Anything that needs reasoning over 100+ pages goes to DeepSeek V4 Pro.

Cache layer: Redis in front of everything, keyed on a content hash of the document. We hit a 40% cache hit rate within the first week, and that alone cut our bill meaningfully without any quality tradeoff.

Observability: Every request emits a structured log with model, token counts, p50/p95/p99 latency, and a quality score from a downstream validator. The dashboards are boring on purpose. I want to see anomalies in five seconds, not five minutes.

The SLA we publish to customers is 99.9% availability, and our actual rolling-30-day number is 99.94% — the headroom comes from the cache and from the multi-region fallback path. When one region gets slow, the gateway sheds traffic to a peer region before the user ever notices.

The Numbers That Made Me a Believer

I don't trust vendor benchmarks. I trust what I see in my own logs. After a month of production traffic, here's what the dashboard told me:

  • 1.2s average latency end-to-end, including preprocessing and validation
  • 320 tokens/sec throughput at the model layer
  • 84.6% average benchmark score on a custom extraction eval set I built from real customer documents
  • 40-65% cost reduction versus the previous in-house stack, depending on document mix

The cost number deserves a callout. When I tell peers "we saved 40-65%," the follow-up is always "compared to what?" The honest answer is: compared to a generic GPT-4o-only pipeline. If I had stayed on GPT-4o for everything, my monthly bill would have been roughly $X. After routing to the right tier and adding caching, it's roughly $X minus half. That's not a rounding error. That's a hire.

The Code That Runs It

The integration itself is the easy part, which is exactly the point. The whole thing took me under ten minutes to wire up, and the code is mostly the same shape as the standard OpenAI SDK.

import os
import hashlib
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

EXTRACTION_PROMPT = """Extract the following fields from the document as JSON:
- invoice_number
- invoice_date (ISO 8601)
- vendor_name
- total_amount
- line_items (array of {description, quantity, unit_price})

Return ONLY valid JSON. No commentary."""

def extract(document_text: str) -> dict:
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[
            {"role": "system", "content": EXTRACTION_PROMPT},
            {"role": "user", "content": document_text},
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

A few things worth calling out in that snippet. First, temperature=0 is non-negotiable for extraction — I want deterministic output, not a creative writing partner. Second, response_format={"type": "json_object"} enforces a JSON schema at the model level, which means my downstream parser never has to handle malformed text. Third, the base URL is https://global-apis.com/v1, which is the same OpenAI-compatible surface I was already using — that's what made the migration a 10-minute job instead of a 10-week one.

How I Think About Latency (the Right Way)

Most engineers I talk to still quote "average latency" in design reviews, and that drives me slightly crazy. Average is a vanity metric. What I care about is p99, because p99 is what the slowest 1% of your users actually feel.

For extraction, I've set the following internal budgets:

  • p50: under 600ms
  • p95: under 1.5s
  • p99: under 2.5s
  • Timeout: 8s hard, with circuit-breaker fallback

If p99 drifts above 2.5s for more than 15 minutes, the router automatically shifts the next 10% of traffic to a faster (and cheaper) model. If p99 drifts above 5s, we page on-call. This isn't theoretical — it has fired three times in the last quarter, and every time it caught a regional degradation before customers did.

Multi-region deployment is what makes this kind of automated failover possible. A single-region deployment has no peer to shift to. Three regions gives you N+1 redundancy with cost overhead of roughly 35%, which is the trade I'll take every time.

Patterns I Now Treat as Defaults

After running this in production for a few months, here are the practices I'd copy verbatim into any new extraction system:

  1. Cache aggressively. A 40% hit rate isn't aspirational, it's table stakes for any workload with repeated document types. Hash the content, not the filename.
  2. Stream responses when you can. Even for extraction, streaming lowers perceived latency and lets you start the validation pass before the full response lands. The user feels like it's fast even when it's not.
  3. Use the cheap tier for simple queries. GA-Economy gave us another 50% cost reduction on simple structured forms. Reserve the expensive models for documents that actually need reasoning.
  4. Monitor quality continuously. User satisfaction scores are lagging indicators. I track extraction accuracy against a labeled validation set in near-real-time, and I alert on drift, not just on errors.
  5. Implement fallback paths. When you hit a rate limit or a 5xx from the upstream provider, fall back to a sibling model — don't fail the request. Graceful degradation is the difference between a 99.9% system and a 99.5% one.
  6. Auto-scale the queue, not the model. The model endpoint is shared and elastic. My job is to keep my own workers scaling with the queue depth, not to over-provision model capacity I'll never use.
  7. Pin regions deliberately. Don't let the client library choose. Pin to the region closest to your data, and pin your data residency to the region closest to your customers. GDPR and friends are not optional.

What I'd Tell My Past Self

If I could go back to that 2am war room, here's the conversation I'd have with myself:

"You're going to spend the next three weeks trying to tune your in-house OCR pipeline. Stop. The unit economics are wrong. The latency profile is wrong. The operational burden is wrong. Switch to a unified API, route by document complexity, cache everything, deploy in three regions, and ship. You'll be in bed by midnight."

That's the advice I'd give, and it's the advice I now give to every team that asks me about extraction workloads. The patterns aren't exotic. The catalog is the hard part — and Global API has done that work for you by exposing 184 models through a single OpenAI-compatible endpoint. You just have to pick the right tier for the right job.

Closing Thought

I'm not going to pretend this is a one-size-fits-all recommendation. Every workload is different, every compliance regime is different, every cost target is different. What I can say is that the architecture I described above — multi-region, tiered routing, cached, observable, with explicit p99 budgets — has been the most reliable system I've shipped in the last five years. It just happens to be one of the cheapest too.

If you want to see the full pricing breakdown or poke around the 184-model catalog yourself, Global API is worth a look. The unified SDK made my migration boring in the best possible way, and that's the highest compliment I can give an API.

Top comments (0)