Adamo Software

Posted on Mar 17

How we reduced AI inference costs by 60% without sacrificing accuracy

#ai #machinelearning #devops #python

Running ML models in production is expensive. When we deployed a document classification pipeline for a fintech client last year, our inference costs hit $12,000/month within the first quarter. The models were accurate, but the economics did not scale. Over 4 months, we brought that number down to $4,500/month while keeping accuracy above 95%. Here is exactly how we did it.

The starting point

The client needed to classify and extract data from financial documents: invoices, bank statements, tax forms, and contracts. We built a pipeline using a fine-tuned BERT model for classification and a GPT-based model for entity extraction.

The stack:

Classification: Fine-tuned BERT-large (340M params) on AWS SageMaker
Extraction: GPT-4 API calls for structured data extraction
Volume: ~50,000 documents/month
Infra: SageMaker real-time endpoints, always-on

It worked well functionally. But the cost breakdown was brutal:

SageMaker endpoints (24/7):    $4,200/month
GPT-4 API calls:               $6,800/month
S3 + data transfer:            $1,000/month
Total:                         $12,000/month

Step 1: Model distillation for classification

BERT-large was overkill for our classification task. We had 12 document categories, and after analyzing confusion matrices, most categories were clearly separable.

We distilled BERT-large into a DistilBERT model (66M params) using the standard knowledge distillation approach:

from transformers import (
    DistilBertForSequenceClassification,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
import torch
import torch.nn.functional as F

class DistillationTrainer(Trainer):
    def __init__(self, teacher_model, temperature=4.0, alpha=0.5, **kwargs):
        super().__init__(**kwargs)
        self.teacher = teacher_model
        self.temperature = temperature
        self.alpha = alpha

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        outputs = model(**inputs)
        student_logits = outputs.logits

        with torch.no_grad():
            teacher_outputs = self.teacher(**inputs)
            teacher_logits = teacher_outputs.logits

        # Soft target loss
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=-1),
            F.softmax(teacher_logits / self.temperature, dim=-1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        # Hard target loss
        hard_loss = F.cross_entropy(student_logits, inputs["labels"])

        loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
        return (loss, outputs) if return_outputs else loss

Results after distillation:

Accuracy drop: 97.2% → 95.8% (acceptable for our use case)
Inference speed: 3.2x faster
Model size: 5.1x smaller
SageMaker cost: Could now run on ml.c5.xlarge instead of ml.g4dn.xlarge

This single change cut SageMaker costs from $4,200 to $1,400/month.

Step 2: Replace GPT-4 with targeted smaller models

GPT-4 was our biggest cost driver. We were sending full document text to GPT-4 for entity extraction, which was like using a sledgehammer to hang a picture frame.

We analyzed our extraction tasks and found three categories:

Structured fields (invoice numbers, dates, amounts): These follow predictable patterns
Semi-structured fields (line items, payment terms): Some variation but bounded
Unstructured fields (contract clauses, special conditions): Actually needs LLM reasoning

For category 1, we replaced GPT-4 with regex + a small NER model:

import re
from typing import Optional

INVOICE_PATTERNS = [
    r"(?:invoice|inv)[\s#.:]*([A-Z0-9-]{4,20})",
    r"(?:bill|receipt)[\s#.:]*([A-Z0-9-]{4,20})",
]

DATE_PATTERNS = [
    r"\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})\b",
    r"\b(\d{4}[/-]\d{1,2}[/-]\d{1,2})\b",
    r"\b(\w+ \d{1,2},? \d{4})\b",
]

def extract_structured_fields(text: str) -> dict:
    results = {}

    for pattern in INVOICE_PATTERNS:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            results["invoice_number"] = match.group(1)
            break

    for pattern in DATE_PATTERNS:
        match = re.search(pattern, text)
        if match:
            results["date"] = match.group(1)
            break

    # Amount extraction with currency handling
    amount_match = re.search(
        r"(?:total|amount|due)[\s:]*[\$€£]?\s*([\d,]+\.?\d*)",
        text, re.IGNORECASE
    )
    if amount_match:
        results["amount"] = amount_match.group(1).replace(",", "")

    return results

For category 2, we fine-tuned a smaller model (Llama 3 8B quantized to 4-bit) hosted on a single GPU instance.

For category 3 only (about 15% of documents), we kept GPT-4 but switched to GPT-4o-mini where possible.

The cost shift:

Before:
  GPT-4 for all docs:          $6,800/month

After:
  Regex + NER (categories 1):  ~$0 (runs on existing infra)
  Llama 3 8B on g5.xlarge:     $900/month
  GPT-4o-mini (category 3):    $400/month
  Total extraction:            $1,300/month

Step 3: Batch processing and auto-scaling

The original setup ran SageMaker endpoints 24/7, but document uploads were heavily concentrated during business hours (8 AM to 6 PM local time). Nights and weekends had near-zero traffic.

We switched to:

Async inference endpoints with auto-scaling (min instances: 0, scale to demand)
Batch transform jobs for bulk uploads (client uploaded batches of 500+ documents every Monday)
Spot instances for batch jobs (70% cheaper than on-demand)

# SageMaker async endpoint config with auto-scaling
scaling_client = boto3.client("application-autoscaling")

scaling_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=4,
)

scaling_client.put_scaling_policy(
    PolicyName="scale-on-queue-depth",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,
        "ScaleOutCooldown": 60,
    },
)

This dropped SageMaker costs another $600/month by eliminating idle compute during off-hours.

Final numbers

                        Before      After       Savings
Classification:         $4,200      $800        -81%
Entity extraction:      $6,800      $1,300      -81%
Infrastructure:         $1,000      $400        -60%
Batch processing:       $0          $200        (new)
Monitoring/logging:     $0          $100        (new)
                        -------     -------
Total:                  $12,000     $2,800      -77%

We actually exceeded the 60% target and landed at 77% reduction. Accuracy stayed at 95.4% overall (down from 97.2%), which the client considered a worthwhile tradeoff.

Key takeaways

Audit before optimizing. We spent two weeks just instrumenting costs per API call, per model, per document type. Without that data, we would have optimized the wrong things.

Not every task needs your biggest model. The single highest-impact change was pulling structured field extraction out of GPT-4. That regex script took a day to write and saved $5,000/month.

Distillation is underrated for production workloads. If your classification accuracy is already high (96%+), a distilled model will likely maintain acceptable performance at a fraction of the cost.

Auto-scaling to zero is powerful. If your workload is not truly 24/7, do not pay for 24/7 compute.

Wrapping up

The common instinct when AI costs spike is to negotiate API pricing or switch providers. In our experience, the bigger wins come from rethinking which model handles which task. Most production pipelines have a mix of simple and complex work, and matching model capability to task complexity is where the real savings are.

If you are running into similar cost issues with your AI-powered pipelines, start with the cost audit. You will almost certainly find that a large chunk of your spend goes to tasks that do not need your most expensive model.

I'm a software engineer at Adamo Software, where we build AI and data pipelines for clients in fintech and healthcare.

Top comments (3)

klement Gunndu • Mar 17

Model distillation from BERT-large to DistilBERT for classification is such an underused pattern. The 95% accuracy retention on 12 categories tracks with what I've seen — most classification tasks don't need the full parameter count.

Adamo Software • Mar 18

Totally agree. For most production classification tasks, the accuracy gap between full and distilled models is surprisingly small. The bigger win we noticed was inference latency: faster responses meant we could also simplify our async queue setup. Distillation really should be the default first step before throwing more GPU at the problem.

Harjot Singh • May 31

"Without sacrificing accuracy" is the line that makes this credible - anyone can cut cost by downgrading and eating a quality hit; the skill is cutting the spend that wasn't buying you accuracy in the first place. 60% almost always comes from the same places: caching repeated work, trimming context, and routing easy calls to a smaller model that scores identically on those.

The mental model that unlocks it: most workloads are bimodal - a big easy bulk where a cheap model is indistinguishable from the flagship, and a small hard tail where the premium model earns its cost. Match model to difficulty and the 60% appears with accuracy flat. That's exactly the routing Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) is built on. Great writeup; was caching or routing the bigger contributor to your 60%? (Moonshift's first run's free if useful.)