DEV Community

Adamo Software
Adamo Software

Posted on

How we reduced AI inference costs by 60% without sacrificing accuracy

Running ML models in production is expensive. When we deployed a document classification pipeline for a fintech client last year, our inference costs hit $12,000/month within the first quarter. The models were accurate, but the economics did not scale. Over 4 months, we brought that number down to $4,500/month while keeping accuracy above 95%. Here is exactly how we did it.

The starting point

The client needed to classify and extract data from financial documents: invoices, bank statements, tax forms, and contracts. We built a pipeline using a fine-tuned BERT model for classification and a GPT-based model for entity extraction.

The stack:

  • Classification: Fine-tuned BERT-large (340M params) on AWS SageMaker
  • Extraction: GPT-4 API calls for structured data extraction
  • Volume: ~50,000 documents/month
  • Infra: SageMaker real-time endpoints, always-on

It worked well functionally. But the cost breakdown was brutal:

SageMaker endpoints (24/7):    $4,200/month
GPT-4 API calls:               $6,800/month
S3 + data transfer:            $1,000/month
Total:                         $12,000/month
Enter fullscreen mode Exit fullscreen mode

Step 1: Model distillation for classification

BERT-large was overkill for our classification task. We had 12 document categories, and after analyzing confusion matrices, most categories were clearly separable.

We distilled BERT-large into a DistilBERT model (66M params) using the standard knowledge distillation approach:

from transformers import (
    DistilBertForSequenceClassification,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
import torch
import torch.nn.functional as F

class DistillationTrainer(Trainer):
    def __init__(self, teacher_model, temperature=4.0, alpha=0.5, **kwargs):
        super().__init__(**kwargs)
        self.teacher = teacher_model
        self.temperature = temperature
        self.alpha = alpha

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        outputs = model(**inputs)
        student_logits = outputs.logits

        with torch.no_grad():
            teacher_outputs = self.teacher(**inputs)
            teacher_logits = teacher_outputs.logits

        # Soft target loss
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=-1),
            F.softmax(teacher_logits / self.temperature, dim=-1),
            reduction="batchmean"
        ) * (self.temperature ** 2)

        # Hard target loss
        hard_loss = F.cross_entropy(student_logits, inputs["labels"])

        loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
        return (loss, outputs) if return_outputs else loss
Enter fullscreen mode Exit fullscreen mode

Results after distillation:

  • Accuracy drop: 97.2% → 95.8% (acceptable for our use case)
  • Inference speed: 3.2x faster
  • Model size: 5.1x smaller
  • SageMaker cost: Could now run on ml.c5.xlarge instead of ml.g4dn.xlarge

This single change cut SageMaker costs from $4,200 to $1,400/month.

Step 2: Replace GPT-4 with targeted smaller models

GPT-4 was our biggest cost driver. We were sending full document text to GPT-4 for entity extraction, which was like using a sledgehammer to hang a picture frame.

We analyzed our extraction tasks and found three categories:

  1. Structured fields (invoice numbers, dates, amounts): These follow predictable patterns
  2. Semi-structured fields (line items, payment terms): Some variation but bounded
  3. Unstructured fields (contract clauses, special conditions): Actually needs LLM reasoning

For category 1, we replaced GPT-4 with regex + a small NER model:

import re
from typing import Optional

INVOICE_PATTERNS = [
    r"(?:invoice|inv)[\s#.:]*([A-Z0-9-]{4,20})",
    r"(?:bill|receipt)[\s#.:]*([A-Z0-9-]{4,20})",
]

DATE_PATTERNS = [
    r"\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})\b",
    r"\b(\d{4}[/-]\d{1,2}[/-]\d{1,2})\b",
    r"\b(\w+ \d{1,2},? \d{4})\b",
]

def extract_structured_fields(text: str) -> dict:
    results = {}

    for pattern in INVOICE_PATTERNS:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            results["invoice_number"] = match.group(1)
            break

    for pattern in DATE_PATTERNS:
        match = re.search(pattern, text)
        if match:
            results["date"] = match.group(1)
            break

    # Amount extraction with currency handling
    amount_match = re.search(
        r"(?:total|amount|due)[\s:]*[\$€£]?\s*([\d,]+\.?\d*)",
        text, re.IGNORECASE
    )
    if amount_match:
        results["amount"] = amount_match.group(1).replace(",", "")

    return results
Enter fullscreen mode Exit fullscreen mode

For category 2, we fine-tuned a smaller model (Llama 3 8B quantized to 4-bit) hosted on a single GPU instance.

For category 3 only (about 15% of documents), we kept GPT-4 but switched to GPT-4o-mini where possible.

The cost shift:

Before:
  GPT-4 for all docs:          $6,800/month

After:
  Regex + NER (categories 1):  ~$0 (runs on existing infra)
  Llama 3 8B on g5.xlarge:     $900/month
  GPT-4o-mini (category 3):    $400/month
  Total extraction:            $1,300/month
Enter fullscreen mode Exit fullscreen mode

Step 3: Batch processing and auto-scaling

The original setup ran SageMaker endpoints 24/7, but document uploads were heavily concentrated during business hours (8 AM to 6 PM local time). Nights and weekends had near-zero traffic.

We switched to:

  • Async inference endpoints with auto-scaling (min instances: 0, scale to demand)
  • Batch transform jobs for bulk uploads (client uploaded batches of 500+ documents every Monday)
  • Spot instances for batch jobs (70% cheaper than on-demand)
# SageMaker async endpoint config with auto-scaling
scaling_client = boto3.client("application-autoscaling")

scaling_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=4,
)

scaling_client.put_scaling_policy(
    PolicyName="scale-on-queue-depth",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,
        "ScaleOutCooldown": 60,
    },
)
Enter fullscreen mode Exit fullscreen mode

This dropped SageMaker costs another $600/month by eliminating idle compute during off-hours.

Final numbers

                        Before      After       Savings
Classification:         $4,200      $800        -81%
Entity extraction:      $6,800      $1,300      -81%
Infrastructure:         $1,000      $400        -60%
Batch processing:       $0          $200        (new)
Monitoring/logging:     $0          $100        (new)
                        -------     -------
Total:                  $12,000     $2,800      -77%
Enter fullscreen mode Exit fullscreen mode

We actually exceeded the 60% target and landed at 77% reduction. Accuracy stayed at 95.4% overall (down from 97.2%), which the client considered a worthwhile tradeoff.

Key takeaways

Audit before optimizing. We spent two weeks just instrumenting costs per API call, per model, per document type. Without that data, we would have optimized the wrong things.

Not every task needs your biggest model. The single highest-impact change was pulling structured field extraction out of GPT-4. That regex script took a day to write and saved $5,000/month.

Distillation is underrated for production workloads. If your classification accuracy is already high (96%+), a distilled model will likely maintain acceptable performance at a fraction of the cost.

Auto-scaling to zero is powerful. If your workload is not truly 24/7, do not pay for 24/7 compute.

Wrapping up

The common instinct when AI costs spike is to negotiate API pricing or switch providers. In our experience, the bigger wins come from rethinking which model handles which task. Most production pipelines have a mix of simple and complex work, and matching model capability to task complexity is where the real savings are.

If you are running into similar cost issues with your AI-powered pipelines, start with the cost audit. You will almost certainly find that a large chunk of your spend goes to tasks that do not need your most expensive model.


I'm a software engineer at Adamo Software, where we build AI and data pipelines for clients in fintech and healthcare.

Top comments (0)