Anthropic Message Batching: When 50% Off Is Worth the Latency

#ai #llm #anthropic #python

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You have a 1,200-prompt eval set. It runs every night. You hit the regular Messages API in a tight asyncio loop. You manage retries. The rate limiter slaps you halfway through and you wake up to a half-finished CSV. The job had until standup, not five minutes.

That is the case the Anthropic Message Batches API was built for. You hand it up to 100,000 requests in one POST. The docs describe most batches finishing in less than 1 hour, with a hard 24-hour expiration on anything that does not. You pay 50% of the standard token rate for everything in the batch. Same model, same outputs. Different endpoint.

The trade is latency. If anyone is staring at a screen waiting for the result, batching is the wrong tool. Otherwise, you are leaving money on the table by not using it.

Where the discount earns its keep

A short list of jobs where batching is the obvious move:

Nightly eval runs. A few hundred to a few thousand prompts against a frozen test set. You queue at 02:00, results are sitting in S3 by 06:00, the morning report job picks them up.
Embeddings or summary backfills. You have a million support tickets and you want a one-paragraph summary on each. The user does not see the result; a search index does.
Document classification cron jobs. Categorise yesterday's uploads, tag last night's email queue, score the transcript backlog. The cron that runs at 03:00 does not care about p95 latency.
Offline data labelling for training or evals. You are building a synthetic dataset. The pipeline runs once, the cost is the constraint, the wall clock is not.
Bulk content rewrites. Re-render 50,000 product descriptions in a new tone, then schedule alt-text regeneration across a media library. Walk away until morning.

Where batching is the wrong call:

Anything a user is waiting on. Chat, autocomplete, an agent loop, a code-action button: none of it.
Anything with a strict SLA tighter than "by tomorrow morning." A 24-hour expiration is a 24-hour expiration. Most batches finish in under an hour, but the cap is what you plan against.
Tiny jobs. If your batch is 30 requests, the plumbing is more code than the savings can pay back.

The script: 1,000 eval prompts, end to end

Here is a runnable script that submits an eval set, polls until the batch ends, and writes one JSONL line per result. It is the actual shape of code I would put in a cron job.

First, build the requests. Each item needs a custom_id (1–64 chars, ^[a-zA-Z0-9_-]{1,64}$) so you can match results back to inputs. The custom_id is the only key in the result file; lose it and you cannot reconcile.

import json
import time
from pathlib import Path

import anthropic
from anthropic.types.message_create_params import (
    MessageCreateParamsNonStreaming,
)
from anthropic.types.messages.batch_create_params import Request

client = anthropic.Anthropic()

EVAL_FILE = Path("evals/dataset.jsonl")
OUTPUT_FILE = Path("evals/results.jsonl")
MODEL = "claude-opus-4-7"
SYSTEM = "You are an evaluator. Reply with PASS or FAIL only."


def load_evals(path: Path) -> list[dict]:
    with path.open() as f:
        return [json.loads(line) for line in f if line.strip()]


def build_requests(rows: list[dict]) -> list[Request]:
    out = []
    for row in rows:
        out.append(
            Request(
                custom_id=row["id"],
                params=MessageCreateParamsNonStreaming(
                    model=MODEL,
                    max_tokens=64,
                    system=SYSTEM,
                    messages=[
                        {"role": "user", "content": row["prompt"]},
                    ],
                ),
            )
        )
    return out

MessageCreateParamsNonStreaming is the typed dict the SDK exposes — you cannot stream inside a batch, so the type signals it. custom_id should be the eval row id, not an autoincrement, so re-runs and partial failures stay reproducible.

Now the submit-and-poll loop. The docs are explicit that processing_status starts as in_progress and ends as ended; ended does not mean every request succeeded, just that the batch is done.

def submit(requests: list[Request]) -> str:
    batch = client.messages.batches.create(requests=requests)
    print(f"Submitted batch {batch.id} with {len(requests)} requests")
    return batch.id


def wait_for(batch_id: str, poll_seconds: int = 60) -> None:
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        if batch.processing_status == "ended":
            counts = batch.request_counts
            print(
                f"ended: {counts.succeeded} ok, "
                f"{counts.errored} err, "
                f"{counts.expired} expired, "
                f"{counts.canceled} canceled"
            )
            return
        time.sleep(poll_seconds)

retrieve is idempotent and cheap; polling once a minute is fine for a batch that runs in an hour. Do not poll once a second. The rate limit on Batches API HTTP requests is real and a tight loop will burn through it for nothing.

Pulling results is the part most teams get wrong on the first try. The result stream gives you one entry per custom_id, and each entry has a result.type of succeeded, errored, canceled, or expired. Only succeeded is billed; the other three are free, which matters more than it sounds when you accidentally submit a batch with a bad system prompt.

def write_results(batch_id: str, out_path: Path) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w") as f:
        for entry in client.messages.batches.results(batch_id):
            row = {"custom_id": entry.custom_id}
            r = entry.result
            if r.type == "succeeded":
                msg = r.message
                row["status"] = "ok"
                row["text"] = msg.content[0].text
                row["input_tokens"] = msg.usage.input_tokens
                row["output_tokens"] = msg.usage.output_tokens
            elif r.type == "errored":
                err = r.error.error
                row["status"] = "error"
                row["error"] = err.type
            elif r.type == "expired":
                row["status"] = "expired"
            elif r.type == "canceled":
                row["status"] = "canceled"
            f.write(json.dumps(row) + "\n")


def main() -> None:
    rows = load_evals(EVAL_FILE)
    requests = build_requests(rows)
    batch_id = submit(requests)
    wait_for(batch_id)
    write_results(batch_id, OUTPUT_FILE)


if __name__ == "__main__":
    main()

The output is a JSONL file your scoring step reads. One line per eval row, every status accounted for. If errored is non-zero, you know which custom_ids to re-queue. If expired is non-zero, your batch ran into the 24-hour wall and you should split it next time.

Cost math, with the numbers we know

The discount is 50% off the standard input and output token rates. The exact dollar figure depends on which Claude model you target, and Anthropic's rate card moves; check the pricing page before you build a forecast.

The shape of the saving is what matters. If your nightly job is 10,000 prompts at an average of 800 input tokens and 200 output tokens each, that is 8M input and 2M output tokens per run, every night. Multiply by 30 to get a monthly bill. Cut that bill in half. For most teams, the batch endpoint moves the line item from invisible to obvious on the finance review.

The other quiet win: the discount stacks with prompt caching. If your eval prompts share a long system prompt or a fixed instruction block, cache it, and the cached portion is billed at the cache hit rate, then the 50% comes off whatever is left. One thing to watch when stacking the two: the default prompt-cache TTL is 5 minutes, which expires mid-batch on anything but the smallest jobs. Use the 1-hour cache duration so the cached prefix survives the run. The pricing page walks through the multipliers.

A few things the docs are precise about

Worth keeping in your head when you write the cron job:

Batch caps. 100,000 requests or 256 MB total, whichever you hit first. Above that, split.
24-hour expiration. If processing has not completed within 24 hours, unfinished requests come back as expired. You are not billed for them, but you also do not have results.
No streaming inside a batch. You get the final message per request, never deltas.
custom_id is the only join key. Use a stable id; do not invent one at submit time you cannot regenerate.
Pricing and SLA are subject to change. The numbers in this post reflect the Anthropic batch processing docs at the time of writing. Re-read them before quoting a saving figure to your finance team.

When in doubt

The decision is one question: who is waiting? If someone is, use the regular Messages API and pay full rate for the latency. If no one is, batch it and pocket the discount.

If this was useful

The LLM Observability Pocket Guide covers the rest of the eval-and-cost stack: which signals to attach to batch runs so a regression shows up in your dashboards, how to structure result files so re-runs are cheap, and how to size a batch against your rate-limit budget without the trial-and-error tax.