Mukunda Rao Katta

Posted on May 25

Cut Your Anthropic Bill in Half with the Batch API

#hermeschallenge #ai #python #agents

The $800 bill

It was a product catalog job. The task: generate a two-sentence marketing description for each of 50,000 SKUs. Standard stuff. Pipe the data in, call the API, write results to a database.

The code worked fine. The model did a good job. The pipeline ran overnight, finished by morning.

Then the invoice came in. $800.

That was for one run. The catalog was going to be refreshed monthly. And there were plans to run the same kind of job across additional product lines.

One week later, a colleague mentioned the Anthropic Message Batches API.

Same model. Same prompts. Same token counts. The only difference: you submit the requests as a batch instead of one at a time, and you wait up to 24 hours for results instead of getting them in real time.

The discount: 50%.

That same 50,000-SKU job would have cost $400.

The batch API is not obscure. It is documented. The discount is real. But nothing in the standard SDK workflow points you toward it, and when you are in deadline mode, you reach for the simplest thing that works. The simplest thing was the synchronous API. That cost an extra $400.

The shape of the fix

Install the library:

pip install anthropic-batch-kit

Submit a batch:

from anthropic_batch_kit import BatchKit

kit = BatchKit(api_key="your-key")

prompts = [
    "Write a two-sentence product description for a waterproof hiking boot.",
    "Write a two-sentence product description for a titanium camping mug.",
    # ... up to 10,000 prompts per batch
]

batch_id = kit.submit(
    prompts=prompts,
    model="claude-sonnet-4-6",
    max_tokens=200,
)
print(f"Submitted batch: {batch_id}")

Poll until the batch is done:

kit.wait(batch_id)
# Polls at increasing intervals. Blocks until status is "ended".

Retrieve results with cost tally:

results = kit.retrieve(batch_id)

for item in results.items:
    print(item.custom_id, item.content)

print(f"Input tokens:  {results.tally.input_tokens}")
print(f"Output tokens: {results.tally.output_tokens}")
print(f"Batch cost:    ${results.tally.batch_cost_usd:.4f}")
print(f"Standard cost: ${results.tally.standard_cost_usd:.4f}")
print(f"Savings:       ${results.tally.savings_usd:.4f}")

Or do it in one call:

results = kit.run(prompts=prompts, model="claude-sonnet-4-6", max_tokens=200)

run() is submit + wait + retrieve in sequence. Use it when you just want results and do not need to track the batch ID separately.

What it does NOT do

It does not make the batch API synchronous. Results come back asynchronously. If you need a response in under a second, the batch API is the wrong tool.
It does not parallelize across multiple batches automatically. If you have 30,000 prompts and want to split across three batches, you do that in your calling code.
It does not retry individual failed items within a batch. The Batch API returns a per-item error status for failures. You can filter those out from results.items and resubmit.
It does not give you streaming output. You get final results after the batch is complete, not partial tokens as they are generated.

Inside the lib: the cost tally

This is the design detail worth explaining.

When you call the standard API, the SDK gives you usage.input_tokens and usage.output_tokens per request. To get total cost for a job, you sum those up and multiply by the per-token price.

With the Batch API, the per-token price is half the standard rate. The token counts are the same. The pricing is different.

The tally in anthropic-batch-kit tracks both numbers at once:

@dataclass
class CostTally:
    input_tokens: int
    output_tokens: int
    batch_cost_usd: float      # what you actually pay
    standard_cost_usd: float   # what you would have paid via real-time API
    savings_usd: float         # the difference

savings_usd is not a marketing number. It is a real field derived from the actual token counts for your actual job. If you run 10 batches over a month, you can sum the savings_usd fields and see what the batch strategy saved you vs. the alternative.

The pricing table used for the calculations is embedded in the library and covers the current Anthropic model lineup. It is versioned separately from the Anthropic SDK so you can pin it and audit it. If a model's price changes, you update the library.

When this is useful

The Batch API is the right choice when:

Latency does not matter. You are processing data overnight, on a schedule, or as a background job. No user is waiting for the response.
Volume is high. The discount only pays off at scale. If you are making 20 calls, the savings are cents. If you are making 20,000 calls, the savings are real money.
Prompts are independent. The Batch API processes items in parallel and returns them in arbitrary order. If item 500 depends on the output of item 200, you need a different approach.
You can tolerate up to 24 hours. The SLA for the Batch API is 24 hours. In practice it is usually faster. But you cannot count on it for time-sensitive jobs.

Common use cases: catalog enrichment, classification pipelines, report generation, embedding-adjacent summarization, QA scoring at scale.

When NOT to use this

Real-time user-facing responses. If a person is waiting at a screen, use the synchronous API.
Streaming use cases. The Batch API does not support streaming.
Short jobs. If you have fewer than a few hundred prompts, the overhead of submitting, polling, and retrieving is not worth it. Just use the synchronous API.
Dependent pipelines. If step 2 needs the output from step 1, the Batch API requires two separate batch submissions with a wait in between.

Install

pip install anthropic-batch-kit

No extra dependencies. The library uses only the Python standard library plus the anthropic SDK (which you already have if you are using the Batch API).

Source and tests: github.com/MukundaKatta/anthropic-batch-kit

Siblings

These libraries in the same agent-stack address adjacent problems:

Lib	Boundary	Repo
llm-batch-coalesce	Deduplicates concurrent identical in-flight requests (different from batch submission). For when 100 users ask the same thing at the same time.	MukundaKatta/llm-batch-coalesce
claude-cost	Per-token pricing table for Anthropic models. The pricing source anthropic-batch-kit uses to calculate the 50% discount.	MukundaKatta/claude-cost
token-budget-py	Thread-safe shared token/USD cap. Tracks cumulative spend across multiple batches.	MukundaKatta/token-budget-py
agenttrace	Cost and latency rollup per agent run. Pairs with the batch tally to give a full picture of where your spend is going.	MukundaKatta/agenttrace

The distinction between anthropic-batch-kit and llm-batch-coalesce comes up often. They solve different problems. llm-batch-coalesce is for collapsing concurrent real-time requests into one in-flight call. anthropic-batch-kit is for submitting large offline workloads to the Anthropic Batch API and getting results back asynchronously. They can coexist in the same system without conflict.

What's next

A few things are on the list:

Multi-batch orchestration. A helper that splits a list of more than 10,000 prompts across multiple batches automatically and collects all results into a single response.
Retry-failed helper. A convenience method that filters failed items from a completed batch and resubmits them as a new batch in one call.
CSV/JSONL input. Direct support for submitting from a file so catalog-style pipelines do not need a custom loader.
Live pricing updates. A mechanism for pulling the latest Anthropic pricing without a library update, for environments where the price table needs to stay current between releases.

The library is at version 0.1.0. Ten tests, zero deps beyond anthropic. If you run high-volume offline jobs against the Anthropic API and have not switched to the Batch API yet, check your last invoice and do the math.

DEV Community