Mukunda Rao Katta

Posted on May 25

When and How to Use the Anthropic Batch API in Your Agent

#hermeschallenge #ai #python #agents

The Anthropic Message Batches API gives you a 50% discount on both input and output tokens. That is not a small number. If you are running nightly evals, bulk document processing, or A/B tests on prompt variants, this discount changes the economics significantly.

The catch is that batch processing is async. You submit a batch, and results come back within 24 hours. For most agent use cases, that is a dealbreaker. But for the cases where it is not, the savings are real and easy to capture.

This post is about knowing when to use it, when not to, and how to wire it up cleanly.

What the Batch API Actually Offers

The Message Batches API lets you submit up to 10,000 requests in a single batch. Anthropic processes them on a best-effort schedule, typically within an hour for small batches, but the SLA is 24 hours.

The pricing is straightforward: 50% off the standard per-token rate for both input and output. At current Sonnet pricing, that is roughly $1.50 per million input tokens instead of $3.00, and $7.50 per million output tokens instead of $15.00.

Results are delivered as a JSONL stream. Each result is keyed to the custom ID you assigned when you submitted the request. Failed requests are included in the results with an error field so you can identify and retry them.

There is no streaming. You get the full response or an error. There is no way to get partial results mid-batch.

When NOT to Use the Batch API

The batch API is the wrong tool when:

The user is waiting. Any agent loop that has a human waiting for a response cannot tolerate 24-hour latency. Realtime agents, chatbots, and interactive tools need synchronous API calls.

You need streaming output. If your UI shows tokens as they arrive, batch API is incompatible. Batch responses are always full-text, never streamed.

Order of results matters for the next step. If request B depends on the result of request A, you cannot batch them together. The batch API is for independent requests only.

You need to react to failures quickly. Batch results come back as a group. If 200 of your 1,000 requests failed, you find out when the batch completes, not when the first failure happens. If fast failure detection matters, use synchronous calls with proper retry logic.

Your batch is small. The overhead of submitting and polling a batch is not worth it for fewer than 50-100 requests. Just call the API synchronously.

Good Use Cases

Nightly eval runs. You have a test suite of 500 prompts that you run against every new model or system prompt variant. These run overnight. The 24-hour SLA is fine. The 50% discount makes eval runs significantly cheaper.

Bulk document processing. You need to summarize, classify, or extract structured data from 2,000 documents. The documents are already in your system. You do not need results immediately. Batch API handles this naturally.

A/B prompt testing. You have two or three system prompt variants and 300 representative user inputs. You want to run all variants against all inputs and compare results. This is embarrassingly parallel and async-friendly.

Model migration testing. You are evaluating a new model version and want to compare outputs on your full golden set. Submit the golden set as a batch. Wait. Compare.

Bulk classification or tagging. You need to run a classifier prompt over a product catalog, a ticket backlog, or a set of user reviews. No urgency. High volume. Batch API is the right call.

The anthropic-batch-kit Pattern

anthropic-batch-kit wraps the Anthropic Python SDK's batch endpoints with a simpler submit/poll/retrieve pattern and handles the JSONL result parsing for you.

from anthropic_batch_kit import BatchKit

kit = BatchKit(api_key="your-key")

# Build requests
requests = []
for i, doc in enumerate(documents):
    requests.append({
        "custom_id": f"doc-{i}",
        "model": "claude-sonnet-4-6",
        "max_tokens": 256,
        "messages": [
            {"role": "user", "content": f"Summarize in one sentence: {doc}"}
        ]
    })

# Submit batch
batch_id = kit.submit(requests)
print(f"Submitted batch {batch_id}")

Then, separately (in a cron job, a scheduled task, or a later script run):

from anthropic_batch_kit import BatchKit

kit = BatchKit(api_key="your-key")

# Poll until complete (blocks with sleep intervals)
results = kit.poll_and_retrieve(batch_id, poll_interval_seconds=60)

# results is a dict keyed by custom_id
for doc_id, result in results.items():
    if result["type"] == "succeeded":
        print(f"{doc_id}: {result['content']}")
    else:
        print(f"{doc_id}: FAILED - {result['error']}")

The poll_and_retrieve method handles the polling loop internally. It checks the batch status every N seconds and returns when the batch reaches a terminal state (succeeded, errored, or expired).

Cost Calculation

Here is how to think about the cost difference for a typical eval run.

Assume 500 prompts, each with 1,200 input tokens and 400 output tokens.

At standard Sonnet pricing: $1.80 input + $3.00 output = $4.80 total.
At batch pricing (50% off): $2.40 total. That is a $2.40 saving per run.

Run that eval twice a day and you save $1,752 per year on eval costs alone. For teams running large eval suites, the savings are significant.

def estimate_batch_cost(
    num_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "claude-sonnet-4-6"
) -> dict:
    # Standard pricing per million tokens
    pricing = {
        "claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
    }

    p = pricing.get(model, {"input": 3.00, "output": 15.00})

    total_input = num_requests * avg_input_tokens
    total_output = num_requests * avg_output_tokens

    standard_cost = (total_input / 1_000_000 * p["input"] + 
                     total_output / 1_000_000 * p["output"])
    batch_cost = standard_cost * 0.50

    return {
        "standard_usd": round(standard_cost, 4),
        "batch_usd": round(batch_cost, 4),
        "savings_usd": round(standard_cost - batch_cost, 4),
    }

Comparing With llm-batch-coalesce

These are two different things that can be confused.

llm-batch-coalesce is about single-flighting concurrent requests to the same prompt. If ten different parts of your code call the model with the same prompt at the same time, llm-batch-coalesce detects the duplicate and makes one API call, sharing the result with all ten callers. This is a synchronous optimization, not the Batch API.

The Anthropic Message Batches API is about submitting many independent requests in one HTTP call and getting results back asynchronously. Different mechanism, different use case.

Use llm-batch-coalesce when you have concurrent code making duplicate synchronous calls. Use the Batch API when you have independent work that does not need to complete within seconds.

Feature	llm-batch-coalesce	Anthropic Batch API
Latency	Synchronous (seconds)	Async (up to 24h)
Use case	Dedup concurrent calls	High-volume offline work
Cost benefit	No discount	50% discount
Max requests	N/A (per-request)	10,000 per batch
Streaming	Supported	Not supported
Result order	Immediate	Batch completion

Tradeoffs to Know

Debugging is harder. When a synchronous call fails, you find out immediately. When a batch request fails, you find out when the batch completes. If you submitted 1,000 requests and 50 failed, you need to identify which ones and why. Build retry logic to handle partial batch failures.

No SLA below 24 hours. Anthropic processes batches on a best-effort basis. Most small batches complete within an hour, but the guaranteed SLA is 24 hours. Do not use it when you need a result within the hour.

Results need parsing. The Batch API returns JSONL, not a simple list. Each line is a JSON object with a custom_id, a result type, and either content or an error. anthropic-batch-kit handles this parsing for you, but if you are rolling your own client, budget time for it.

Context windows still apply. Each request in the batch is still subject to the model's context window limit. You cannot use the Batch API to send a larger context than the model supports.

Quick Start

pip install anthropic-batch-kit

from anthropic_batch_kit import BatchKit

kit = BatchKit()  # reads ANTHROPIC_API_KEY from env

# Submit
batch_id = kit.submit([
    {
        "custom_id": "item-1",
        "model": "claude-sonnet-4-6",
        "max_tokens": 128,
        "messages": [{"role": "user", "content": "What is 2+2?"}]
    }
])

# Later: retrieve
results = kit.poll_and_retrieve(batch_id)
print(results["item-1"]["content"])

Related Tools

Tool	Purpose
anthropic-batch-kit	Submit, poll, retrieve Anthropic batches
llm-batch-coalesce	Single-flight dedup for concurrent sync calls
llm-cost-cap	Pre-flight USD gate for synchronous calls
agenttrace	Per-run cost tracking for synchronous agents
llm-fallback-router	Provider failover for synchronous calls

What Is Next

The Batch API is one of the more underused cost levers available. Most teams default to synchronous calls everywhere even when the use case is async-friendly. If you have eval runs or bulk processing jobs, measure your current monthly spend, apply the 50% discount, and decide if the async complexity is worth it.

Source and examples are at MukundaKatta/anthropic-batch-kit on GitHub.

DEV Community