Ravi Patel

Posted on Jun 13 • Originally published at ssimplifi.com

Batch API vs real-time OpenAI: the 50% discount, the 24-hour latency tolerance, and the workloads that should switch

#openai #batchapi #costoptimization #asyncprocessing

OpenAI's Batch API is one of the highest-ROI cost levers in the catalog, and one of the least-used. The mechanic: submit a JSONL file of chat completions to the Batch endpoint, pay 50% of the normal rate, accept up to 24 hours of processing latency, retrieve the results when ready. For any workload that doesn't need real-time response — and most companies have at least one — this is a free 50% cut on that slice. The reason it's under-used is that "Batch API" sounds intimidating compared to a single synchronous call, and most teams default to the chat completions endpoint reflexively. This post walks through the mechanic, the integration pattern, the realistic workload classification (which slice should batch, which shouldn't), the cost math, and the operational gotchas that surface in production deployments.

The parent guide OpenAI cost optimization covers OpenAI-specific cost techniques generally; this article is the Batch-API-specific deep dive.

What it is, mechanically

OpenAI's Batch API is a different endpoint from chat completions. Instead of a single request-response over HTTP, the batch flow is:

Compose a JSONL file where each line is one chat completion request (structurally similar to what you'd POST to /v1/chat/completions, with a custom custom_id field per request).
Upload the file via the Files API.
Create a batch by POSTing to /v1/batches with the uploaded file ID + the endpoint to call (/v1/chat/completions) + the completion window (24h).
Poll for batch status. Batches transition through validating → in_progress → completed. Typical end-to-end time is 30 minutes to a few hours; the 24-hour window is a guarantee, not a typical wait.
Download the results when the batch completes. Output is a JSONL file with one line per request, matched to the input via the custom_id field.

The pricing trade: 50% off the equivalent chat completions pricing in exchange for the up-to-24-hour processing window. Same model. Same response shape per request. Same usage block (including cached_tokens for prompt caching — Batch + prompt caching stack cleanly).

The full Batch API reference lives in OpenAI's docs; the rest of this article assumes the mechanic and focuses on when and how to use it.

The pricing math

For a representative offline workload — 100,000 chat completions per day, average 1,000 input tokens + 300 output tokens, on GPT-5.4:

Path	Per-day cost	Monthly cost
Real-time chat completions	100K × (1,000 × $2.50 + 300 × $15) / 1M = $700/day	~$21,000/month
Batch API	$700 × 0.5 = $350/day	~$10,500/month

Net saving: $350/day, ~$10,500/month, 50% of the workload's spend. The numbers scale linearly with volume; a 1M-requests-per-day batch-eligible workload saves $3,500/day.

VERIFY (founder): replace the example with a representative real-customer workload at current pricing. The illustrative numbers above are reasonable but worth grounding in real production data.

The math doesn't care about workload shape — the 50% discount applies to all chat completions through the Batch endpoint, regardless of model. GPT-5.4-mini batch is 50% off mini pricing; GPT-5 batch is 50% off GPT-5 pricing. The discount is uniform.

The bottom line: for any workload running ≥$1K/month through chat completions that can tolerate up to 24-hour latency, Batch API is a no-engineering-time-required 50% cut.

Which workloads actually qualify

This is where most teams stumble — not on the mechanic but on the classification of which workloads can move to Batch.

The "obviously yes" workloads

Offline analytics on logged data. Re-running an LLM analysis on yesterday's logs, generating insights for a weekly report, classifying historical content. No user is waiting on the result; the consumer is a batch report or a dashboard refresh. Move to Batch.

Bulk content moderation. Reviewing flagged content from the past 24 hours; the moderation decision feeds a queue or a follow-up workflow, not a user-facing block. Move to Batch.

Evaluation runs. Running a 1,000-prompt eval set against a new prompt version, computing aggregate scores, deciding whether to roll out. No user-facing latency requirement. Move to Batch.

Dataset generation / labeling. Generating synthetic training data, labeling unannotated examples, summarizing long-form content for downstream processing. Async by nature. Move to Batch.

Content generation pipelines that aren't time-critical. Generating product descriptions for an e-commerce catalog refresh; producing meta-descriptions for SEO content; bulk-translating documentation. The consumer waits for the batch to complete and processes the results. Move to Batch.

The "depends on the requirement" workloads

Customer support back-office. If "the AI summary of this ticket appears in the support agent's dashboard within an hour" is acceptable, move to Batch. If "the agent expects the summary the moment they open the ticket," stay real-time.

Email-content generation. If emails are sent on a daily cron, move to Batch. If they're triggered by user action and sent immediately, stay real-time.

Notification generation. Same shape — daily-digest notifications batch fine; transactional notifications need real-time.

Document processing pipelines. Often-batchable; depends on whether the user is waiting for the document to complete (real-time) or whether the document feeds a downstream queue (batchable).

The decision pattern: does a human (or a time-sensitive consumer) explicitly wait for this LLM response? If yes, stay real-time. If no, Batch is on the table.

The "obviously no" workloads

Interactive chat UIs. User typed, expects response in seconds. Real-time only.

Real-time agents responding to user actions. Same shape — user action triggers LLM call, user sees result. Real-time only.

Code completion. Inline tokens appearing as the user types. Real-time only.

Anything with user-facing latency SLAs under an hour. Batch latency is "up to 24 hours" — even the typical 30-minute-to-few-hours window is wrong for any sub-hour SLA.

The realistic split for most production deployments

The interesting finding when teams actually classify their workloads: 20-40% of total LLM spend is batch-eligible. Most teams have at least one offline analytics workflow, one content-generation pipeline, one evaluation cadence — and the cumulative volume across these is meaningful.

The first time a team does the audit, the typical reaction is "we've been overpaying for a third of our spend by routing it through real-time when it didn't need to be." The audit itself takes about half a day; the migration is another half day to a day per workload.

The integration pattern

The architectural shape that holds up in production:

# Pseudo-code for the canonical Batch integration

def submit_batch_job(workload_name: str, requests: list[dict]) -> str:
    """Submit a batch and return the batch ID."""
    # 1. Compose JSONL with custom_id per request
    jsonl_content = "\n".join(json.dumps({
        "custom_id": f"{workload_name}-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": request,
    }) for i, request in enumerate(requests))

    # 2. Upload as a file
    file = openai.files.create(
        file=("batch.jsonl", jsonl_content),
        purpose="batch",
    )

    # 3. Create the batch
    batch = openai.batches.create(
        input_file_id=file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
    )

    return batch.id

def poll_and_retrieve(batch_id: str) -> list[dict]:
    """Poll until the batch completes, then return results."""
    while True:
        batch = openai.batches.retrieve(batch_id)
        if batch.status == "completed":
            output_file = openai.files.content(batch.output_file_id)
            return [json.loads(line) for line in output_file.text.split("\n") if line]
        elif batch.status in ("failed", "expired", "cancelled"):
            raise BatchFailure(f"Batch {batch_id} ended in {batch.status}")
        time.sleep(60)  # poll every minute; tune for your cadence

The pattern in production deployments typically wraps the above in:

A job queue (Celery, Sidekiq, or whatever async-job system you use) that handles the submission + polling + result distribution.
Result correlation by custom_id — your application matches batch outputs back to the original work items via the IDs you assigned.
Failure handling for the rare cases where a batch fails (typically: a malformed input line, a model unavailable for batching, an account-level batch quota hit).
Cost tracking that attributes batch spend to the originating workload, since the JSONL file aggregates multiple requests under one batch ID.

The submission code itself is small; the operational wrapping is where the engineering investment lives. Most teams need 2-3 days of focused work to move a single workload from real-time to Batch with proper error handling and observability.

When Batch + prompt caching combine

A common gotcha worth flagging: Batch API and prompt caching stack cleanly. If your batch requests share a stable system prompt (which they typically do — same workload, same prompt structure across batch entries), prompt caching engages within the batch, and the discount lands on top of the 50% Batch discount.

The effective math: 50% Batch discount × ~85% effective input price after prompt-caching discount = ~42.5% of the original price on the input-token portion of batched requests with stable prefixes. The headline 50% discount understates the real saving on well-structured workloads.

This is a feature, not a workaround. OpenAI explicitly supports both at once.

Failure modes and operational gotchas

The patterns that trip up production deployments:

Latency variability. Batches don't always take close to 24 hours. Most complete in 30 minutes to 4 hours; some take longer; the SLA is just the worst-case guarantee. Design your downstream processing to tolerate batch-time variability — don't hard-code "1 hour" assumptions.

Account-level batch quotas. OpenAI imposes per-account quotas on batch token volume and batch count. For high-volume workloads, you may need to break a single conceptual batch into multiple submitted batches to stay under the limit. Production code should check quota state before submitting and queue if necessary.

Malformed input lines. A single malformed JSONL line can fail the whole batch (depending on the failure mode). Validate input before submission — pydantic models or equivalent type-checking on each request before serialising to JSONL.

Result file expiration. Batch output files expire after some period (typically 7 days). Download and process them promptly; don't leave results sitting on OpenAI's side as your durable storage.

Cost attribution complexity. A single batch ID covers N requests. Per-feature attribution requires propagating the custom_id through the batch flow and recording per-request cost separately. Worth wiring properly the first time.

Cancellation timing. Batches can be cancelled before completion, but the time to actually stop charging is bounded by how much processing already happened. Cancellation is best-effort, not instantaneous.

How Prism handles Batch API

Prism doesn't currently proxy Batch API calls — the v1.0-v1.8 product surface focuses on the real-time /v1/chat/completions endpoint. Batch workloads typically call OpenAI directly (or through a different infrastructure layer that's purpose-built for batch processing).

The strategic call: customers running batch workloads typically also have real-time workloads, and Prism captures the cost-engineering value on the real-time slice (caching, routing, savings tracking). The batch slice runs in parallel with no Prism involvement; the customer gets the 50% Batch discount directly from OpenAI.

VERIFY (founder): confirm Prism currently doesn't proxy Batch API — accurate as of v1.7-B / v1.8 product scope. If Batch proxy is on the v2.0 roadmap or has been added, update accordingly.

For applications that mix real-time + batch:

Real-time requests go through Prism (api.ssimplifi.com/v1/chat/completions) for the cost-engineering layer.
Batch requests go directly to OpenAI's Batch endpoint. Same API keys; same model selection; same usage accounting (your OpenAI dashboard shows both).
Per-feature spend attribution requires correlating both streams — your application's usage logs aggregate Prism-side data + OpenAI-side batch data.

This is the standard pattern for AI gateway + Batch coexistence. Not every cost-engineering tool needs to live behind one product surface; the Batch API is sufficiently standalone that direct OpenAI integration is the right shape for that slice.

Decision framework

If you're evaluating whether to move a workload to Batch:

Audit your current real-time LLM calls. Classify each by "is a human/time-sensitive consumer waiting on this response in real time." Yes → stays real-time. No → batch candidate.
Quantify the eligible spend. What fraction of your monthly LLM bill comes from the batch-eligible workloads? If <10%, the engineering investment isn't worth it; >20%, it's a clear win.
Pick one workload to migrate first. Usually the highest-volume offline analytics or content-generation pipeline. The first migration is 2-3 days of focused engineering; subsequent ones are faster.
Wire the operational pieces. Job queue, result correlation, failure handling, cost attribution.
Validate the result quality. Batch responses should be identical to real-time responses for the same model + parameters; verify on a sample before flipping production.
Roll out, monitor, expand to additional workloads.

The audit step is the most underrated. Most teams skip it because Batch sounds like a niche feature; the audit usually reveals a substantial slice of "we've been overpaying for X% of our spend" that justifies the investment by itself.

Where to go next

For the parent OpenAI cost-optimization context: OpenAI cost optimization. For the broader cost-reduction playbook this sits inside: LLM cost reduction. For the prompt caching that stacks with Batch: OpenAI prompt caching explained and provider-native caching.

For modelling cost impact on your specific workload: savings calculator.

FAQ

Does the Batch API support all OpenAI models?

Most chat-completion models are supported. Some specialised models (audio, vision-specific endpoints, real-time-specific models) aren't available via Batch. Check the OpenAI Batch documentation for the current supported-models list before assuming compatibility.

What's the typical batch turnaround time?

Empirically 30 minutes to 4 hours for most batches; the 24-hour guarantee is the worst case. The actual time depends on OpenAI's current batch-processing capacity, the size of your batch, and the model. Don't hard-code "1 hour" assumptions; design your downstream processing for variability.

Can I use prompt caching in batch requests?

Yes — batch and prompt caching stack cleanly. If your batch requests share a stable system prompt, prompt caching engages on the cached portions, with the 50% caching discount applied on top of the 50% Batch discount. The combined effective price on input tokens with both engaged is ~42.5% of the original.

What about Batch API for Anthropic?

Anthropic offers a similar batch endpoint with comparable economics (~50% discount, 24-hour processing window). The integration pattern is different from OpenAI's; check Anthropic's documentation for the specifics. Other providers (Google, Mistral, etc.) vary in batch support — some have it, some don't.

How do I do per-feature cost attribution on batch requests?

Propagate the custom_id field through your application. Set custom_id to encode the feature identifier (e.g. "<feature>-<workload>-<index>"). When you download batch results, parse the custom_id to attribute each completion to the correct feature. Aggregate per-feature spend offline by joining batch outputs with your usage-tracking system.

What if a batch fails partway through?

Typically the failed batch leaves you with no usable output (you don't get partial results). The mitigation: validate your input thoroughly before submission, monitor batch status, and design your application to handle the re-submission case. Most batch failures are due to malformed input lines or quota issues — both preventable with proper pre-submission checks.

Can I cancel a batch after submitting?

Yes, but cancellation is best-effort. Once you call cancel, OpenAI stops processing new items from the batch, but items already in-flight may complete and be charged. You don't get charged for items that haven't started yet. Plan for partial charges if you cancel mid-flight.

Does Batch + speculative routing make sense?

No — speculative routing is a real-time latency-hedging technique; Batch processing isn't latency-bound in the same way. The two patterns target different problems and don't combine. Use speculative for real-time-critical workloads; use Batch for latency-tolerant ones.

OpenAI's Batch API is one of the easiest cost wins in the catalog — 50% off, no quality regression, applies to any workload that can tolerate up to 24-hour processing latency. The OpenAI cost optimization pillar covers the broader OpenAI-specific cost-reduction stack; the LLM cost reduction pillar covers the cross-provider techniques.

DEV Community