The Cheaper API Was 2.5x Cheaper. It Cost 1.6x More.

#python #ai #agents #llm

AI-disclosure: AI-assisted draft, human-reviewed. The demo numbers are the verbatim stdout of a deterministic, stdlib-only Python script included in full below — re-run it and you get the same bytes. The attempt counts in that script are a SYNTHETIC fixture I chose to exercise the accounting mechanism, calibrated to the retry skew I see in my own scraper logs (run counts from my Apify history). It is NOT a benchmark of any named vendor's API or prices. The one external claim (the cost-per-successful-task formula) is attributed and linked.

The cheaper API was 2.5x cheaper per call. The monthly bill came in higher anyway.

Not by a rounding error. The "cheap" option cost 1.63x more per successful task than the one with the bigger sticker price. Same workload. The price page never showed me that number, because the price page doesn't know your success rate. You do — after you've already paid.

This is the arithmetic the per-call price hides. And it's a decision you make before you spend, not a cap you bolt on after.

TL;DR

You compare two API tiers by per-call (or per-token) price and pick the cheaper one. That ranking can be wrong.
You pay for every attempt — success or fail. The denominator that pays the bills is successful tasks, not calls.
True cost = price_per_attempt × attempts ÷ successes. A cheap tier with a low success rate burns its discount on retries.
In the run below: cheap tier $0.0020/call but $0.0096/success; robust tier $0.0050/call but $0.0059/success. The sticker winner loses.
For anyone choosing an API, model, or tier for an agent: log attempts and successes for a week, divide, then decide. A 70-line script is at the bottom — drop in your numbers.

The price page is a sticker, not a bill

Here's the trap, stated plainly. The number on the pricing page is per call. The number on your invoice is per call too — but the value you got is per successful task. Those are different denominators, and the gap between them is exactly the work that failed.

Every attempt is billed. The one that timed out and got retried: billed. The one that came back malformed and you re-prompted: billed. The one that succeeded on the fourth try: billed four times. If a tier fails 35% of its tasks and burns three to six attempts chasing each hard one, you are paying for a lot of calls that produced nothing you can use.

So the real question isn't "which tier is cheaper per call." It's "which tier is cheaper per task I actually completed." Those can point at different tiers. When they do, ranking by the sticker picks the loser.

The formula is small enough to fit in a sentence:

cost per successful task = cost per attempt ÷ success rate

Codebridge put it the same way in a February 2026 write-up titled, literally, Real Cost per Successful Task: "a model that costs $0.01 per attempt but succeeds only 50% of the time effectively costs $0.02 per success," and "the gap between attempted tasks and completed outcomes contains the bulk of real-world cost." (codebridge.tech) Same mechanism. My contribution here isn't the formula — it's showing the ranking flip with a number you can reproduce, and where the realistic retry shape comes from.

Measuring the flip

I wrote a small script to make the flip concrete. It's deterministic — stdlib only, no network, no random, no clock. Two tiers, 40 tasks each. For every task it records how many billed attempts it took and whether it ultimately succeeded. Then it computes three numbers per tier: per call, per task (spend spread over all tasks), and per successful task.

One honest caveat up front, because it matters: the attempt counts are a synthetic fixture I wrote by hand — numbers I chose to exercise the mechanism. They are not a measurement of any named vendor. What makes them realistic rather than arbitrary is that I shaped the skew to mirror what I see in my own scraper production logs across 2,190 lifetime runs: the cheap, flaky source eats far more retries per success than the stable one. The mechanism is real. The specific cells are illustrative. Swap in your own and the script does the same arithmetic.

#!/usr/bin/env python3
# cost_per_successful_task.py
# Deterministic, stdlib-only, no network. Fixture is inlined below.
#
# Question this answers:
#   You pick the option with the cheaper per-call price. Is it actually cheaper
#   PER SUCCESSFUL TASK once you pay for the failed attempts and retries?
#
# Mechanism (the whole point):
#   true cost-per-success = (price_per_attempt * attempts_spent) / successes
#   A cheap-per-attempt option with a low success rate makes you pay for the
#   wasted attempts on every retry. The headline price lies. The denominator
#   that matters is *successful tasks*, not calls.
#
# This is NOT an LLM benchmark. It is a stdlib simulation of the accounting
# mechanism. The attempt counts are a fixed, hand-written fixture (no RNG),
# chosen to mirror the retry skew we see in our own scraper production logs
# (2,190 lifetime runs): the "cheap" tier eats far more retries per success.

PRICE = {
    # price charged PER ATTEMPT (every attempt is billed, success or fail)
    "cheap_tier":  0.0020,   # looks 2.5x cheaper per call
    "robust_tier": 0.0050,
}

# Fixture: for each task we record how many BILLED attempts it took, and
# whether it ultimately SUCCEEDED. Deterministic, written out by hand so the
# run is fully reproducible. 40 tasks per tier.
#   - cheap_tier: low success rate, heavy retrying (mirrors our flaky-source logs:
#     the cheap option fails ~40% of tasks and burns 3-6 billed attempts chasing
#     each one before giving up or limping to a success)
#   - robust_tier: high success rate, almost always first-try
#
# Each entry = (attempts_billed, succeeded)
TASKS = {
    "cheap_tier": [
        (6, False), (1, True), (5, False), (2, True), (1, True),
        (6, False), (5, False), (2, True), (1, True), (4, True),
        (6, False), (3, True), (1, True), (2, True), (5, False),
        (1, True), (6, False), (2, True), (5, False), (1, True),
        (2, True), (3, True), (6, False), (1, True), (2, True),
        (6, False), (1, True), (4, True), (5, False), (1, True),
        (6, False), (2, True), (1, True), (5, False), (2, True),
        (1, True), (3, True), (6, False), (2, True), (1, True),
    ],
    "robust_tier": [
        (1, True), (1, True), (1, True), (2, True), (1, True),
        (1, True), (1, True), (1, True), (2, True), (1, True),
        (1, True), (1, False), (1, True), (1, True), (1, True),
        (2, True), (1, True), (1, True), (1, True), (1, True),
        (1, True), (1, True), (2, True), (1, True), (1, True),
        (1, True), (1, True), (1, True), (1, True), (2, True),
        (1, True), (1, True), (1, True), (1, True), (1, True),
        (1, True), (2, True), (1, True), (1, True), (1, True),
    ],
}


def summarize(tier):
    rows = TASKS[tier]
    price = PRICE[tier]
    n_tasks = len(rows)
    attempts = sum(a for a, _ in rows)
    successes = sum(1 for _, ok in rows if ok)
    spend = attempts * price
    success_rate = successes / n_tasks
    naive_per_call = price                      # the sticker price you compare
    naive_per_task = spend / n_tasks            # spend spread over ALL tasks
    true_per_success = spend / successes        # the number that pays the bills
    return {
        "tier": tier,
        "n_tasks": n_tasks,
        "attempts": attempts,
        "successes": successes,
        "success_rate": success_rate,
        "spend": spend,
        "naive_per_call": naive_per_call,
        "naive_per_task": naive_per_task,
        "true_per_success": true_per_success,
    }


def main():
    cheap = summarize("cheap_tier")
    robust = summarize("robust_tier")

    print("=" * 64)
    print("COST PER SUCCESSFUL TASK — sticker price vs the real bill")
    print("(stdlib simulation of the accounting mechanism; not an LLM bench)")
    print("=" * 64)
    print()
    hdr = "{:<12} {:>8} {:>9} {:>9} {:>12} {:>16}"
    print(hdr.format("tier", "per-call", "tasks", "success%", "per-task", "per-SUCCESS-task"))
    for r in (cheap, robust):
        print(hdr.format(
            r["tier"].replace("_tier", ""),
            f"${r['naive_per_call']:.4f}",
            r["n_tasks"],
            f"{r['success_rate']*100:.0f}%",
            f"${r['naive_per_task']:.4f}",
            f"${r['true_per_success']:.4f}",
        ))
    print()

    # Who wins on the sticker (per-call) price?
    sticker_winner = min((cheap, robust), key=lambda r: r["naive_per_call"])
    # Who wins on the number that actually pays the bills?
    real_winner = min((cheap, robust), key=lambda r: r["true_per_success"])

    ratio = cheap["true_per_success"] / robust["true_per_success"]

    print(f"Sticker price says cheapest: {sticker_winner['tier'].replace('_tier','')} "
          f"(${sticker_winner['naive_per_call']:.4f}/call)")
    print(f"Cost-per-SUCCESS says cheapest: {real_winner['tier'].replace('_tier','')} "
          f"(${real_winner['true_per_success']:.4f}/success)")
    print()
    print(f"The 'cheap' tier is {robust['naive_per_call']/cheap['naive_per_call']:.1f}x "
          f"cheaper per call,")
    print(f"but {ratio:.2f}x MORE EXPENSIVE per successful task.")
    print()
    print(f"Why: cheap tier burned {cheap['attempts']} attempts for "
          f"{cheap['successes']} successes "
          f"({cheap['attempts']/cheap['successes']:.2f} attempts/success);")
    print(f"     robust tier burned {robust['attempts']} attempts for "
          f"{robust['successes']} successes "
          f"({robust['attempts']/robust['successes']:.2f} attempts/success).")
    print()
    print("VERDICT: the per-call price flipped the winner. The decision is made")
    print("BEFORE you spend — on cost-per-success, not on the sticker.")

    # ---- asserts: lock the invariants that make the article true ----
    # 1) cheap really is cheaper per call
    assert cheap["naive_per_call"] < robust["naive_per_call"]
    # 2) ...but the winner FLIPS on cost-per-success
    assert cheap["true_per_success"] > robust["true_per_success"]
    assert sticker_winner["tier"] == "cheap_tier"
    assert real_winner["tier"] == "robust_tier"
    # 3) the flip is material (cheap is >1.5x worse per success)
    assert ratio > 1.5
    print()
    print("All asserts passed.")


if __name__ == "__main__":
    main()

Run it with python3 -I cost_per_successful_task.py. Here is the exact output:

================================================================
COST PER SUCCESSFUL TASK — sticker price vs the real bill
(stdlib simulation of the accounting mechanism; not an LLM bench)
================================================================

tier         per-call     tasks  success%     per-task per-SUCCESS-task
cheap         $0.0020        40       65%      $0.0063          $0.0096
robust        $0.0050        40       98%      $0.0057          $0.0059

Sticker price says cheapest: cheap ($0.0020/call)
Cost-per-SUCCESS says cheapest: robust ($0.0059/success)

The 'cheap' tier is 2.5x cheaper per call,
but 1.63x MORE EXPENSIVE per successful task.

Why: cheap tier burned 125 attempts for 26 successes (4.81 attempts/success);
     robust tier burned 46 attempts for 39 successes (1.18 attempts/success).

VERDICT: the per-call price flipped the winner. The decision is made
BEFORE you spend — on cost-per-success, not on the sticker.

All asserts passed.

Read the table once. Per call, cheap is $0.0020 and robust is $0.0050 — exactly the 2.5x discount the sticker promises. Per successful task, cheap is $0.0096 and robust is $0.0059. The ranking flips. The discount didn't disappear; it got spent on the 14 tasks that never succeeded and the retries chasing them.

Why it flips: count attempts per success, not calls per dollar

The line that explains everything is the last one in the output: cheap burned 125 attempts for 26 successes — 4.81 attempts per success. Robust burned 46 attempts for 39 successes — 1.18 attempts per success.

That's a 4x difference in how many billed calls it takes to get one usable result. A 2.5x price discount cannot survive a 4x attempt penalty. The math isn't close. 0.0020 × 4.81 ≈ 0.0096. 0.0050 × 1.18 ≈ 0.0059. The cheap tier is cheaper at the unit you don't ship and more expensive at the unit you do.

Notice the middle column too — per-task, spreading spend over all 40 tasks, the two tiers look almost tied: $0.0063 vs $0.0057. That column is a trap of its own. It counts the failed tasks in the denominator as if they were worth something. They weren't. Divide only by what succeeded and the real gap shows up.

"But my success rates are basically the same"

You might be thinking: my two options aren't 65% vs 98%, they're more like 92% vs 95%, so this doesn't apply to me. Maybe. That's exactly the point, though — you don't know until you count, and you can't eyeball a 4x attempt ratio from a pricing table.

A small gap in success rate matters more than it looks when one tier also retries harder. Two things compound: the fraction that never succeeds (pure waste) and the attempts-per-success on the ones that do. A tier can have a "fine" 90% success rate and still burn three attempts on every hard task, and that second factor never shows up as a failure in your dashboard — it shows up as a bigger bill. So don't guess the gap. Log it.

Here's the honest limit of my own claim, since I'm asking you to log yours: I haven't run this exact A/B across two named LLM APIs in production. The retry skew is real and comes from my scraper logs, where flaky sources have always cost multiples more per usable record than stable ones. The two-tier flip in the script is a clean illustration of that pattern, not a vendor benchmark. If you run it for real and the gap is small — great, you just bought certainty for the price of one week of logging.

What to do Monday

The change is procedural, not technical, and it happens before you commit to a tier — not as a spending cap you add after the bill scares you.

Log two counters per option: billed attempts, and successful tasks. Most SDKs already surface attempt/retry counts; if not, increment a counter in your retry wrapper.
Run a week on each tier (or a fair sample of the same workload through both). You need real tasks, not a synthetic ping — your hard cases are where the cheap tier bleeds.
Rank by total_spend ÷ successes, not by the price page. That single division is the whole decision.
Then choose. If the cheap tier still wins on cost-per-success, take it with confidence. If it flips, you just avoided paying 1.6x more for the privilege of a smaller sticker.

This is upstream of every budget guardrail. A spending cap stops you after you've chosen wrong and started bleeding. Choosing on cost-per-success means there's less to cap, because you picked the tier that wastes fewer attempts in the first place. (If you do want the downstream guardrail too, the HTTP 402 budget piece is the other half of this — that one's about capping spend during a run; this one's about which option you pick before the run.)

The price page sells you a per-call number because it's the number that makes them look cheapest. The number that pays your invoice is per successful task. Compute the second one yourself, with your own logs, before you switch.

What's the widest gap you've seen between the sticker price and the real cost-per-success once you counted the retries? I'm collecting the worst flips — drop yours in the comments. 👇

Follow for the next batch of cost-per-success numbers from production. I read every comment.