Alex Spinov

Posted on Jun 29 • Originally published at blog.spinov.online

Your Agent Success Rate Counts Only the Survivors

#ai #llm #observability #python

Your agent dashboard says 90% success. It is wrong, and not because the math is sloppy. It is wrong because of which runs it forgot to count. Every run that timed out, got aborted, or is still stuck in RUNNING three hours later has quietly slipped out of the denominator. A run that FAILED is the honest one. It raised its hand, it sits in your error logs, it is already dragging the number down where it belongs. The run you should be scared of is the one that never came back to tell you anything.

That is survivorship bias, and it lives in almost every reliability number I have looked at.

TL;DR

A naive success rate divides wins by "runs that returned a clean pass or fail." That set excludes timed-out, aborted, and hung runs.
Excluded runs leave the denominator, so they inflate the rate by being invisible. The metric looks better the more runs disappear.
The fix is not better error handling. A FAILED run is already counted. The dangerous run has no terminal verdict at all.
One-line change: count runs that started, not runs that finished. On the synthetic numbers below, 90.0% becomes 72.0%.

The plane that came back

In 1943 the US military looked at bombers coming home from Europe and mapped where they were taking the most damage. Wings. Fuselage. Tail. The obvious move was to bolt armor onto those spots. Abraham Wald, a statistician at the Statistical Research Group, argued the opposite. Armor the engines, the one place with almost no holes on the planes that landed. The planes hit in the engines were not in his sample. They never made it home to be measured. The damage you do not see is the damage that kills.

Your run ledger has the same shape. You measure the runs that came home.

How the rate goes wrong

Most success-rate code I have seen, mine included, looks like this in spirit: take the count of SUCCEEDED, divide by SUCCEEDED plus FAILED, multiply by a hundred. Clean. It reads like a pass rate on an exam. The trouble hides in the words "plus FAILED," because that is the entire denominator. You are dividing wins by the runs that came back with a clear yes or a clear no.

Plenty of runs never come back with either.

A long crawl's worker drops off the network at row 9,000 and never reports back. A run hits a wall-clock limit and the platform marks it TIMED_OUT. Someone kills a wedged job by hand. And the worst case of all: a run that simply hangs. No exit code, no terminal status, no log line after 14:02. It is still listed as RUNNING days later because nothing ever wrote the ending.

None of those are SUCCEEDED. None of them are FAILED either. So a "succeeded over succeeded-plus-failed" rate does not rate them low. It deletes them from the question. The denominator shrinks and the rate climbs. The more runs that vanish into a non-terminal limbo, the healthier the dashboard looks. The metric rewards exactly the failure mode that should scare you most.

The fix is not error handling

Here is the part that took me embarrassingly long to see. I spent a couple of days hardening error handling. Tighter try/except boundaries, retries with backoff, cleaner FAILED records. None of it moved the real number, because FAILED was never the problem.

A FAILED run is the honest citizen of your ledger. It threw an exception you could catch. It is in your error logs, it is in your alerts, and it is already inside the denominator. When you polish error handling you are improving the runs that already report themselves.

The runs that corrupt the metric are the ones with no clean verdict. Timed out. Aborted. Stuck in a transitional status that never resolves. They do not throw anything, because from your code's point of view nothing happened. The process just stopped existing. You cannot try/except a worker whose node died mid-run without ever writing a final status. There is no stack trace for a run that is still, technically, "running." So the bug is not in your handler. The bug is in your denominator.

Three buckets, not two

It helps to borrow a vocabulary that already names this. Apify, the platform our actors run on, documents every actor run as carrying one status from a small fixed set, grouped into three kinds (verified against their docs, link at the end):

Initial: READY, "Started but not allocated to any worker yet."
Transitional: RUNNING, TIMING-OUT, ABORTING. The run is in motion.
Terminal: SUCCEEDED, FAILED, TIMED-OUT, ABORTED. The run is done.

Their docs put it plainly: a run begins in the initial state, progresses through one or more transitional phases, and concludes in one of the terminal states. That is the whole lifecycle. Our own run ledgers, across 2,190 production runs on 32 actors, live entirely inside this vocabulary. The Trustpilot review scraper alone holds 962 runs in that table, and the long ones, the crawls that grind for an hour, are exactly the runs that flirt with the memory ceiling and the timeout. They are the most likely to end up TIMED_OUT or wedged in a transitional state. So the runs a naive rate silently drops are the same runs that were hardest to keep alive. The metric goes blind precisely where the work is hardest.

A naive pass rate uses two of those terminal statuses and throws away the other two terminals plus every transitional run. Three buckets flattened into a yes or no.

The arithmetic, on numbers you can run

Here is a tiny script. No imports, no network, no randomness, no clock. A dictionary of run counts and three ways to divide it. The ledger is synthetic, hand-built to isolate the mechanism, not a measurement of any single actor. I will come back to why that distinction matters.

"""
survivorship_success_rate.py - your agent's success rate is measured on the
runs that survived long enough to report a verdict.

A run ledger (an Apify actor's run list, a CI job table) carries one status per
run. Apify documents them as initial (READY), transitional (RUNNING, TIMING-OUT,
ABORTING) and terminal (SUCCEEDED, FAILED, TIMED-OUT, ABORTED). A run "goes
through one or more transitional statuses to one of the terminal statuses".

A dashboard that divides SUCCEEDED by "clean pass/fail" drops every run that
timed out, was aborted, or is still transitional. Those runs leave the
denominator, so they inflate the rate by being invisible.

Counter-take: the fix is NOT better error handling. A FAILED run is honest - it
already sits in your error logs, in the denominator. The runs that wreck the
metric are the ones with no clean verdict (timed out / aborted / still RUNNING).
The one-line fix is to change the denominator from "runs that finished" to
"runs that started". The single most dangerous run is the one stuck in a
transitional status forever: it has no terminal record at all.

This ledger is SYNTHETIC, hand-built to isolate the mechanism. The 90.0 -> 72.0
gap is illustrative of the arithmetic, not a measured rate from any one actor.

Run: python3 -I survivorship_success_rate.py
stdlib only, 0 imports, 0 network / 0 RNG / 0 clock -> identical stdout, always.
"""

# Run counts by status. RUNNING here = a run that began but never reached a
# terminal status: it hung, was OOM-killed mid-stream, or infra dropped it.
LEDGER = {
    "SUCCEEDED": 36,   # terminal
    "FAILED":     4,   # terminal
    "TIMED_OUT":  5,   # terminal
    "ABORTED":    3,   # terminal
    "RUNNING":    2,   # transitional - never resolved
}

PASS_FAIL = ("SUCCEEDED", "FAILED")                       # naive "clean verdict" set
TERMINAL = ("SUCCEEDED", "FAILED", "TIMED_OUT", "ABORTED")  # all terminal statuses

succeeded = LEDGER["SUCCEEDED"]
attempts = sum(LEDGER.values())                  # every run that STARTED
passfail_denom = sum(LEDGER[s] for s in PASS_FAIL)
terminal_denom = sum(LEDGER[s] for s in TERMINAL)

naive_rate = round(100 * succeeded / passfail_denom, 1)    # succeeded / clean pass+fail
terminal_rate = round(100 * succeeded / terminal_denom, 1)  # succeeded / all terminals
honest_rate = round(100 * succeeded / attempts, 1)          # succeeded / runs that started
hidden = attempts - passfail_denom

print("=== run ledger (every run that wrote a start record) ===")
for status, n in LEDGER.items():
    kind = "transitional" if status == "RUNNING" else "terminal"
    print(f"  {status:<10} {n:>3}   ({kind})")
print(f"  {'-'*10} {'-'*3}")
print(f"  {'ATTEMPTS':<10} {attempts:>3}")
print()
print("=== three denominators, one numerator (succeeded = 36) ===")
print(f"  NAIVE    succeeded / pass+fail      : 36/{passfail_denom} = {naive_rate}%")
print(f"  TERMINAL succeeded / all terminals  : 36/{terminal_denom} = {terminal_rate}%")
print(f"  HONEST   succeeded / runs that began: 36/{attempts} = {honest_rate}%")
print(f"  the naive rate hides {hidden} runs (5 timed out, 3 aborted, 2 never resolved)")
print(f"  -> a dashboard reading {naive_rate}% is really running at {honest_rate}%")
print()
print("=== ceiling (where this fix stops) ===")
print("  1. HONEST is still an upper bound: it counts only runs that managed to")
print("     write a start record. Runs killed before their first log line (OOM at")
print("     spawn, infra drop) are in NO ledger. True rate <= 72.0%.")
print("  2. SUCCEEDED is trusted as-is. A run that exits 0 but returns empty or")
print("     partial data still counts as a win here. Fixing the denominator does")
print("     not fix the definition of success - that is a separate gate.")
print("  3. Synthetic ledger. The 90.0 -> 72.0 gap shows the arithmetic, not a")
print("     measured rate. Your real gap is whatever your RUNNING column is.")

assert attempts == 50
assert passfail_denom == 40
assert terminal_denom == 48
assert naive_rate == 90.0
assert terminal_rate == 75.0
assert honest_rate == 72.0
assert hidden == 10

Run it:

=== run ledger (every run that wrote a start record) ===
  SUCCEEDED   36   (terminal)
  FAILED       4   (terminal)
  TIMED_OUT    5   (terminal)
  ABORTED      3   (terminal)
  RUNNING      2   (transitional)
  ---------- ---
  ATTEMPTS    50

=== three denominators, one numerator (succeeded = 36) ===
  NAIVE    succeeded / pass+fail      : 36/40 = 90.0%
  TERMINAL succeeded / all terminals  : 36/48 = 75.0%
  HONEST   succeeded / runs that began: 36/50 = 72.0%
  the naive rate hides 10 runs (5 timed out, 3 aborted, 2 never resolved)
  -> a dashboard reading 90.0% is really running at 72.0%

=== ceiling (where this fix stops) ===
  1. HONEST is still an upper bound: it counts only runs that managed to
     write a start record. Runs killed before their first log line (OOM at
     spawn, infra drop) are in NO ledger. True rate <= 72.0%.
  2. SUCCEEDED is trusted as-is. A run that exits 0 but returns empty or
     partial data still counts as a win here. Fixing the denominator does
     not fix the definition of success - that is a separate gate.
  3. Synthetic ledger. The 90.0 -> 72.0 gap shows the arithmetic, not a
     measured rate. Your real gap is whatever your RUNNING column is.

One numerator, succeeded = 36. Three denominators.

NAIVE divides by pass plus fail, 36 over 40, and reports 90.0%. This is the number most dashboards put on the big screen.

TERMINAL divides by all four terminal statuses, 36 over 48, and reports 75.0%. This is what you get the moment you stop pretending the timeouts and aborts did not happen. Fifteen points, gone, just by counting every run that ended badly in any way and not only the ones that raised an error.

HONEST divides by every run that started, 36 over 50, and reports 72.0%. The last three points are the two runs still stuck in RUNNING. They never resolved. They carry no terminal record at all, and they are the ones I would lose sleep over, because a run with no ending is a run nobody is watching.

Eighteen points of spread between the first number and the last. Same successes. Same ledger. The only thing that moved is what I was willing to count.

Where this stops working

I put the limits in the program's own output, because a fix that oversells itself is just a fancier kind of lying metric. Three things this does not do.

First, HONEST is still an upper bound, not the truth. It counts runs that managed to write a start record. A run killed before its first log line, an OOM at spawn, a node that fell off the network, is in no ledger at all. It never got a row. So the real success rate is at most 72.0%, and probably under it. You cannot count what was never written down.

Second, SUCCEEDED is taken on faith. A run that exits zero but returns an empty array or half a dataset still scores as a win in this script. Fixing the denominator does not fix the definition of success. That is a separate gate, and I have written about that other half before: a run can pass and still hand you garbage, like a clean row that was quietly wrong. Counting outcomes honestly and judging whether an outcome was actually good are two different jobs.

Third, the ledger is synthetic. The jump from 90.0% to 72.0% shows you the arithmetic, not a benchmark. Your real gap is whatever the size of your RUNNING column happens to be. If almost nothing ever hangs, your naive and honest rates sit close together, and good for you. If your transitional column is fat, your dashboard is off by a margin you have never measured.

This is not the data-quality problem

It would be easy to file this next to the "clean row that was wrong" post above. They are not the same bug. That one is about the value inside a single run: a row that parsed fine and still held junk, a rating of 7 on a five-star site. This one sits a level up. It says nothing about whether any individual run's output is correct. It is about how you count runs across the whole population. A run can succeed with flawless data, and if its neighbor hung in silence, your aggregate rate is still wrong about the fleet.

It is also not the eval problem. When you write a regression gate for an agent's final answer, you are judging the quality of one response against a rubric. Useful, necessary, and orthogonal to this. A success rate counts how runs ended, not what they produced. You can own a flawless eval suite and a success rate that is still inflated by survivorship, because the eval only ever sees the runs that returned something to grade. Same blind spot, one floor up.

What I changed

The actual fix is almost insultingly small. Change the denominator. Count runs that started, not runs that finished. If your run table gets a row the instant a run is created, then the denominator is just that row count, full stop, including everything still marked RUNNING.

One caveat I owe you here. A run that started ninety seconds ago and is still RUNNING is not a failure, it just has not finished. That run is right-censored, not lost, and counting it against you on a live snapshot biases the rate the other way, pessimistically, by lumping healthy in-flight work in with the dead. So the honest denominator is a settled one: count over a window that has already drained, or age-gate the transitional runs by the rule in the next paragraph. Younger than that threshold, a run is still pending, not a loss. The synthetic ledger above sidesteps this by definition; its two RUNNING rows are the long-dead kind. On a live dashboard you draw that line yourself.

Two things I added around it turned out to be worth more than the metric itself.

I started alerting on the age of transitional runs. A run that has been RUNNING for three times its median duration is not running. It is dead and lying about it. That alert caught more real problems than the success rate ever did, because it points straight at the runs the rate was hiding.

And I put the denominator next to the rate on the dashboard. "94% of 312 terminal" and "94% of 1,040 started" are two very different sentences, and showing both makes the gap impossible to scroll past. When the started count and the terminal count drift apart, that drift is your survivorship tax, written in plain numbers.

I am not going to quote you the percentage of our runs that hang, because the honest answer is that for a long stretch I was not measuring it, which is the entire point of this post. The number you cannot see is the number that gets you. Wald armored the engines. Count what started, not what finished.

Written by Aleksei Spinov. I run production scrapers and agents, currently 2,190 runs across 32 actors. The code here is stdlib-only and was run and verified (python3 -I, identical output, asserts green) before publishing; the ledger numbers are synthetic and labelled as such in the script. Drafted with an AI assistant, fact-checked and edited by me.

Follow for the next teardown from the run ledger, one fix at a time. Genuine question for the comments: what is the longest a run has sat in your dashboard still marked RUNNING long after it was actually dead, and what finally made you notice? I read every reply.

Source: Apify Actor run lifecycle statuses.

Top comments (4)

Alice • Jun 29

The 'still RUNNING three hours later' case is the one I'd flag hardest, because it isn't just missing from the denominator — it's often actively doing damage while it's gone: holding a lock, a session, a browser tab, a half-applied state change. A clean failure is recoverable; a silent hang can corrupt the thing you'll need later. So it should count as worse than a failure, not get dropped.

What's helped me is making 'didn't reach a terminal state' impossible to ignore: every started task gets a deadline and must resolve to one of {completed, failed, timed-out, abandoned}. No terminal state by T means auto-classified as failure AND its held resources get reclaimed. You don't just count timeouts, you enforce them — so a hang can't quietly persist as a phantom 'success in progress.'

There's also a twin of this on the numerator side: a 'clean pass' is itself a survivor if the agent declared success without verifying the outcome. An agent that says 'done' but never checked the result is a win that shouldn't count. Honest reliability needs both — count every start (honest denominator) and verify the claimed wins (honest numerator).

Alex Spinov • Jul 1

This is the case that turned it from a metrics fix into an ops problem for us. A run stuck in RUNNING isn't just uncounted. It's usually still holding a worker slot and a proxy lease, sometimes a DB connection, and billing the whole time. We ended up alerting on "RUNNING > 3x median duration" as its own signal, separate from the success-rate math, because by the time it settles into the denominator the damage is already done. Do you auto-kill those or page a human? We went back and forth on that one.

Alice • Jul 1

That RUNNING > 3x median alarm as its own signal is the right instinct — the leak (worker slot, proxy lease, a DB conn, and the meter running the whole time) is the real damage, and it's orthogonal to the success-rate math. The one place a global-median threshold bit me: agent task durations are heavy-tailed — a few legitimately run 10x median, so a fixed multiple either pages on the long-but-healthy ones or sleeps through a task that hung at 2x. What's worked better for me is a deadline the task DECLARES at start (from its own class / expected work), so stuck is judged against what THIS run should take, not the fleet median — and that same deadline is what triggers resource reclamation, so the alert and the cleanup fire off one commitment. Median is a great cheap default though; the per-task deadline is just the next rung when the tail gets wide.

Alex Spinov • Jul 1

Yeah, the heavy tail is what broke the global multiple for us too. A full Trustpilot crawl legitimately runs an order of magnitude longer than a single-page fetch, so one fleet-wide "3x median" either pages on every big crawl or sleeps through a small job that wedged.

Two things helped, and I think they sit right next to your declared deadline. First, we moved the baseline per class — bucket by actor/job type, compute the median inside the bucket, so a big crawl is judged against other big crawls, not the fleet. That's the cheap version of what you're describing; the class carries the expectation, which was easier to retrofit onto jobs that don't know their own size up front.

But even a per-class deadline sleeps through the hang-at-2x when 2x is still inside the class budget. What actually caught those was watching progress instead of total time: a stuck run stops emitting — no new log lines, zero new rows for N minutes — so it fires even on a legitimately-long job that's only halfway done. Cheap to wire if you're already streaming rows or heartbeating.

No single clean threshold, really. Per-class deadline kills the false pages on the long-but-healthy runs; the no-progress watchdog catches the ones that wedge early inside budget. We run both — they fail differently.