Takayuki Kawazoe

Posted on May 8

"Why your long-running AI agent feels broken (even when it isn't)"

#ai #agents #ux #python

A support ticket came in last month with the subject line "the plan generator is broken." It was not, in fact, broken. The Celery task was running. The downstream service had accepted the job. The database row was sitting there with generation_status = 'in_progress' exactly as designed. From the server's point of view, the system was healthy.

From the user's point of view, they had clicked a button fifteen minutes ago and nothing had happened since.

I run Codens, a small AI dev harness, mostly solo. We have a product called Green Codens that turns Product Requirements Documents into actionable dev plans. The plan generation is a long-running AI job. It can take 30 seconds for a tiny repo or 30+ minutes for a sprawling one. We had built two completion paths: a webhook for the happy case and a polling fallback for when the webhook missed. The webhook had silently failed during a deploy. The polling fallback was scheduled to make its first call fifteen minutes after submission.

We changed two numbers. The same workflow now feels roughly fifteen times faster. Total compute is basically unchanged. This post is about why those two numbers mattered so much, and what they imply about designing async UX in AI products in general.

What was actually happening

Green Codens does the PRD authoring side. A separate service we call Purple Codens does the heavier lifting: cloning the repo, reading code, running an analysis agent, producing a structured task list. When a user converts a PRD into a dev plan, Green submits an analyze job to Purple, gets a 202 Accepted and a job id back, and then has to wait for the result.

There are two completion paths.

The first is a webhook, which is just the server-to-server "I'm done" callback. When Purple finishes, it POSTs the result back to Green with a signature, and Green applies it to the plan row. This is the happy path and it usually works.

The second path is a polling fallback. Webhooks miss for boring reasons. A receiver might be mid-deploy and bouncing 503s for thirty seconds. A signing key rotation might leave one side temporarily unable to verify the other. A network blip might drop the request and the sender's retry policy might give up before the receiver is back. None of these are exotic. All of them happen in real production systems. So Green also runs a Celery task that wakes up periodically, asks Purple "hey, what's the status of job X?", and applies the result if the job is done.

The polling task is idempotent. If the webhook already applied the result, the polling task sees generation_status = 'completed' and is a no-op. If the webhook missed, the polling task is the safety net that catches the dropped result.

Here is what the original schedule looked like:

# Original (bad)
_INITIAL_COUNTDOWN = 900  # wait 15 minutes before first poll
_RETRY_COUNTDOWN = 300    # then poll every 5 minutes, up to 12 times

Total polling window: 15 + 12 × 5 = 75 minutes. The reasoning was server-side and superficially sensible. Most analyses on real customer repos finish somewhere in the 10 to 20 minute range. Polling earlier than 15 minutes "wastes" API calls on jobs that are obviously still running. Polite. Considerate. Reasonable in isolation.

The problem was that the user does not live in the server's frame of reference. The user clicks the button, sees a "your plan is being analyzed..." spinner, and then the front-end is silent. If the webhook fires, great, the spinner becomes a result. If the webhook does not fire, the user sits with that spinner for a full fifteen minutes before any other code path even tries to discover the truth. They reload the page. They check the network tab. They contact us. By the time the polling fallback fires its first request, the user has already decided we are broken.

The retry-design trap

When you reach for retry logic in any system, the default mental model most engineers grab is "start short, double each time, give up at some bound." If you have ever written time.sleep(2 ** attempt) you have used it. It is taught early, it appears in HTTP client libraries, it ships in AWS SDKs by default. It is the right answer to a real problem.

But it is the right answer to a specific problem: you are calling something that is probably failing, and you do not want to hammer it while it is on fire. Each retry is a fresh attempt at the same operation. You assume the remote side might be temporarily unable to serve you, you give it space to recover, and you increase the wait between attempts so that if the outage is long, you are not piling on. The pattern protects the server from you.

The polling fallback in Green is doing something different. The job we are checking on is, in the overwhelming majority of cases, completely healthy. It started running a few minutes ago. It is going to finish on its own. The only reason we are polling at all is to catch the rare case where Purple finished, told us about it, and the message did not get through. We are not retrying a failing call. We are scanning for a missed event.

Once you frame it that way, the standard retry shape becomes obviously wrong. Starting short and lengthening makes sense when "short" means "give the failing thing a moment to recover." That is not what we are doing. We are saying "did the message arrive yet?" There is no recovery happening on the other side, because the other side is fine. Waiting longer between checks does not help anyone. It just delays the moment we notice the missed message.

If you stay with the standard shape and just shorten the initial wait, you end up over-polling at the tail. A job that legitimately takes 35 minutes does not need someone tapping it on the shoulder every 30 seconds for the back half of its run. That actually does spend API calls and Celery worker capacity for no information gain.

The shape we wanted was something the standard pattern does not provide a good vocabulary for. Aggressive at the start. Calmer at the end. Inverted from the usual instinct. Every framing I tried for it (front-loaded, decaying, head-heavy) sounded jargony and made the actual idea harder to talk about than it deserved. So I will skip the label entirely and just describe the shape.

We want the first poll within roughly a minute of submitting the job, because the cost of a missed webhook is measured in the user's emotional clock. We want a tight cluster of polls in the first five minutes, because that is the window in which essentially every kind of webhook failure manifests. Then we want to space out, because once you are ten minutes into a healthy job, the user has already accepted that this is going to take a while, and quick polling buys nothing.

The numbers after the change

Here is the new schedule, lifted from poll_purple_analyze_job.py:

# Polling window = 60s initial + sum(_RETRY_BACKOFFS) ≈ 73 min total.
# Front-loaded so a missed webhook is noticed within ~2 minutes.
_RETRY_BACKOFFS = [60, 60, 120, 240, 480, 480, 480, 480, 480, 480, 480, 480]
_MAX_RETRIES = len(_RETRY_BACKOFFS)

The submitting task schedules the first poll with countdown=60 instead of countdown=900. Each retry uses the next entry in the array as its countdown. Once the array is exhausted, the task gives up and marks the plan as failed so the UI can exit the loading state.

The total budget is almost identical to the old design. Old: 15 + 12 × 5 = 75 minutes. New: 1 + 1 + 2 + 4 + (8 × 8) = 72 minutes. Both cover the long tail of legitimately long analyses with room to spare. Both stop somewhere around the 70-minute mark, which is where we have decided that further waiting is not actually going to produce a useful result and the right move is to surface the failure and let the user retry from the PRD page.

What changed is the distribution of those minutes.

Metric	Before	After
Time to first poll	15 min	1 min
Worst-case missed-webhook detection	15 min	2 min
Polls in the first 5 minutes	0	4
Polls in the first 10 minutes	0	5
Total polling budget	75 min	72 min
Polls at the long tail (every interval)	5 min	8 min

The most important row in that table is the second one. Worst case detection went from fifteen minutes to two. That is a roughly 7.5× improvement in the time it takes the system to notice that a webhook went missing. For users who hit this path, that translates directly into how long they sit watching nothing happen.

Why is two minutes the right ceiling for missed-webhook detection? It comes from looking at how webhook failures actually present in our environment. Configuration errors and signature mismatches surface on the very first request, because the verification step is deterministic and the same key is used every time. Network blips, deploy bounces, and 5xx storms are short-lived. We have never seen a webhook failure pattern in production that took more than a couple of minutes to show up. So if we have not heard back within the first five-ish minutes of polling, the failure is one of the loud, immediate kinds, and it is already in our logs. If the webhook does eventually arrive late, the polling task is idempotent and skips out as soon as it sees the plan resolved.

Conversely, the long tail is where polite polling actually pays off. Once a job has been running for ten minutes and is still in in_progress, you are probably looking at one of the genuinely slow analyses. Polling that every 30 seconds does nothing useful and just clutters logs. Eight-minute intervals at the tail give the job room to finish on its own and only check in occasionally.

The dispatch in the submitting task is a single line:

poll_purple_analyze_job.apply_async(
    kwargs={
        "plan_id": str(plan_id),
        "analyze_job_id": analyze_job_id,
        "organization_id": str(purple_org_id),
        "project_id": str(project_id),
        "retry_count": 0,
    },
    countdown=60,  # was 900
)

That single number, 900 to 60, is most of the user-facing improvement. The array reshape is what protects the server from the consequences.

The deeper lesson

The thing I keep coming back to after this change is how much of "this product feels good" turns out to be set in the first sixty to ninety seconds of any long async operation.

A user clicking "generate plan" is making a small bet. They believe, tentatively, that this is going to work. They are willing to wait. But they need the system to keep that belief warm, and the way you keep it warm is by giving them a sign of life early. It can be a progress bar that moves. It can be a status string that updates. It can be, in our case, a backend that quickly notices when something has gone wrong and surfaces the truth instead of letting the spinner spin.

What the system absolutely cannot do is stay silent for fifteen minutes. By minute three the user has already started constructing a story about what is broken. By minute five they are looking for a way to cancel. By minute ten they have moved on and the next time they come back they will arrive expecting failure. Even if the webhook eventually fires at minute twelve and everything works, the experience has been spent.

The original 15-minute initial wait was reasoning about the wrong thing. It was optimizing the API call profile against the modal completion time of the underlying job. That is a real number and it is a real consideration, but it is not the constraint that should drive the polling cadence. The constraint that should drive the polling cadence is "how long can the user sit in front of a silent screen before they conclude we are broken." For our users, that number is somewhere between 60 and 90 seconds. Past that, you are losing them.

This generalizes. Any time you have a long-running async AI task, somewhere in the system there is a piece of code that decides how often the rest of the system asks "is it done yet." That code is a UX decision, not a backend decision. Treat it that way.

The framing I now use when reviewing this kind of code is to separate two distinct questions and answer them separately. Question one: how quickly do we need to detect that the happy path failed? That governs the early polling cadence. The answer is almost always "faster than you think," because the happy path failing silently is the worst experience the system can produce. Question two: how patiently can we wait for the work to finish on its own? That governs the late polling cadence. The answer is usually "more patiently than you think," because once the user has accepted the wait, polling more often does not buy anyone anything.

Server politeness is a real cost, and I do not want to pretend otherwise. Hammering an internal API every five seconds for an hour wastes capacity and clutters dashboards. But you weigh it against the perception cost. For a small B2B SaaS like ours, a single user concluding the product is broken and ghosting is far more expensive than any conceivable amount of well-bounded internal polling traffic. We are on a private API to our own service. The economics are not even close.

We added a single line to our internal design checklist as a result of this work: "First poll inside 60 seconds." When we review any new long-running async flow, that line gets checked. If we are scheduling the first liveness check more than a minute after submission, we have to justify it explicitly, in writing, against the user-perception cost. So far we have not had a single case that survived that justification.

What else got fixed along the way

A couple of things came along for the ride in the same PR, because once you start looking at one polling task you tend to notice the things around it.

The polling task now has an explicit "give up" path that marks the plan as failed when the retry array is exhausted. The original code logged a warning and exited. The plan row stayed in in_progress forever, which meant the UI loading state never resolved and the user could not even retry generation, because the front-end refused to start a new job while the previous one was supposedly still running. The fix is small but important: when retries hit the wall, write an explanatory error message to the plan, mark it failed, and publish a status-change event so the UI exits the spinner. The error message tells the user how long we waited and suggests retrying from the PRD detail page. It is also idempotent, so if the webhook arrives late and resolves the plan as completed, the giveup path sees generation_status is no longer in_progress and does nothing.

We also added an admin recovery endpoint for the case where a plan does get stuck in some unexpected state, usually because of a bug we have not seen yet. It manually transitions a plan back to a state where the user can retry. This sits in our admin tools and is not user-facing, but it has been useful exactly twice in the month since we shipped it, both for cases that taught us about new failure modes we then fixed properly. Operational tools earn their keep.

Neither of these changes was the headline of the PR. They were both downstream consequences of taking the polling task seriously enough to read it line by line. That is its own lesson. Polling tasks tend to be the bit of code nobody reads. They are scheduled once when the feature is built and then they quietly run forever. The next time you find yourself in a polling fallback that nobody has touched in months, it is worth half an hour of your time to read the whole thing and ask whether the cadence still matches what users actually need.

Wrap

The principle, in one sentence: poll the way the user feels the product, not the way the server feels the load. Almost everything else falls out of that.

If you want to see what the rest of Codens looks like, the English landing page is at https://www.codens.ai/en/ and our help docs (which include a lot more about how Green and Purple talk to each other) live at https://help.codens.ai/en/. The polling task discussed in this post lives in the open part of our backend; if you happen to spot a different case where this same trade-off applies, I would genuinely like to hear about it.