Nobody knows when a job will finish. I'd still like to report it accurately.

#api #architecture #backend #devops

Most async APIs commit to one thing: starting your job. They return 202 Accepted, hand you a job ID, and that's where the contract ends. The rest is your problem.

I do something different. I make one promise:

When your job is done, I'll tell you accurately. Until then, I'll keep retrying.

That's the entire contract for everything I've ever shipped. It sounds small. In practice, it's the only thing I actually do.

The shape every job in my system shares

You hand me work.

You wait.

I retry as hard as I can.

I report when it's done.

That's it. Whether the job is OCR on a scanned PDF, structured extraction from a long document, or refining the translation of an XLIFF file — the shape is identical. You give me an input. You don't watch the screen. I come back when I have something honest to report.

This sounds obvious until you try to actually deliver it.

Why "started" is easier than "finished"

Returning 202 Accepted is easy. The hard part starts right after that.

Real jobs hit things like:

Vendor APIs that occasionally throw 503. No reason. Just sometimes.
Native binaries that core dump. Twice in a row, then fine for a week.
Subprocesses that go zombie. Not crashed. Not finished. Just defunct. The OS still holds them.
Disks that fill up with stale debug files because something somewhere wrote them and forgot.

If you ship "started, here's a job ID, good luck" and call that an API, you're outsourcing all of the above to your user.

I'm not willing to do that. So I take the work back inside.

What that looks like in code

I'm not going to name any vendor. They don't matter. What matters is the shape. The code below is a simplified sketch — the production version handles a lot more (PDF library version quirks, fallback engines when the first one rejects the input, demo-mode page limits, and a long list of vendor-specific error codes that mean "retry," "skip," or "stop"). The shape is what survives.

Here's a sketch of the inside of one of my conversion services:

public JobResult runJob(Input input) throws Exception {
    for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
        Process child = new ProcessBuilder(
                "java", "-cp", classpath, EngineMain.class.getName())
            .redirectErrorStream(true)
            .start();
        passInputToStdin(child, input);

        long started = System.currentTimeMillis();
        while (child.isAlive()) {
            if (System.currentTimeMillis() - started > MAX_RUNTIME_MS) {
                child.destroyForcibly();
            }
            if (isDefunct(child)) {
                reap(child);
                break;
            }
            sweepStaleCoreFiles(workDir, MAX_CORE_AGE_MS);
            Thread.sleep(POLL_INTERVAL_MS);
        }

        ChildOutcome outcome = readOutcome(child);
        if (outcome.isTransientError()) continue; // retry
        if (outcome.isIrrelevantError()) {
            log.info("irrelevant error, treating as success: {}", outcome);
            return outcome.toSuccessResult();
        }
        if (outcome.hasResult()) return outcome.toResult();
    }
    return JobResult.failedAfterRetries(MAX_RETRIES);
}

A few things in there are worth pointing at.

new ProcessBuilder("java", ..., EngineMain.class.getName()). Not "call a library function." Not "use the SDK." I literally re-enter main from another process. The reason is that the underlying engine, in its native form, is unreliable enough that I want process-level isolation. If it dies, only the child dies.

if (isDefunct(child)) { reap(child); break; }. Native binaries don't always exit cleanly. Sometimes they're not crashed and not running — they're stuck. The parent has to notice, decide, and clean up.

sweepStaleCoreFiles(workDir, MAX_CORE_AGE_MS). When a child crashes hard, the OS dumps a core file. That file is huge. If you don't sweep it, the disk fills up. There is no clever solution here. You sweep.

outcome.isTransientError() → continue. Some vendor errors come and go. The fix is to wait and try again. If you don't try again, your user sees failed. If you do try again, your user sees "took a bit longer." I pick the second one.

outcome.isIrrelevantError() → log and return success. This is the part that surprises people. Some errors aren't actually errors for the use case. They're noise the engine emits. Knowing which is which takes years, and is most of the actual product.

None of this is elegant. None of it shows up in an architecture diagram. It all lives in the gap between "the job was submitted" and "the job is done, here's the result."

That gap is what I do.

What I gave up

I don't promise low latency. I can't. The thing I'm waiting on isn't predictable.

I don't promise the job will always succeed. Sometimes the input is genuinely broken. Then I report that, accurately, instead of pretending.

I don't promise streaming partial results. I keep the user out of the loop until I have something stable to hand back. The cost is they wait. The benefit is they don't see noise.

These trade-offs aren't sophisticated. They're just consistent.

I didn't design this. It survived.

Looking back, this is how every job-shaped API I've ever built has worked. I didn't sit down one day and decide on a contract. I kept ending up here.

Each time I tried to ship something where the API said started and stopped caring, the user came back asking what happened. So I started caring. Each time I tried to surface every transient error to the user, the user got scared. So I started absorbing them. Each time I tried to make jobs faster by skipping the cleanup, the disks filled up. So I started sweeping.

After enough years of this, what's left is a single rule: