George Kioko

Posted on Apr 25

What I shipped after the $540 silent churn postmortem

#apify #postmortem #billing #saas

Yesterday I posted a postmortem on losing $540 a month to silent user churn. Some folks asked what the actual fix was. This is that post. Less drama, more code, three concrete patches that went live today.

If you missed yesterday: https://theaientrepreneur.hashnode.dev/two-agency-users-were-83-of-my-revenue-they-left-and-i-noticed-29-days-later

When I started digging into why my LinkedIn employee scraper was bleeding compute on real user runs, I found it was not one bug. It was three, layered.

Bug one: push first, charge second

Every Apify pay per event tutorial shows you Actor.charge('event-name', { count: 1 }). Easy. What none of them stress is what happens when the charge call fails.

There are at least three live failure modes:

The user set maxTotalChargeUsd on the run. They hit it. Charge returns chargedCount: 0.
Apify itself returns eventChargeLimitReached: true mid run.
The platform throws a transient error your try/catch swallows.

My actor's loop was structured like this:

await Actor.pushData(record);
try {
    await Actor.charge({ eventName, count: 1 });
} catch (e) {
    log.warning(`charge failed: ${e.message}`);
}
// the loop keeps going regardless

Push first, charge second, swallow errors, keep looping. So if charge stopped working halfway through a 100 profile run, the actor cheerfully output the remaining 50 for free while still spending real proxy and SERP money. The user got 100 profiles. I got billed for 50.

That is exactly the kind of leak you only notice when you stare at a per run cost graph and wonder why your revenue line is growing slower than your cost line.

The fix: a charge gate that fails closed

The new code calls a small helper before every emit. If charge fails for any reason, the gate refuses every subsequent call without even trying.

export function createProfileChargeGate({ isPPE, eventName, actorCharge, logger, stats }) {
    let chargeLimitReached = false;

    return {
        hasChargeLimitReached: () => chargeLimitReached,
        async chargeForNextProfile() {
            if (!isPPE) return { canEmit: true, charged: false, reason: 'not-ppe' };
            if (chargeLimitReached) return { canEmit: false, charged: false, reason: 'charge-limit-reached' };

            try {
                const result = await actorCharge({ eventName, count: 1 });
                if (result?.eventChargeLimitReached) {
                    chargeLimitReached = true;
                    return { canEmit: false, charged: false, reason: 'charge-limit-reached', result };
                }
                const charged = Number(result?.chargedCount || 0);
                if (charged <= 0) {
                    chargeLimitReached = true;
                    return { canEmit: false, charged: false, reason: 'not-charged', result };
                }
                stats.totalCharges = (stats.totalCharges || 0) + charged;
                return { canEmit: true, charged: true, reason: 'charged', result };
            } catch (error) {
                chargeLimitReached = true;
                return { canEmit: false, charged: false, reason: 'charge-error', error };
            }
        },
    };
}

The main loop now does:

const gate = createProfileChargeGate({ isPPE, eventName, actorCharge: Actor.charge, logger: log, stats });

for (const profile of profiles) {
    const verdict = await gate.chargeForNextProfile();
    if (!verdict.canEmit) {
        log.info(`Stopping at ${stats.totalCharges} profiles, gate refused: ${verdict.reason}`);
        break;
    }
    await Actor.pushData(profile);
}

Once the gate has refused, every call short circuits without trying to charge again. The run wraps up gracefully instead of bleeding compute on uncharged output.

Bug two: jobs that should never start

The second class of bug is the job that should not have run at all. A user sets companyCount 200, targetTitles 30, maxEmployees 1, hits Run, and watches my actor burn proxy and verification cost while emitting almost nothing.

The math is approachable. Per company you do roughly basePages + targetTitleCount SERP requests at about $0.0025 each, plus a verification attempt budget at about $0.0004 each. Per profile emitted you collect actorStartPriceUsd + shortProfilePriceUsd, then Apify takes 20% platform share off the top.

So a preflight estimator can compute estimatedPlatformCostUsd and estimatedCreatorRevenueUsd before any compute happens.

export function buildMarginPreflightEstimate({
    companyCount, targetTitleCount, maxEmployees, verifyEnabled,
    actorStartPriceUsd, shortProfilePriceUsd, creatorRevenueShare,
    serpCostUsd, verificationAttemptCostUsd, maxCostToCreatorRevenueRatio = 0.75,
}) {
    const basePages = getBaseSerpPagesPerCompany({ maxEmployees, verifyEnabled });
    const serpRequests = companyCount * (basePages + targetTitleCount);
    const verificationAttempts = verifyEnabled
        ? companyCount * getInitialVerificationCandidateLimit({ companyCount, maxEmployees })
        : 0;
    const estimatedProfiles = companyCount * maxEmployees;

    const estimatedPlatformCostUsd = (serpRequests * serpCostUsd) + (verificationAttempts * verificationAttemptCostUsd);
    const estimatedCreatorRevenueUsd = (actorStartPriceUsd + estimatedProfiles * shortProfilePriceUsd) * creatorRevenueShare;

    const ratio = estimatedCreatorRevenueUsd > 0
        ? estimatedPlatformCostUsd / estimatedCreatorRevenueUsd
        : Infinity;

    const exceedsMarginBudget = ratio > maxCostToCreatorRevenueRatio;
    return {
        estimatedPlatformCostUsd, estimatedCreatorRevenueUsd, ratio,
        exceedsMarginBudget,
        warning: exceedsMarginBudget
            ? `Not profitable enough for verified mode. Reduce companies, reduce targetTitles, or increase maxEmployees per company.`
            : '',
    };
}

In main, before any real work:

if (estimate.exceedsMarginBudget) {
    throw new Error(`Input rejected before run start: ${estimate.warning}. No PPE events will be charged.`);
}

The throw happens before Actor.charge has been called once. The user gets a clear refusal at submit time and pays nothing. They can resubmit with parameters that actually make sense.

The estimator tests confirm it accepts normal small runs (2 companies, 25 employees each, verified) and rejects the title-heavy 1-employee runs that were the worst offenders.

Bug three: a default that was too generous

The third change is product taste. Default maxEmployees was 100. That is too many for verified scraping with current LinkedIn block rates. Most users wanted 10 to 20 anyway and just left the default. The new default is 25.

If you really want 100 verified profiles per company, type 100 explicitly. Acknowledging it costs you a keystroke instead of nothing.

Small change, real impact. The new default protected at least one user yesterday from accidentally triggering the margin preflight refusal.

What this means if you use the actor

Three things you will notice in the build that went live today:

You will not get billed for partial runs where charge stopped working. Either the run completes and you pay for everything you got, or it stops mid run and you pay nothing past the limit, never both.

You will get rejected at submit time if your input is structurally unprofitable. The error message tells you exactly which knob to turn.

You will be defaulted into a smaller, faster run. Big jobs are still possible, you just have to opt in.

What I am doing next

The same three patterns apply to most of my PPE actors. The charge gate is already a module. I will be rolling it across the rest of the portfolio over the next week:

AI Content Detector
Email Validator API
URL Metadata Extractor
Domain WHOIS Lookup
Company Enrichment API
Website Intelligence API

These already shipped a fix yesterday for a different leak (GPT Store action pings hitting the standby actor with test payloads). The billing gate is the next layer.

If you build pay per event actors on Apify, take an hour and add a similar gate. The savings show up immediately. Your users start trusting your billing numbers because the numbers actually match the work.

Try the actor

The fixes are live as of today's build:

https://apify.com/george.the.developer/linkedin-company-employees-scraper

Pass companies as a list of LinkedIn URLs, set maxEmployees explicitly if you want more than 25, and watch the run console. The new guards should make the cost line predictable for the first time since I shipped this thing.

Yesterday was the diagnosis. Today is the fix. Tomorrow, I find out if anyone other than me actually cares.

Top comments (3)

toshihiro shishido • Apr 25

@the_aientrepreneur_7ae85, the "push first, charge second, swallow errors" anti-pattern hit me on a different platform but the failure mode is identical — and the fix shape (a gate that fails closed) is exactly right. Three things I'd add from the observability side, since the 29-day detection lag was as expensive as the bug itself:

Per-run unit-economics graph beats absolute revenue/cost graph. Plotting revenue_per_run / cost_per_run over time would have shown the divergence within days instead of weeks. Absolute lines hide the bug because both grow; the ratio cliffs immediately when the charge gate breaks.
Synthetic canary runs as a probe. A scheduled cron that runs your own actor with a known-good account, ~5 profiles, and asserts chargedCount === emittedCount. If it ever drifts, page yourself. Cheap, deterministic, catches every variant of bug 1/2/3 before a real user does.
"User behavior changed" alerts are bug signals, not user-segment signals. "Top 2 users dropped 83% of revenue" reads as churn — but in your case it was your code that changed, not their behavior. Worth wiring anomaly detection on per-customer revenue to also surface "did the actor logic change near the drop date?" before you assume customer-side cause.

Curious — has the charge-gate refactor changed your retry/backoff strategy on the proxy side too, or is that still independent? That's where I'd guess the next bug lives.

George Kioko • Apr 27

Toshihiro this is one of the sharpest post incident takes I have read on this. The ratio graph point is a bullseye. I was watching absolute lines and both drifted in the same direction so the cliff was hidden. Switching the dashboard to per run revenue divided by cost tonight.

On your question, the retry and backoff is still independent for now. The proxy layer retries on its own counter and the charge gate fires off a separate counter that only ticks after a successful push. So a retry storm on the proxy side cannot inflate charges, and it also cannot deflate them, which I think is the right separation. The next bug you predicted is probably in the assumption that all retries return the same record shape. I have seen the verifier hand back a stub when the SERP cache is cold and that stub silently passes the gate. Adding a schema check on the gate input is the next patch.

Canary cron is going on the list. revenuescope.jp is bookmarked.

toshihiro shishido • Apr 27

Glad if any of it was useful and thanks for bookmarking RevenueScope!!
The schema check on the gate input is the kind of thing that only bites after you've trusted the counter for a month; a stub silently passing is worse than a hard failure.

Curious what threshold ends up actually paging you on the canary cron.
That's the part I never get right first try....