HelperX

Posted on Jun 25

Designing an Author-Filter Pipeline: Short-Circuiting From Cheap to Expensive Checks

#automation #javascript #architecture #node

At HelperX, the Reply (Search) and Reply (List) modules have to decide, for every tweet they encounter, whether the author is worth replying to. A tweet might match a keyword, but the author could be a bot, a brand-new account, or in a country our operator doesn't want to target.

The naive approach — fetch the full author profile and run every check against it — works, but it's slow and expensive. Most candidates fail the cheapest checks. Fetching follower counts, verification status, and geo data for accounts that fail the minimum-follower filter is wasted work.

This article is about the pipeline we built to evaluate authors: ordered, short-circuiting filters that run from cheapest to most expensive, so we spend API budget only on candidates that pass everything cheaper first.

The filters, and what they cost

Our reply modules support four author filters, configurable per slot:

Minimum followers — author must have at least N followers.
Verification — author must (or must not) be verified.
X-Score — a proprietary engagement-quality score we compute per author.
Country blacklist — author must not be in a list of blocked countries.

These have wildly different costs, and the cost isn't always obvious from the name:

Filter	Data needed	Relative cost	Why
Country blacklist	Geo (IP or profile)	Medium	Requires geo lookup
Minimum followers	Follower count	Low	Usually in the tweet payload
Verification	Verified flag	Lowest	Almost always in the tweet payload
X-Score	Engagement history	Highest	Computed from multiple fetched data points

The key insight: the payload that comes with a tweet already contains some author metadata (follower count, verified flag, sometimes geo). Anything in the payload is essentially free — we already have it. Anything not in the payload requires a fetch, and the X-Score requires several fetches plus computation.

This means the optimal order isn't "the most selective filter first." It's "the cheapest filter first, then progressively more expensive ones."

The short-circuit principle

A pipeline that short-circuits stops at the first failing filter. If an author has 12 followers and our minimum is 1,000, we reject them on the cheap follower check and never pay for the expensive X-Score computation.

The expected cost of evaluating one author through the full pipeline is:

E[cost] = cost(f1)
        + P(pass f1) * cost(f2)
        + P(pass f1) * P(pass f2) * cost(f3)
        + ...

Where f1, f2, ... are the filters in order. To minimize expected cost, you want the filters ordered so that the cumulative cost stays low — which means putting cheap filters that reject many candidates first, even if a more selective filter exists later.

In practice: verification and follower-count checks come from the payload (nearly free), so they always go first. The X-Score computation is expensive, so it always goes last.

Building the pipeline

Each filter is a small function with a uniform signature: take an author context, return a decision.

// A filter returns one of three decisions.
const DECISION = {
  PASS: 'pass',     // passes this filter; continue to the next
  REJECT: 'reject', // fails this filter; reject the author
  SKIP: 'skip',     // filter not configured; continue to the next
};

function makeMinFollowersFilter(minFollowers) {
  return (ctx) => {
    if (!minFollowers || minFollowers <= 0) return DECISION.SKIP;
    if (ctx.author.followers == null) return DECISION.SKIP; // can't evaluate
    return ctx.author.followers >= minFollowers ? DECISION.PASS : DECISION.REJECT;
  };
}

function makeVerifiedFilter(requireVerified) {
  return (ctx) => {
    if (!requireVerified) return DECISION.SKIP;
    return ctx.author.verified ? DECISION.PASS : DECISION.REJECT;
  };
}

function makeCountryBlacklistFilter(blacklist) {
  return async (ctx) => {
    if (!blacklist || blacklist.length === 0) return DECISION.SKIP;
    // Geo isn't in the payload — this is a fetch
    const country = await getAuthorCountry(ctx.author.id);
    ctx.author.country = country; // cache for downstream reuse
    return blacklist.includes(country) ? DECISION.REJECT : DECISION.PASS;
  };
}

Note the three-decision model. SKIP matters: it distinguishes "this filter passed" from "this filter isn't in use." A pipeline where every filter returns SKIP lets everything through — correct behavior when the operator configured no filters.

Filters can be async (the country filter fetches geo). The pipeline awaits each in turn.

The pipeline runner

async function evaluateAuthor(author, filters) {
  const ctx = { author: { ...author }, traces: [] };

  for (const filter of filters) {
    const decision = await filter(ctx);
    ctx.traces.push({ filter: filter.name, decision });

    if (decision === DECISION.REJECT) {
      return { accepted: false, reason: filter.name, ctx };
    }
    // PASS or SKIP -> continue
  }

  return { accepted: true, ctx };
}

Short-circuiting is the return on REJECT. The traces array is gold for debugging: when an operator asks "why didn't we reply to @someone?", we can show them exactly which filter rejected them and on what data.

Ordering the pipeline for minimum cost

The operator configures which filters to enable, but we control the order. We hard-code the order by cost, not by what the operator enabled:

function buildPipeline(slotConfig) {
  return [
    makeVerifiedFilter(slotConfig.requireVerified),         // cheapest (payload)
    makeMinFollowersFilter(slotConfig.minFollowers),        // cheap (payload)
    makeCountryBlacklistFilter(slotConfig.countryBlacklist),// medium (1 fetch)
    makeXScoreFilter(slotConfig.minXScore),                 // expensive (compute)
  ].filter(f => f !== null); // drop unconfigured filters
}

Notice we don't let the operator choose the order. That's intentional. An operator might think "X-Score is my most important filter, put it first" — but running an expensive filter first means paying for it on every candidate, including the 70% that would have been rejected by the free follower check. The pipeline cost is our problem, not theirs, so we control the order.

The ordering principle generalizes: in any multi-stage validation pipeline, order stages from cheapest to most expensive, not from most selective to least. Selectivity matters, but expected cost is dominated by the order of the cheap stages.

The caching layer that makes it cheap

Even with short-circuiting, the medium-cost filters (country) and expensive filters (X-Score) would be costly if re-evaluated for every encounter with the same author. An author who tweets 5 times a day shouldn't trigger 5 geo fetches.

We cache filter results per author with a TTL:

const authorCache = new Map(); // authorId -> { data, fetchedAt }
const CACHE_TTL = 6 * 60 * 60 * 1000; // 6 hours

async function getAuthorCountry(authorId) {
  const cached = authorCache.get(authorId);
  if (cached && Date.now() - cached.fetchedAt < CACHE_TTL) {
    return cached.data.country;
  }
  const country = await fetchAuthorCountry(authorId); // the real fetch
  authorCache.set(authorId, {
    data: { country },
    fetchedAt: Date.now(),
  });
  return country;
}

The TTL is a judgment call. Too short and we re-fetch constantly; too long and we miss changes (an author who moves countries, or whose follower count crosses a threshold). Six hours is our balance — long enough to amortize fetches across a day's tweets, short enough to catch real changes.

Caching interacts with the pipeline order in a subtle way. The country filter, when cached, becomes almost as cheap as the payload filters. So in steady state (after warmup), the effective order is: payload filters, then cached country, then X-Score. The expensive stage stays expensive, but the medium stage amortizes to near-free.

Edge cases worth handling

A few that bit us:

1. Missing data in the payload. Sometimes follower count isn't in the tweet payload (API changes, partial fetches). The follower filter returns SKIP instead of failing. We'd rather over-include an author we can't evaluate than reject one for a data gap. Over-inclusion is recoverable; false rejection is invisible.

2. Filter that fetches, then fails. If the country filter fetches geo and then rejects, we've paid the fetch cost and gained nothing — except we've cached the geo, so the next time we see this author, the country filter is free. The trace records the fetch so cost accounting is honest.

3. The X-Score dependency on other filters. Our X-Score computation uses follower count as an input. If the follower filter ran first (it did) and the author passed, the follower count is already in ctx.author — no re-fetch needed. The pipeline context object is the mechanism for passing cheap-filter results to expensive filters.

4. Race conditions on cache. Two modules evaluating the same author simultaneously can double-fetch. We use a simple in-flight promise dedup (cache the promise of the fetch, not just the result) to collapse concurrent fetches into one.

Measuring the pipeline

We log every filter decision. Over a representative week:

78% of candidates rejected by the payload filters (verified, followers) — essentially free.
14% rejected by the country filter — one fetch each, mostly cached on repeat encounters.
5% rejected by X-Score — expensive, but only the 8% that passed everything cheaper got there.
3% accepted — the X-Score fetch for these is the cost of a real reply target, money well spent.

If we'd run the expensive X-Score first, we'd have computed it for 100% of candidates instead of 8% — a ~12x increase in compute cost for identical results. That's the value of ordering by cost, not by selectivity.

What we learned

1. Order by cost, not by importance. The most "important" filter (X-Score, for us) goes last, because it's the most expensive. Operators don't see the order; they see correct results and fast, cheap operation.

2. Short-circuit early and often. The pipeline's value is almost entirely in not running the expensive stages. Every cheap reject is free performance.

3. Cache aggressively, TTL honestly. Filter inputs change, but slowly. A 6-hour cache captures most of the benefit of caching without masking real changes.

4. Trace everything. When an operator asks why their account isn't replying to a specific author, the filter trace answers instantly. Without it, you're debugging blind.

5. Distinguish "pass" from "not configured." The three-decision model (PASS / REJECT / SKIP) prevents a subtle bug where an unconfigured filter accidentally rejects or accepts everything.

The pattern generalizes beyond author filtering. Any system that validates candidates through multiple checks — user input, transaction risk, content moderation — benefits from the same structure: uniform filter signatures, cost-ordered execution, short-circuit on reject, and a trace for every decision.

HelperX runs ordered, short-circuiting author-filter pipelines behind every reply module — so operators target the right accounts without paying for expensive checks on candidates that fail the cheap ones. Free 30-day trial.

DEV Community