DEV Community

Alex Spinov
Alex Spinov

Posted on • Originally published at blog.spinov.online

5 Apify webhook patterns that turn one-off scrapers into reliable data pipelines

5 Apify webhook patterns that turn one-off scrapers into reliable data pipelines

This article was drafted with AI assistance and edited by a human author. All metrics and patterns come from actors I personally maintain on Apify.

I maintain 79 Apify actors (32 public on the Store). The Trustpilot review scraper alone has 954 runs across 3 paying users; the Reddit scraper has 84; the email extractor has 109. The runs themselves work — proxies rotate, pagination terminates, output lands in the dataset. The hard part is what happens after the run.

Most actors I see in the wild treat the dataset as the final destination. The user has to remember to log into Apify, click into the run, and download a CSV. That works for one-off jobs. It does not work for a customer who wants the data in their Postgres at 06:00 UTC every morning, or in a Slack channel the moment a competitor's price changes, or piped into a vector store for retrieval-augmented search.

Webhooks are how you bridge "the actor finished" with "the data is now where the customer needs it." But Apify webhooks are easy to get wrong. Below are the 5 patterns I wish every actor I owned had on day one. None of them require a new dependency. All of them survive the move from "free actor with 10 runs/month" to "paying customer with 1000 runs/month."


Pattern 1: Fire one webhook per ACTOR.RUN.SUCCEEDED and one per ACTOR.RUN.FAILED — never share a handler

The instinct when you wire up your first webhook is to make it generic — one endpoint that receives "the run is done" and figures out the rest from the payload. This breaks the day a customer's success-handler has a bug and silently swallows failure events too.

Wrong:

{
  "eventTypes": ["ACTOR.RUN.SUCCEEDED", "ACTOR.RUN.FAILED", "ACTOR.RUN.TIMED_OUT", "ACTOR.RUN.ABORTED"],
  "requestUrl": "https://customer.example.com/apify-hook"
}
Enter fullscreen mode Exit fullscreen mode

Right:

[
  { "eventTypes": ["ACTOR.RUN.SUCCEEDED"], "requestUrl": "https://customer.example.com/hooks/run-success" },
  { "eventTypes": ["ACTOR.RUN.FAILED", "ACTOR.RUN.TIMED_OUT", "ACTOR.RUN.ABORTED"], "requestUrl": "https://customer.example.com/hooks/run-failure" }
]
Enter fullscreen mode Exit fullscreen mode

The split is worth the extra config row because a success handler that processes data and a failure handler that pages on-call are different services. They have different retry policies, different secrets, and different blast radii. When one breaks, you do not want it to take the other down.

Pattern 2: Use payloadTemplate to send only what the receiver actually needs

The default Apify webhook payload includes the full run object — about 4 KB of JSON with fields like meta, stats, usage, containerUrl, buildId. Most receivers care about three fields: runId, defaultDatasetId, and a status enum.

Sending 4 KB when 200 bytes will do means more bandwidth on the receiver's side, more parsing time, and (the real problem) more accidental coupling — your customer's code starts depending on stats.requestsFinished, and the day Apify renames or removes that field, the integration breaks.

Use payloadTemplate to flatten:

{
  "runId": "{{resource.id}}",
  "actorId": "{{resource.actId}}",
  "datasetId": "{{resource.defaultDatasetId}}",
  "status": "{{resource.status}}",
  "startedAt": "{{resource.startedAt}}",
  "finishedAt": "{{resource.finishedAt}}",
  "actorVersion": "{{resource.buildNumber}}"
}
Enter fullscreen mode Exit fullscreen mode

Now your receiver gets a stable, minimal contract. When you need a new field, you add it explicitly — never by accident.

Pattern 3: HMAC-sign every webhook so the receiver can prove it came from Apify

A webhook URL leaked once in a customer's load balancer logs lives forever. Anyone who finds it can POST your payload format and trigger downstream work. The defense is not "rotate the URL" — it is "sign every payload."

Apify's webhook configuration accepts custom headers. Use one to send an HMAC of the payload:

Sender side (your actor's webhook config or a small wrapper):

import crypto from 'node:crypto';

const secret = process.env.WEBHOOK_HMAC_SECRET;
const payload = JSON.stringify({ runId, datasetId, status });
const signature = crypto.createHmac('sha256', secret).update(payload).digest('hex');

await fetch(customerUrl, {
  method: 'POST',
  headers: {
    'content-type': 'application/json',
    'x-apify-signature': `sha256=${signature}`,
  },
  body: payload,
});
Enter fullscreen mode Exit fullscreen mode

Receiver side:

const expected = crypto
  .createHmac('sha256', process.env.WEBHOOK_HMAC_SECRET)
  .update(req.rawBody)
  .digest('hex');

const provided = (req.headers['x-apify-signature'] || '').replace(/^sha256=/, '');

if (!crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(provided))) {
  return res.status(401).end();
}
Enter fullscreen mode Exit fullscreen mode

Two notes that catch people: (1) compute the HMAC on the raw request body, before any JSON parsing — middleware that mutates the body silently breaks signatures. (2) Use timingSafeEqual, not ===. String comparison is timing-sensitive and lets attackers brute-force one byte at a time.

Pattern 4: Make the receiver idempotent — Apify retries failed webhooks

Apify retries a webhook up to 11 times with exponential backoff if your endpoint returns a non-2xx. That's a feature: it means transient receiver-side failures don't lose data. It's also a footgun: it means your receiver can be called multiple times for the same run.

If your handler does anything stateful — inserts into Postgres, posts to Slack, sends an email — you must dedupe by runId.

The minimum viable dedupe table:

CREATE TABLE apify_webhook_seen (
  run_id TEXT PRIMARY KEY,
  received_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Enter fullscreen mode Exit fullscreen mode

Handler:

const { runId, datasetId, status } = req.body;

const { rowCount } = await pg.query(
  'INSERT INTO apify_webhook_seen (run_id) VALUES ($1) ON CONFLICT DO NOTHING',
  [runId]
);

if (rowCount === 0) {
  return res.status(200).json({ ok: true, already_processed: true });
}

await processRun(runId, datasetId, status);
res.status(200).end();
Enter fullscreen mode Exit fullscreen mode

The INSERT ... ON CONFLICT DO NOTHING is the heart of the pattern. The second call for the same run inserts zero rows, the handler exits early, and nothing downstream sees a duplicate. This is the same pattern Stripe documents for their own webhooks — it's the de-facto standard for at-least-once delivery.

Pattern 5: Don't process the dataset inside the webhook — enqueue and return 200 fast

Apify expects a webhook response in 30 seconds. If your handler downloads a 200 MB dataset, parses it, runs deduplication, and writes to Postgres before returning, you will hit timeouts on every large run. Apify will then retry — and your half-finished writes will compound.

The correct shape: webhook = enqueue, worker = process.

// Webhook handler
app.post('/hooks/run-success', async (req, res) => {
  // verify HMAC (Pattern 3) + dedupe (Pattern 4) first
  await jobQueue.add('process-apify-run', {
    runId: req.body.runId,
    datasetId: req.body.datasetId,
  });
  res.status(200).json({ ok: true, queued: true });
});

// Worker (separate process)
jobQueue.process('process-apify-run', async (job) => {
  const { runId, datasetId } = job.data;
  for await (const item of streamDataset(datasetId)) {
    await upsertItem(item);
  }
});
Enter fullscreen mode Exit fullscreen mode

Your queue can be BullMQ on Redis, AWS SQS, Postgres LISTEN/NOTIFY, or a single row in a jobs table polled by a cron. The technology doesn't matter. What matters is that the webhook handler does only: verify, dedupe, enqueue, return. Heavy work happens in the worker, where you have hours, not seconds, and where retries are your own controlled retries — not Apify's blind 11.


Why these 5 and not 15

Webhooks have dozens of edge cases. SSL termination, IP allowlists, header size limits, timezone-naive timestamps, payload-too-large errors, dead-letter queues, exponential-backoff jitter. I picked these 5 because they are the ones that will bite a paying customer in their first month, and the ones that take 30 minutes each to wire in if you do them on day one — versus 3 days each to retrofit after the integration is in production.

If you maintain Apify actors that have paying users, audit each of yours against this list. The Trustpilot scraper and the email extractor both run all 5 patterns; the older actors do not, and every customer support thread I have ever had has been one of these 5 failures.



Disclosure: I maintain Apify actors related to this topic; the apify.com link below directs to my Apify Store profile.

I write about Apify production patterns and scraping engineering on blog.spinov.online and on Telegram @scraping_ai. Author profile: apify.com/knotless_cadence (79 actors, 32 public). Questions about a specific actor or webhook pattern — open an issue on the actor page or email spinov001@gmail.com.

Top comments (0)