DEV Community

Artyom Rabzonov
Artyom Rabzonov

Posted on

I built an open dataset of 1,119 SaaS webhook events. Here's what surprised me.

If you have ever tried to wire up webhooks from more than three SaaS apps, you already know the punchline: every vendor invented their own conventions, and none of them are wrong, but none of them agree either.

I was building agent tooling that had to understand many of them. Halfway through, I stopped trying to keep it in my head and started writing it down. Then I kept writing it down. The result is now an open dataset.

1,119 webhook events. 30 platforms. One schema. Free, CC-BY-4.0.

What it is

A normalized catalog covering Stripe, GitHub, Slack, Notion, Linear, Jira, HubSpot, Salesforce, Zendesk, Intercom, Discord, Twilio, Calendly, Mailchimp, Zoom, Microsoft Teams, PagerDuty, Pipedrive, Asana, ClickUp, Front, Help Scout, Loom, Greenhouse, Ashby, BambooHR, Gusto, Attio, Close, and Freshdesk.

For every event, you get:

  • event_name and trigger_description
  • payload_schema as JSON Schema (draft 2020-12)
  • auth_method, signature_header, and the exact signing algorithm
  • delivery_guarantees and retry_policy
  • idempotency_key_header (if the vendor provides one)
  • docs_url back to the canonical vendor docs
  • Format: JSONL per vendor, plus Parquet for the full set

A sample row

Here is what a single Stripe event looks like, trimmed:

{
  "vendor": "stripe",
  "category": "payments",
  "event_name": "account.application.authorized",
  "trigger_description": "Fires when a user authorizes a Stripe application.",
  "auth_method": "hmac-sha256",
  "signature_header": "Stripe-Signature",
  "signature_algorithm_detail": "HMAC-SHA256 with versioned scheme; header contains timestamp and v1 hash. Verify timestamp to prevent replay attacks.",
  "delivery_guarantees": "at-least-once",
  "retry_policy": {
    "max_attempts": null,
    "backoff": "Exponential backoff over multiple hours.",
    "total_retry_window": "PT72H"
  },
  "payload_schema": { "$schema": "https://json-schema.org/draft/2020-12/schema", "type": "object", "properties": { "...": "..." } },
  "docs_url": "https://docs.stripe.com/api/events/types"
}
Enter fullscreen mode Exit fullscreen mode

Same shape across all 30 vendors. That is the entire point.

The surprising stuff

A few things jumped out once everything was in one schema:

Signing is not standardized in any way. A small sampler:

Vendor Auth Header Detail
Stripe HMAC-SHA256 Stripe-Signature Timestamp + v1 hash, replay-window enforced
GitHub HMAC-SHA256 X-Hub-Signature-256 Plus legacy SHA1 on X-Hub-Signature
Slack HMAC-SHA256 X-Slack-Signature 5-minute timestamp window
Shopify HMAC-SHA256 X-Shopify-Hmac-Sha256 Base64 of HMAC of raw body
Linear HMAC-SHA256 Linear-Signature Hex digest only

Same algorithm, five different envelopes. Anyone writing one verifier and trying to reuse it has a bad afternoon ahead.

Retry policies are wildly different. Stripe retries for 72 hours with exponential backoff. GitHub does not retry by default at all (it depends on app type). Slack retries 3 times. Some vendors do not publish a retry policy in their docs, which means you should not rely on one.

Idempotency support is hit-or-miss. GitHub gives you X-GitHub-Delivery. Stripe gives you the event id. Several vendors give you nothing, which means you either dedupe by payload hash or accept duplicates.

Max payload sizes are mostly undocumented. GitHub publishes 25 MB. Most vendors do not say.

These are not opinions. They are facts I would rather not have had to learn the hard way.

Why this exists

I am building agents that integrate with many SaaS products. Two things have to be true for an agent to call any webhook tool correctly:

  1. The agent needs the payload schema to know what fields it can rely on.
  2. The agent needs the auth contract to know how to validate inbound deliveries.

If that information is scattered across 30 different docs sites in 30 different shapes, the agent cannot use it. Once it is in a single schema, the agent can.

Same logic applies to anything that has to interop with many vendors at once: an integration platform, a security scanner that audits webhook configurations, a docs site that needs to render comparison tables, a research notebook.

Use it however

from datasets import load_dataset

ds = load_dataset("automatelab/saas-webhooks")

stripe_events = ds["train"].filter(lambda r: r["vendor"] == "stripe")
print(stripe_events[0]["payload_schema"])
Enter fullscreen mode Exit fullscreen mode

The dataset is CC-BY-4.0. No email gate, no sign-up. Attribution is appreciated when you ship something interesting on top of it.

It updates monthly from source. If you find a missing event or a vendor that should be there, the GitHub issue tracker is the right place to drop it.

Links

If you build something on top of it I would genuinely like to know. The whole point of putting it under CC-BY is that the work compounds.

Top comments (0)