Opemipo Disu

Posted on Oct 13

Incident Event Pipelines for Real-Time Notifications with Windmill and Checkly

#webdev #programming #ai #tutorial

When applications and APIs have a downtime, customers are usually affected as operations remain broken. However, engineers do their job by fixing the broken operation - but a big question often comes up in engineering teams is how to tell users what exactly is happening without spamming or sending duplicate messages. That's where building a well-structured incident notification pipeline comes in.

In this article, we'll explore how to build an event pipeline for real-time notifications using Windmill. There are a bunch of monitoring tools like Sentry, New Relic, and even Grafana, for generating raw incidents but turning alerts into clear and reliable alerts for users means you need to orchestrate the alerts, clean them up, and make sure they always gets delivered.

We will show how to use Windmill to gather incident events from a monitoring tool, apply routing rules, and deliver notifications through various channels. With this piece, you'll know how to orchestrate a production-ready pattern for ensuring that updates are sent to users without duplications or delay.

This is not an alert from Windmill, please. 😂

System Architecture: From Incidents to Alerts

Now that we’ve set the context and had a little laugh with that caption 😅, let’s get straight to the point and look at how this whole system fits together. The goal of this workflow is simple: when something breaks within an application, users should be able to get real-time alerts only once. To make this work, we will integrate a monitoring tool with Windmill for orchestration and notification delivery. Now, let's have a deeper look at the major components of this workflow:

Monitoring Tool: While there are many tools such as Grafana, Sentry, and Prometheus, we will be using Checkly to detect when something goes wrong within an application. These tools can fire a webhook or alert when things break.
Windmill: Windmill sits in the middle of the workflow as the orchestrator as it helps ingest incident events from the monitoring tool (Checkly, in this context), and normalize them into a consistent schema so all the alerts are in the right destination. Windmill will also decide the destination of each type of alerts (whether they're critical, warnings, or minor). Lastly, we'll be ensuring there's a high level of reliability and deduplication with alerts.
Notification Channels: Once the orchestration with Windmill is done, it can deliver notifications through various channels like Slack, email, SMS, or webhook endpoints to other services.
Feedback Loop: Some alerts might not appear at the expected destination the first time. Windmill allows you to implement retry flows to reattempt delivery or escalate the issue to another channel.

Gathering Events From Checkly in Windmill

The first step of this workflow is collecting incident events from the monitoring tool. Since we're using Checkly as our monitoring tool, we'll integrate Checkly with Windmill for orchestration and notification delivery whenever an error or downtime occurs within an application.

Step 1: Create a Webhook in Windmill

Windmill lets you expose a flow as a webhook, which Checkly can call whenever something breaks. We need to work with Windmill for the webhook trigger configuration as this we'll need the incident pipeline for Checkly's webhook configuration.

Create a Windmill account and log into your Windmill workspace via Windmill Cloud or the self-hosted versions.
In your Windmill dashboard, create a new Flow from the Home tab.
In the flow editor, give the flow a name and add a trigger.
Select the Webhook Trigger option.
- This automatically generates an endpoint that accepts POST requests.
Next, inspect the incoming data and add a step to log the payload (so you can see what you want Checkly to send).
Add a preprocessor module to handle incoming webhook events (click the "+" button after the trigger node) - once this is done, select TypeScript as the language.

In the preprocessor editor, you can access the webhook’s payload. In this case, we will be replacing the template with the following code to process the data to ensure that the webhook and payload are working properly:

export async function preprocessor(event: {
  kind: "webhook";
  body: any;
  headers: Record<string, string>;
}) {
  const payload = event.body || {};

  return {
    check_name: payload.check_name,
    status: payload.status,
    error: payload.error,
    region: payload.region,
    run_id: payload.run_id,
    timestamp: payload.timestamp || new Date().toISOString()
  };
}

The script above receives JSON data from Checkly, logs them and returns a confirmation message. If you test the script, you should receive this.

export async function preprocessor(event: {
  kind: "webhook";
  body: any;
  headers: Record<string, string>;
}) {
  const payload = event.body || {};

  return {
    check_name: payload.check_name,
    status: payload.status,
    error: payload.error,
    region: payload.region,
    run_id: payload.run_id,
    timestamp: payload.timestamp || new Date().toISOString()
  };
}

The script above receives JSON data from Checkly, logs them and returns a confirmation message. If you test the script, you should receive this.

Copy your Windmill webhook endpoint. Now, you have to reopen the Trigger node, then you’ll see a URL like this:

https://app.windmill.dev/api/w/[workspace]/jobs/run_wait_result/[flow_path]

Step 2: Configure Checkly to Send Incidents

Create a Checkly account and log into Checkly’s dashboard.
Head over to the Alerts Channels tab.
Click the Add more channels button and select the Webhook option.
Give the Webhook instance a name and paste the Windmill webhook URL you created earlier (be sure to use the POST method).
Save the webhook instance on Checkly and you should get a JSON payload with different fields. If anything fails, Windmill will receive this payload instantly when you start capturing the issues with your monitored services.

Orchestrating Incidents in Windmill

Raw incidents aren’t enough, they’re often noisy and inconsistent and that’s why there’s a need to clean and route them properly. In Windmill’s environment, workflows are like blocks for routing, and providing contexts to alerts. Once Checkly pushes an incident payload, Windmill takes over to make sense of it.

A basic orchestration pattern includes the following components:

Normalization: Convert Checkly’s raw JSON payload format into a consistent schema. Each monitoring tool has its own format of displaying schemas. Normalization in this context ensures every incident follows a predictable format like this:

{
  "service": "coderoflagos-check",
  "severity": "critical",
  "message": "Timeout on /register route",
  "region": "eu-west-1",
  "incident_id": "run_8472"
}

This approach makes it easier to work with Slack notifications to process alerts without working with too much logic.

Deduplication: Incidents can be disturbing. For example, if the same check fails 10 times within a minute, you don't want users to be sent 10 alerts on that. In Windmill, you can store or reference previous incidents in a temporary key-value store or even cache and check whether an identical incident already exists before sending another one.
Routing Rules: Not all incidents are the same; there are different classes they fall into. While there are different classes, you can build conditional logic in your Windmill flow to route alerts based on severity, for example:
- Critical: Send immediately via SMS and urgent channels.
- Warning: Forward via email or Slack only.
- Info: Log silently.
This approach ensures developers or end-users only get the right level of attention when something occurs.
Retries and Recovery: Delivery sometimes fails. For example, a notification service might return a network timeout. Windmill allows you to implement retry flows to reattempt delivery or escalate the issue to another channel.

Example normalization step in Windmill

Once your flow receives a payload from Checkly, the next step is to normalize it into your schema and prepare it for notification delivery. Here's how to achieve this:

After your preprocessor, add a new code step that handles normalization:

Click the "+" button after your preprocessor.
Select Code option and choose Python as the language.
Give the step a unique name.
After, you can apply the following code in the code step:

def main(flow_input: dict) -> dict:
    """
    Normalize Checkly data into a consistent incident schema
    """
    # Normalize the incident data
    normalized = {
        "service": flow_input.get("check_name", "Unknown Service"),
        "severity": "critical" if flow_input.get("status") == "failed" else "info",
        "message": flow_input.get("error", "Check completed"),
        "region": flow_input.get("region", "unknown"),
        "incident_id": flow_input.get("run_id"),
        "timestamp": flow_input.get("timestamp"),
        "source": "checkly"
    }

    print(f"Normalized incident: {normalized}")
    return normalized

This step takes the preprocessed Checkly data and normalizes it into a consistent format that can be used by downstream notification steps.

Delivering Notifications

After normalization, you can add steps to deliver notifications through various channels. Here's how Windmill transforms your Checkly incidents into actionable notifications:

Step 3: Add Notification Delivery Steps

You can add multiple notification steps based on your needs. Each step receives the normalized incident data and delivers it through a specific channel:

Slack Notification Step:

Add another code step after your normalization step
Choose Python as the language
Add slack_webhook_url as a parameter
Here’s an example code snippet of what you should have in the code step trigger:

import requests

def main(incident: dict, slack_webhook_url: str) -> dict:
    """
    Send incident notification to Slack
    """
    severity_emoji = "🚨" if incident["severity"] == "critical" else "⚠️"

    slack_message = {
        "text": f"{severity_emoji} Incident Alert",
        "blocks": [
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f"*{incident['service']}* is experiencing issues\n*Error:* {incident['message']}\n*Severity:* {incident['severity']}\n*Region:* {incident['region']}\n*Incident ID:* {incident['incident_id']}"
                }
            }
        ]
    }

    response = requests.post(slack_webhook_url, json=slack_message)

    return {
        "status": "success" if response.status_code == 200 else "error",
        "channel": "slack",
        "incident_id": incident["incident_id"],
        "message_sent": f"Notified team about {incident['service']} incident"
    }

This approach ensures that even non-critical incidents still reach your engineering team without repeating emails

Conclusion

Windmill makes it easy to send alerts from Checkly to your team or product users. You can use it to recieve alerts, clean data, and send alerts via Slack, email, or any other destination you prefer. Everything works automatically, so your team can focus on fixing issues instead of managing notifications manually - however, note that with this approach, there are still some efforts you can apply to make things work more efficently.

To get the best results, try using clear alert formats and test your flow often to make sure messages are sent correctly. You can also explore other Windmill features like custom logic and extra integrations to make your system stronger.

Before you go… 😂

Thanks for reading! If this article helped you, check out Windmill’s docs or join Windmill’s community to learn more. Sharing your feedback or building something cool with Windmill helps everyone in the developer community grow together.

DEV Community