DEV Community

DevOps VN
DevOps VN

Posted on

The Open-Source On-Call Integration

This document provides a step-by-step guide to integrate Versus Incident with AWS Incident Manager to make an On Call. The integration enables automated escalation of alerts to on-call teams when incidents are not acknowledged within a specified time.

Versus

Prerequisites

Before you begin, ensure you have:

  • An AWS account with access to AWS Incident Manager.
  • Versus Incident deployed (instructions provided later).
  • Prometheus Alert Manager set up to monitor your systems.

Setting Up AWS Incident Manager for On-Call

AWS Incident Manager requires configuring several components to manage on-call workflows. Let’s configure a practical example using 6 contacts, two teams, and a two-stage response plan. Use the AWS Console to set these up.

Contacts

Contacts are individuals who will be notified during an incident.

  1. In the AWS Console, navigate to Systems Manager > Incident Manager > Contacts.
  2. Click Create contact.
  3. For each contact:
  4. Enter a Name (e.g., "Natsu Dragneel").
  5. Add Contact methods (e.g., SMS: +1-555-123-4567, Email: natsu@devopsvn.tech).
  6. Save the contact.

Repeat to create 6 contacts (e.g., Natsu, Zeref, Igneel, Gray, Gajeel, Laxus).

Escalation Plan

An escalation plan defines the order in which contacts are engaged.

  1. Go to Incident Manager > Escalation plans > Create escalation plan.
  2. Name it (e.g., TeamA_Escalation).
  3. Add contacts (e.g., Natsu, Zeref, and Igneel) and set them to engage simultaneously or sequentially.
  4. Save the plan.
  5. Create a second plan (e.g., TeamB_Escalation) for Gray, Gajeel, and Laxus.

RunBook (Optional)

RunBooks automate incident resolution steps. For this guide, we’ll skip RunBook creation, but you can define one in AWS Systems Manager Automation if needed.

Response Plan

A response plan ties contacts and escalation plans into a structured response.

  1. Go to Incident Manager > Response plans > Create response plan.
  2. Name it (e.g., CriticalIncidentResponse).
  3. Define two stages:
  4. Stage 1: Engage TeamA_Escalation (Natsu, Zeref, and Igneel) with a 5-minute timeout.
  5. Stage 2: If unacknowledged, engage TeamB_Escalation (Gray, Gajeel, and Laxus).
  6. Save the plan and note its ARN (e.g., arn:aws:ssm-incidents::111122223333:response-plan/CriticalIncidentResponse).

Define IAM Role for Versus

Versus needs permissions to interact with AWS Incident Manager.

  1. In the AWS Console, go to IAM > Roles > Create role.
  2. Choose AWS service as the trusted entity and select EC2 (or your deployment type, e.g., ECS).
  3. Attach a custom policy with these permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm-incidents:StartIncident",
                "ssm-incidents:GetResponsePlan"
            ],
            "Resource": "*"
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode
  1. Name the role (e.g., VersusIncidentRole) and create it.
  2. Note the Role ARN (e.g., arn:aws:iam::111122223333:role/VersusIncidentRole).

Deploy Versus Incident

Deploy Versus using Docker or Kubernetes. Docker Deployment. Create a directory for your configuration files:

mkdir -p ./config
Enter fullscreen mode Exit fullscreen mode

Create config/config.yaml with the following content:

name: versus
host: 0.0.0.0
port: 3000
public_host: https://your-ack-host.example

alert:
  debug_body: true

  slack:
    enable: true
    token: ${SLACK_TOKEN}
    channel_id: ${SLACK_CHANNEL_ID}
    template_path: "config/slack_message.tmpl"

oncall:
  enable: true
  wait_minutes: 3

  aws_incident_manager:
    response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN}

redis: # Required for on-call functionality
  insecure_skip_verify: true # dev only
  host: ${REDIS_HOST}
  port: ${REDIS_PORT}
  password: ${REDIS_PASSWORD}
  db: 0
Enter fullscreen mode Exit fullscreen mode

Create Slack templates config/slack_message.tmpl:

🔥 *{{ .commonLabels.severity | upper }} Alert: {{ .commonLabels.alertname }}*

🌐 *Instance*: `{{ .commonLabels.instance }}`  
🚨 *Status*: `{{ .status }}`

{{ range .alerts }}
📝 {{ .annotations.description }}  
{{ end }}
{{ if .AckURL }}
----------
<{{.AckURL}}|Click here to acknowledge>
{{ end }}
Enter fullscreen mode Exit fullscreen mode

ACK URL Generation

  • When an incident is created (e.g., via a POST to /api/incidents), Versus generates an acknowledgment URL if on-call is enabled.
  • The URL is constructed using the public_host value, typically in the format: https://your-host.example/api/incidents/ack/<incident-id>.
  • This URL is injected into the alert data as a field (e.g., .AckURL) and becomes available for use in templates.

Create the docker-compose.yml file:

version: '3.8'

services:
  versus:
    image: ghcr.io/versuscontrol/versus-incident
    ports:
      - "3000:3000"
    environment:
      - SLACK_TOKEN=your_slack_token
      - SLACK_CHANNEL_ID=your_channel_id
      - AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN=arn:aws:ssm-incidents::111122223333:response-plan/CriticalIncidentResponse
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - REDIS_PASSWORD=your_redis_password
    depends_on:
      - redis

  redis:
    image: redis:6.2-alpine
    command: redis-server --requirepass your_redis_password
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  redis_data:
Enter fullscreen mode Exit fullscreen mode

Note: If using AWS credentials, add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, or attach the IAM role to your deployment environment.

Run Docker Compose:

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Alert Rules

Create a prometheus.yml file to define a metric and alerting rule:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'server'
    static_configs:
      - targets: ['localhost:9090']

rule_files:
  - 'alert_rules.yml'
Enter fullscreen mode Exit fullscreen mode

Create alert_rules.yml to define an alert:

groups:
- name: rate
  rules:

  - alert: HighErrorRate
    expr: rate(http_requests_total{status="500"}[5m]) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected in {{ $labels.service }}"
      description: "{{ $labels.service }} has an error rate above 0.1 for 5 minutes."

  - alert: HighErrorRate
    expr: rate(http_requests_total{status="500"}[5m]) > 0.5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Very high error rate detected in {{ $labels.service }}"
      description: "{{ $labels.service }} has an error rate above 0.5 for 2 minutes."

  - alert: HighErrorRate
    expr: rate(http_requests_total{status="500"}[5m]) > 0.8
    for: 1m
    labels:
      severity: urgent
    annotations:
      summary: "Extremely high error rate detected in {{ $labels.service }}"
      description: "{{ $labels.service }} has an error rate above 0.8 for 1 minute."
Enter fullscreen mode Exit fullscreen mode

Alert Manager Routing Configuration

Configure Alert Manager to route alerts to Versus with different behaviors.

Send Alert Only (No On-Call)

receivers:
- name: 'versus-no-oncall'
  webhook_configs:
  - url: 'http://versus-service:3000/api/incidents?oncall_enable=false'
    send_resolved: false

route:
  receiver: 'versus-no-oncall'
  group_by: ['alertname', 'service']
  routes:
  - match:
      severity: warning
    receiver: 'versus-no-oncall'
Enter fullscreen mode Exit fullscreen mode

Send Alert with Acknowledgment Wait

receivers:
- name: 'versus-with-ack'
  webhook_configs:
  - url: 'http://versus-service:3000/api/incidents?oncall_wait_minutes=5'
    send_resolved: false

route:
  routes:
  - match:
      severity: critical
    receiver: 'versus-with-ack'
Enter fullscreen mode Exit fullscreen mode

This waits 5 minutes for acknowledgment before triggering the AWS Incident Manager Response Plan if the user doesn't click the link ACK that Versus sends to Slack.

Send Alert with Immediate On-Call Trigger

receivers:
- name: 'versus-immediate'
  webhook_configs:
  - url: 'http://versus-service:3000/api/incidents?oncall_wait_minutes=0'
    send_resolved: false

route:
  routes:
  - match:
      severity: urgent
    receiver: 'versus-immediate'
Enter fullscreen mode Exit fullscreen mode

This triggers the response plan immediately without waiting.

Testing the Integration

  1. Trigger an Alert: Simulate a critical alert in Prometheus to match the Alert Manager rule.
  2. Verify Versus: Check that Versus receives the alert and sends it to configured channels (e.g., Slack).
  3. Check Escalation:
  4. Wait 5 minutes without acknowledging the alert.
  5. In Incident Manager > Incidents, verify that an incident starts and Team A is engaged.
  6. After 5 more minutes, confirm Team B is engaged.
  7. Immediate Trigger Test: Send an urgent alert and confirm the response plan triggers instantly.

Conclusion

You’ve now integrated Versus Incident with AWS Incident Manager for on-call management! Alerts from Prometheus Alert Manager can trigger notifications via Versus, with escalations handled by AWS Incident Manager based on your response plan. Adjust configurations as needed for your environment.

If you encounter any issues or have further questions, feel free to reach out!

Image of Datadog

Create and maintain end-to-end frontend tests

Learn best practices on creating frontend tests, testing on-premise apps, integrating tests into your CI/CD pipeline, and using Datadog’s testing tunnel.

Download The Guide

Top comments (0)

5 Playwright CLI Flags That Will Transform Your Testing Workflow

  • 0:56 --last-failed
  • 2:34 --only-changed
  • 4:27 --repeat-each
  • 5:15 --forbid-only
  • 5:51 --ui --headed --workers 1

Learn how these powerful command-line options can save you time, strengthen your test suite, and streamline your Playwright testing experience. Click on any timestamp above to jump directly to that section in the tutorial!

👋 Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay