DEV Community: IncidentHub

Top 6 Reasons Why You Need a Status Page Aggregator

Hrish B — Sun, 06 Apr 2025 17:09:10 +0000

Introduction

Your business depends on the reliability of the third-party services you use. Monitoring the status pages of these services is the best way of keeping track of their outages and maintenances. Although some status pages let you subscribe to alerts, there is no standard way of doing this. Service providers can change their status page providers, disable subscriptions, or not support the same notification options.

A status page aggregator is a tool that solves all these problems by aggregating the status pages of multiple services in one place.
If you depend on only 2-3 third-party services, you can probably get away without a status page aggregator. Beyond this, it will become harder to stay on top of third-party service outages and maintenances, leaving gaps in your monitoring.

Let's look at the top 6 reasons why you need a status page aggregator.

Introduction
Top 6 Reasons Why You Need a Status Page Aggregator
- Services Can Change Status Page Providers
- Not All Status Pages Let You Subscribe to Specific Components and Regions
- There Can Be Too Many Status Pages To Track
- Status Page URLs Can Change
- Some Status Pages Don't Have Any Way of Subscribing to Outages
- Home-Grown Status Page Monitoring Tools Are Hard To Maintain
Conclusion

Top 6 Reasons Why You Need a Status Page Aggregator

Services Can Change Status Page Providers

Businesses use a status page provider to create a managed status page that they can use to communicate with their customers and users. Depending on business needs, provider reliability, integration options, and more, businesses can change their status page provider. The status page URL would remain the same but the format and subscription options would change.

A recent example of such a move is OpenAI's status page. In Jan 2025, OpenAI was using Atlassian Statuspage. You can check it at the Wayback Machine.

The current OpenAI status page as of this writing is managed by Incident.io. The URL remains the same.

The subscription options have changed. If you were previously subscribed using webhooks, that option is no longer available. What's more, you would not even know that this happened. Once you setup the webhook subscription, you would not visit the status page except to check for details of outages and maintenances. If the subscription were removed, you would be blissfully unaware of any future outages. That is, until the outages start affecting your applications, and by extension, your business. You can end up with angry customers, lost revenue, and stressed SRE/Ops teams.

IncidentHub - a status page aggregator - automatically detects such changes. Using an aggregator shifts the responsibility of outage notifications to the aggregator, which can smooth over any differences in the status page providers.

Not All Status Pages Let You Subscribe to Specific Components and Regions

Your third-party cloud and SaaS dependencies would be globally distributed and have many regions of operation. Your applications use a subset of these services. Why receive alerts for everything?

Some status pages, like Stability.ai's, let you subscribe to specific components and regions.

Others, like LiteLLM's status page, have an RSS feed only. If you connect the feed to your Slack channel using the /feed command, you will get notified of each and every outage in LiteLLM. There is no way to subscribe to a specific LiteLLM service from its status page.

A status page aggregator like IncidentHub lets you monitor specific components and regions as long as the information is on the status page. This is true even when the originating status page does not offer component-specific subscriptions.

There Can Be Too Many Status Pages To Track

According to the State of SaaSOps Report 2024, organizations use an average of 112 SaaS tools. Even for smaller organizations and startups, most operations are outsourced to SaaS and Cloud vendors. 100+ tools means 100+ chances of unnoticed disruptions.

Monitoring all these services manually by tracking their status pages is not only hard but also not scalable.

Status Page URLs Can Change

For various reasons, the third-party vendor's organization can change their status page URLs.

Cloudflare acquired Area 1 Security, which previously had its own status page.
A few months ago, they removed the status page and Area 1's products are part of the Cloudflare status page now.

If you were previously monitoring Area 1's status page directly using just RSS feeds or email notifications, you might not have known about this change, leaving you exposed to undetected outages.

Another example is Railway's status page which moved from status.railway.app to status.railway.com.

IncidentHub detects such changes and auto-adjusts its monitoring.

Some Status Pages Don't Have Any Way of Subscribing to Outages

Most status pages have at least an RSS or Atom feed. However, some status pages don't have any visible means of subscribing to outages.
You need to keep refreshing the status page. This is just not feasible if you have a lot of dependencies.

Home-Grown Status Page Monitoring Tools Are Hard To Maintain

Some engineering and IT teams choose to build their own tooling to get around the above problems. After all, why pay for a status page aggregator when you can build your own? Any self-respecting Ops Engineer/SRE would probably want to whip up a script and try to write this tool by themselves. However, such a homegrown solution requires a lot of upfront development and ongoing maintenance effort. The technical challenges themselves are significant. In addition, there are other costs:

Any software you write needs maintenance. E.g. when your organization starts using a new service that cannot be monitored using your existing tooling, you need to add support for it.
Somebody has to ensure reliability and uptime of the homegrown solution.
It becomes an additional burden on your already overburdened SRE/Ops teams.

Conclusion

The challenges in monitoring status pages yourself or using home-grown solutions are real. A status page aggregator like IncidentHub solves these problems by providing a reliable and scalable solution.
IncidentHub adapts to status page quirks, URL changes, and more, continuously, where more basic tools can falter.

Try out the free (forever) tier of IncidentHub to never miss an outage again.

IncidentHub is not affiliated with any of the services and vendors mentioned in this article.

This article was originally published on the IncidentHub blog.

How to Configure a Remote Data Store for Prometheus

Hrish B — Sat, 21 Dec 2024 17:03:03 +0000

Introduction

The Prometheus monitoring tool can store its metrics either locally or remotely. You can configure a remote data store using the remote_write configuration. This article describes the various data store options available as well as how to set up a remote store.

Overview of Remote Storage

By default, Prometheus stores data locally wherever it is installed. The data directory can be configured by using the --storage.tsdb.path command line option when starting Prometheus.
In practice you can use a separate disk for higher performance attached to the machine where Prometheus is running.

However, this may not be possible or optimal in all situations as you might want a data store that is more suited for time series data, and has larger storage capabilities for higher data retention. Prometheus would usually run in a standalone VM or a Kubernetes pod or a Docker container, and it would not have access to such data stores by default.

A remote store can add these capabilities to Prometheus. The remote storage option can be set by using the remote_write key in the Prometheus configuration YAML file.

Introduction
Overview of Remote Storage
Remote Store Architecture
Remote Store Configuration
- Basic Syntax
- Security and Authentication
- Remote Write Protocol Configuration
- Network Configuration
- Metrics Configuration
- Queue Configuration
Remote Storage Options
Troubleshooting
- Prometheus failing to write to the remote storage
- Network connectivity between Prometheus and the remote store
- If there is a proxy in between, it might be dropping packets or might not be running
- Requests are timing out due to network issues
- Requests are timing out due to the remote store not being able to keep up
Best Practices
Conclusion
FAQ

Remote Store Architecture

Remote Store Configuration

Basic Syntax

A very simple configuration for a remote store that accepts unauthenticated connections would look like this:

remote_write:
- url: "http://192.168.23.4/api/v1/write"
  name: "production-metrics"

You can have multiple remote_write sections in the same Prometheus configuration.

Based on your requirements and the features supported by the remote write server you can configure other options. Let us look at them one by one.

Security and Authentication

To protect your metrics data in transit whether it is traveling via your internal network or through the internet, you can enable both TLS as well as authentication. The remote store server should
support these options.

# Remote write configuration for Prometheus
remote_write:
- url: "https://prometheus-data-store.mydb.io/api/v1/write"
  name: "production-metrics"

  headers:
    Authorization: "Bearer <token>"

  basic_auth:
    username: "prometheus"
    password: "secret-password"

  tls_config:
    insecure_skip_verify: false
    ca_file: "/path/to/ca.pem"
    cert_file: "/path/to/cert.pem"
    key_file: "/path/to/key.pem"

This sample configuration does the following:

Adds a Bearer token for authentication, as well as basic auth options. In practice you would use only one of these.
Adds a tls_config assuming you have a custom CA which has issued the certificates for the remote store's server. If it's a certificate issued by a well-known CA, you would not have to configure this. This option would come in handy when you have a private CA.

You can also create a separate authorization section for more options while setting the Authorization header. Note that the options below are mutually exclusive - the example is only for illustration.

# Example 1: Default Bearer type with direct credentials
authorization:
  type: Bearer
  credentials: "eyJhbGciOiJIPoI1NiIsInR5cCI6IkpXVCJ9..."

# Example 2: Bearer type with credentials from file. This is mutually exclusive with credentials_file
authorization:
  type: Bearer
  credentials_file: "/etc/prometheus/token.txt"

# Example 3: Custom type with direct credentials
authorization:
  type: CustomAuth
  credentials: "secret-token-123"

Remote Write Protocol Configuration

As of this writing, the remote write specification is undergoing a change.
You probably don't have to worry about this section unless you are optimizing for very specific cases. You can configure the protobuf_message object that Prometheus uses when sending metrics.
This depends on what your remote store server supports.

remote_write:
- url: "http://192.168.23.4/api/v1/write"
  name: "production-metrics"

  protobuf_message: prometheus.WriteRequest

Network Configuration

Based on the properties of your remote store server, you can tune some functional settings.

The remote_timeout key sets the timeout for requests to the remote write endpoint. The default value is 30s. You would not need to set this unless you have a noisy network, or there are shorter timeouts in the network path between your Prometheus server and the remote store server.

If your remote store is behind a proxy server, you can configure the proxy details in the YAML.

remote_write:
- url: "http://192.168.23.4/api/v1/write"
  remote_timeout: 45s
  name: "production-metrics"

  # Proxy configuration
  proxy_url: "http://proxy.internal:4200"
  proxy_connect_header:
    "Proxy-Authorization": ["Basic xxxxxxxxxxxxxxxxxxxx"]
    "X-Custom-Proxy-Header": ["app1", "app2"]
  proxy_from_environment: false

  follow_redirects: true
  enable_http2: true

Metrics Configuration

You can use a relabel_config key to modify or drop specific metrics before they are written to the remote store. The relabel syntax is identical to that used in the scrape_config section. You might want to do this if:

You have multiple remote stores and want specific metrics to go to specific stores to avoid unnecessary storage costs.
You have one remote store but don't want certain metrics to be written there but let them remain with Prometheus' local storage.

remote_write:
  write_relabel_configs:
    - source_labels: [__name__]
      regex: 'test_metric.*'
      action: drop
    - source_labels: [environment]
      regex: 'staging'
      action: drop

Queue Configuration

The queue_config has settings to fine tune the queue that is used to write to remote storage. Prometheus creates an internal queue for each remote write server. As it collects metrics, Prometheus maintains a write-ahead log (WAL) that it can replay if there's a crash. Each remote destination queue picks up metrics data from the WAL and sends it to the remote store server. Each queue can also have multiple shards, which can be used to configure the amount of parallelism for each queue.

You will have to to tune the queue settings only if you have a very high volume of data and/or are facing issues with the remote store struggling to keep up with your Prometheus server.

You can check out these great writeups on tuning the queue settings for remote_write.

Remote Storage Options

A non-exhaustive list of software that supports the Prometheus remote write protocol includes:

Thanos
VictoriaMetrics
Splunk
OpenTSDB
Kafka
InfluxDB
Google BigQuery

Troubleshooting

Prometheus failing to write to the remote storage

This can be caused by a number of issues:

Network connectivity between Prometheus and the remote store

Check if you can reach the remote store using ping or curl.

If there is a proxy in between, it might be dropping packets or might not be running

Check if the proxy is running. Verify that the proxy configuration as well as the Prometheus remote_write proxy settings are correct. Check the proxy server's logs for any errors. The proxy might be blocking large packets.

Requests are timing out due to network issues

Run a traceroute from your Prometheus server to the remote store to see if packets are being dropped.

Requests are timing out due to the remote store not being able to keep up

Tune the queue configuration. If this happens suddenly, it's important to find out the root cause.

The number of metrics might have increased due to autoscaling events or an increase in cardinality.
The remote store might have disk issues.

Best Practices

Backup your data in the remote store.
Add security and authentication between your Prometheus and the remote store server. If your remote store does not support this natively, you can add a proxy like nginx in between and configure it to have TLS and authentication.
Monitor your remote store metrics for indications of trouble.
If you are in a regulated industry, ensure that your remote store is compliant with your requirements. E.g. if it's managed by a cloud vendor, ascertain that their security credentials are sufficient for your needs.

Conclusion

The remote store functionality in Prometheus offers a scalable and flexible way of adding a dedicated storage backend for Prometheus metrics. You can use the remote store for increased data retention,
durability of data, and offline data analysis.

Deploying Prometheus With Docker

Hrish B — Tue, 10 Dec 2024 16:28:11 +0000

Introduction

There are different ways you can use to deploy the Prometheus monitoring tool in your environment. One of the fastest ways to get started is to deploy it as a Docker container. This guide shows you how to quickly set up a minimal Prometheus on your laptop. You can then extend that setup to add a monitoring dashboard, alerting, and authentication.

Introduction
Deploying Prometheus in a Docker Container
- Basic Setup
- Separating the Configuration
- Making the Data Storage Persistent
- Further Configuration
Conclusion
References

Deploying Prometheus in a Docker Container

Basic Setup

Running Prometheus in Docker is very simple using this command

docker run -p 9000:9090 prometheus prom/prometheus:v3.0.0

This will pull and run the latest version (as of this writing) of Prometheus. You can access the Prometheus UI at localhost:9000. Note that the container port 9090 is forwarded to the localhost port 9000.

Useful Tip

The order in Docker commands where you have to map something in your local machine to something in the container is local-resource:container-resource. In the command above it's local-port:container-port. You can see a similar example in the volume setup below.

Note that we will be stopping and starting the container many times during this tutorial. Once the container is stopped all storage and configuration inside it is gone. To get around this, we will move out the following to outside the container to our local machine, i.e., our laptop:

Metrics storage location
configuration file

Create a directory called prometheus and a config directory inside it

mkdir prometheus
cd
mkdir config

Separating the Configuration

Now create a file inside the config directory called prometheus.yml file and put this content inside it:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

rule_files:
  # - "alert_rules.yml"

scrape_configs:
  - job_name: "prometheus"

    static_configs:
      - targets: ["localhost:9090"]

The prometheus.yml file is also available online in the Prometheus repo. It's a bare-bones configuration that just scrapes the Prometheus process itself for metrics and nothing more.

Be careful about YAML formatting. You can use an online tool like YAML Lint to format your YAML file.

Your directory will look like this:

code/prometheus  > tree
.
└── config
    └── prometheus.yml

1 directory, 1 file

Making the Data Storage Persistent

To make the metrics data persistent across container restarts, we will create a Docker volume:

docker volume create prometheus

Let us run our container again it so that it uses the two artifacts that we just created

docker run -p 9000:9090 -v /home/talonx/code/prometheus/config:/etc/prometheus -v /home/talonx/code/prometheus/data:/prometheus  prom/prometheus:v3.0.0

Here, the config dir is mounted as /etc/prometheus and the prometheus volume as /prometheus inside the running container. Prometheus assumes /etc/prometheus/prometheus.yml as the default config file location and /prometheus as the default data directory, so we don't have to do any further configuration here. Note that you have to provide the full paths of the directories here for the configuration path.

You can verify that this config is working by visiting the UI at http://localhost:9000. To verify that the data volume is working, do the following:

Let the container run for 5 minutes.
Stop the container.
Start the container again.
Visit the UI at http://localhost:9000/query and search for a metric, say, process_cpu_seconds_total. Click Execute and the select the Graph tab. If the Docker volume is mounted correctly, you should be able to see metrics going back 5 minutes and more.

This completes our basic setup of a Prometheus container.

Further Configuration

You can make further changes to the configuration by editing the config/prometheus.yml file and restarting your Prometheus container. I recommend committing this file into your source code repository.

You can run the container in the background by using the -d flag:

docker run -d -p 9000:9090 -v /home/talonx/code/prometheus/config:/etc/prometheus -v prometheus:/prometheus  prom/prometheus:v3.0.0

Conclusion

Prometheus is an easy to setup metrics collection and monitoring tool. You can try it out using a Docker container. Using a container allows for rapid iteration of changing and testing your configuration. In other articles in this series, we will explore how to add authentication, external dashboards, and integrate Prometheus with other alerting systems.

References

Social share photo credits: Shubham Dhage on Unsplash

A Beginner's Guide To Service Discovery in Prometheus

Hrish B — Thu, 05 Dec 2024 04:17:15 +0000

Introduction

Service discovery (SD) is a mechanism by which the Prometheus tool can discover monitorable targets automatically. Instead of listing down each and every target to be scraped in the Prometheus configuration, service discovery acts as a source of targets that Prometheus can query at runtime.

Service discovery becomes crucial when there are dynamically changing hosts, especially in microservices architectures and environments like Kubernetes. In Prometheus parlance, service discovery is a way of discovering "scrape targets".

For example, pods are created dynamically in Kubernetes as a result of new services being deployed and undeployed, autoscaling events, and errors causing pods to crash and go away. If you are using Prometheus for scraping pods in such an environment, Prometheus has to know which pods are running and scrapable at any given point in time. The Kubernetes service discovery pluging enables this. Similarly, there are SD plugins for other common environments.

You can use service discovery in Prometheus with the predefined plugins, or write your own custom ones using file or HTTP, depending on the situation.

Introduction
Types of Prometheus Service Discovery
- Predefined Mechanisms in Prometheus
- Custom Service Discovery in Prometheus, or Writing Your Own
- HTTP based service discovery
- File based service discovery
Configuring Service Discovery in Prometheus
- Basic Syntax
- Target Relabeling and Filtering
- Verifying Your Configuration
- Handling Secrets
Combining Multiple Service Discovery Mechanisms
Troubleshooting Service Discovery
- Prometheus failing to scrape some or all targets
- Target list is not refreshed, or Prometheus is not scraping new targets, or Prometheus is attempting to scrape dead targets
- Wrong labels are showing up in metrics, or not showing up at all
Conclusion

Types of Prometheus Service Discovery

Predefined Mechanisms in Prometheus

Prometheus has out of the box support for discovering scrape targets for many popular environments, including:

Amazon Web Services (EC2 instances)
Azure (Azure VMs)
Consul
Digital Ocean
DNS
Google Cloud Platform (Google Compute Engine VMs)
Hetzner
Kubernetes
Linode
OpenStack

This list is not exhaustive. For the full list, see the the Prometheus GitHub repository.

Custom Service Discovery in Prometheus, or Writing Your Own

You may have infrastructure or application endpoints that cannot be discovered by the standard mechanisms. In such cases you can use write your own. There are two options available.

HTTP based service discovery

You can write an HTTP-based mechanism and return the scrape target information in response to Prometheus' GET requests. Prometheus will perform a GET request periodically - by default every 1 minute. This periodic request is made so that Prometheus has the latest list of targets. You can see this as a configurable parameter in the standard SD configurations of AWS and others, and you can also include it in your SD configuration as "refresh_interval". Note that this interval is different from the scrape_interval, which is used by Prometheus to scrape the targets themselves.

There are a few basic requirements for HTTP service discovery:

Response should be in JSON with the correct HTTP Content-Type header.
The content must be in UTF-8.
Authentication if required can be Basic, using the Authorization header, or OAuth 2.0. You would typically not need authentication if the endpoint is in your internal network, or part of your applications.
If there are no scrape targets, the endpoint should return an empty list.

A sample configuration for an HTTP service discovery mechanism can look like this:

http_sd_config:
  url: 'http://192.168.2.34/api/internal/hosts'
  refresh_interval: 600
  http_headers:
    "Purpose":
      values: ["Prometheus-scraper"]

Internally, your HTTP endpoint would query a database or inventory to fetch the list of targets and return them.

File based service discovery

File-based service discovery is another alternative if you need to provide a custom list of scrape targets. To do this, you can create a file and list down your scrape targets in it.
It is important to note that this is also a dynamic mechanism like HTTP service discovery. Prometheus will check for changes to the file at periodic intervals. This interval is configured with the "refresh_interval" key, just like in others. The default is 5 minutes.

Requirements for file based service discovey:

Files can be in JSON or YAML.
You can specify a pattern to match multiple files. This is helpful if you wish to keep your scrape targets grouped logically across separate files.
Malformed JSON or YAML files are ignored, so ensure that they conform to the required format.

In the Prometheus configuration, you can specify it as follows:

file_sd_config:
  files:
    - "/etc/prometheus/external/targets/*.yml"
    - "/opt/monitoring/targets/prod-*.yml"
    - "/data/dynamic-targets-[0-9]*.yaml"

  refresh_interval: 120

Configuring Service Discovery in Prometheus

Like everything else, service discovery configurations go into the configuration file which is prometheus.yml by default.

Basic Syntax

For predefined SD mechanisms, the top level YAML key is x_sd_config, where x is the environment name. You can find the complete list in the docs.
Each mechanism has a set of common keys like refresh_interval, and then keys specific to the environment.

Here is an example AWS config which generates a dynamic list of node exporter scrape targets for EC2 VMs:

# AWS Region Configuration
region: "us-west-2"
endpoint: "https://ec2.us-west-2.amazonaws.com"

# AWS Authentication (using role ARN in this example)
role_arn: "arn:aws:iam::123456789012:role/PrometheusServiceDiscovery"

refresh_interval: 300s
port: 9100  # Default port for node_exporter

# EC2 Instance Filters
filters:
  - name: "tag:Environment"
    values: ["production"]
  - name: "instance-state-name"
    values: ["running"]
  - name: "tag:Service"
    values: ["web", "api"]
  - name: "vpc-id"
    values: ["vpc-0abc123def456789"]

follow_redirects: true
enable_http2: true

For HTTP and file based mechanism the syntax is similar, and much simpler. Refer to the sections above for samples.

Target Relabeling and Filtering

Target relabeling is a technique which is applied to the labels of the target (machine, pod, endpoint, etc) before it is scraped. Labels are key value pairs attached to a metric that allow us to categorize the metric. Note that target relabeling can be used for static scrape configurations also and not just for SD-based ones.

E.g. code is the label in the following metric:

promhttp_metric_handler_requests_total{code="200"} 0

Since target relabeling is applied before scraping happens, we can use it to filter our metrics we don't care about, and also modify labels.

An example use case of modifying labels in AWS is to scrape the public IP address of the instance instead of the private one. By default, the private IP address is used.

scrape_configs:
  - job_name: 'ec2-instances'
    ec2_sd_configs:
      - region: us-west-2
        port: 9100
        filters:
          - name: "instance-state-name"
            values: ["running"]

    relabel_configs:
      # Drop targets without public IP addresses
      - source_labels: [__meta_ec2_public_ip]
        regex: ''
        action: drop

      # Use public IP instead of private IP
      - source_labels: [__meta_ec2_public_ip]
        target_label: __address__
        replacement: '${1}:9100'
        action: replace

This configuration does the following

Lists running instances only.
Drops instances without a public IP.
Set the __address__ label in the target to point to the public IP and the node exporter port (9100).

The __address__ is a special label used by Prometheus to determine the final address and port in a target for scraping.

The __meta__ prefix indicates special labels that are provided by the SD plugin. It's a way of bringing in metadata from your cloud provider (or other environment) into your metric labels.

Here is another example for Google Cloud illustrating the second point about __meta__ :

- job_name: node
    honor_labels: true
    gce_sd_configs:
      - project: ml-platform-a
        zone: us-eastl1-a
        port: 9100
    relabel_configs:
      - source_labels: [__meta_gce_label_cloud_provider]
        target_label: cloud_provider
      - source_labels: [__meta_gce_label_cloud_zone]
        target_label: cloud_zone
      - source_labels: [__meta_gce_label_cloud_app]
        target_label: cloud_app
      - source_labels: [__meta_gce_label_team]
        target_label: team
      - source_labels: [__meta_gce_instance_name]
        target_label: instance

Verifying Your Configuration

Run your configuration using a YAML linter first. If there are no errors, run Prometheus with the configuration and check for the following:

Are you seeing metrics from the intended targets?
Do the metrics have the correct labels?
When you add or remove a target (pod, host, etc), does it reflect in your metrics?

Handling Secrets

In the above AWS example, we could have used AWS API keys instead of AWS Role ARN. However, your configuration file should be stored in a source code repository, and we obviously don't want to store the keys in the committed file. There are different options for handling this depending on your deployment infrastructure.

E.g. If you are using Kubernetes, you can use Helm with the helm-secrets plugin to deploy Prometheus. Helm will seamlessly decrypt the secrets and place them in the final rendered version of your Prometheus deployment.

Combining Multiple Service Discovery Mechanisms

You can add as many SD configurations as you want to a single Prometheus configuration. Some example setups could be:

Multiple cloud vendors: Watch out for cross-cloud access in such cases, where you have to deal with encryption of in-transit data as well as authentication. A better option here is to run one Prometheus in each cloud account or environment.
Multiple regions or zones in the same cloud vendor: Here also, you might find yourself dealing with data transfer costs between regions. A full discussion of this topic is beyond the scope of this article.
Hybrid environments like your on-premises VMs and your cloud vendor's instances.
Cloud native deployments like Kubernetes and virtual machines with the same cloud vendor.

Troubleshooting Service Discovery

The first sign that your SD configuration is not working - either partially or at all - is missing metrics. Let's look at a few common issues and how to troubleshoot them.

Prometheus failing to scrape some or all targets

Check for any error messages in your Prometheus dashboard

http://prom-ip:prom-port/targets

Failed targets would be marked "Down" in red. The Error message should give you an idea of why it could not be scraped.

Target list is not refreshed, or Prometheus is not scraping new targets, or Prometheus is attempting to scrape dead targets

If it's a custom SD mechanism like HTTP or file, check if the endpoint is able to fetch data from your db or inventory systems. Prometheus can only scrape what your SD endpoint provides.
If it's an inbuilt SD mechanism like AWS or GCP, check if your cloud credentials are correct, and if the refresh_interval is reasonable.
Check that your filters are correct and not dropping valid targets.

Wrong labels are showing up in metrics, or not showing up at all

This is usually a problem with the relabel_config. If you have multiple config lines, remove all of them except the first one. If that works, add them back one by one until you hit the problematic one.

Conclusion

Service Discovery in Prometheus is a powerful way of discovering scrape targets in dynamic environments. It offers you the flexibility of being able to use in-built plugins for common cloud providers and
environments, and also write your own custom plugin for your systems.

The Ultimate List of Incident Management Tools in 2024

Hrish B — Sun, 27 Oct 2024 11:13:04 +0000

Introduction

Incident management tools are important for organizations to effectively handle service outages. With so many incident management tools around with different feature sets, it's often difficult to find the one that is right for your needs. In this article, we attempt to make a list of incident management software available in 2024 with their features to help you arrive at the right one.

We have focused mostly on tools that offer incident management capabilities - which include at least incident lifecycle management, on-call scheduling, and third-party integrations.

There are many good tools which are focused only on incident response, or on monitoring and generating alerts, or on the ticketing aspect of incidents. We have not included those to avoid cluttering this article.

Benefits of Using an Incident Management Tool

An incident management tool streamlines the incident management process by helping to define and automate workflows. It can help you create runbooks, alerting and escalation policies, and define and manage on-call schedules.
Incident Management software often come with integrations with your observability stack. Your observability stack is a key source of incidents. They can also integrate with your existing communication and collaboration tools to provide real-time updates.
Some incident management tools add context to your incident analysis by pulling in data from your infrastructure, applications, and observability systems. This can help in narrowing down the root cause.
Incident management tools can provide analytics which can be used to gain insights into patterns and performance to create a culture of continuous improvement.
An incident management tool can also provide audit trails and standardized documentation for compliance requirements.
Some tools have public and private status pages so that stakeholders can get more visibility into the process.

List of Incident Management Tools in 2024

1. PagerDuty

Key Features:

Alerting over multiple channels including phone, app, email
On-call management - scheduling, roster management, overrides
Rule definitions for alert routing
Integrations with most common tools
APIs for incident lifecycle management
Status pages
Support for teams with role-based permissions
Integration with ITSM tools
Analytics
Single sign-on
Maintenance mode

PagerDuty is best for large enterprises requiring comprehensive incident management, although it can be used by smaller teams too.

2. Opsgenie

Key Features:

Alerting over multiple channels including phone, app, email
On-call scheduling, management, overrides, and escalation policies
Ability to add contextual information to alerts
Custom actions for alerts like executing a script
Automatic actions like running playbooks
Third-party integrations
Analytics
Status pages
Single sign-on
Maintenance mode

Opsgenie is suited for ops teams tha need sophisticated alerting.

3. Splunk On-Call

Key Features:

On-call schedules and overrides
Role-based permissions
Rules engine for triggering custom actions
Incident waiting rooms to reduce alert fatigue
Maintenance mode
Notifications via email, phone, SMS, email, app push
Third-party integrations with many common tools

Splunk On-Call, formerly VictorOps, is best suited for teams already using Splunk for monitoring.

4. Grafana OnCall

Key Features:

Open source and also has a managed solution
Alert grouping
Escalation policies
Alert routing
Calendar-based on-call schedule and roster
Maintenance mode
Integrations with common third-party tools
Role based access control
Analytics

Grafana OnCall works seamlessly with other Grafana Cloud products, so it is best suited for teams already using Grafana for monitoring.

5. ServiceNow

Key Features:

On-call scheduling with overrides
Supports multiple notification channels
Automated ticket routing
SLA tracking
Compliance and governance features
Integrations with many third-party tools
Analytics

It's best suited for organizations using ServiceNow products like ITSM.

6. iLert

Key Features:

On-call schedules and escalation policies
Notifications using SMS, push, voice call
Maintenance support
Critical phone call routing using customizable multi-language IVR
Status pages
Integrations with MS Teams and Slack for chatops-based incident management
Integrates with most common tools

iLert is best suited for mid-sized Ops teams.

7. incident.io

Key Features:

On-call scheduling and escalations, with overrides
Notifications with app push, phone, email, Slack, MS Teams
Incident lifecycle management from within Slack
Private incidents support
API for integration and data access
Status pages
Analytics
Third-party integrations
Integrates with CRM systems

incident.io focuses on being an incident management platform with a Slack-first approach.

8. FireHydrant

Key Features:

On-call management
Notifications on app push, Slack, Whatsapp
Runbooks
Service catalog
Incident retrospectives
Analytics
Integrates with most common tools
Status pages

FireHydrant with its strong incident workflows and retrospectives is best suited for SRE teams.

9. Squadcast

Key Features:

On-call scheduling, escalation policies, and overrides
Integrations with common tools
Live call routing to connect to on-call folks directly
Alert classification and routing rules
Auto-pause flapping alerts
Analytics
Manage incidents directly from Slack
Runbooks
Status pages

Squadcast is meant for modern SRE and Ops teams with its alert routing, post-mortem support, and chatops features.

10. Better Stack

Key Features:

On-call scheduling and escalation policies
Incident grouping
Status pages
Integrations with common tools
Single-sign on
Teams support

Better Stack is a suite of products that includes monitoring and logging also, but we felt it should be included in this list because of its integrated on-call features.

11. Rootly

Key Features:

On-call scheduling, escalation policies, and overrides
Alert grouping based on time-window and on content
Integrates with many third-party tools
Playbooks
Support for managing the incident lifecycle directly from Slack
Retrospectives with automatic data capture and sync with Jira
Analytics

Rootly specializes in automating incident workflows with strong integration capabilities and customizable playbooks.

Conclusion

Choosing an incident management tool involves looking at many aspects including:

Features - Instead of looking at the number of features, list down the ones you actually need for your team and evaluate based on that.
Cost - Incident Management is a key part of your business operations, so you also need to forecast future costs if your team or infrastructure is growing.
Customer support - Your incident management systems' reliability needs to be top-notch. However, incidents happen, even in incident management software, so make sure they have great customer support.
Integration capabilities - Your team might be using multiple observability tools, either third-party or custom or both. Any incident management tool should be able to integrate well with your existing stack as well as with your communication and collaboration tools.
Reports - Metrics and analytics are invaluable for figuring out trends in your outages and where to focus on for improvement.
Flexibility in scheduling - Easy roster setup and overrides are a must.
Alignment with your regulatory requirements, if any.
Documentation/knowledge base integration.

Choose the tool that is right for you and your team - which may not necessarily be the one that everybody else is using because it's the "best".

Photo credits: Miha Meglic

Originally published on the IncidentHub blog.

Best Practices for Choosing a Status Page Provider

Hrish B — Tue, 15 Oct 2024 03:04:07 +0000

Introduction

Downtime is inevitable but what sets successful businesses apart is how they handle it. A key part of incident management is incident communication with both internal and external stakeholders. A status page is a crucial tool for maintaining clear communication with users during outages or service interruptions. There are numerous status page providers available with different features. This article will guide you through best practices for selecting a provider that suits your needs.

The Importance of a Status Page

An internal status page provider your colleagues and stakeholders in your organization to get a snapshot of of the current status. It can help reduce unnecessary back and forth between teams, and help people to prioritize their work better. It also creates internal transparency and trust between teams.

An external status page is crucial if you say you are committed to open communication with your end users or customers. Whether you are B2B or B2C, a public status page would be the first thing people would check if they face issues. Being open about incidents and your efforts to mitigate them build user trust. They can also decrease support ticket volume during incidents.

Key Factors to Consider When Choosing a Status Page Provider

1. Reliability

Your status page needs to be accessible especially when your main services are down. Your provider should be able to guarantee a reasonable amount of

Uptime SLA
Globally distributed infrastructure for high availabilty
Redundant systems to ensure failover and availability
Scalability to handle increased traffic during major incidents

2. Customization Options

Prioritize providers that offer customization options.

Functional customization

Support for components - This is important if your product/platform has many services and is served from many independent locations. Each such service/location should be a component in the status page so that you can publish incident updates only against the affected components.
Support for different types of events - At least maintenance events, informational events, and incidents should be supported.
Localization options - If your have customers distributed across the globe, you would want to serve locale specific pages in different languages.
Ability to update older entries - As new information flows in during an incident, you might want to update previously published information like the title or the affected components for completeness.

Branding

Your status page should reflect your brand. Look for a provider that allows you to customize your status page with your brand's logo and color scheme.
Custom domain support - Instead of serving the status page from the provider's domain you should be able to host it on your own domain - e.g. status.mydomain.com

3. Integration Capabilities

Efficient incident management requires easy tool integration. At the very least you should look for

API access for automating the incident management updates that you will publish
Integration with your monitoring and alerting tools

At the consumer end, i.e. for people who will see your status page, it's good to have integration capabilities
like webhooks, REST APIs, Slack, text, etc so that they can integrate with the systems they want.

4. Reporting and Analytics

Data-driven insights can help improve your incident response and post-mortem sessions. Choose a provider which offers:

Detailed incident history with configurable retention. The entire history need not be displayed on the page, but it hould be available to your internal teams for analysis.
Metrics and trends - Metrics can help you pinpoint services that need extra attention from your teams.
Customizable reports for stakeholders. This is mostly useful for internal stakeholders in your organization.
Page traffic - Some providers offer analytics to help you understand how often users check your status page and what they're viewing.

5. User Management and Permissions

For larger organizations, granular access control is important. Look for:

Role-based access control (RBAC).
Multi-user support.
Audit logs for accountability.

6. Mobile Support

In our mobile-first world, ensure your provider offers:

Responsive design for all devices.
SMS and email notification options.

7. Customer Support

When issues arise with the status page, prompt support is essential. Choose providers that have:

Clear SLA - Review the provider's SLA to ensure they meet your uptime and response time expectations.
24/7 customer support.
Multiple support channels (chat, email, phone).
Comprehensive documentation and notifications about updates to the status page format or APIs.

Best Practices for Implementing Your Status Page

Once you've chosen a provider, follow these best practices:

Timely updates : Keep your status page updated with correct information. For internal status pages it should be the first reference point for other teams to know the status.
Be proactive: Communicate scheduled maintenance in advance and note down which systems would be affected.
Use plain language: Avoid technical jargon in your updates as much as possible.
Provide context: Explain the impact of incidents on the end user experience. Users are interested in how an incident affects them or their work before anything else.
Offer workarounds if available.
Learn: Use incident data to enhance your systems and processes by feeding incident metrics and trends back into your post-mortems. This can help in building a culture of continuous improvement.

A Note About Internal vs External Status Pages

Internal status pages are available for viewing only by your organization's members. External status pages are available for viewing by everybody, including your customers, users, and the general public.

If it's an internal status page, the kind of updates you publish would be different from that of an external status page. Your internal stakeholders are part of the same organization, so you can
publish more internal, technical details. Although it's important to include specific technical details in the post mortem report for public pages also, you have to be careful not to publish internal system details which might compromise security. Also note that publishing expected times of resolution can backfire.

Conclusion

Choosing the right status page provider is a key decision that will affect your communication strategy during critical moments. Select a provider that not only meets your current needs but can also grow with your business. A status page reflects your commitment to transparency, so make sure you invest time in choosing the provider that is right for you.

Here is a list of status page related software and services.

This article was originally published on the IncidentHub blog.

When Alerts Don’t Mean Downtime - Preventing SRE Fatigue

Hrish B — Thu, 12 Sep 2024 02:44:55 +0000

Introduction

A recent question in an SRE forum triggered this train of thought.

How do I deal with alerts that are triggered by internal patching/release activities but don't actually cause a downtime? If we react to these alerts we might not have time to react to actual alerts that are affecting customers.

I've paraphrased the question to reflect its essence. There is plenty to unravel here.

My first reaction to this question was that the SRE who posted this is in a difficult place with systemic issues.

Systemic Issues

Without knowing more about the org and their alerting policies, let's look at what we can dig out based on this question alone

Patches/deployments trigger alerts
The team does not react to such alerts to avoid spending valuable time that can be directed towards solving downtime that is affecting customers
There is cognitive overhead of selectively reacting to some alerts, and ignoring others
The knowledge of which alerts to react to is something only the SRE team knows
Any MTTx data from such a setup are useless

The eventual impact is sub-optimal incident management, eventually affecting SLAs, and burnout in on-call folks.

Improving the SRE Experience

How would you approach fixing something like this?

Some thoughts, in no particular order

Setting the correct priority for alerts - Anything that affects customer perception of uptime, or can lead to data loss, is a P1. In larger organizations with independent teams responsible for their own microservices, I would extend the definition of customer to any team in your org that depends on your service(s). If you are responsible for an API used by a downstream service, they are your customers too.
Zero-downtime deployments - This is not as hard as it sounds if you design your systems with this goal in mind. For stateless web applications it is trivial to switch to a new version behind a load balancer. For stateful applications it can take a bit more work.
Maintenance mode - This can fall into two categories - maintenance mode that has to be communicated to the customer, and maintenance mode that is internal - affecting other teams who consume your service. At the alerting level, you temporarily silence the specific alerts that will get triggered by the rollout.
Investigate all alerts and disable useless ones - Not looking at an alert creates indeterminism and can lead to alert fatigue. The alerting system should be the single source of truth.

Solving such issues has to be a team effort involving the dev teams also. You can start by recognizing customer-facing uptime and having a sustainable on-call process as the priorities.

Photo by CDC on Unsplash

14 Monitoring Tools for Full-Stack Developers

Hrish B — Sat, 31 Aug 2024 08:47:12 +0000

Whether you are a solo full-stack developer or a member of a team, your toolkit needs to have software that monitors your applications, infrastructure, managed services, and third-party dependencies.

This is a list of 14 monitoring tools you can use to gain insights into your applications’ performance, reliability, and uptime. Some of these are managed, and others can be self-hosted.

Apache SkyWalking

Apache SkyWalking is an open-source APM tool meant for distributed systems. It has support for distributed tracing, agents in multiple languages, and support for an eBPF agent.

SkyWalking has its own native APM database called BanyanDB which can ingest and store telemetry and observability data. It also allows you to parse logs and extract metrics from log entries.

One of the important features of SkyWalking is its ability to ingest data from other sources in well-known formats like OpenTelemetry. It can also forward data to external services like alerting systems. This allows you to plug in SkyWalking without replacing your other tools.

Better Stack

Better Stack is a managed log aggregation system that can ingest logs from your sources, run search queries, and set up alerts on queries. It also comes with hosted status pages.

The alerting feature of Better Stack has support for multiple team members as well as integration with third-party tools like PagerDuty and ZenDesk. You can also pull data from external cloud services like GCP, AWS, and Azure to create incidents in Better Stack.

In addition, Better Stack also supports website monitoring.

ELK (Elasticsearch/Logstash/Kibana)

This stack consists of three components - the Elasticsearch log ingestion and processing engine, the Logstash log processor, and the Kibana UI.

Elasticsearch supports advanced log aggregation features with support for indexing, sharding, and clustering. It also comes with a REST API. Elasticsearch and Kibana can work seamlessly together. It's easy to set up this stack with Docker images but it can take considerably more work to install, configure, and maintain a scalable ELK stack.

As of this writing, Elasticsearch is again open-source.

GlitchTip

GlitchTip is an open-source error, uptime, and performance monitoring tool which also has a managed version.

GlitchTip supports multiple languages and frameworks. Its uptime monitoring includes URL and heartbeat monitoring. It is also compatible with Sentry's API, thus you can use it to push data anywhere that supports Sentry's API. It has basic alerting support via email.

They are also pretty open about their hosted architecture.

Grafana

Grafana is an analytics and data visualization tool that can create dashboards of charts and graphs. It supports many different data sources via an extensive plugin ecosystem, so you can look at and correlate metrics from different systems in the same dashboard.

Grafana is open-source and also has a managed version. You can query both metrics and logs. It has a very active community. You can set up and try Grafana on your local machine easily using Docker.

Grafana's alerting feature supports sending alerts to external services like PagerDuty, OpsGenie, Slack, etc.

IncidentHub

IncidentHub monitors third-party Cloud and SaaS services and alerts you when they have an outage. It supports monitoring hundreds of cloud platforms like GCP, AWS, Digital Ocean, communication/collaboration tools like Slack, Zoom, Office365, payment services like PayPal and Stripe, and dev tooling like GitHub, GitLab, and CircleCI.

IncidentHub periodically checks public data sources like status pages. It can notify you using channels like email, PagerDuty, Discord, Slack, Webhooks etc.

If you are a developer, you can use IncidentHub to monitor your external dependencies like cloud services, CDNs, and CI/CD and deployment platforms. As of this writing, it supports 20 free monitors.

Parseable

Parseable is a managed log analytics solution that also has an open-source version. It's written in Rust. Parseable can use either Parquet or the Arrow format for storage. Both Arrow and Parquet are Apache open-source column-oriented data storage formats.

Parseable supports OpenTelemetry and common log collectors like Fluent Bit and LogStash for ingestion. You can also send logs programmatically. It has built-in support for alerting and can push alerts into webhooks, Prometheus Alertmanager, and Slack.

Parseable also has LLM-based SQL generation for querying logs, Role-based Access Control, and OpenID Connect.

Pinpoint

This is an open-source application performance management (APM) tool that is written in Java. Pinpoint can help understand how components in distributed systems interact with each other. Its UI can show you the topology of your system visually.

Pinpoint works on the agent model where you can hook into your applications without changing any code. You can integrate with Pinpoint either by calling its APIs or by using byte code instrumentation. The second approach does not require you to change any code.

Pinpoint supports common Java software out of the box.

Prometheus

Prometheus is an open-source metrics collection and monitoring tool written in Go. It has a very active developer and user community. Originally developed at SoundCloud, it is now an independently managed CNCF project.

Prometheus supports time series metrics ingestion and has a native query language PromQL. It works via the pull model where it collects metrics from "exporters", which collect data from different sources. The list of exporters is extensive, and you can also instrument your application to either expose metrics to be collected or send them directly to Prometheus.

Prometheus has a service discovery feature where it can automatically detect nodes to monitor. It can push metrics data into and read from external data stores.

Using PromQL you can define alerting rules in your Prometheus configuration. Prometheus comes with its own Alertmanager which can be used to configure alerting rules. Alerts emitted by Prometheus can be sent to different third-party systems like Slack and PagerDuty through Alertmanager.

Sentry

Sentry is an open-source error tracking and performance monitoring tool that also has a managed version.

Sentry has support for many languages and frameworks. It supports session replay and end-to-end tracing. You can dig into the root cause of slow requests by tracing requests across function calls and services.

Sentry's alerting feature supports both metrics-based checks and URL monitoring.

Sentry also integrates with a lot of popular developer tools.

SigNoz

SigNoz positions itself as an "open-source DataDog alternative". You can host it yourself or use the commercial cloud version.

SigNoz collects metrics, traces, and logs and presents them in one dashboard. It can track external API calls which is useful when your application uses third-party APIs. You can look at common metrics like p95/p99 and trace the root cause of slow requests - whether they are because of external API response times or slow DB queries. SigNoz also lets you filter out traces by tags, service name, errors, and latency.

SigNoz supports OpenTelemetry as its instrumentation library - which means that any language and framework supported by OpenTelemetry is also supported by SigNoz. SigNoz also has built-in alerting.

UptimeRobot

UptimeRobot is a website monitoring service that checks if your website is accessible periodically and alerts you.

It supports different types of monitoring like HTTP/S, checking for keywords, cron jobs, TLS certificate expiry, and domain monitoring. It integrates with different services like Slack, PagerDuty, Telegram, Email, ZenDesk, etc. It also gives you a status page that you can share with your team.

As of this writing the service supports 50 free monitors, making it useful for solo devs and small teams.

Victoria Metrics

VictoriaMetrics is a monitoring tool and time series database. It is open-source and has a managed version.

VictoriaMetrics can integrate with other monitoring tools. E.g. with Prometheus, it can function as a storage backend for long-term data retention. It take ingest data in all well-known formats including OpenTelemetry.

You can query VictoriaMetrics using either PromQL or its native MetricsQL. It's also straightforward to back up VictoriaMetrics data using its snapshots feature to any cloud storage like Amazon S3 or Google Cloud Storage.

WireShark

Now we are getting a bit low-level. WireShark is a network protocol analyzer that has been around for a long time.

WireShark is ideal if you have to inspect network traffic at the packet level. It supports many protocols with filtering capabilities. You can capture and inspect data live, or do offline analysis.

WireShark can run on multiple OSs including Windows, Linux, FreeBSD, and OSX.

Conclusion

Choosing the right monitoring tool can be daunting with so many options. A checklist for choosing what is right for your needs could be

What are your top 5 feature requirements? This list can change over time.
What is your budget?
Do you prefer managing your own, or using a hosted solution? As your applications mature and your observability data volume grows, the scalability of your tool becomes important.
Does your organization have regulatory requirements?
Does your chosen tool do multiple things well? E.g. Does it handle logs and metrics equally well?
Does the tool integrate with your existing toolkit?

You might end up with 2-3 or even more tools, each in its specialized niche, and that's ok. In that case, integration features become important. You might choose a distributed tracing tool that sends alerts to another alerting tool. Or you might have an uptime monitor which sends informational alerts to your Slack, and critical ones to PagerDuty. As your project needs change, so will your tools.

This is by no means an exhaustive list, and there are many other tools out there. Try out some of these and let others know what you think in the comments.

Cover photo by Martin Martz on Unsplash

The Benefits of a Single Incident Management System

Hrish B — Thu, 29 Aug 2024 01:17:31 +0000

How many monitoring tools do you have?

Chances are at least 2-3. One tool usually does not cover all cases, and it’s usually a combination of self-managed and managed tools. Self-managed gives you more control over custom configurations and cost. Managed ones take away the headache of running it yourself.

Prometheus is the de-facto standard for monitoring these days if you have a modern application stack and you want to manage your own monitoring. It is metrics-based, i.e., it uses metrics as the source of data from all the monitored systems. There are ready-made exporters for almost all popular infrastructure components. You can send your application and business metrics to Prometheus too with OpenTelemetry exporters.

This model does not work for all aspects of your service. E.g. If you want to monitor external properties like your website, or use synthetic monitoring to check your customer-facing APIs from global locations, you could use something like Pingdom or UptimeRobot. This becomes another source of data about your service's uptime.
Many Monitors, One Incident Management System

A downside of having more than one monitoring system in place, regardless of the need, is that you have multiple sources of data. You have to consult multiple systems if you want to know the overall status. However, it is important that you receive alerts in one single incident and on-call management system. This allows a single place from where your on-call teams can get paged.

So ensuring that all your monitoring tools can integrate with your on-call system is crucial.

A typical Prometheus setup might look like:

If you have other monitoring systems, you should be able to route those alerts into your on-call/incident response system. Most tools support this:

IncidentHub monitors your external SaaS and cloud providers and notifies you when they have incidents. It can easily integrate into your existing incident management system.

If you’re using PagerDuty, just add a PagerDuty channel and you’re good to go. Check out the documentation for more.

Cover image credits - Luke Chesser on Unsplash.

Monitoring Third Party Vendors as an Ops Engineer/SRE

Hrish B — Mon, 26 Aug 2024 08:27:02 +0000

Why should you monitor your third-party Cloud and SaaS vendors if you are in SRE/Ops?

As part of an SRE team, your primary responsibility is ensuring the reliability of your applications. What makes you responsible for monitoring services that you don't even manage? Third-party services are just like yours - with SLAs. And outages happen, affecting you as well as many others who depend on them.

It's a no-brainer that you should know when such outages happen to be on top of things if/when it affects your running applications.

Most of your third party dependencies will have a public status page or a Twitter account where they publish updates on their outages. Here are some seemingly easy ways to monitor these pages

Subscribe to the RSS feed of these pages
Follow the Twitter account
Sign up for Slack, Email, SMS notifications on the status page itself if the page supports these

But if you have tried it, it's not that easy

Not all pages have RSS feeds
Some have Slack, Email, SMS integration - some don't
Some don't have a Twitter account
You need to sign up on all of these pages one by one, and all services may not support the same notification channel

You can easily end up doing this one by one for 10-15 or more service providers. Let's do a quick check. Which services in this list below do you use in your stack?

DNS - GCP/GoDaddy/UltraDNS/Route53
Cloud/PaaS - GCP/AWS/Azure/DigitalOcean/Heroku/Render/Railway/Hetzner
Monitoring - Grafana Cloud/DataDog/New Relic/SolarWinds
On-call management - PagerDuty/OpsGenie
Email - Google Workspace/Zoho
Communication - Zoom/Slack
Collaboration - Atlassian Jira/Confluence
Source code - GitLab/GitHub
CI/CD/GitOps - TravisCI/CircleCI/CodeFresh
CDN/Content delivery/ - Cloudflare/CDNJS/Fastly/Akamai
SMTP providers - SMTP.com/SendGrid
Payments - PayPal/Stripe
Artifact Repo - Maven/DockerHub/Quay.io
Others - OpenAI/Apple Dev Platform/Meta Platform/Anthropic
Marketing - MailChimp/Hubspot
Auth - Okta/Clerk/Auth0

This is a small list. You may not have all of these, or may have more/others, but you get the point.

Like any self-respecting Ops Engineer/SRE, you would probably want to whip up a script and write this check-pages-and-notify-in-one-place tool by yourself. I know, because I've worked in Ops/SRE roles for the better part of my career, and NIH is a very real thing. Here's why it's not a great idea

Any software you write has to be maintained. Say your org starts using a new service which does not have an RSS feed on the status page. What now?
Who monitors the monitor? How do you know when your script is not running?
You probably have better uses for your time

IncidentHub was built to solve precisely these problems - so you can focus on what's important, and hand off monitoring third-party services to something that was built with that goal in mind. So stop hacking together scripts to monitor public status pages, and try it out for free.

Image credits : Nastya Dulhiier on Unsplash