<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: IncidentHub</title>
    <description>The latest articles on DEV Community by IncidentHub (@incidenthub).</description>
    <link>https://dev.to/incidenthub</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F8820%2Fc5942688-b4f6-41e9-a945-de7ef68f4906.png</url>
      <title>DEV Community: IncidentHub</title>
      <link>https://dev.to/incidenthub</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/incidenthub"/>
    <language>en</language>
    <item>
      <title>Top 6 Reasons Why You Need a Status Page Aggregator</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Sun, 06 Apr 2025 17:09:10 +0000</pubDate>
      <link>https://dev.to/incidenthub/top-6-reasons-why-you-need-a-status-page-aggregator-5a2m</link>
      <guid>https://dev.to/incidenthub/top-6-reasons-why-you-need-a-status-page-aggregator-5a2m</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Your business depends on the reliability of the third-party services you use. Monitoring the status pages of these services is the best way of keeping track of their outages and maintenances. Although some status pages let you subscribe to alerts, there is no standard way of doing this. Service providers can change their status page providers, disable subscriptions, or not support the same notification options.&lt;/p&gt;

&lt;p&gt;A status page aggregator is a tool that solves all these problems by aggregating the status pages of multiple services in one place. &lt;br&gt;
If you depend on only 2-3 third-party services, you can probably get away without a status page aggregator. Beyond this, it will become harder to stay on top of third-party service outages and maintenances, leaving gaps in your monitoring.&lt;/p&gt;

&lt;p&gt;Let's look at the top 6 reasons why you need a status page aggregator.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;
Top 6 Reasons Why You Need a Status Page Aggregator

&lt;ul&gt;
&lt;li&gt;Services Can Change Status Page Providers&lt;/li&gt;
&lt;li&gt;Not All Status Pages Let You Subscribe to Specific Components and Regions&lt;/li&gt;
&lt;li&gt;There Can Be Too Many Status Pages To Track&lt;/li&gt;
&lt;li&gt;Status Page URLs Can Change&lt;/li&gt;
&lt;li&gt;Some Status Pages Don't Have Any Way of Subscribing to Outages&lt;/li&gt;
&lt;li&gt;Home-Grown Status Page Monitoring Tools Are Hard To Maintain&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Top 6 Reasons Why You Need a Status Page Aggregator
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Services Can Change Status Page Providers
&lt;/h3&gt;

&lt;p&gt;Businesses use a &lt;a href="https://blog.incidenthub.cloud/Best-Practices-Choosing-Status-Page-Provider" rel="noopener noreferrer"&gt;status page provider&lt;/a&gt; to create a managed status page that they can use to communicate with their customers and users. Depending on business needs, provider reliability, integration options, and more, businesses can change their status page provider. The status page URL would remain the same but the format and subscription options would change. &lt;/p&gt;

&lt;p&gt;A recent example of such a move is OpenAI's status page. In Jan 2025, OpenAI was using Atlassian Statuspage. You can check it at the &lt;a href="https://web.archive.org/web/20250101055627/https://status.openai.com/" rel="noopener noreferrer"&gt;Wayback Machine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxoo60ngxrpch3jjm4gb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxoo60ngxrpch3jjm4gb.png" alt="OpenAI's previous status page" width="800" height="811"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://status.openai.com/" rel="noopener noreferrer"&gt;current OpenAI status page&lt;/a&gt; as of this writing is managed by Incident.io. The URL remains the same.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbookbn5j59a9u5no9jc6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbookbn5j59a9u5no9jc6.png" alt="OpenAI's current status page" width="700" height="959"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The subscription options have changed. If you were previously subscribed using webhooks, that option is no longer available. What's more, you would not even know that this happened. Once you setup the webhook subscription, you would not visit the status page except to check for details of outages and maintenances. If the subscription were removed, you would be blissfully unaware of any future outages. That is, until the outages start affecting your applications, and by extension, your business. You can end up with angry customers, lost revenue, and stressed SRE/Ops teams.&lt;/p&gt;

&lt;p&gt;IncidentHub - a status page aggregator - automatically detects such changes. Using an aggregator shifts the responsibility of outage notifications to the aggregator, which can smooth over any differences in the status page providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Not All Status Pages Let You Subscribe to Specific Components and Regions
&lt;/h3&gt;

&lt;p&gt;Your third-party cloud and SaaS dependencies would be globally distributed and have many regions of operation. Your applications use a subset of these services. Why receive alerts for everything? &lt;/p&gt;

&lt;p&gt;Some status pages, like &lt;a href="https://stabilityai.instatus.com/" rel="noopener noreferrer"&gt;Stability.ai's&lt;/a&gt;, let you subscribe to specific components and regions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsl5zvd8qay26bwlg71aw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsl5zvd8qay26bwlg71aw.png" alt="Subscribe to specific components on the status page itself" width="800" height="760"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Others, like &lt;a href="https://status.litellm.ai/" rel="noopener noreferrer"&gt;LiteLLM's status page&lt;/a&gt;, have an RSS feed only. If you connect the feed to your Slack channel using the &lt;a href="https://slack.com/intl/en-in/help/articles/218688467-Add-RSS-feeds-to-Slack" rel="noopener noreferrer"&gt;&lt;code&gt;/feed&lt;/code&gt;&lt;/a&gt; command, you will get notified of each and every outage in LiteLLM. There is no way to subscribe to a specific LiteLLM service from its status page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnivxjuy943e1mh5wecxt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnivxjuy943e1mh5wecxt.png" alt="LiteLLM's status page" width="800" height="845"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A status page aggregator like IncidentHub lets you monitor &lt;a href="https://blog.incidenthub.cloud/Monitoring-Specific-Components-and-Regions-in-Your-Third-Party-Services" rel="noopener noreferrer"&gt;specific components and regions&lt;/a&gt; as long as the information is on the status page. This is true even when the originating status page does not offer component-specific subscriptions.&lt;/p&gt;

&lt;h3&gt;
  
  
  There Can Be Too Many Status Pages To Track
&lt;/h3&gt;

&lt;p&gt;According to the &lt;a href="https://www.bettercloud.com/resources/state-of-saas/" rel="noopener noreferrer"&gt;State of SaaSOps Report 2024&lt;/a&gt;, organizations use an average of 112 SaaS tools. Even for smaller organizations and startups, most operations are outsourced to SaaS and Cloud vendors. 100+ tools means 100+ chances of unnoticed disruptions.&lt;/p&gt;

&lt;p&gt;Monitoring all these services manually by tracking their status pages is not only hard but also not scalable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Status Page URLs Can Change
&lt;/h3&gt;

&lt;p&gt;For various reasons, the third-party vendor's organization can change their status page URLs. &lt;/p&gt;

&lt;p&gt;Cloudflare acquired Area 1 Security, which previously had its own &lt;a href="https://web.archive.org/web/20250108114141/https://status.area1security.com/" rel="noopener noreferrer"&gt;status page&lt;/a&gt;.&lt;br&gt;
A few months ago, they removed the status page and Area 1's products are part of the &lt;a href="https://www.cloudflarestatus.com/" rel="noopener noreferrer"&gt;Cloudflare status page&lt;/a&gt; now.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkzp4ap88s3m9ub2sj879.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkzp4ap88s3m9ub2sj879.png" alt="Cloudflare" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you were previously monitoring Area 1's status page directly using just RSS feeds or email notifications, you might not have known about this change, leaving you exposed to undetected outages.&lt;/p&gt;

&lt;p&gt;Another example is Railway's status page which moved from &lt;a href="https://web.archive.org/web/20240728002854/https://railway.app/" rel="noopener noreferrer"&gt;&lt;code&gt;status.railway.app&lt;/code&gt;&lt;/a&gt; to &lt;a href="https://status.railway.com/" rel="noopener noreferrer"&gt;&lt;code&gt;status.railway.com&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;IncidentHub detects such changes and auto-adjusts its monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Some Status Pages Don't Have Any Way of Subscribing to Outages
&lt;/h3&gt;

&lt;p&gt;Most status pages have at least an RSS or Atom feed. However, some status pages don't have any visible means of subscribing to outages.&lt;br&gt;
You need to keep refreshing the status page. This is just not feasible if you have a lot of dependencies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Home-Grown Status Page Monitoring Tools Are Hard To Maintain
&lt;/h3&gt;

&lt;p&gt;Some engineering and IT teams choose to &lt;a href="https://blog.incidenthub.cloud/Monitoring-Third-Party-Vendors-As-An-Ops-Engineer-SRE" rel="noopener noreferrer"&gt;build their own tooling&lt;/a&gt; to get around the above problems. After all, why pay for a status page aggregator when you can build your own? Any self-respecting Ops Engineer/SRE would probably want to whip up a script and try to write this tool by themselves. However, such a homegrown solution requires a lot of upfront development and ongoing maintenance effort. The technical challenges themselves are significant. In addition, there are other costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any software you write needs maintenance. E.g. when your organization starts using a new service that cannot be monitored using your existing tooling, you need to add support for it.&lt;/li&gt;
&lt;li&gt;Somebody has to ensure reliability and uptime of the homegrown solution.&lt;/li&gt;
&lt;li&gt;It becomes an additional burden on your already overburdened SRE/Ops teams.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The challenges in monitoring status pages yourself or using home-grown solutions are real. A status page aggregator like IncidentHub solves these problems by providing a reliable and scalable solution.&lt;br&gt;
IncidentHub adapts to status page quirks, URL changes, and more, continuously, where more basic tools can falter.&lt;/p&gt;

&lt;p&gt;Try out the free (forever) tier of &lt;a href="https://incidenthub.cloud/#pricing" rel="noopener noreferrer"&gt;IncidentHub&lt;/a&gt; to never miss an outage again.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;IncidentHub is not affiliated with any of the services and vendors mentioned in this article.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This article was originally published on the &lt;a href="https://blog.incidenthub.cloud/top-six-reasons-why-you-need-a-status-page-aggregator" rel="noopener noreferrer"&gt;IncidentHub blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>statuspage</category>
      <category>monitoring</category>
      <category>devops</category>
      <category>sitereliabilityengineering</category>
    </item>
    <item>
      <title>How to Configure a Remote Data Store for Prometheus</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Sat, 21 Dec 2024 17:03:03 +0000</pubDate>
      <link>https://dev.to/incidenthub/how-to-configure-a-remote-data-store-for-prometheus-2g27</link>
      <guid>https://dev.to/incidenthub/how-to-configure-a-remote-data-store-for-prometheus-2g27</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The Prometheus &lt;a href="https://dev.to/tags/monitoring"&gt;monitoring&lt;/a&gt; tool can store its metrics either locally or remotely. You can configure a remote data store using the &lt;code&gt;remote_write&lt;/code&gt; configuration. This article describes the various data store options available as well as how to set up a remote store.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview of Remote Storage
&lt;/h2&gt;

&lt;p&gt;By default, Prometheus stores data locally wherever it is installed. The data directory can be configured by using the &lt;code&gt;--storage.tsdb.path&lt;/code&gt; command line option when starting Prometheus. &lt;br&gt;
In practice you can use a separate disk for higher performance attached to the machine where Prometheus is running.&lt;/p&gt;

&lt;p&gt;However, this may not be possible or optimal in all situations as you might want a data store that is more suited for time series data, and has larger storage capabilities for higher data retention. Prometheus would usually run in a standalone VM or a Kubernetes pod or a Docker container, and it would not have access to such data stores by default. &lt;/p&gt;

&lt;p&gt;A remote store can add these capabilities to Prometheus. The remote storage option can be set by using the &lt;code&gt;remote_write&lt;/code&gt; key in the Prometheus configuration YAML file. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Overview of Remote Storage&lt;/li&gt;
&lt;li&gt;Remote Store Architecture&lt;/li&gt;
&lt;li&gt;
Remote Store Configuration

&lt;ul&gt;
&lt;li&gt;Basic Syntax&lt;/li&gt;
&lt;li&gt;Security and Authentication&lt;/li&gt;
&lt;li&gt;Remote Write Protocol Configuration&lt;/li&gt;
&lt;li&gt;Network Configuration&lt;/li&gt;
&lt;li&gt;Metrics Configuration&lt;/li&gt;
&lt;li&gt;Queue Configuration&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Remote Storage Options&lt;/li&gt;
&lt;li&gt;
Troubleshooting

&lt;ul&gt;
&lt;li&gt;Prometheus failing to write to the remote storage&lt;/li&gt;
&lt;li&gt;Network connectivity between Prometheus and the remote store&lt;/li&gt;
&lt;li&gt;If there is a proxy in between, it might be dropping packets or might not be running&lt;/li&gt;
&lt;li&gt;Requests are timing out due to network issues&lt;/li&gt;
&lt;li&gt;Requests are timing out due to the remote store not being able to keep up&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Best Practices&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;li&gt;FAQ&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Remote Store Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg4lrlg5y9ej5ag6l7mvc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg4lrlg5y9ej5ag6l7mvc.png" alt="Prometheus remote write architecture" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Remote Store Configuration
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Basic Syntax
&lt;/h3&gt;

&lt;p&gt;A very simple configuration for a remote store that accepts unauthenticated connections would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;remote_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.23.4/api/v1/write"&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production-metrics"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can have multiple &lt;code&gt;remote_write&lt;/code&gt; sections in the same Prometheus configuration.&lt;/p&gt;

&lt;p&gt;Based on your requirements and the features supported by the remote write server you can configure other options. Let us look at them one by one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security and Authentication
&lt;/h3&gt;

&lt;p&gt;To protect your metrics data in transit whether it is traveling via your internal network or through the internet, you can enable both TLS as well as authentication. The remote store server should&lt;br&gt;
support these options.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Remote write configuration for Prometheus&lt;/span&gt;
&lt;span class="na"&gt;remote_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://prometheus-data-store.mydb.io/api/v1/write"&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production-metrics"&lt;/span&gt;

  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;token&amp;gt;"&lt;/span&gt;

  &lt;span class="na"&gt;basic_auth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prometheus"&lt;/span&gt;
    &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secret-password"&lt;/span&gt;

  &lt;span class="na"&gt;tls_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;insecure_skip_verify&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;ca_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/ca.pem"&lt;/span&gt;
    &lt;span class="na"&gt;cert_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/cert.pem"&lt;/span&gt;
    &lt;span class="na"&gt;key_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/key.pem"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sample configuration does the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds a &lt;code&gt;Bearer&lt;/code&gt; token for authentication, as well as basic auth options. In practice you would use only one of these.&lt;/li&gt;
&lt;li&gt;Adds a &lt;code&gt;tls_config&lt;/code&gt; assuming you have a custom CA which has issued the certificates for the remote store's server. If it's a certificate issued by a well-known CA, you would not have to configure this. This option would come in handy when you have a private CA.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also create a separate &lt;code&gt;authorization&lt;/code&gt; section for more options while setting the &lt;code&gt;Authorization&lt;/code&gt; header. Note that the options below are mutually exclusive - the example is only for illustration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example 1: Default Bearer type with direct credentials&lt;/span&gt;
&lt;span class="na"&gt;authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bearer&lt;/span&gt;
  &lt;span class="na"&gt;credentials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eyJhbGciOiJIPoI1NiIsInR5cCI6IkpXVCJ9..."&lt;/span&gt;

&lt;span class="c1"&gt;# Example 2: Bearer type with credentials from file. This is mutually exclusive with credentials_file&lt;/span&gt;
&lt;span class="na"&gt;authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bearer&lt;/span&gt;
  &lt;span class="na"&gt;credentials_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/etc/prometheus/token.txt"&lt;/span&gt;

&lt;span class="c1"&gt;# Example 3: Custom type with direct credentials&lt;/span&gt;
&lt;span class="na"&gt;authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CustomAuth&lt;/span&gt;
  &lt;span class="na"&gt;credentials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secret-token-123"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Remote Write Protocol Configuration
&lt;/h3&gt;

&lt;p&gt;As of this writing, the &lt;a href="https://prometheus.io/docs/specs/remote_write_spec/" rel="noopener noreferrer"&gt;remote write&lt;/a&gt; specification is undergoing &lt;a href="https://prometheus.io/docs/specs/remote_write_spec_2_0/" rel="noopener noreferrer"&gt;a change&lt;/a&gt;. &lt;br&gt;
You probably don't have to worry about this section unless you are optimizing for very specific cases. You can configure the &lt;code&gt;protobuf_message&lt;/code&gt; object that Prometheus uses when sending metrics. &lt;br&gt;
This depends on what your remote store server supports.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;remote_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.23.4/api/v1/write"&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production-metrics"&lt;/span&gt;

  &lt;span class="na"&gt;protobuf_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus.WriteRequest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Network Configuration
&lt;/h3&gt;

&lt;p&gt;Based on the properties of your remote store server, you can tune some functional settings.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;remote_timeout&lt;/code&gt; key sets the timeout for requests to the remote write endpoint. The default value is 30s. You would not need to set this unless you have a noisy network, or there are shorter timeouts in the network path between your Prometheus server and the remote store server.&lt;/p&gt;

&lt;p&gt;If your remote store is behind a proxy server, you can configure the proxy details in the YAML.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;remote_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://192.168.23.4/api/v1/write"&lt;/span&gt;
  &lt;span class="na"&gt;remote_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;45s&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production-metrics"&lt;/span&gt;

  &lt;span class="c1"&gt;# Proxy configuration&lt;/span&gt;
  &lt;span class="na"&gt;proxy_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://proxy.internal:4200"&lt;/span&gt;
  &lt;span class="na"&gt;proxy_connect_header&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Proxy-Authorization"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;xxxxxxxxxxxxxxxxxxxx"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Custom-Proxy-Header"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app1"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app2"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;proxy_from_environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

  &lt;span class="na"&gt;follow_redirects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;enable_http2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Metrics Configuration
&lt;/h3&gt;

&lt;p&gt;You can use a &lt;code&gt;relabel_config&lt;/code&gt; key to modify or drop specific metrics before they are written to the remote store. The &lt;a href="https://dev.to/A-Beginners-Guide-To-Service-Discovery-in-Prometheus#target-relabeling-and-filtering"&gt;relabel syntax&lt;/a&gt; is identical to that used in the &lt;code&gt;scrape_config&lt;/code&gt; section. You might want to do this if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have multiple remote stores and want specific metrics to go to specific stores to avoid unnecessary storage costs.&lt;/li&gt;
&lt;li&gt;You have one remote store but don't want certain metrics to be written there but let them remain with Prometheus' local storage.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;remote_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;write_relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__name__&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;test_metric.*'&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;drop&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;staging'&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;drop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Queue Configuration
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;queue_config&lt;/code&gt; has settings to fine tune the queue that is used to write to remote storage. Prometheus creates an internal queue for each remote write server. As it collects metrics, Prometheus maintains a &lt;a href="https://en.wikipedia.org/wiki/Write-ahead_logging" rel="noopener noreferrer"&gt;write-ahead log&lt;/a&gt; (WAL) that it can replay if there's a crash. Each remote destination queue picks up metrics data from the WAL and sends it to the remote store server. Each queue can also have multiple shards, which can be used to configure the amount of parallelism for each queue.&lt;/p&gt;

&lt;p&gt;You will have to to tune the queue settings only if you have a very high volume of data and/or are facing issues with the remote store struggling to keep up with your Prometheus server.&lt;/p&gt;

&lt;p&gt;You can check out these great writeups on tuning the queue settings for &lt;code&gt;remote_write&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://grafana.com/blog/2021/04/12/how-to-troubleshoot-remote-write-issues-in-prometheus/" rel="noopener noreferrer"&gt;https://grafana.com/blog/2021/04/12/how-to-troubleshoot-remote-write-issues-in-prometheus/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://last9.io/blog/how-to-scale-prometheus-remote-write/" rel="noopener noreferrer"&gt;https://last9.io/blog/how-to-scale-prometheus-remote-write/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Remote Storage Options
&lt;/h2&gt;

&lt;p&gt;A non-exhaustive list of software that supports the Prometheus remote write protocol includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thanos&lt;/li&gt;
&lt;li&gt;VictoriaMetrics&lt;/li&gt;
&lt;li&gt;Splunk&lt;/li&gt;
&lt;li&gt;OpenTSDB&lt;/li&gt;
&lt;li&gt;Kafka&lt;/li&gt;
&lt;li&gt;InfluxDB&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prometheus failing to write to the remote storage
&lt;/h3&gt;

&lt;p&gt;This can be caused by a number of issues:&lt;/p&gt;

&lt;h4&gt;
  
  
  Network connectivity between Prometheus and the remote store
&lt;/h4&gt;

&lt;p&gt;Check if you can reach the remote store using ping or curl.&lt;/p&gt;

&lt;h4&gt;
  
  
  If there is a proxy in between, it might be dropping packets or might not be running
&lt;/h4&gt;

&lt;p&gt;Check if the proxy is running. Verify that the proxy configuration as well as the Prometheus &lt;code&gt;remote_write&lt;/code&gt; proxy settings are correct. Check the proxy server's logs for any errors. The proxy might be blocking large packets.&lt;/p&gt;

&lt;h4&gt;
  
  
  Requests are timing out due to network issues
&lt;/h4&gt;

&lt;p&gt;Run a traceroute from your Prometheus server to the remote store to see if packets are being dropped.&lt;/p&gt;

&lt;h4&gt;
  
  
  Requests are timing out due to the remote store not being able to keep up
&lt;/h4&gt;

&lt;p&gt;Tune the queue configuration. If this happens suddenly, it's important to find out the root cause.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The number of metrics might have increased due to autoscaling events or an increase in cardinality.&lt;/li&gt;
&lt;li&gt;The remote store might have disk issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Backup your data in the remote store.&lt;/li&gt;
&lt;li&gt;Add security and authentication between your Prometheus and the remote store server. If your remote store does not support this natively, you can add a proxy like nginx in between and configure it to have
TLS and authentication.&lt;/li&gt;
&lt;li&gt;Monitor your remote store metrics for indications of trouble.&lt;/li&gt;
&lt;li&gt;If you are in a regulated industry, ensure that your remote store is compliant with your requirements. E.g. if it's managed by a cloud vendor, ascertain that their security credentials are sufficient for your needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The remote store functionality in Prometheus offers a scalable and flexible way of adding a dedicated storage backend for Prometheus metrics. You can use the remote store for increased data retention,&lt;br&gt;
durability of data, and offline data analysis.&lt;/p&gt;

</description>
      <category>prometheus</category>
      <category>monitoring</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Deploying Prometheus With Docker</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Tue, 10 Dec 2024 16:28:11 +0000</pubDate>
      <link>https://dev.to/incidenthub/deploying-prometheus-with-docker-b5a</link>
      <guid>https://dev.to/incidenthub/deploying-prometheus-with-docker-b5a</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;There are different ways you can use to deploy the Prometheus monitoring tool in your environment. One of the fastest ways to get started is to deploy it as a Docker container. This guide shows you how to quickly set up a minimal Prometheus on your laptop. You can then extend that setup to add a monitoring dashboard, alerting, and authentication.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;
Deploying Prometheus in a Docker Container

&lt;ul&gt;
&lt;li&gt;Basic Setup&lt;/li&gt;
&lt;li&gt;Separating the Configuration&lt;/li&gt;
&lt;li&gt;Making the Data Storage Persistent&lt;/li&gt;
&lt;li&gt;Further Configuration&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;li&gt;References&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Deploying Prometheus in a Docker Container
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Basic Setup
&lt;/h3&gt;

&lt;p&gt;Running Prometheus in Docker is very simple using this command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 9000:9090 prometheus prom/prometheus:v3.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will pull and run the latest version (as of this writing) of Prometheus. You can access the Prometheus UI at localhost:9000. Note that the container port 9090 is forwarded to the localhost port 9000. &lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Useful Tip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The order in Docker commands where you have to map something in your local machine to something in the container is &lt;code&gt;local-resource&lt;/code&gt;:&lt;code&gt;container-resource&lt;/code&gt;. In the command above it's &lt;code&gt;local-port&lt;/code&gt;:&lt;code&gt;container-port&lt;/code&gt;. You can see a similar example in the volume setup below.&lt;/p&gt;




&lt;p&gt;Note that we will be stopping and starting the container many times during this tutorial. Once the container is stopped all storage and configuration inside it is gone. To get around this, we will move out the following to outside the container to our local machine, i.e., our laptop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics storage location&lt;/li&gt;
&lt;li&gt;configuration file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Create a directory called prometheus and a config directory inside it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;prometheus
&lt;span class="nb"&gt;cd
mkdir &lt;/span&gt;config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Separating the Configuration
&lt;/h3&gt;

&lt;p&gt;Now create a file inside the &lt;code&gt;config&lt;/code&gt; directory called &lt;code&gt;prometheus.yml&lt;/code&gt; file and put this content inside it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;global&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;
  &lt;span class="na"&gt;evaluation_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;

&lt;span class="na"&gt;alerting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;alertmanagers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="c1"&gt;# - alertmanager:9093&lt;/span&gt;

&lt;span class="na"&gt;rule_files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# - "alert_rules.yml"&lt;/span&gt;

&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prometheus"&lt;/span&gt;

    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:9090"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;prometheus.yml&lt;/code&gt; file is also available online in the &lt;a href="https://github.com/prometheus/prometheus/blob/main/documentation/examples/prometheus.yml" rel="noopener noreferrer"&gt;Prometheus repo&lt;/a&gt;. It's a bare-bones configuration that just scrapes the Prometheus process itself for metrics and nothing more. &lt;/p&gt;

&lt;p&gt;Be careful about YAML formatting. You can use an online tool like &lt;a href="https://www.yamllint.com/" rel="noopener noreferrer"&gt;YAML Lint&lt;/a&gt; to format your YAML file.&lt;/p&gt;

&lt;p&gt;Your directory will look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;code/prometheus  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; tree
&lt;span class="nb"&gt;.&lt;/span&gt;
└── config
    └── prometheus.yml

1 directory, 1 file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Making the Data Storage Persistent
&lt;/h3&gt;

&lt;p&gt;To make the metrics data persistent across container restarts, we will create a Docker volume:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker volume create prometheus
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let us run our container again it so that it uses the two artifacts that we just created&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 9000:9090 &lt;span class="nt"&gt;-v&lt;/span&gt; /home/talonx/code/prometheus/config:/etc/prometheus &lt;span class="nt"&gt;-v&lt;/span&gt; /home/talonx/code/prometheus/data:/prometheus  prom/prometheus:v3.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, the config dir is mounted as &lt;code&gt;/etc/prometheus&lt;/code&gt; and the &lt;code&gt;prometheus&lt;/code&gt; volume as &lt;code&gt;/prometheus&lt;/code&gt; inside the running container. Prometheus assumes &lt;code&gt;/etc/prometheus/prometheus.yml&lt;/code&gt; as the default config file location and &lt;code&gt;/prometheus&lt;/code&gt; as the default data directory, so we don't have to do any further configuration here. Note that you have to provide the full paths of the directories here for the configuration path.&lt;/p&gt;

&lt;p&gt;You can verify that this config is working by visiting the UI at &lt;a href="http://localhost:9000" rel="noopener noreferrer"&gt;http://localhost:9000&lt;/a&gt;. To verify that the data volume is working, do the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Let the container run for 5 minutes.&lt;/li&gt;
&lt;li&gt;Stop the container.&lt;/li&gt;
&lt;li&gt;Start the container again.&lt;/li&gt;
&lt;li&gt;Visit the UI at &lt;a href="http://localhost:9000/query" rel="noopener noreferrer"&gt;http://localhost:9000/query&lt;/a&gt; and search for a metric, say, &lt;code&gt;process_cpu_seconds_total&lt;/code&gt;. Click &lt;code&gt;Execute&lt;/code&gt; and the select the &lt;code&gt;Graph&lt;/code&gt; tab. If the Docker volume is mounted correctly, you should be able to see metrics going back 5 minutes and more. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This completes our basic setup of a Prometheus container.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Configuration
&lt;/h3&gt;

&lt;p&gt;You can make further changes to the configuration by editing the &lt;code&gt;config/prometheus.yml&lt;/code&gt; file and restarting your Prometheus container. I recommend committing this file into your source code repository.&lt;/p&gt;

&lt;p&gt;You can run the container in the background by using the &lt;code&gt;-d&lt;/code&gt; flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 9000:9090 &lt;span class="nt"&gt;-v&lt;/span&gt; /home/talonx/code/prometheus/config:/etc/prometheus &lt;span class="nt"&gt;-v&lt;/span&gt; prometheus:/prometheus  prom/prometheus:v3.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Prometheus is an easy to setup metrics collection and monitoring tool. You can try it out using a Docker container. Using a container allows for rapid iteration of changing and testing your configuration. In other articles in this series, we will explore how to add authentication, external dashboards, and integrate Prometheus with other alerting systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/prometheus/3.0/getting_started/" rel="noopener noreferrer"&gt;Prometheus 3.0.0 documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/reference/cli/docker/container/" rel="noopener noreferrer"&gt;Docker Container Management commands&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.docker.com/reference/cli/docker/volume/" rel="noopener noreferrer"&gt;Docker Volume commands&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.yamllint.com/" rel="noopener noreferrer"&gt;YAML Validator&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Social share photo credits: &lt;a href="https://unsplash.com/@theshubhamdhage?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Shubham Dhage&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/a-group-of-cubes-that-are-connected-to-each-other-R2HtYWs5-QA?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>prometheus</category>
      <category>docker</category>
    </item>
    <item>
      <title>A Beginner's Guide To Service Discovery in Prometheus</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Thu, 05 Dec 2024 04:17:15 +0000</pubDate>
      <link>https://dev.to/incidenthub/a-beginners-guide-to-service-discovery-in-prometheus-3366</link>
      <guid>https://dev.to/incidenthub/a-beginners-guide-to-service-discovery-in-prometheus-3366</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Service discovery (SD) is a mechanism by which the Prometheus tool can discover monitorable targets automatically. Instead of listing down each and every target to be scraped in the Prometheus configuration, service discovery acts as a source of targets that Prometheus can query at runtime.&lt;/p&gt;

&lt;p&gt;Service discovery becomes crucial when there are dynamically changing hosts, especially in microservices architectures and environments like Kubernetes. In Prometheus parlance, service discovery is a way of discovering "scrape targets". &lt;/p&gt;

&lt;p&gt;For example, pods are created dynamically in Kubernetes as a result of new services being deployed and undeployed, autoscaling events, and errors causing pods to crash and go away. If you are using Prometheus for scraping pods in such an environment, Prometheus has to know which pods are running and scrapable at any given point in time. The Kubernetes service discovery pluging enables this. Similarly, there are SD plugins for other common environments.&lt;/p&gt;

&lt;p&gt;You can use service discovery in Prometheus with the predefined plugins, or write your own custom ones using file or HTTP, depending on the situation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;
Types of Prometheus Service Discovery

&lt;ul&gt;
&lt;li&gt;Predefined Mechanisms in Prometheus&lt;/li&gt;
&lt;li&gt;Custom Service Discovery in Prometheus, or Writing Your Own&lt;/li&gt;
&lt;li&gt;HTTP based service discovery&lt;/li&gt;
&lt;li&gt;File based service discovery&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Configuring Service Discovery in Prometheus

&lt;ul&gt;
&lt;li&gt;Basic Syntax&lt;/li&gt;
&lt;li&gt;Target Relabeling and Filtering&lt;/li&gt;
&lt;li&gt;Verifying Your Configuration&lt;/li&gt;
&lt;li&gt;Handling Secrets&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Combining Multiple Service Discovery Mechanisms&lt;/li&gt;

&lt;li&gt;

Troubleshooting Service Discovery

&lt;ul&gt;
&lt;li&gt;Prometheus failing to scrape some or all targets&lt;/li&gt;
&lt;li&gt;Target list is not refreshed, or Prometheus is not scraping new targets, or Prometheus is attempting to scrape dead targets&lt;/li&gt;
&lt;li&gt;Wrong labels are showing up in metrics, or not showing up at all&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Types of Prometheus Service Discovery
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Predefined Mechanisms in Prometheus
&lt;/h3&gt;

&lt;p&gt;Prometheus has out of the box support for discovering scrape targets for many popular environments, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Web Services (EC2 instances)&lt;/li&gt;
&lt;li&gt;Azure (Azure VMs)&lt;/li&gt;
&lt;li&gt;Consul&lt;/li&gt;
&lt;li&gt;Digital Ocean&lt;/li&gt;
&lt;li&gt;DNS&lt;/li&gt;
&lt;li&gt;Google Cloud Platform (Google Compute Engine VMs)&lt;/li&gt;
&lt;li&gt;Hetzner&lt;/li&gt;
&lt;li&gt;Kubernetes&lt;/li&gt;
&lt;li&gt;Linode&lt;/li&gt;
&lt;li&gt;OpenStack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This list is not exhaustive. For the full list, see the the &lt;a href="https://github.com/prometheus/prometheus/tree/main/discovery" rel="noopener noreferrer"&gt;Prometheus GitHub repository&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom Service Discovery in Prometheus, or Writing Your Own
&lt;/h3&gt;

&lt;p&gt;You may have infrastructure or application endpoints that cannot be discovered by the standard mechanisms. In such cases you can use write your own. There are two options available.&lt;/p&gt;

&lt;h4&gt;
  
  
  HTTP based service discovery
&lt;/h4&gt;

&lt;p&gt;You can write an HTTP-based mechanism and return the scrape target information in response to Prometheus' GET requests. Prometheus will perform a GET request periodically - by default every 1 minute. This periodic request is made so that Prometheus has the latest list of targets. You can see this as a configurable parameter in the standard SD configurations of AWS and others, and you can also include it in your SD configuration as "refresh_interval". Note that this interval is different from the &lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file" rel="noopener noreferrer"&gt;scrape_interval&lt;/a&gt;, which is used by Prometheus to scrape the targets themselves.&lt;/p&gt;

&lt;p&gt;There are a few basic requirements for HTTP service discovery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response should be in JSON with the correct HTTP &lt;code&gt;Content-Type&lt;/code&gt; header.&lt;/li&gt;
&lt;li&gt;The content must be in &lt;code&gt;UTF-8&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Authentication if required can be Basic, using the Authorization header, or OAuth 2.0. You would typically not need authentication if the endpoint is in your internal network, or part of your 
applications.&lt;/li&gt;
&lt;li&gt;If there are no scrape targets, the endpoint should return an empty list.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A sample configuration for an HTTP service discovery mechanism can look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;http_sd_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://192.168.2.34/api/internal/hosts'&lt;/span&gt;
  &lt;span class="na"&gt;refresh_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;
  &lt;span class="na"&gt;http_headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Purpose"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prometheus-scraper"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internally, your HTTP endpoint would query a database or inventory to fetch the list of targets and return them.&lt;/p&gt;

&lt;h4&gt;
  
  
  File based service discovery
&lt;/h4&gt;

&lt;p&gt;File-based service discovery is another alternative if you need to provide a custom list of scrape targets. To do this, you can create a file and list down your scrape targets in it.&lt;br&gt;
It is important to note that this is also a dynamic mechanism like HTTP service discovery. Prometheus will check for changes to the file at periodic intervals. This interval is configured with the "refresh_interval" key, just like in others. The default is 5 minutes.&lt;/p&gt;

&lt;p&gt;Requirements for file based service discovey:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Files can be in JSON or YAML.&lt;/li&gt;
&lt;li&gt;You can specify a pattern to match multiple files. This is helpful if you wish to keep your scrape targets grouped logically across separate files.&lt;/li&gt;
&lt;li&gt;Malformed JSON or YAML files are ignored, so ensure that they conform to the &lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config" rel="noopener noreferrer"&gt;required format&lt;/a&gt;. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the Prometheus configuration, you can specify it as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;file_sd_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;files&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/etc/prometheus/external/targets/*.yml"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/opt/monitoring/targets/prod-*.yml"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/data/dynamic-targets-[0-9]*.yaml"&lt;/span&gt;

  &lt;span class="na"&gt;refresh_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuring Service Discovery in Prometheus
&lt;/h2&gt;

&lt;p&gt;Like everything else, service discovery configurations go into the configuration file which is prometheus.yml by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Syntax
&lt;/h3&gt;

&lt;p&gt;For predefined SD mechanisms, the top level YAML key is &lt;code&gt;x_sd_config&lt;/code&gt;, where x is the environment name. You can find the complete list in the &lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file" rel="noopener noreferrer"&gt;docs&lt;/a&gt;.&lt;br&gt;
Each mechanism has a set of common keys like &lt;code&gt;refresh_interval&lt;/code&gt;, and then keys specific to the environment.&lt;/p&gt;

&lt;p&gt;Here is an example AWS config which generates a dynamic list of node exporter scrape targets for EC2 VMs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# AWS Region Configuration&lt;/span&gt;
&lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2"&lt;/span&gt;
&lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://ec2.us-west-2.amazonaws.com"&lt;/span&gt;

&lt;span class="c1"&gt;# AWS Authentication (using role ARN in this example)&lt;/span&gt;
&lt;span class="na"&gt;role_arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::123456789012:role/PrometheusServiceDiscovery"&lt;/span&gt;

&lt;span class="na"&gt;refresh_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;300s&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9100&lt;/span&gt;  &lt;span class="c1"&gt;# Default port for node_exporter&lt;/span&gt;

&lt;span class="c1"&gt;# EC2 Instance Filters&lt;/span&gt;
&lt;span class="na"&gt;filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag:Environment"&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;instance-state-name"&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag:Service"&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vpc-id"&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vpc-0abc123def456789"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;follow_redirects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;enable_http2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For HTTP and file based mechanism the syntax is similar, and much simpler. Refer to the sections above for samples.&lt;/p&gt;

&lt;h3&gt;
  
  
  Target Relabeling and Filtering
&lt;/h3&gt;

&lt;p&gt;Target relabeling is a technique which is applied to the labels of the target (machine, pod, endpoint, etc) before it is scraped. Labels are key value pairs attached to a metric that allow us to categorize the metric. Note that target relabeling can be used for static scrape configurations also and not just for SD-based ones.&lt;/p&gt;

&lt;p&gt;E.g. &lt;code&gt;code&lt;/code&gt; is the label in the following metric:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;promhttp_metric_handler_requests_total&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"200"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt; 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since target relabeling is applied before scraping happens, we can use it to filter our metrics we don't care about, and also modify labels.&lt;/p&gt;

&lt;p&gt;An example use case of modifying labels in AWS is to scrape the public IP address of the instance instead of the private one. By default, the private IP address is used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ec2-instances'&lt;/span&gt;
    &lt;span class="na"&gt;ec2_sd_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-west-2&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9100&lt;/span&gt;
        &lt;span class="na"&gt;filters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;instance-state-name"&lt;/span&gt;
            &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

    &lt;span class="na"&gt;relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Drop targets without public IP addresses&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_ec2_public_ip&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;drop&lt;/span&gt;

      &lt;span class="c1"&gt;# Use public IP instead of private IP&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_ec2_public_ip&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;__address__&lt;/span&gt;
        &lt;span class="na"&gt;replacement&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;${1}:9100'&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;replace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration does the following&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lists running instances only.&lt;/li&gt;
&lt;li&gt;Drops instances without a public IP.&lt;/li&gt;
&lt;li&gt;Set the &lt;code&gt;__address__&lt;/code&gt; label in the target to point to the public IP and the node exporter port (9100).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;__address__&lt;/code&gt; is a special label used by Prometheus to determine the final address and port in a target for scraping.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;__meta__&lt;/code&gt; prefix indicates special labels that are provided by the SD plugin. It's a way of bringing in metadata from your cloud provider (or other environment) into your metric labels.&lt;/p&gt;

&lt;p&gt;Here is another example for Google Cloud illustrating the second point about &lt;code&gt;__meta__&lt;/code&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node&lt;/span&gt;
    &lt;span class="s"&gt;honor_labels&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="s"&gt;gce_sd_configs&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-platform-a&lt;/span&gt;
        &lt;span class="na"&gt;zone&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-eastl1-a&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9100&lt;/span&gt;
  &lt;span class="err"&gt;  &lt;/span&gt;&lt;span class="na"&gt;relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_gce_label_cloud_provider&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloud_provider&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_gce_label_cloud_zone&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloud_zone&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_gce_label_cloud_app&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloud_app&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_gce_label_team&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_gce_instance_name&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;instance&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verifying Your Configuration
&lt;/h3&gt;

&lt;p&gt;Run your configuration using a YAML linter first. If there are no errors, run Prometheus with the configuration and check for the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are you seeing metrics from the intended targets?&lt;/li&gt;
&lt;li&gt;Do the metrics have the correct labels?&lt;/li&gt;
&lt;li&gt;When you add or remove a target (pod, host, etc), does it reflect in your metrics?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Handling Secrets
&lt;/h3&gt;

&lt;p&gt;In the above AWS example, we could have used AWS API keys instead of AWS Role ARN. However, your configuration file should be stored in a source code repository, and we obviously don't want to store the keys in the committed file. There are different options for handling this depending on your deployment infrastructure.&lt;/p&gt;

&lt;p&gt;E.g. If you are using Kubernetes, you can use Helm with the &lt;a href="https://github.com/jkroepke/helm-secrets" rel="noopener noreferrer"&gt;helm-secrets&lt;/a&gt; plugin to deploy Prometheus. Helm will seamlessly decrypt the secrets and place them in the final rendered version of your Prometheus deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combining Multiple Service Discovery Mechanisms
&lt;/h2&gt;

&lt;p&gt;You can add as many SD configurations as you want to a single Prometheus configuration. Some example setups could be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple cloud vendors: Watch out for cross-cloud access in such cases, where you have to deal with encryption of in-transit data as well as authentication. A better option here is to run one Prometheus in each cloud account or environment.&lt;/li&gt;
&lt;li&gt;Multiple regions or zones in the same cloud vendor: Here also, you might find yourself dealing with data transfer costs between regions. A full discussion of this topic is beyond the scope of this article.&lt;/li&gt;
&lt;li&gt;Hybrid environments like your on-premises VMs and your cloud vendor's instances.&lt;/li&gt;
&lt;li&gt;Cloud native deployments like Kubernetes and virtual machines with the same cloud vendor.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Troubleshooting Service Discovery
&lt;/h2&gt;

&lt;p&gt;The first sign that your SD configuration is not working - either partially or at all -  is missing metrics. Let's look at a few common issues and how to troubleshoot them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prometheus failing to scrape some or all targets
&lt;/h3&gt;

&lt;p&gt;Check for any error messages in your Prometheus dashboard&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://prom-ip:prom-port/targets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failed targets would be marked "Down" in red. The Error message should give you an idea of why it could not be scraped.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fut6q02yi5jrqyti68ytg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fut6q02yi5jrqyti68ytg.png" alt="Prometheus scrape error" width="800" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Target list is not refreshed, or Prometheus is not scraping new targets, or Prometheus is attempting to scrape dead targets
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If it's a custom SD mechanism like HTTP or file, check if the endpoint is able to fetch data from your db or inventory systems. Prometheus can only scrape what your SD endpoint provides.&lt;/li&gt;
&lt;li&gt;If it's an inbuilt SD mechanism like AWS or GCP, check if your cloud credentials are correct, and if the refresh_interval is reasonable.&lt;/li&gt;
&lt;li&gt;Check that your filters are correct and not dropping valid targets.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Wrong labels are showing up in metrics, or not showing up at all
&lt;/h3&gt;

&lt;p&gt;This is usually a problem with the &lt;code&gt;relabel_config&lt;/code&gt;. If you have multiple config lines, remove all of them except the first one. If that works, add them back one by one until you hit the problematic one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Service Discovery in Prometheus is a powerful way of discovering scrape targets in dynamic environments. It offers you the flexibility of being able to use in-built plugins for common cloud providers and&lt;br&gt;
environments, and also write your own custom plugin for your systems.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>observability</category>
      <category>devops</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>The Ultimate List of Incident Management Tools in 2024</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Sun, 27 Oct 2024 11:13:04 +0000</pubDate>
      <link>https://dev.to/incidenthub/the-ultimate-list-of-incident-management-tools-in-2024-4m16</link>
      <guid>https://dev.to/incidenthub/the-ultimate-list-of-incident-management-tools-in-2024-4m16</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Incident management tools are important for organizations to effectively handle service outages. With so many incident management tools around with different feature sets, it's often difficult to find the one that is right for your needs. In this article, we attempt to make a list of incident management software available in 2024 with their features to help you arrive at the right one.&lt;/p&gt;

&lt;p&gt;We have focused mostly on tools that offer incident management capabilities - which include at least incident lifecycle management, on-call scheduling, and third-party integrations. &lt;/p&gt;

&lt;p&gt;There are many good tools which are focused only on incident response, or on monitoring and generating alerts, or on the ticketing aspect of incidents. We have not included those to avoid cluttering this article. &lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits of Using an Incident Management Tool
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;An incident management tool streamlines the incident management process by helping to define and automate workflows. It can help you create runbooks, alerting and escalation policies, and define and manage on-call schedules.&lt;/li&gt;
&lt;li&gt;Incident Management software often come with integrations with your observability stack. Your observability stack is a key source of incidents. They can also integrate with your existing &lt;a href="https://blog.incidenthub.cloud/The-Rising-Role-of-Slack-in-Incident-Management" rel="noopener noreferrer"&gt;communication&lt;/a&gt; and collaboration tools to provide real-time updates.&lt;/li&gt;
&lt;li&gt;Some incident management tools add context to your incident analysis by pulling in data from your infrastructure, applications, and observability systems. 
This can help in narrowing down the root cause.&lt;/li&gt;
&lt;li&gt;Incident management tools can provide analytics which can be used to gain insights into patterns and performance to create a culture of continuous improvement.&lt;/li&gt;
&lt;li&gt;An incident management tool can also provide audit trails and standardized documentation for compliance requirements.&lt;/li&gt;
&lt;li&gt;Some tools have public and private &lt;a href="https://blog.incidenthub.cloud/Best-Practices-Choosing-Status-Page-Provider" rel="noopener noreferrer"&gt;status pages&lt;/a&gt; so that stakeholders can get more visibility into the process.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  List of Incident Management Tools in 2024
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;a href="https://www.pagerduty.com" rel="noopener noreferrer"&gt;PagerDuty&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alerting over multiple channels including phone, app, email&lt;/li&gt;
&lt;li&gt;On-call management - scheduling, roster management, overrides&lt;/li&gt;
&lt;li&gt;Rule definitions for alert routing&lt;/li&gt;
&lt;li&gt;Integrations with most common tools&lt;/li&gt;
&lt;li&gt;APIs for incident lifecycle management&lt;/li&gt;
&lt;li&gt;Status pages&lt;/li&gt;
&lt;li&gt;Support for teams with role-based permissions&lt;/li&gt;
&lt;li&gt;Integration with ITSM tools&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;li&gt;Single sign-on&lt;/li&gt;
&lt;li&gt;Maintenance mode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PagerDuty is best for large enterprises requiring comprehensive incident management, although it can be used by smaller teams too.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;a href="https://www.atlassian.com/software/opsgenie/features" rel="noopener noreferrer"&gt;Opsgenie&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alerting over multiple channels including phone, app, email&lt;/li&gt;
&lt;li&gt;On-call scheduling, management, overrides, and escalation policies&lt;/li&gt;
&lt;li&gt;Ability to add contextual information to alerts&lt;/li&gt;
&lt;li&gt;Custom actions for alerts like executing a script&lt;/li&gt;
&lt;li&gt;Automatic actions like running playbooks&lt;/li&gt;
&lt;li&gt;Third-party integrations&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;li&gt;Status pages&lt;/li&gt;
&lt;li&gt;Single sign-on&lt;/li&gt;
&lt;li&gt;Maintenance mode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Opsgenie is suited for ops teams tha need sophisticated alerting.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;a href="https://www.splunk.com/en_us/products/on-call.html" rel="noopener noreferrer"&gt;Splunk On-Call&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call schedules and overrides&lt;/li&gt;
&lt;li&gt;Role-based permissions&lt;/li&gt;
&lt;li&gt;Rules engine for triggering custom actions&lt;/li&gt;
&lt;li&gt;Incident waiting rooms to reduce alert fatigue&lt;/li&gt;
&lt;li&gt;Maintenance mode&lt;/li&gt;
&lt;li&gt;Notifications via email, phone, SMS, email, app push&lt;/li&gt;
&lt;li&gt;Third-party integrations with many common tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Splunk On-Call, formerly VictorOps, is best suited for teams already using Splunk for monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;a href="https://grafana.com/products/cloud/oncall/" rel="noopener noreferrer"&gt;Grafana OnCall&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open source and also has a managed solution&lt;/li&gt;
&lt;li&gt;Alert grouping&lt;/li&gt;
&lt;li&gt;Escalation policies&lt;/li&gt;
&lt;li&gt;Alert routing&lt;/li&gt;
&lt;li&gt;Calendar-based on-call schedule and roster&lt;/li&gt;
&lt;li&gt;Maintenance mode&lt;/li&gt;
&lt;li&gt;Integrations with common third-party tools&lt;/li&gt;
&lt;li&gt;Role based access control&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Grafana OnCall works seamlessly with other Grafana Cloud products, so it is best suited for teams already using Grafana for monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;a href="https://www.servicenow.com/products/incident-management.html" rel="noopener noreferrer"&gt;ServiceNow&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call scheduling with overrides&lt;/li&gt;
&lt;li&gt;Supports multiple notification channels&lt;/li&gt;
&lt;li&gt;Automated ticket routing&lt;/li&gt;
&lt;li&gt;SLA tracking&lt;/li&gt;
&lt;li&gt;Compliance and governance features&lt;/li&gt;
&lt;li&gt;Integrations with many third-party tools&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's best suited for organizations using ServiceNow products like ITSM.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. &lt;a href="https://www.ilert.com/product/on-call-management-escalations" rel="noopener noreferrer"&gt;iLert&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call schedules and escalation policies&lt;/li&gt;
&lt;li&gt;Notifications using SMS, push, voice call&lt;/li&gt;
&lt;li&gt;Maintenance support&lt;/li&gt;
&lt;li&gt;Critical phone call routing using customizable multi-language IVR&lt;/li&gt;
&lt;li&gt;Status pages&lt;/li&gt;
&lt;li&gt;Integrations with MS Teams and Slack for chatops-based incident management&lt;/li&gt;
&lt;li&gt;Integrates with most common tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;iLert is best suited for mid-sized Ops teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. &lt;a href="https://incident.io/" rel="noopener noreferrer"&gt;incident.io&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call scheduling and escalations, with overrides&lt;/li&gt;
&lt;li&gt;Notifications with app push, phone, email, Slack, MS Teams&lt;/li&gt;
&lt;li&gt;Incident lifecycle management from within Slack&lt;/li&gt;
&lt;li&gt;Private incidents support&lt;/li&gt;
&lt;li&gt;API for integration and data access&lt;/li&gt;
&lt;li&gt;Status pages&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;li&gt;Third-party integrations&lt;/li&gt;
&lt;li&gt;Integrates with CRM systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;incident.io focuses on being an incident management platform with a Slack-first approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. &lt;a href="https://firehydrant.com/" rel="noopener noreferrer"&gt;FireHydrant&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call management&lt;/li&gt;
&lt;li&gt;Notifications on app push, Slack, Whatsapp&lt;/li&gt;
&lt;li&gt;Runbooks&lt;/li&gt;
&lt;li&gt;Service catalog&lt;/li&gt;
&lt;li&gt;Incident retrospectives&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;li&gt;Integrates with most common tools&lt;/li&gt;
&lt;li&gt;Status pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FireHydrant with its strong incident workflows and retrospectives is best suited for SRE teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. &lt;a href="https://www.squadcast.com/" rel="noopener noreferrer"&gt;Squadcast&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call scheduling, escalation policies, and overrides&lt;/li&gt;
&lt;li&gt;Integrations with common tools&lt;/li&gt;
&lt;li&gt;Live call routing to connect to on-call folks directly&lt;/li&gt;
&lt;li&gt;Alert classification and routing rules&lt;/li&gt;
&lt;li&gt;Auto-pause flapping alerts&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;li&gt;Manage incidents directly from Slack&lt;/li&gt;
&lt;li&gt;Runbooks&lt;/li&gt;
&lt;li&gt;Status pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Squadcast is meant for modern SRE and Ops teams with its alert routing, post-mortem support, and chatops features.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. &lt;a href="https://betterstack.com/" rel="noopener noreferrer"&gt;Better Stack&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call scheduling and escalation policies&lt;/li&gt;
&lt;li&gt;Incident grouping&lt;/li&gt;
&lt;li&gt;Status pages&lt;/li&gt;
&lt;li&gt;Integrations with common tools&lt;/li&gt;
&lt;li&gt;Single-sign on&lt;/li&gt;
&lt;li&gt;Teams support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Better Stack is a suite of products that includes monitoring and logging also, but we felt it should be included in this list because of its integrated on-call features.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. &lt;a href="https://rootly.com/on-call" rel="noopener noreferrer"&gt;Rootly&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Key Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call scheduling, escalation policies, and overrides&lt;/li&gt;
&lt;li&gt;Alert grouping based on time-window and on content&lt;/li&gt;
&lt;li&gt;Integrates with many third-party tools&lt;/li&gt;
&lt;li&gt;Playbooks&lt;/li&gt;
&lt;li&gt;Support for managing the incident lifecycle directly from Slack&lt;/li&gt;
&lt;li&gt;Retrospectives with automatic data capture and sync with Jira&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rootly specializes in automating incident workflows with strong integration capabilities and customizable playbooks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing an incident management tool involves looking at many aspects including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Features - Instead of looking at the number of features, list down the ones you actually need for your team and evaluate based on that.&lt;/li&gt;
&lt;li&gt;Cost - Incident Management is a key part of your business operations, so you also need to forecast future costs if your team or infrastructure is growing.&lt;/li&gt;
&lt;li&gt;Customer support - Your incident management systems' reliability needs to be top-notch. However, incidents happen, even in incident management software, so make sure they have great customer support.&lt;/li&gt;
&lt;li&gt;Integration capabilities - Your team might be using &lt;a href="https://blog.incidenthub.cloud/The-Benefits-of-a-Single-Incident-Management-System" rel="noopener noreferrer"&gt;multiple observability tools&lt;/a&gt;, either third-party or custom or both. Any incident management tool should be able to integrate well with your existing stack as well as with your communication and collaboration tools.&lt;/li&gt;
&lt;li&gt;Reports - Metrics and analytics are invaluable for figuring out trends in your outages and where to focus on for improvement.&lt;/li&gt;
&lt;li&gt;Flexibility in scheduling - Easy roster setup and overrides are a must.&lt;/li&gt;
&lt;li&gt;Alignment with your regulatory requirements, if any.&lt;/li&gt;
&lt;li&gt;Documentation/knowledge base integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choose the tool that is right for you and your team - which may not necessarily be the one that everybody else is using because it's the "best".&lt;/p&gt;

&lt;p&gt;Photo credits: &lt;a href="https://unsplash.com/photos/a-control-room-with-a-desk-and-two-chairs-p7Bfwn_VKRQ?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Miha Meglic&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Originally published on the &lt;a href="https://blog.incidenthub.cloud/The-Ultimate-List-of-Incident-Management-Tools-in-2024" rel="noopener noreferrer"&gt;IncidentHub blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>sitereliabilityengineering</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>incidentmanagement</category>
    </item>
    <item>
      <title>Best Practices for Choosing a Status Page Provider</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Tue, 15 Oct 2024 03:04:07 +0000</pubDate>
      <link>https://dev.to/incidenthub/best-practices-for-choosing-a-status-page-provider-3p4c</link>
      <guid>https://dev.to/incidenthub/best-practices-for-choosing-a-status-page-provider-3p4c</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Downtime is inevitable but what sets successful businesses apart is how they handle it. A key part of incident management is incident communication with both internal and external stakeholders. A status page is a crucial tool for maintaining clear communication with users during outages or service interruptions. There are numerous status page providers available with different features. This article will guide you through best practices for selecting a provider that suits your needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Importance of a Status Page
&lt;/h2&gt;

&lt;p&gt;An internal status page provider your colleagues and stakeholders in your organization to get a &lt;a href="https://incident.io/blog/internal-status-pages" rel="noopener noreferrer"&gt;snapshot of of the current status&lt;/a&gt;. It can help reduce unnecessary back and forth between teams, and help people to prioritize their work better. It also creates internal transparency and trust between teams.&lt;/p&gt;

&lt;p&gt;An external status page is crucial if you say you are committed to open communication with your end users or customers. Whether you are B2B or B2C, a public status page would be the first thing people would check if they face issues. Being open about incidents and your efforts to mitigate them build user trust. They can also decrease support ticket volume during incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Factors to Consider When Choosing a Status Page Provider
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Reliability
&lt;/h3&gt;

&lt;p&gt;Your status page needs to be accessible especially when your main services are down. Your provider should be able to guarantee a reasonable amount of&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uptime SLA&lt;/li&gt;
&lt;li&gt;Globally distributed infrastructure for high availabilty&lt;/li&gt;
&lt;li&gt;Redundant systems to ensure failover and availability&lt;/li&gt;
&lt;li&gt;Scalability to handle increased traffic during major incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Customization Options
&lt;/h3&gt;

&lt;p&gt;Prioritize providers that offer customization options.&lt;/p&gt;

&lt;h4&gt;
  
  
  Functional customization
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Support for components - This is important if your product/platform has many services and is served from many independent locations. Each such service/location should be a component in the status page so that you can publish incident updates only against the affected components.&lt;/li&gt;
&lt;li&gt;Support for different types of events - At least maintenance events, informational events, and incidents should be supported.&lt;/li&gt;
&lt;li&gt;Localization options - If your have customers distributed across the globe, you would want to serve locale specific pages in different languages.&lt;/li&gt;
&lt;li&gt;Ability to update older entries - As new information flows in during an incident, you might want to update previously published information like the title or the affected components for completeness.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Branding
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Your status page should reflect your brand. Look for a provider that allows you to customize your status page with your brand's logo and color scheme.&lt;/li&gt;
&lt;li&gt;Custom domain support - Instead of serving the status page from the provider's domain you should be able to host it on your own domain - e.g. status.mydomain.com&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Integration Capabilities
&lt;/h3&gt;

&lt;p&gt;Efficient incident management requires easy tool integration. At the very least you should look for&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.statuspal.io/blog/why-use-a-status-page-api-and-best-alternatives" rel="noopener noreferrer"&gt;API access&lt;/a&gt; for automating the incident management updates that you will publish&lt;/li&gt;
&lt;li&gt;Integration with your &lt;a href="https://dev.to/The-Benefits-of-a-Single-Incident-Management-System"&gt;monitoring and alerting&lt;/a&gt; tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the consumer end, i.e. for people who will see your status page, it's good to have integration capabilities &lt;br&gt;
like webhooks, REST APIs, Slack, text, etc so that they can integrate with the systems they want.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Reporting and Analytics
&lt;/h3&gt;

&lt;p&gt;Data-driven insights can help improve your incident response and post-mortem sessions. Choose a provider which offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detailed incident history with configurable retention. The entire history need not be displayed on the page, but it hould be available to your internal teams for analysis.&lt;/li&gt;
&lt;li&gt;Metrics and trends - Metrics can help you pinpoint services that need extra attention from your teams.&lt;/li&gt;
&lt;li&gt;Customizable reports for stakeholders. This is mostly useful for internal stakeholders in your organization.&lt;/li&gt;
&lt;li&gt;Page traffic - Some providers offer analytics to help you understand how often users check your status page and what they're viewing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. User Management and Permissions
&lt;/h3&gt;

&lt;p&gt;For larger organizations, granular access control is important. Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Role-based access control (RBAC).&lt;/li&gt;
&lt;li&gt;Multi-user support.&lt;/li&gt;
&lt;li&gt;Audit logs for accountability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Mobile Support
&lt;/h3&gt;

&lt;p&gt;In our mobile-first world, ensure your provider offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Responsive design for all devices.&lt;/li&gt;
&lt;li&gt;SMS and email notification options.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7. Customer Support
&lt;/h3&gt;

&lt;p&gt;When issues arise with the status page, prompt support is essential. Choose providers that have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear SLA - Review the provider's SLA to ensure they meet your uptime and response time expectations.&lt;/li&gt;
&lt;li&gt;24/7 customer support.&lt;/li&gt;
&lt;li&gt;Multiple support channels (chat, email, phone).&lt;/li&gt;
&lt;li&gt;Comprehensive documentation and notifications about updates to the status page format or APIs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Best Practices for Implementing Your Status Page
&lt;/h2&gt;

&lt;p&gt;Once you've chosen a provider, follow these best practices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Timely updates : Keep your status page updated with correct information. For internal status pages it should be the first reference point for other teams to know the status.&lt;/li&gt;
&lt;li&gt;Be proactive: Communicate scheduled maintenance in advance and note down which systems would be affected.&lt;/li&gt;
&lt;li&gt;Use plain language: Avoid technical jargon in your updates as much as possible.&lt;/li&gt;
&lt;li&gt;Provide context: Explain the impact of incidents on the end user experience. Users are interested in how an incident affects them or their work before anything else.&lt;/li&gt;
&lt;li&gt;Offer workarounds if available.&lt;/li&gt;
&lt;li&gt;Learn: Use incident data to enhance your systems and processes by feeding incident metrics and trends back into your post-mortems. This can help in building a culture of continuous improvement.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A Note About Internal vs External Status Pages
&lt;/h2&gt;

&lt;p&gt;Internal status pages are available for viewing only by your organization's members. External status pages are available for viewing by everybody, including your customers, users, and the general public.&lt;/p&gt;

&lt;p&gt;If it's an internal status page, the kind of updates you publish would be different from that of an external status page. Your internal stakeholders are part of the same organization, so you can &lt;br&gt;
publish more internal, technical details. Although it's important to include specific technical details in the post mortem report for public pages also, you have to be careful not to publish internal system details which might compromise security. Also note that publishing expected times of resolution &lt;a href="https://firehydrant.com/blog/hot-take-dont-provide-incident-resolution-estimates/" rel="noopener noreferrer"&gt;can backfire&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing the right status page provider is a key decision that will affect your communication strategy during critical moments. Select a provider that not only meets your current needs but can also grow with your business. A status page reflects your commitment to transparency, so make sure you invest time in choosing the provider that is right for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ivbeg/awesome-status-pages" rel="noopener noreferrer"&gt;Here is a list&lt;/a&gt; of status page related software and services.&lt;/p&gt;

&lt;p&gt;This article was originally published on the &lt;a href="https://blog.incidenthub.cloud/Best-Practices-Choosing-Status-Page-Provider" rel="noopener noreferrer"&gt;IncidentHub blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>statuspage</category>
      <category>sre</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>When Alerts Don’t Mean Downtime - Preventing SRE Fatigue</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Thu, 12 Sep 2024 02:44:55 +0000</pubDate>
      <link>https://dev.to/incidenthub/when-alerts-dont-mean-downtime-preventing-sre-fatigue-4ne7</link>
      <guid>https://dev.to/incidenthub/when-alerts-dont-mean-downtime-preventing-sre-fatigue-4ne7</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;A recent question in an SRE forum triggered this train of thought.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do I deal with alerts that are triggered by internal patching/release activities but don't actually cause a downtime? If we react to these alerts we might not have time to react to actual alerts that are affecting customers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I've paraphrased the question to reflect its essence. There is plenty to unravel here.&lt;/p&gt;

&lt;p&gt;My first reaction to this question was that the SRE who posted this is in a difficult place with systemic issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Systemic Issues
&lt;/h2&gt;

&lt;p&gt;Without knowing more about the org and their alerting policies, let's look at what we can dig out based on this question alone&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Patches/deployments trigger alerts&lt;/li&gt;
&lt;li&gt;The team does not react to such alerts to avoid spending valuable time that can be directed towards solving downtime that is affecting customers&lt;/li&gt;
&lt;li&gt;There is cognitive overhead of selectively reacting to some alerts, and ignoring others&lt;/li&gt;
&lt;li&gt;The knowledge of which alerts to react to is something only the SRE team knows&lt;/li&gt;
&lt;li&gt;Any MTTx data from such a setup are useless&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The eventual impact is sub-optimal incident management, eventually affecting SLAs, and burnout in on-call folks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improving the SRE Experience
&lt;/h2&gt;

&lt;p&gt;How would you approach fixing something like this?&lt;/p&gt;

&lt;p&gt;Some thoughts, in no particular order&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Setting the correct priority for alerts - Anything that affects customer perception of uptime, or can lead to data loss, is a P1. In larger organizations with independent teams responsible for their own microservices, I would extend the &lt;a href="https://www.linkedin.com/pulse/your-first-customer-team-hrishikesh-barua/" rel="noopener noreferrer"&gt;definition of customer&lt;/a&gt; to any team in your org that depends on your service(s). If you are responsible for an API used by a downstream service, they are your customers too.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Zero-downtime deployments - This is not as hard as it sounds if you design your systems with this goal in mind. For stateless web applications it is trivial to switch to a new version behind a load balancer. For stateful applications it can take a bit more work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maintenance mode - This can fall into two categories - maintenance mode that has to be communicated to the customer, and maintenance mode that is internal - affecting other teams who consume your service. At the alerting level, you temporarily silence the specific alerts that will get triggered by the rollout.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Investigate all alerts and disable useless ones - Not looking at an alert creates indeterminism and can lead to alert fatigue. The &lt;a href="https://blog.incidenthub.cloud/The-Benefits-of-a-Single-Incident-Management-System" rel="noopener noreferrer"&gt;alerting system&lt;/a&gt; should be the single source of truth.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solving such issues has to be a team effort involving the dev teams also. You can start by recognizing customer-facing uptime and having a sustainable on-call process as the priorities.&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://unsplash.com/@cdc?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;CDC&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/man-in-black-and-white-checkered-dress-shirt-using-computer-_XLJy3h77cw?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>monitoring</category>
      <category>incidentresponse</category>
    </item>
    <item>
      <title>14 Monitoring Tools for Full-Stack Developers</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Sat, 31 Aug 2024 08:47:12 +0000</pubDate>
      <link>https://dev.to/incidenthub/14-monitoring-tools-for-full-stack-developers-4nkf</link>
      <guid>https://dev.to/incidenthub/14-monitoring-tools-for-full-stack-developers-4nkf</guid>
      <description>&lt;p&gt;Whether you are a solo full-stack developer or a member of a team, your toolkit needs to have software that monitors your applications, infrastructure, managed services, and third-party dependencies.&lt;/p&gt;

&lt;p&gt;This is a list of 14 monitoring tools you can use to gain insights into your applications’ performance, reliability, and uptime. Some of these are managed, and others can be self-hosted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache SkyWalking
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://skywalking.apache.org/" rel="noopener noreferrer"&gt;Apache SkyWalking&lt;/a&gt; is an open-source APM tool meant for distributed systems. It has support for distributed tracing, agents in multiple languages, and support for an eBPF agent. &lt;/p&gt;

&lt;p&gt;SkyWalking has its own native APM database called BanyanDB which can ingest and store telemetry and observability data. It also allows you to parse logs and extract metrics from log entries.&lt;/p&gt;

&lt;p&gt;One of the important features of SkyWalking is its ability to ingest data from other sources in well-known formats like OpenTelemetry. It can also forward data to external services like alerting systems. This allows you to plug in SkyWalking without replacing your other tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Stack
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://betterstack.com/" rel="noopener noreferrer"&gt;Better Stack&lt;/a&gt; is a managed log aggregation system that can ingest logs from your sources, run search queries, and set up alerts on queries. It also comes with hosted status pages. &lt;/p&gt;

&lt;p&gt;The alerting feature of Better Stack has support for multiple team members as well as integration with third-party tools like PagerDuty and ZenDesk. You can also pull data from external cloud services like GCP, AWS, and Azure to create incidents in Better Stack. &lt;/p&gt;

&lt;p&gt;In addition, Better Stack also supports website monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  ELK (Elasticsearch/Logstash/Kibana)
&lt;/h2&gt;

&lt;p&gt;This stack consists of three components - the Elasticsearch log ingestion and processing engine, the Logstash log processor, and the Kibana UI. &lt;/p&gt;

&lt;p&gt;Elasticsearch supports advanced log aggregation features with support for indexing, sharding, and clustering. It also comes with a REST API. Elasticsearch and Kibana can work seamlessly together. It's easy to set up this stack with Docker images but it can take considerably more work to install, configure, and maintain a scalable ELK stack. &lt;/p&gt;

&lt;p&gt;As of this writing, &lt;a href="https://github.com/elastic/elasticsearch" rel="noopener noreferrer"&gt;Elasticsearch&lt;/a&gt; is again open-source.&lt;/p&gt;

&lt;h2&gt;
  
  
  GlitchTip
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://gitlab.com/glitchtip" rel="noopener noreferrer"&gt;GlitchTip&lt;/a&gt; is an open-source error, uptime, and performance monitoring tool which also has a &lt;a href="https://glitchtip.com/" rel="noopener noreferrer"&gt;managed version&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;GlitchTip supports &lt;a href="https://glitchtip.com/sdkdocs" rel="noopener noreferrer"&gt;multiple languages&lt;/a&gt; and frameworks. Its &lt;a href="https://glitchtip.com/documentation/uptime-monitoring" rel="noopener noreferrer"&gt;uptime monitoring&lt;/a&gt; includes URL and heartbeat monitoring. It is also compatible with Sentry's API, thus you can use it to push data anywhere that supports Sentry's API. It has basic alerting support via email.&lt;/p&gt;

&lt;p&gt;They are also pretty open about their &lt;a href="https://glitchtip.com/documentation/hosted-architecture" rel="noopener noreferrer"&gt;hosted architecture&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grafana
&lt;/h2&gt;

&lt;p&gt;Grafana is an analytics and data visualization tool that can create dashboards of charts and graphs. It supports many different data sources via an extensive plugin ecosystem, so you can look at and correlate metrics from different systems in the same dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/grafana/grafana" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt; is open-source and also has a &lt;a href="https://grafana.com/products/cloud/" rel="noopener noreferrer"&gt;managed version&lt;/a&gt;. You can   query both metrics and logs. It has a very active community. You can set up and try Grafana on your local machine easily using Docker.&lt;/p&gt;

&lt;p&gt;Grafana's alerting feature supports sending alerts to external services like PagerDuty, OpsGenie, Slack, etc. &lt;/p&gt;

&lt;h2&gt;
  
  
  IncidentHub
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://incidenthub.cloud/" rel="noopener noreferrer"&gt;IncidentHub&lt;/a&gt; monitors third-party Cloud and SaaS services and alerts you when they have an outage. It supports monitoring hundreds of cloud platforms like &lt;a href="https://incidenthub.cloud/status/googlecloudplatform" rel="noopener noreferrer"&gt;GCP&lt;/a&gt;, AWS, &lt;a href="https://incidenthub.cloud/status/digitalocean" rel="noopener noreferrer"&gt;Digital Ocean&lt;/a&gt;, communication/collaboration tools like &lt;a href="https://incidenthub.cloud/status/slack" rel="noopener noreferrer"&gt;Slack&lt;/a&gt;, &lt;a href="https://incidenthub.cloud/status/zoom" rel="noopener noreferrer"&gt;Zoom&lt;/a&gt;, &lt;a href="https://incidenthub.cloud/status/microsoft365" rel="noopener noreferrer"&gt;Office365&lt;/a&gt;, payment services like &lt;a href="https://incidenthub.cloud/status/paypal" rel="noopener noreferrer"&gt;PayPal&lt;/a&gt; and &lt;a href="https://incidenthub.cloud/status/stripe" rel="noopener noreferrer"&gt;Stripe&lt;/a&gt;, and dev tooling like &lt;a href="https://incidenthub.cloud/status/github" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://incidenthub.cloud/status/gitlab" rel="noopener noreferrer"&gt;GitLab&lt;/a&gt;, and &lt;a href="https://incidenthub.cloud/status/circleci" rel="noopener noreferrer"&gt;CircleCI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;IncidentHub periodically checks public data sources like status pages. It can notify you using channels like email, PagerDuty, Discord, Slack, Webhooks etc.&lt;/p&gt;

&lt;p&gt;If you are a developer, you can use IncidentHub to monitor your external dependencies like &lt;a href="https://incidenthub.cloud/status/googlecloudplatform" rel="noopener noreferrer"&gt;cloud services&lt;/a&gt;, &lt;a href="https://incidenthub.cloud/status/akamai" rel="noopener noreferrer"&gt;CDNs&lt;/a&gt;, and &lt;a href="https://incidenthub.cloud/status/github" rel="noopener noreferrer"&gt;CI/CD&lt;/a&gt; and deployment platforms. As of this writing, it supports 20 free monitors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parseable
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.parseable.com/" rel="noopener noreferrer"&gt;Parseable&lt;/a&gt; is a managed log analytics solution that also has an &lt;a href="https://github.com/parseablehq/parseable" rel="noopener noreferrer"&gt;open-source version&lt;/a&gt;. It's written in Rust. Parseable can use either Parquet or the Arrow format for storage. Both Arrow and Parquet are Apache open-source column-oriented data storage formats.&lt;/p&gt;

&lt;p&gt;Parseable supports OpenTelemetry and common &lt;a href="https://www.parseable.com/docs/category/log-agents" rel="noopener noreferrer"&gt;log collectors&lt;/a&gt; like Fluent Bit and LogStash for ingestion. You can also &lt;a href="https://www.parseable.com/docs/category/applications" rel="noopener noreferrer"&gt;send logs programmatically&lt;/a&gt;. It has built-in support for alerting and can push alerts into webhooks, Prometheus Alertmanager, and Slack.&lt;/p&gt;

&lt;p&gt;Parseable also has &lt;a href="https://www.parseable.com/docs/integrations/llm-based-sql-generation" rel="noopener noreferrer"&gt;LLM-based&lt;/a&gt; SQL generation for querying logs, Role-based Access Control, and OpenID Connect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pinpoint
&lt;/h2&gt;

&lt;p&gt;This is an &lt;a href="https://pinpoint-apm.github.io/pinpoint/" rel="noopener noreferrer"&gt;open-source&lt;/a&gt; application performance management (APM) tool that is written in Java. Pinpoint can help understand how components in distributed systems interact with each other. Its UI can show you the topology of your system visually.&lt;/p&gt;

&lt;p&gt;Pinpoint works on the agent model where you can hook into your applications without changing any code. You can integrate with Pinpoint either by calling its APIs or by using byte code instrumentation. The second approach does not require you to change any code.&lt;/p&gt;

&lt;p&gt;Pinpoint supports &lt;a href="https://github.com/pinpoint-apm/pinpoint?tab=readme-ov-file#supported-modules" rel="noopener noreferrer"&gt;common Java software&lt;/a&gt; out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prometheus
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/prometheus" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; is an open-source metrics collection and monitoring tool written in Go. It has a very active developer and user community. Originally developed at SoundCloud, it is now an independently managed CNCF project. &lt;/p&gt;

&lt;p&gt;Prometheus supports time series metrics ingestion and has a native query language PromQL. It works via the pull model where it collects metrics from "exporters", which collect data from different sources. The list of exporters is &lt;a href="https://prometheus.io/docs/instrumenting/exporters/" rel="noopener noreferrer"&gt;extensive&lt;/a&gt;, and you can also instrument your application to either expose metrics to be collected or send them directly to Prometheus.&lt;/p&gt;

&lt;p&gt;Prometheus has a service discovery feature where it can automatically detect nodes to monitor. It can push metrics data into and read from &lt;a href="https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage" rel="noopener noreferrer"&gt;external data stores&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Using PromQL you can define alerting rules in your Prometheus configuration. Prometheus comes with its own Alertmanager which can be used to configure alerting rules. Alerts emitted by Prometheus can be sent to different third-party systems like Slack and PagerDuty through Alertmanager.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sentry
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://sentry.io/welcome/" rel="noopener noreferrer"&gt;Sentry&lt;/a&gt; is an &lt;a href="https://github.com/getsentry/sentry" rel="noopener noreferrer"&gt;open-source&lt;/a&gt; error tracking and performance monitoring tool that also has a managed version.&lt;/p&gt;

&lt;p&gt;Sentry has support for many &lt;a href="https://github.com/getsentry/sentry?tab=readme-ov-file#official-sentry-sdks" rel="noopener noreferrer"&gt;languages&lt;/a&gt; and frameworks. It supports session replay and end-to-end tracing. You can dig into the root cause of slow requests by tracing requests across function calls and services.&lt;/p&gt;

&lt;p&gt;Sentry's alerting feature supports both metrics-based checks and URL monitoring.&lt;/p&gt;

&lt;p&gt;Sentry also integrates with a lot of &lt;a href="https://sentry.io/integrations/" rel="noopener noreferrer"&gt;popular developer tools&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  SigNoz
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://signoz.io/" rel="noopener noreferrer"&gt;SigNoz&lt;/a&gt; positions itself as an "open-source DataDog alternative".  You can &lt;a href="https://github.com/SigNoz/signoz" rel="noopener noreferrer"&gt;host it yourself&lt;/a&gt; or use the commercial cloud version. &lt;/p&gt;

&lt;p&gt;SigNoz collects metrics, traces, and logs and presents them in one dashboard. It can track external API calls which is useful when your application uses third-party APIs. You can look at common metrics like p95/p99 and trace the root cause of slow requests - whether they are because of external API response times or slow DB queries. SigNoz also lets you filter out traces by tags, service name, errors, and latency. &lt;/p&gt;

&lt;p&gt;SigNoz supports OpenTelemetry as its instrumentation library - which means that any language and framework supported by OpenTelemetry is also &lt;a href="https://github.com/SigNoz/signoz?tab=readme-ov-file#languages-supported" rel="noopener noreferrer"&gt;supported by SigNoz&lt;/a&gt;. SigNoz also has built-in alerting. &lt;/p&gt;

&lt;h2&gt;
  
  
  UptimeRobot
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://uptimerobot.com/" rel="noopener noreferrer"&gt;UptimeRobot&lt;/a&gt; is a website monitoring service that checks if your website is accessible periodically and alerts you. &lt;/p&gt;

&lt;p&gt;It supports different types of monitoring like HTTP/S, checking for keywords, cron jobs, TLS certificate expiry, and domain monitoring. It integrates with different services like Slack, PagerDuty, Telegram, Email, ZenDesk, etc. It also gives you a status page that you can share with your team.&lt;/p&gt;

&lt;p&gt;As of this writing the service supports 50 free monitors, making it useful for solo devs and small teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Victoria Metrics
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://victoriametrics.com/" rel="noopener noreferrer"&gt;VictoriaMetrics&lt;/a&gt; is a monitoring tool and time series database. It is &lt;a href="https://github.com/VictoriaMetrics/VictoriaMetrics" rel="noopener noreferrer"&gt;open-source&lt;/a&gt; and has a managed version.&lt;/p&gt;

&lt;p&gt;VictoriaMetrics can integrate with other monitoring tools. E.g. with Prometheus, it can function as a storage backend for long-term data retention. It take ingest data in all well-known formats including OpenTelemetry.&lt;/p&gt;

&lt;p&gt;You can query VictoriaMetrics using either PromQL or its native MetricsQL. It's also straightforward to back up VictoriaMetrics data using its snapshots feature to any cloud storage like Amazon S3 or Google Cloud Storage.&lt;/p&gt;

&lt;h2&gt;
  
  
  WireShark
&lt;/h2&gt;

&lt;p&gt;Now we are getting a bit low-level. &lt;a href="https://www.wireshark.org/" rel="noopener noreferrer"&gt;WireShark&lt;/a&gt; is a network protocol analyzer that has been around for a long time. &lt;/p&gt;

&lt;p&gt;WireShark is ideal if you have to inspect network traffic at the packet level. It supports many protocols with filtering capabilities. You can capture and inspect data live, or do offline analysis. &lt;/p&gt;

&lt;p&gt;WireShark can run on multiple OSs including Windows, Linux, FreeBSD, and OSX.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing the right monitoring tool can be daunting with so many options. A checklist for choosing what is right for your needs could be&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What are your top 5 feature requirements? This list can change over time.&lt;/li&gt;
&lt;li&gt;What is your budget?&lt;/li&gt;
&lt;li&gt;Do you prefer managing your own, or using a hosted solution? As your applications mature and your observability data volume grows, the scalability of your tool becomes important.&lt;/li&gt;
&lt;li&gt;Does your organization have regulatory requirements?&lt;/li&gt;
&lt;li&gt;Does your chosen tool do multiple things well? E.g. Does it handle logs and metrics equally well?&lt;/li&gt;
&lt;li&gt;Does the tool integrate with your existing toolkit?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might end up with 2-3 or even more tools, each in its specialized niche, and that's ok. In that case, integration features become important. You might choose a distributed tracing tool that sends alerts to another alerting tool. Or you might have an uptime monitor which sends informational alerts to your Slack, and critical ones to PagerDuty. As your project needs change, so will your tools. &lt;/p&gt;

&lt;p&gt;This is by no means an exhaustive list, and there are many other tools out there. Try out some of these and let others know what you think in the comments.&lt;/p&gt;

&lt;p&gt;Cover photo by &lt;a href="https://unsplash.com/@martz90" rel="noopener noreferrer"&gt;Martin Martz&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/a-blue-background-with-wavy-shapes-vy6eb3gscTk" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>fullstack</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The Benefits of a Single Incident Management System</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Thu, 29 Aug 2024 01:17:31 +0000</pubDate>
      <link>https://dev.to/incidenthub/the-benefits-of-a-single-incident-management-system-1k07</link>
      <guid>https://dev.to/incidenthub/the-benefits-of-a-single-incident-management-system-1k07</guid>
      <description>&lt;p&gt;How many monitoring tools do you have?&lt;/p&gt;

&lt;p&gt;Chances are at least 2-3. One tool usually does not cover all cases, and it’s usually a combination of self-managed and managed tools. Self-managed gives you more control over custom configurations and cost. Managed ones take away the headache of running it yourself.&lt;/p&gt;

&lt;p&gt;Prometheus is the de-facto standard for monitoring these days if you have a modern application stack and you want to manage your own monitoring. It is metrics-based, i.e., it uses metrics as the source of data from all the monitored systems. There are &lt;a href="https://prometheus.io/docs/instrumenting/exporters/" rel="noopener noreferrer"&gt;ready-made exporters&lt;/a&gt; for almost all popular infrastructure components. You can send your application and business metrics to Prometheus too with &lt;a href="https://opentelemetry.io/docs/specs/otel/metrics/sdk_exporters/prometheus/" rel="noopener noreferrer"&gt;OpenTelemetry exporters&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This model does not work for all aspects of your service. E.g. If you want to monitor external properties like your website, or use synthetic monitoring to check your customer-facing APIs from global locations, you could use something like Pingdom or UptimeRobot. This becomes another source of data about your service's uptime.&lt;br&gt;
Many Monitors, One Incident Management System&lt;/p&gt;

&lt;p&gt;A downside of having more than one monitoring system in place, regardless of the need, is that you have multiple sources of data. You have to consult multiple systems if you want to know the overall status. However, it is important that you receive alerts in one single incident and on-call management system. This allows a single place from where your on-call teams can get paged.&lt;/p&gt;

&lt;p&gt;So ensuring that all your monitoring tools can integrate with your on-call system is crucial.&lt;/p&gt;

&lt;p&gt;A typical Prometheus setup might look like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9cnopt3s4fotjb9db07.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9cnopt3s4fotjb9db07.png" alt="Prometheus monitoring setup" width="800" height="205"&gt;&lt;/a&gt;&lt;br&gt;
If you have other monitoring systems, you should be able to route those alerts into your on-call/incident response system. Most tools support this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2fh0zg50qrep3jqlr5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2fh0zg50qrep3jqlr5f.png" alt="Prometheus and other external monitoring tools" width="800" height="219"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://incidenthub.cloud/" rel="noopener noreferrer"&gt;IncidentHub&lt;/a&gt; monitors your external SaaS and cloud providers and notifies you when they have incidents. It can easily integrate into your existing incident management system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9ld1hwyc8eh7sitjjsb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9ld1hwyc8eh7sitjjsb.png" alt="IncidentHub fits into your existing monitoring ecosystem" width="800" height="285"&gt;&lt;/a&gt;&lt;br&gt;
If you’re using PagerDuty, just add a PagerDuty channel and you’re good to go. Check out the &lt;a href="https://docs.incidenthub.cloud/channels" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for more.&lt;/p&gt;

&lt;p&gt;Cover image credits - &lt;a href="https://unsplash.com/@lukechesser" rel="noopener noreferrer"&gt;Luke Chesser&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/graphs-of-performance-analytics-on-a-laptop-screen-JKUTrJ4vK00" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>observability</category>
    </item>
    <item>
      <title>Monitoring Third Party Vendors as an Ops Engineer/SRE</title>
      <dc:creator>Hrish B</dc:creator>
      <pubDate>Mon, 26 Aug 2024 08:27:02 +0000</pubDate>
      <link>https://dev.to/incidenthub/monitoring-third-party-vendors-as-an-ops-engineersre-41j1</link>
      <guid>https://dev.to/incidenthub/monitoring-third-party-vendors-as-an-ops-engineersre-41j1</guid>
      <description>&lt;p&gt;Why should you monitor your third-party Cloud and SaaS vendors if you are in SRE/Ops?&lt;/p&gt;

&lt;p&gt;As part of an SRE team, your primary responsibility is ensuring the reliability of your applications. What makes you responsible for monitoring services that you don't even manage? Third-party services are just like yours - with SLAs. And outages happen, affecting you as well as many others who depend on them.&lt;/p&gt;

&lt;p&gt;It's a no-brainer that you should know when such outages happen to be on top of things if/when it affects your running applications.&lt;/p&gt;

&lt;p&gt;Most of your third party dependencies will have a public status page or a Twitter account where they publish updates on their outages. Here are some seemingly easy ways to monitor these pages&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subscribe to the RSS feed of these pages&lt;/li&gt;
&lt;li&gt;Follow the Twitter account&lt;/li&gt;
&lt;li&gt;Sign up for Slack, Email, SMS notifications on the status page 
itself if the page supports these&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But if you have tried it, it's not that easy&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not all pages have RSS feeds&lt;/li&gt;
&lt;li&gt;Some have Slack, Email, SMS integration - some don't&lt;/li&gt;
&lt;li&gt;Some don't have a Twitter account&lt;/li&gt;
&lt;li&gt;You need to sign up on all of these pages one by one, and all 
services may not support the same notification channel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can easily end up doing this one by one for 10-15 or more service providers. Let's do a quick check. Which services in this list below do you use in your stack?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;    DNS - &lt;a href="https://incidenthub.cloud/status/googlecloudplatform" rel="noopener noreferrer"&gt;GCP&lt;/a&gt;/GoDaddy/UltraDNS/Route53&lt;/li&gt;
&lt;li&gt;    Cloud/PaaS - GCP/AWS/Azure/&lt;a href="https://incidenthub.cloud/status/digitalocean" rel="noopener noreferrer"&gt;DigitalOcean&lt;/a&gt;/Heroku/Render/&lt;a href="https://incidenthub.cloud/status/railway" rel="noopener noreferrer"&gt;Railway&lt;/a&gt;/&lt;a href="https://incidenthub.cloud/status/hetzner" rel="noopener noreferrer"&gt;Hetzner&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    Monitoring - Grafana Cloud/&lt;a href="https://incidenthub.cloud/status/datadog" rel="noopener noreferrer"&gt;DataDog&lt;/a&gt;/&lt;a href="https://incidenthub.cloud/status/newrelic" rel="noopener noreferrer"&gt;New Relic&lt;/a&gt;/SolarWinds&lt;/li&gt;
&lt;li&gt;    On-call management - PagerDuty/OpsGenie&lt;/li&gt;
&lt;li&gt;    Email - &lt;a href="https://incidenthub.cloud/status/googleworkspace" rel="noopener noreferrer"&gt;Google Workspace&lt;/a&gt;/Zoho&lt;/li&gt;
&lt;li&gt;    Communication - &lt;a href="https://incidenthub.cloud/status/zoom" rel="noopener noreferrer"&gt;Zoom&lt;/a&gt;/&lt;a href="https://incidenthub.cloud/status/slack" rel="noopener noreferrer"&gt;Slack&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    Collaboration - &lt;a href="https://incidenthub.cloud/status/jira" rel="noopener noreferrer"&gt;Atlassian Jira&lt;/a&gt;/Confluence&lt;/li&gt;
&lt;li&gt;    Source code - &lt;a href="https://incidenthub.cloud/status/gitlab" rel="noopener noreferrer"&gt;GitLab&lt;/a&gt;/&lt;a href="https://incidenthub.cloud/status/github" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    CI/CD/GitOps - TravisCI/&lt;a href="https://incidenthub.cloud/status/circleci" rel="noopener noreferrer"&gt;CircleCI&lt;/a&gt;/CodeFresh&lt;/li&gt;
&lt;li&gt;    CDN/Content delivery/ - &lt;a href="https://incidenthub.cloud/status/cloudflare" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt;/CDNJS/Fastly/&lt;a href="https://incidenthub.cloud/status/akamai" rel="noopener noreferrer"&gt;Akamai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    SMTP providers - SMTP.com/&lt;a href="https://incidenthub.cloud/status/sendgrid" rel="noopener noreferrer"&gt;SendGrid&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    Payments - &lt;a href="https://incidenthub.cloud/status/paypal" rel="noopener noreferrer"&gt;PayPal&lt;/a&gt;/&lt;a href="https://incidenthub.cloud/status/stripe" rel="noopener noreferrer"&gt;Stripe&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    Artifact Repo - Maven/&lt;a href="https://incidenthub.cloud/status/dockerhub" rel="noopener noreferrer"&gt;DockerHub&lt;/a&gt;/Quay.io&lt;/li&gt;
&lt;li&gt;    Others - &lt;a href="https://incidenthub.cloud/status/openai" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;/Apple Dev Platform/Meta Platform/&lt;a href="https://incidenthub.cloud/status/anthropic" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    Marketing - MailChimp/&lt;a href="https://incidenthub.cloud/status/hubspot" rel="noopener noreferrer"&gt;Hubspot&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;    Auth - Okta/Clerk/Auth0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a small list. You may not have all of these, or may have more/others, but you get the point.&lt;/p&gt;

&lt;p&gt;Like any self-respecting Ops Engineer/SRE, you would probably want to whip up a script and write this check-pages-and-notify-in-one-place tool by yourself. I know, because I've worked in Ops/SRE roles for the better part of my career, and NIH is a very real thing. Here's why it's not a great idea&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any software you write has to be maintained. Say your org starts using a new service which does not have an RSS feed on the status page. What now?&lt;/li&gt;
&lt;li&gt;Who monitors the monitor? How do you know when your script is not running?&lt;/li&gt;
&lt;li&gt;You probably have better uses for your time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;IncidentHub was built to solve precisely these problems - so you can focus on what's important, and hand off monitoring third-party services to something that was built with that goal in mind. So stop hacking together scripts to monitor public status pages, and &lt;a href="https://incidenthub.cloud/" rel="noopener noreferrer"&gt;try it out for free&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Image credits : &lt;a href="https://unsplash.com/@dulhiier" rel="noopener noreferrer"&gt;Nastya Dulhiier&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/lighted-city-at-night-aerial-photo-OKOOGO578eo" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>cloud</category>
      <category>saas</category>
      <category>uptime</category>
    </item>
  </channel>
</rss>
