<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ajeesh </title>
    <description>The latest articles on DEV Community by Ajeesh  (@ajeesh).</description>
    <link>https://dev.to/ajeesh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1395734%2F675f7dab-0e2e-45ec-ad79-02e1ad6ee14e.jpg</url>
      <title>DEV Community: Ajeesh </title>
      <link>https://dev.to/ajeesh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ajeesh"/>
    <language>en</language>
    <item>
      <title>Don't Get Backlogged: Effective Monitoring for Healthy Pub/Sub</title>
      <dc:creator>Ajeesh </dc:creator>
      <pubDate>Sun, 31 Mar 2024 15:37:28 +0000</pubDate>
      <link>https://dev.to/ajeesh/dont-get-backlogged-effective-monitoring-for-healthy-pubsub-1p0</link>
      <guid>https://dev.to/ajeesh/dont-get-backlogged-effective-monitoring-for-healthy-pubsub-1p0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Google Pub/Sub is a powerful event messaging service, but even the most robust system needs monitoring to ensure smooth operation. Basic monitoring like message count and acknowledge rate provide a starting point, but for true optimization, we need to delve deeper.&lt;/p&gt;

&lt;p&gt;Only having basic monitoring restricts our visibility into Pub/Sub's health. We might miss critical issues like message backlog, slow delivery, and data loss. With this blog, I would like to deep dive into Monitoring PubSub next level. &lt;/p&gt;

&lt;h2&gt;
  
  
  Level Up Your Monitoring with Pub/Sub Metrics
&lt;/h2&gt;

&lt;p&gt;In this post, we'll examine Pub/Sub metrics that offer valuable insights into subscriber performance and message flow efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor Healthy Subscribers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Monitor Message Backlog&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subscription/num_undelivered_messages:&lt;/strong&gt; This metric reveals the number of unacknowledged messages, a potential backlog indicator.&lt;br&gt;
&lt;strong&gt;Subscription/oldest_unacked_message_age:&lt;/strong&gt; This metric identifies the age of the oldest unacknowledged message to pinpoint potential bottlenecks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor Delivery Health&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subscription/delivery_latency_health_score:&lt;/strong&gt; This metric offers a holistic view of message delivery health based on latency. It considers factors like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Seek requests:&lt;/strong&gt; Frequent seeking indicates potential message delivery issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Negatively acknowledged (NACKed) messages:&lt;/strong&gt; Messages rejected by the subscriber due to errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expired acknowledgment deadlines:&lt;/strong&gt; When a subscriber fails to acknowledge a message within the deadline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acknowledgment latencies&lt;/strong&gt;: Time taken by a subscriber to acknowledge a message.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low utilization:&lt;/strong&gt; Underutilized subscriptions might not be scaling efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delivery latency health score assigns a 0 (unhealthy) or 1 (healthy) for each above mentioned tracked criteria, providing a quick overview of your subscription's health.&lt;/p&gt;

&lt;p&gt;The following is a screenshot of the metric plotted for a one-hour period using a stacked area chart. The combined health score goes up to 4, with a score of 1 for each criterion. However the utilization score drops down to 0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk1zslbjmclsgvlzkmm15.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk1zslbjmclsgvlzkmm15.png" alt="Delivery Health Score"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subscription/ack_latencies:&lt;/strong&gt; This metric shows message processing latency and provides insights into subscriber performance. This also provide histograms of latencies, from which one can analyze latency distribution at different percentiles.&lt;/p&gt;

&lt;p&gt;Here is an example of a PromQL query which calculates the 95th percentile latency of acknowledged messages for a specific Pub/Sub subscription ("order_subscription_v1") over the past 5 minutes&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

histogram_quantile(0.95, sum by (le)(rate(subscription_ack_latencies_bucket{ subscription_id="order_subscription_v1"}[5m])))


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnm8p54gtjeidlku9wprz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnm8p54gtjeidlku9wprz.png" alt="Ack Latency"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor Acknowledgment Deadline Expiration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subscription/expired_ack_deadlines_count:&lt;/strong&gt; We can proactively identify situations where messages are redelivered continuously due to expired acknowledgment deadlines, potentially leading to duplicates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor Undelivered Messages&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subscription/dead_letter_message_count:&lt;/strong&gt; This metric tracks messages deemed undeliverable by Pub/Sub and forwarded for further investigation to dead letter topic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor Healthy Publishers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Topic/send_request_count (grouped by response_code):&lt;/strong&gt; Analyze the volume of messages sent by publishers and identify any errors indicated by the response codes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Topic/send_request_count:&lt;/strong&gt; This metric reveals the overall volume of messages being sent by publishers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Topic/message_sizes:&lt;/strong&gt; Monitor the size of individual messages to ensure efficient message transmission.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond the Basics: Combining Metrics for Enhanced Monitoring
&lt;/h2&gt;

&lt;p&gt;Combining above mentioned metrics unlocks even more powerful insights. For instance, tracking both oldest message count and high processing latency can help identify potential backpressure situations.&lt;/p&gt;

&lt;p&gt;Here is an example of a PromQL query which checks for both a message backlog (old unacknowledged messages) and slow processing times (high acknowledgment latency) for the specified subscription.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

subscription_oldest_unacked_message_age{subscription_id="order_subscription_v1"} &amp;gt; 60*60 and histogram_quantile(0.95, sum by (le)(rate(subscription_ack_latencies_bucket{subscription_id="order_subscription_v1"}[5m]))) &amp;gt; 60000


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Another instance, tracking dead letter message count and oldest unacked message age can help to prevent data loss. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

subscription_oldest_unacked_message_age{subscription_id="order_subscription_v1"} &amp;gt; 60*60*24 and sum(subscription_dead_letter_message_count{subscription_id="order_subscription_v1"}) &amp;gt; 1


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By going beyond basic Pub/Sub monitoring and leveraging these valuable metrics, you can ensure a healthy and efficient event messaging system. This translates to robust applications, reliable message delivery, and a scalable infrastructure for your needs.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
