Don't Get Backlogged: Effective Monitoring for Healthy Pub/Sub

Ajeesh — Sun, 31 Mar 2024 15:37:28 +0000

Introduction

Google Pub/Sub is a powerful event messaging service, but even the most robust system needs monitoring to ensure smooth operation. Basic monitoring like message count and acknowledge rate provide a starting point, but for true optimization, we need to delve deeper.

Only having basic monitoring restricts our visibility into Pub/Sub's health. We might miss critical issues like message backlog, slow delivery, and data loss. With this blog, I would like to deep dive into Monitoring PubSub next level.

Level Up Your Monitoring with Pub/Sub Metrics

In this post, we'll examine Pub/Sub metrics that offer valuable insights into subscriber performance and message flow efficiency.

Monitor Healthy Subscribers

Monitor Message Backlog

Subscription/num_undelivered_messages: This metric reveals the number of unacknowledged messages, a potential backlog indicator.
Subscription/oldest_unacked_message_age: This metric identifies the age of the oldest unacknowledged message to pinpoint potential bottlenecks.

Monitor Delivery Health

Subscription/delivery_latency_health_score: This metric offers a holistic view of message delivery health based on latency. It considers factors like:

Seek requests: Frequent seeking indicates potential message delivery issues.
Negatively acknowledged (NACKed) messages: Messages rejected by the subscriber due to errors.
Expired acknowledgment deadlines: When a subscriber fails to acknowledge a message within the deadline.
Acknowledgment latencies: Time taken by a subscriber to acknowledge a message.
Low utilization: Underutilized subscriptions might not be scaling efficiently.

Delivery latency health score assigns a 0 (unhealthy) or 1 (healthy) for each above mentioned tracked criteria, providing a quick overview of your subscription's health.

The following is a screenshot of the metric plotted for a one-hour period using a stacked area chart. The combined health score goes up to 4, with a score of 1 for each criterion. However the utilization score drops down to 0.

Subscription/ack_latencies: This metric shows message processing latency and provides insights into subscriber performance. This also provide histograms of latencies, from which one can analyze latency distribution at different percentiles.

Here is an example of a PromQL query which calculates the 95th percentile latency of acknowledged messages for a specific Pub/Sub subscription ("order_subscription_v1") over the past 5 minutes



histogram_quantile(0.95, sum by (le)(rate(subscription_ack_latencies_bucket{ subscription_id="order_subscription_v1"}[5m])))

Monitor Acknowledgment Deadline Expiration

Subscription/expired_ack_deadlines_count: We can proactively identify situations where messages are redelivered continuously due to expired acknowledgment deadlines, potentially leading to duplicates.

Monitor Undelivered Messages

Subscription/dead_letter_message_count: This metric tracks messages deemed undeliverable by Pub/Sub and forwarded for further investigation to dead letter topic.

Monitor Healthy Publishers

Topic/send_request_count (grouped by response_code): Analyze the volume of messages sent by publishers and identify any errors indicated by the response codes.

Topic/send_request_count: This metric reveals the overall volume of messages being sent by publishers.

Topic/message_sizes: Monitor the size of individual messages to ensure efficient message transmission.

Beyond the Basics: Combining Metrics for Enhanced Monitoring

Combining above mentioned metrics unlocks even more powerful insights. For instance, tracking both oldest message count and high processing latency can help identify potential backpressure situations.

Here is an example of a PromQL query which checks for both a message backlog (old unacknowledged messages) and slow processing times (high acknowledgment latency) for the specified subscription.



subscription_oldest_unacked_message_age{subscription_id="order_subscription_v1"} > 60*60 and histogram_quantile(0.95, sum by (le)(rate(subscription_ack_latencies_bucket{subscription_id="order_subscription_v1"}[5m]))) > 60000

Another instance, tracking dead letter message count and oldest unacked message age can help to prevent data loss.



subscription_oldest_unacked_message_age{subscription_id="order_subscription_v1"} > 60*60*24 and sum(subscription_dead_letter_message_count{subscription_id="order_subscription_v1"}) > 1

Conclusion