War Story: How a Sentry 24 Bug Failed to Alert Us to a 3-Hour Production Outage

#story #sentry #failed #alert

War Story: How a Sentry 24 Bug Failed to Alert Us to a 3-Hour Production Outage

It was 9:00 AM on a Tuesday when our support team started flooding the engineering Slack channel with screenshots of failed checkout attempts. For three hours, our payment service had been returning 500 errors to users, but our primary alerting system, Sentry 24, had sent exactly zero notifications. What followed was a frantic 30-minute scramble to diagnose an outage that should never have gone unnoticed, and a post-mortem that revealed a critical flaw in the tool we trusted to keep our production environment safe.

Incident Timeline

6:02 AM: A routine config update to our payment microservice introduced a regression that caused the service to return raw, non-UTF-8 encoded error messages from a third-party tax API.
6:02 AM – 9:00 AM: The payment service generated over 12,000 500 errors, all of which were forwarded to Sentry 24 for error tracking and alerting.
9:00 AM: First customer support ticket about failed checkouts is filed.
9:05 AM: On-call engineers check the Sentry 24 dashboard, which shows 0 errors for the payment service. They pivot to load balancer logs, which confirm a 98% error rate for payment endpoints.
9:15 AM: Engineers dig into Sentry 24’s internal ingestion logs and find thousands of entries reading dropped event: invalid UTF-8 payload — events that were silently discarded instead of processed.
9:30 AM: The config change is rolled back, and payment functionality is restored. Total outage duration: 3 hours 28 minutes.

Root Cause Analysis

We quickly identified two primary contributors to the missed alert: a bug in Sentry 24, and gaps in our own monitoring setup.

Sentry 24 v24.1.0, the version we were running, had an unpatched regression in its event ingestion pipeline. When an error payload contained non-UTF-8 characters, the system would silently drop the event instead of attempting to sanitize it, log a warning, or increment error metrics. Because all 12,000+ error events from the payment service had non-UTF-8 characters in their stack traces, none were processed, so our alert rule (trigger on >10 errors in 5 minutes) never fired.

Compounding the issue was our over-reliance on Sentry 24 as a single source of truth. We had no redundant alerting for critical payment flows, and Sentry 24’s built-in health checks did not monitor ingestion drop rates. The dropped event logs were only accessible to Sentry 24 admins, and no alerts were configured for ingestion failures.

Remediation Steps

Within 48 hours of the incident, we implemented the following fixes:

Upgraded Sentry 24 to v24.1.1, which fixed the UTF-8 handling bug, added explicit metrics for dropped events, and surfaced ingestion errors to the main dashboard.
Configured a new alert rule in Sentry 24 to trigger when ingestion drop rate exceeds 0.1% over 1 minute.
Set up redundant alerting via Prometheus and Grafana to monitor 500 error rates for all critical services, independent of Sentry 24.
Added a pre-processing step to all error payloads sent to Sentry 24 to sanitize non-UTF-8 characters before ingestion.
Lowered alert thresholds for critical services to 5 errors in 5 minutes, down from 10.

Lessons Learned

This incident taught us hard lessons about trusting monitoring tools without verifying their health:

Never rely on a single alerting system for business-critical user flows. Redundancy is key.
Monitor your monitoring tools: track ingestion rates, drop rates, and latency for all observability platforms.
Test error scenarios with edge cases (malformed payloads, non-UTF-8 characters, high volume) during regular disaster recovery drills.
Patch observability tools promptly, even if they appear stable — a bug in your monitoring system is far more dangerous than a bug in a non-critical feature.

Three months later, Sentry 24’s dropped event alert has fired twice (both for minor non-critical services), and our redundant Prometheus alerts have caught two additional payment service issues before they impacted users. The 3-hour outage was a painful reminder that even the tools we trust most need to be verified, not just deployed.