On December 12, 2024, a major Amazon Cognito outage struck the critical us-east-1 (N. Virginia) region. Cognito is one of several AWS services used for user authentication and its outage severely affected the auth processes for countless applications. This operational issue, triggered by a configuration change deployment, resulted in widespread “TooManyRequestsException” errors that persisted for several hours. Numerous Amazon Cognito users were left in a lurch, trying to determine why their sites were down, why users were unable to authenticate, and how to restore functionality. Despite the severity, the Amazon Cognito official status page initially failed to report the disruption, leaving many users in the dark.
During the initial phase of the outage, while IT teams scrambled to devise recovery strategies, Amazon remained silent, with their official status site indicating “No recent issues”.
However, StatusGator customers received a notification just minutes after the widespread Amazon Cognito outage began, a full 30 minutes before Amazon officially acknowledged the issue on their status page.
The AWS Cognito Outage Timeline
At 02:24 UTC, StatusGator notified our users of authentication issues with Amazon Cognito — 28 minutes before AWS officially acknowledged the problem on their status page. Our early warning signal was powered by reports and patterns we observed starting at 02:17 UTC, allowing us to send notifications to our customers while they were triaging the issue. This crucial lead time enabled proactive troubleshooting and communication to end-users, minimizing the impact of the outage.
Timeline of Amazon Cognito Authentication Errors
Here's the entire timeline based on what we've collected from the internet, from our own monitoring, and from the official AWS status page. According to Amazon's official status page, the issue in the US East 1 (N. Virginia) region began at 00:35 UTC (4:35 PM PST) due to deployment within Amazon Cognito. Here's a full breakdown of the timeline:
00:35 UTC: Amazon identifies a rise in error rates within Cognito in the US-EAST-1 region at 00:35 UTC. Despite this, the issue isn't yet widespread, and Amazon chooses not to publicly announce the increase in error rates.
01:14 UTC: Amazon engineers initiate an investigation and begin working on a resolution, yet the status page remains unchanged at this point. AWS has not yet acknowledged the issue with Cognito in US East.
02:00 UTC: Amazon identifies two primary causes for the surge in error rates but has yet to disclose this investigation on their status page.
02:17 UTC: The scope of the issue expands and early reports of authentication errors in US East 1 start surfacing across the internet.
02:24 UTC: StatusGator customers receive our Early Warning Signals alert about problems with Amazon Cognito.
02:52 UTC: AWS updates its health portal to acknowledge they are aware of an issue with Cognito in US East. This is 28 minutes after StatusGator customers were alerted.
02:55 UTC: StatusGator detects the change on AWS's status page and updates the official status, reflecting the outage in US East 1.
03:17 UTC: AWS confirms the increase in error rates and isolates the issue to one of two root causes, pledging to continue investigating and hoping to resolve the issue within 60 minutes.
03:37 UTC: AWS updates its status page to state that they have implemented a fix and are seeing signs of recovery.
04:01 UTC: Full recovery was confirmed by AWS at 04:01 UTC, as noted in their retrospective analysis.
04:38 UTC: AWS releases the final incident summary on their status page.
There are two critical moments of this timeline: At 9:17 PM ET / 6:17 PM PT the issue become more widespread and StatusGator notified its customers 7 minutes later. But Amazon did not notify its customers for a further 28 minutes. This timeline highlights the critical gap between when problems first emerge and when providers acknowledge them. StatusGator bridges that gap, giving its users an edge.
How StatusGator Beats the Amazon Cognito Official Status Page
Our platform continuously monitors an extensive array of data sources, gathering early warning signals from multiple channels. This distinctive capability enables StatusGator to identify and report issues before they become widely recognized. As the leading cloud monitoring platform with more than 10 years of continued service, StatusGator has more data than any other provider.
During this incident, we captured signals such as:
User reports of “TooManyRequestsException” errors submitted to our public website.
Reports of issues with Amazon Web Services from StatusGator customers' internal status pages.
A sudden spike in interest in the current Amazon Cognito status in US East.
Reports of authentication-related issues on other services' status pages.
By analyzing these signals in real time, StatusGator provides faster alerts and actionable insights that can help organizations respond quickly. We answer that critical question “Is it everyone or just us?” and help teams react to outages in real time.
Learnings and Takeaways
This incident underscores the importance of independent monitoring for critical services. While provider status pages play a crucial role, they are often reactive, leaving customers to navigate service disruptions during a major outage until official updates are provided. StatusGator's capability to detect issues early empowers teams to remain proactive, respond swiftly, and uphold customer trust.
Stay Ahead with StatusGator
Outages are inevitable, but you don't have to be unprepared. With StatusGator, you benefit from early detection and actionable insights. Whether overseeing a vital application or a global IT infrastructure, StatusGator ensures you stay informed and ready. Discover more about our capabilities and join the hundreds of IT teams that depend on StatusGator for essential monitoring by signing up for a free trial.
Top comments (0)