Amazon Cognito outage: How StatusGator notified customers 30 minutes before Amazon did

Published:

December 13, 2024

Updated:

December 13, 2024

Table of contents

On December 12, 2024, Amazon Cognito experienced a significant outage in the US-EAST-1 (N. Virginia) region, impacting authentication for numerous applications. This operational issue, caused by a configuration change deployment, led to widespread “TooManyRequestsException” errors for several hours. Many Amazon Cognito users were left scrambling to figure out why their application was down, why users could authenticate, and how to get back up and running.

In the early minutes of the outage, as IT teams were struggling to figure out how to recover, Amazon was silent on the issue, with their status page proclaiming “No recent issues”.

However, for StatusGator customers, the story unfolded quite differently as they were alerted minutes after the widespread outage began and 30 minutes before Amazon acknowledged the issue officially on their status page.

The AWS Cognito Outage Timeline

At 02:24 UTC, StatusGator notified our users of authentication issues with Amazon Cognito — 28 minutes before AWS officially acknowledged the problem on their status page. Our early warning signal was powered by reports and patterns we observed starting at 02:17 UTC, allowing us to alert customers before their applications were deeply affected. This crucial lead time enabled proactive troubleshooting and communication to end-users, minimizing the impact of the outage.

What Happened?

According to AWS’s postmortem, the issue began at 00:35 UTC (4:35 PM PST) due to a change deployment within Amazon Cognito. Here’s a full breakdown of the timeline:

00:35 UTC: Amazon detects increased error rates in Cognito in the US-EAST-1 region, however the issue not widespread and Amazon does not publicly disclose the error rate increase.
01:14 UTC: Amazon engineers begin investigating and working on a resolution, but the status page is not yet updated.
02:00 UTC: Amazon identifies two root causes for the increase in error rates but still has not yet disclosed this investigation on the status page.
02:17 UTC: The issue becomes more widespread and early reports of authentication errors start surfacing across the internet.
02:24 UTC: StatusGator customers receive our Early Warning Signals alert about problems with Amazon Cognito.
02:52 UTC: AWS updates its status page to acknowledge they are investigating an issue.
02:55 UTC: StatusGator detects the change on AWS’s status page and updates the official status.
03:17 UTC: AWS confirms the increase in error rates and isolates the issue to one of two root causes, pledging to continue investigating and hoping to resolve the issue within 60 minutes.
03:37 UTC: AWS updates its status page to state that they have implemented a fix and are seeing signs of recovery.
04:01 UTC: Time of full recovery as retroactively confirmed by AWS.
04:38 UTC: AWS posts final incident summary,

There are two critical moments of this timeline: At 9:17 PM ET / 6:17 PM PT the issue become more widespread and StatusGator notified its customers 7 minutes later. But Amazon did not notify its customers for a further 28 minutes. This timeline highlights the critical gap between when problems first emerge and when providers acknowledge them. StatusGator bridges that gap, giving its users an edge.

How StatusGator Beats Status Pages

Our platform continuously monitors hundreds of status pages and collects early warning signals from a variety of sources. This unique capability allows StatusGator to detect and report issues before they become widely known. In this incident, we were able to capture signals such as:

User reports of “TooManyRequestsException” errors submitted to our public website.
Reports of issues with Amazon Web Services from StatusGator customers’ internal status pages.
A sudden spike in interest and activity surrounding the status of Amazon Cognito.
Reports of authentication-related issues on other official status pages that depend on Cognito

By analyzing these signals in real time, StatusGator provides faster alerts and actionable insights that can help organizations respond quickly. We answer that critical question “Is it everyone or just us?” and help teams react to outages in real time.

Learnings and Takeaways

This incident underscores the importance of independent monitoring for critical services. While provider status pages are essential, they are often reactive, leaving customers to grapple with service disruptions until official updates are posted. StatusGator’s early detection capabilities empower teams to stay ahead, respond swiftly, and maintain trust with their customers.

Stay Ahead with StatusGator

Outages happen, but you don’t have to be caught off guard. With StatusGator, you gain the power of early detection and actionable insights. Whether you’re managing a critical application or global IT infrastructure, StatusGator keeps you informed and prepared.

Read more about Early Warning Signals to see how we make this possible, and join the growing number of organizations that rely on StatusGator for critical monitoring and communication by booking a demo.

Early Warning Signals

Use Cases

Features

Pricing

Integrations

Chat

Embeds

Help Desk

Incident Management

Monitoring

Notifications

Private Status

Status Pages

Advanced

.st0{fill:#252F3E;} .st1{fill-rule:evenodd;clip-rule:evenodd;fill:;} AWS status

.cls-1{fill:url(#linear-gradient);}.cls-2{fill:url(#linear-gradient-2);}.cls-3{fill:#2684ff;} Opsgenie

.st0{fill:#252F3E;} .st1{fill-rule:evenodd;clip-rule:evenodd;fill:;} AWS status

.cls-1{fill:url(#linear-gradient);}.cls-2{fill:#2684ff;} Atlassian Statuspage

Amazon Cognito outage: How StatusGator notified customers 30 minutes before Amazon did

The AWS Cognito Outage Timeline

What Happened?

How StatusGator Beats Status Pages

Learnings and Takeaways

Stay Ahead with StatusGator

Recent posts

AWS status

Opsgenie

AWS status

Atlassian Statuspage