Towards a more resilient StatusGator

Published:

December 5, 2025

by Colin Bartlett

Updated:

December 5, 2025

by Colin Bartlett

Table of contents

Background

Between October 20 and December 5, 2025, a rapid succession of major outages across multiple cloud providers disrupted large portions of the internet. Each of these events affected StatusGator in different ways.

After each incident, we implemented improvements to strengthen our reliability. This post summarizes the impact of each outage, the changes made, and the architectural work now underway to ensure StatusGator remains available during the moments when it is needed most.

But first let me personally apologize on behalf of the entire StatusGator team for the impact these outages have caused. StatusGator has become an important part of the monitoring and incident-response workflows of thousands of IT teams.

As reliance on our service has grown, we’ve seen firsthand how concentrated the world’s infrastructure has become around a small number of providers. Reducing this dependency is now a core focus of ours.

Summary

From October through early December, four major outages at AWS, Azure, and Cloudflare occurred within days or weeks of each other. StatusGator maintained notification delivery and monitoring throughout, but access to our web UI and API was impacted during some events. Our uptime over the last five years remains above 99.98%, but we are committed to doing better.

Below is a concise overview of each outage and the steps taken afterward.

October 20, 2025: Amazon Web Services outage

An AWS failure took down over 2,000 services – more than 30 percent of those we monitor. StatusGator surfaced the outage 10 minutes before AWS acknowledged it. Our web application experienced two downtime periods totaling roughly 3.5 hours due to reliance on AWS infrastructure.

We chronicled changes we made in an outage postmortem: We deployed a second region for our core systems and introduced traffic-mitigation strategies to absorb extreme spikes in visits to our public website.

October 29, 2025: Azure outage

A failure in Azure’s CDN (Azure Front Door) disrupted over 700 services, including most Microsoft products. StatusGator detected the issue 42 minutes before acknowledgment. We experienced approximately 10 minutes of downtime early in the incident.

The mitigation systems added after the AWS outage performed as intended, though they still required manual activation. We subsequently further scripted and automated these processes and significantly strengthened our origin infrastructure with extra caching and autoscaling.

November 18, 2025: Cloudflare outage

A Cloudflare Web Application Firewall issue made more than 1,400 services unreachable and affected StatusGator’s web UI and API. Because Cloudflare’s control panel was also down, we were unable to disable the WAF to fall back to our origin.

Afterward, we created scripts and playbooks enabling WAF bypass without reliance on Cloudflare’s UI and reinforced our backend to prepare for operation without CDN-level caching. We also automated updates to our independently hosted status page to ensure it remains timely during incidents. Work began on evaluating a multi-CDN approach to avoid future single-provider failures.

December 5, 2025: Cloudflare outage

A second Cloudflare WAF-related outage – different in nature from the November event – again affected StatusGator and more than 500 monitored services. This time, Cloudflare’s website and APIs were also unreachable, preventing our automated WAF bypass from executing. StatusGator’s web UI, API, and status pages were unavailable for 14 minutes. Monitoring and outbound notifications continued normally.

Next Evolution

This series of outages underscored the critical role StatusGator plays during industry-wide failures. We are now investing deeply in architectural changes to ensure high availability even when major providers experience disruptions.

Reduce reliance on a single CDN and WAF. This is our highest priority. We are fast-tracking plans to diversify our CDN and WAF footprint to remove Cloudflare as a single point of failure. This work will take several months, and we will share more specifics as a timeline emerges.
Reduce reliance on cloud hyperscalers. When major cloud providers go down, the ripple effects are global. Our goal is to reduce – and where possible eliminate – single-provider dependencies, including AWS. This will take longer, likely a year or more. But we are committed to this investment.
Restructure integrations for higher availability. Customers using notifications or webhook-based workflows saw no downtime during these events. We intend to redesign certain helpdesk and embedded integrations so they operate on similar principles and maintain functionality independent of our API’s availability.
Publish a Service Level Agreement. Many customers have requested an SLA, and we now recognize its importance. After a decade of minimal downtime, these consecutive outages make it the right time to formalize one. More details will follow soon.

Conclusion

Thank you for relying on StatusGator to keep your team informed during critical moments. We’ve taken concrete steps after each outage and are now making more substantial architectural investments to ensure our reliability for years to come. We are committed to delivering the resilient, dependable service you rely on during the internet’s most critical moments.

Announcements

Use Cases

Features

Pricing

Integrations

Chat

Embeds

Help Desk

Incident Management

Monitoring

Notifications

Private Status

Status Pages

Advanced

.st0{fill:#252F3E;} .st1{fill-rule:evenodd;clip-rule:evenodd;fill:;} AWS status

.cls-1{fill:url(#linear-gradient);}.cls-2{fill:url(#linear-gradient-2);}.cls-3{fill:#2684ff;} Opsgenie

.st0{fill:#252F3E;} .st1{fill-rule:evenodd;clip-rule:evenodd;fill:;} AWS status

.cls-1{fill:url(#linear-gradient);}.cls-2{fill:#2684ff;} Atlassian Statuspage