Hardening against future S3 outages
March 7, 2017
by Colin Bartlett
On February 28, 2017, Amazon S3 in the us-east-1 region suffered an outage for several hours, impacting huge swaths of the internet. StatusGator was impacted, though I was able to mitigate some of the more serious effects pretty quickly and StatusGator remained up and running, reporting status page changes through the event. Since StatusGator is a destination for people when the internet goes dark, I aim to keep keep it stable during these events.
Special thanks to the StatusGator customers for putting up with some web UI quirks and some delayed notifications in the early hour of the outage. I aim to do better, and I will. Starting with this analysis:
The most immediate issue when S3 went down was the display of some assets in the StatusGator web UI. Although most of the website is served by Heroku, the tiny favicons next to each service are served up by S3, behind CloudFront. When browsing the UI, one would see broken or missing icons. This didn’t affect the layout or functionality, so I didn’t bother to try and fix this. And anyway, all of S3 was down, surely it would be back up in a few minutes!
To harden against future downtime, I’m going to replicate these to another region, or even replicate to Google Cloud, which supports the S3 protocol, as discussed on Hacker News. In a pinch, perhaps I could change the source of the CloudFront bucket that fronts these images. But in the end, they are not mission critical.
StatusGator uses a combination of web scraping and APIs (both public/documented and hidden/undocumented) to aggregate the published status of cloud services. No matter what method it uses, StatusGator saves a PNG screenshot of the status page and a scrape of the API results or HTML at the time of page change. The images are presented to users in the UI as a way to see more details about what was posted during a given event. The scrapes are kept internally for debugging, continuous scrape improvement, and future data analysis. All of this data is pushed to S3 at the time of capture. Immediately, these capture jobs started failing because the POST operations to the bucket were failing. I ended up needing to disable to capability during the outage, and therefore skipped scrapes and screen captures entirely during the S3 downtime.
To prevent from losing archive data in the future, I have already spun up a backup bucket in another region. The jobs that capture will first attempt to upload to the primary bucket and, if unable, will upload to the backup bucket.
The biggest impact that StatusGator users might have noticed during the outage was the delay of some status page change notifications (via email, Slack, or any of the other supported methods). The reason for this was simple: The Sidekiq background queue processing StatusGator jobs could not keep up with the jobs it had to processes, because many of the screenshot and scrape capture jobs were taking quite a long time to fail. A brute force approach would be just spin up additional Heroku dynos to churn through the jobs. However Heroku had by this time shut off their API which means that their customers cannot deploy, cannot launch new dynos, or scale at all. This is probably the most frustrating part of the Heroku platform because it appears that deployment and scaling in all Heroku regions is affected by S3 availability in the us-east-1 region.
Scratching my head a bit, I was able to hack together a solution by running StatusGator background workers from my own local machine. Luckily, Heroku’s API still allowed me to get the remote environment variables and pull down the production database. I spun up a copy of the Rails app running Sidekiq, connected to the cloud-hosted Postgres and Redis, and ran extra workers from my local machine. To my surprise, this worked remarkably well and I was even able to temporarily turn off the capture jobs which were clogging the queues in the first place. The web front end was still served by Heroku, while the workers ran from my late 2012 Core i5 iMac, and with remarkable success.
In addition to the backup bucket, I added more appropriate timeouts to the capture job and also set the queue priority of those capture jobs to below the page checking and notification jobs. I’m also going to explore running StatusGator in another Heroku region, although it appears that might not have helped in this situation.
AWS status page parsing
Lastly, a large a source of frustration for myself in the early minutes of the outage was Amazon’s own lack of status page updating. It appears their status page was dependent on S3 working — to me the most egregious error of theirs that day. Fortunately, they hacked a text update into the top of their page after the complaints mounted on Twitter. I was able to update StatusGator’s scraper during the outage to pull that text from the top of the page, update StatusGator’s status cache, and notify all the users who subscribe to status updates. So although the initial down notification might have been delayed, the up notifications were timely.
I had been meaning to switch the AWS status checker over from web scraping to the AWS API but now I believe it might be better to leave it as-is or to build it so that when the API is unavailable or not updating, it can still grab Amazon’s hard-coded status updates from their status page. Hopefully Amazon has also remedied their status page’s dependency on S3.
All of these changes took considerable time investment, but a worthwhile one to ensure that StatusGator stays up and running at the most critical moments. There is obviously a lot more that could be done to ensure a platform that’s fully available during the broadest of outages. I look forward to growing StatusGator to the point where such infrastructure is warranted. If you want to help, sign up for a free StatusGator account and monitor up to three status page, forever, free of charge. If you, like many people, depend on more than three cloud services, then consider a paid plan which start at just $10/month — a small price to pay for the peace of mind and transparency that comes with reliable, centralized, status page change notifications.