At the beginning of April 2022, a massive disruption in CircleCI caused large portions of their cloud offering to be unavailable for users worldwide. It occurred after CircleCI deployed a change to its front end and an auto-vacuum job on one of its core databases.
Due to this outage, CircleCI users were unable to run tests and deploy code.
After the incident, CircleCI promised to prevent these kinds of disruptions in the future.
We decided to check CircleCI’s reliability and see if they kept their promises in 2022.
Did they? Read on to find out.
The Data Behind this Article
To evaluate if CircleCI kept its promises of reliability, we analyzed the outage data between January 1 and December 31, 2022. Covering the whole year, including before and after the April disruption, provides a reliable comparison.
We used historical data from our product, StatusGator, which aggregates data from official status pages to create team status pages. Our unique dataset allowed us to understand the reliability of all CircleCI features.
CircleCI features include Docker Jobs, macOS Job, Windows Job, CircleCI UI, Artifacts, and others.
On its status page, CircleCI also shows the history of CircleCI Dependencies and Upstream Services. They define Dependencies as third-party services that their infrastructure depends on and Upstream Services as third-party services that impact CircleCI’s jobs.
CircleCI reports on scheduled maintenance, minor incidents, and major outages. However, for reliability analysis, we only considered major outages.
CircleCI Outages Overview in 2022
Let’s take a look at the history of CircleCI outages across both its own features and third-party services (CircleCI Dependencies and Upstream Services) to analyze reliability in 2022.
After an initial update on reliability on April 13, 2022, we can see that the number of outages increased in May and June. Yet, no spikes in duration were observed.
July and August 2022 exhibited a drop in outages and duration. Yet, in September and October, the number of outages and duration increased — looking similar to April 2022.
October exhibited another increase in outages, yet at the end of the year, the outages and duration dropped significantly.
To understand the difference properly, we should look at outage data before and after the reliability “promise”. But, it is important to note that we compared January–April and May–December. This was because the reliability promise came in April.
|Average outage stats on CircleCI's features and 3rd-party services, 2022|
|Before promise||After promise||Change|
|Outage count||4.5||5.63||⬆+0.8 (~18%)|
|Outage duration||4:37:30||2:47:30||⬇-1:50:00 (~41.4%)|
While the average number of outages went up by ⬆18%, the average duration for the period after the promise dropped by ⬇41.4%. So overall, the number of outages increased, but they’re not lasting as long. This is interesting find points to an increase in transparency by publishing more incidents and a renewed focus on faster resolution time.
It’s clear from this data that CircleCI had a few tough months of outages. Still, the fact that they brought the average duration down indicates improvement.
With that in mind, it’s important to remember that CircleCI depends on third-party services. Therefore, we decided to go deeper to understand how much influence these third parties had on reliability.
HELPFUL TIP – This data is a clear reminder that you should monitor your cloud dependencies – you never really know who causes downtime without doing so.
Don’t fall behind! StatusGator shows on your status page aggregated data from all your cloud dependencies (over 2,660+ popular cloud providers).
CircleCI Features Outages (without the third-party services)
The graph below demonstrates CircleCI outages across its own features without third-party services.
A few interesting points to note:
- The total outage duration in 2022 adds up to nearly 31 hours.
- The longest outage was in January, lasting 9 hours and 25 minutes
- The highest number of outages in one month occurred in October.
As we can see from the chart, October was notorious for the number of features affected during an outage. Two out of three outages affected 6 features and 5 features accordingly – a significant statistic.
We decided to look further and calculate their downtime separately.
|Outages across CircleCI's own features, 2022|
|CircleCI feature||Count of outages||Duration|
|Pipelines & Workflows||6||12:45:00|
The two main conclusions were:
- MacOS Jobs, Machine Jobs, and Docker Jobs were affected by the outages more frequently than other features.
- Machine Job’s total outage duration was the longest at 18 hours and 50 minutes. This was closely followed by Docker Jobs, with a total duration of 16 hours and 40 minutes.
Taking this into account, we looked closer at the average number of outages (and duration) before and after the promise.
We saw the following:
|Average outage stats on CircleCI’s own features (without 3rd-party services), 2022|
|Before promise||After promise||Change|
|Outage count||3.5||3.63||⬆+0.13 (~3.7%)|
|Outage duration||3:51:15||1:43:08||⬇-2:08:07 (~56.7%)|
When comparing the total outages and duration, we can see that CircleCI features are not solely to blame for the total number of outages and their duration.
Even after the April reliability promise, CircleCI’s features’ average outage number slightly increased by ⬆3.7%. But, we did see a significant drop in average duration, with an overall cut of ⬇56.7%.
So far, the data suggests that the reliability promise did not reduce outage numbers. Even so, it is clear that the outages are being resolved quicker as the duration is lower.
Let’s take a look at CircleCI Dependencies and Upstream Services outage statistics to investigate further.
CircleCI 3rd-party Dependencies and Upstream Services Outages
CircleCI’s Dependencies and Upstream Services also affect uptime, so they must be considered in the analysis. As mentioned above, these are the third parties that CircleCI depends on (such as GitHub and AWS).
The key conclusions we made from this dataset are:
- Dependencies and Upstream Services caused at least 1 outage per month (except for August and December).
- The biggest downtime input from dependencies and Upstream Services was in June and July – additional 2 hours and 40 minutes in June and 2 hours and 35 minutes in July.
- In May 2022, CircleCI went down 7 times due to the disruption of the third-party service providers.
An important fact is that partial system outages of Atlassian Bitbucket SSH, Google Cloud Storage, and GitHub Packages were the cause of CircleCI outages in February, July, and November, accordingly. If this were not the case, CircleCI would have avoided major outages in those months.
Let’s take a look at the list of CircleCI’s third-party services that they depend on.
|Outages across CircleCI's 3rd-party dependencies, 2022|
|CircleCI connection||3rd-party services||Count of outages||Duration|
|Upstream Services||Atlassian Bitbucket API||1||1:05:00|
|Upstream Services||Atlassian Bitbucket SSH||1||0:04:00|
|Upstream Services||GitHub API Requests||2||0:45:00|
|Upstream Services||GitHub Packages||4||2:25:00|
|Upstream Services||GitHub Webhooks||1||0:05:00|
|CircleCI Dependencies||Google Cloud Networking||4||1:21:00|
|CircleCI Dependencies||Google Cloud Storage||2||2:35:00|
|CircleCI Dependencies||Mailgun API||1||1:30:00|
|CircleCI Dependencies||Mailgun Outbound Delivery||2||2:40:00|
|CircleCI Dependencies||Mailgun SMTP||1||1:30:00|
From this data, it’s clear that:
- GitHub Packages and Google Cloud Networking affected the reliability of CircleCI in 2022 most often.
- The longest outages that affected CircleCI were in connection with the downtime of Mailgun Outbound Delivery, Google Cloud Storage, and GitHub Packages.
- In total, CircleCI saw 14 hours and 20 minutes of outages in 2022 (due to third-party services).
|Average outage stats on CircleCI features and 3rd-party services, 2022|
|Before promise||After promise||Change|
|Outage count||1||2||⬆+1 (+100%)|
|Outage duration||0:28:45||1:05:00||⬆+0:36:15 (+127%)|
We can see that CircleCI Dependencies and Upstream Services caused an increase in average outages and duration after CircleCI went public with the promise of reliability.
If you wish to monitor your third-party services and provide the same level of transparency as CircleCI does, try StatusGator status pages to monitor your dependencies in a single place.
CircleCI came very close to keeping its promises.
CircleCI improved the average monthly duration of outages across its own features, but the total number of major outages increased by 18% (3.7% if counting CircleCI’s own features only).
This should not turn users against CircleCI, since their transparency is admirable within the market. Making a promise with such a level of transparency and providing comments on outages is a brave move for CircleCI.
Overall, we believe that Rob Zuber, CircleCI’s CTO, is working towards improving reliability despite external dependencies. Shout out to Rob for doing a great job with transparency. We hope 2023 will be better and CircleCI will show ever-improving reliability in 2023. Keep up the great work.
If you want to follow CircleCI, sign up to StatusGator to receive notifications and updates on CircleCI outages.