On February 28th, 2017, Amazon Web Services (AWS) experienced a 4-hour long outage of the Amazon S3 Cloud Storage Service in US-EAST-1 region. A huge chunk of the Internet went down because of that. The outage was unprecedented for the world’s largest and busiset cloud infrastructure provider. In the aftermath, what can we learn from Amazon’s outage? Here are 4 lessons learned.
1. Humans errors are unpredictable yet devastating
As unbelievable as it sounds, a typo turned out to be the root cause of the AWS S3 outage. Here is a quote from AWS’ official explanation message:
“The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process.
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region.”
Human errors can cause an outage anywhere – on-premise, in the cloud, managed, or self-hosted. Consider the recent Delta computer outage as an example of an entire self-hosted system going down. Delegating the responsibility for managing a platform service to a cloud provider doesn’t change the fact that human error can bring it down — but it does amplify the impact. Whereas the Delta outage only impacted Delta, an AWS S3 outage impacted a good chunk of the internet.
Bear in mind that this outage happened just because of a typo without malicious intent. If failures are initiated by terror organizations or rogue states, the ramifications will be unthinkable.
2. Don’t confuse cloud data durability with availability
As it turns out, there is a common misconception about the difference between durability and availability. Durability measures how reliable the storage is and answers the question “Am I going to lose my data?” Availability, on the other hand, measures how accessible the data is, i.e. “Am I going to be able to retrieve my data?”
AWS S3 offers 99.999999999% durability within a single region. If we examine Amazon’s example, that means if you store 10,000 objects in S3, on average a single object may get lost once every 10 million years. Amazon S3 accomplishes this is by replicating the data across multiple facilities within a region.
Standard S3 availability of objects, on the other hand, is at 99.99% per year within a region. What that means is that in any given 12 month period you should expect a total of 52 minutes and 33 seconds of not be able to access your data.
3. Cloud downtime has costly ramifications
Millions of large and small companies in the United States using AWS S3 were impacted, including Yahoo, Apple, Adobe, Pinterest, Airbnb, Netflix, and Slack to name a few. It also affected Huffington Post, Imgur, Business Insider and many other mainstream media sites. In short, it affected a lot of people and discrupted many business operations.
The Wall Street Journal reported that the outage “cost $150 million losses for companies in the S&P 500 index and $160 million for U.S. financial services companies using the infrastructure, according to Cyence Inc., a startup that specializes in estimating cyber-risks. Apica Inc., a website-monitoring company, said 54 of the internet’s top 100 retailers saw website performance slow by 20% or more.”
4. All businesses need cloud to cloud / cloud to local backup solution
Putting all of your eggs in one basket is a highly dangerous idea. This incident highlights the real risks that can come when a company employs only one cloud-based file system. As the AWS S3 crash inevitably cost businesses millions of dollars, it’s time companies start looking at multi-cloud strategy, i.e. taking some eggs from the AWS basket and putting them in different baskets. Some call it “multi-cloud strategy”, which means using more than one cloud to store the data or run the services, or using backup software to replicate data automatically from one cloud to another. By adopting this strategy, companies can retrieve their mission-critical cloud data quickly from other clouds when one of their cloud vendors experiences a failure. It increases redundancy, eliminates any vendor lock-in and allows companies to use best-of-breed solutions instead of using everything from the same vendor. If the speed of recovery is your highest priority, consider adding your local server as a backup destination. When your data is hosted locally, you will be able to reduce the time the recover the data.
All in all, the Amazon S3 outage incident shows that relying on a single cloud provider to store critical data is way too risky for businesses. This is not the first time a public cloud storage sevice has gone down for hours, bringing everyone down with them. And it won’t be the last time.
If you are concerned about how an incident like this could impact your business, it’s time to plan for such a disruption now. Click here to see how CloudBacko’s Cloud to Cloud / Cloud to Local Backup Solution can help you.