

On February 28, as many Brazilians enjoyed the final day of carnival festivities, Amazon servers, including the S3 cloud computing service provided by Amazon Web Services (AWS), experienced a brief outage.
The autumn had a significant impact on a wide range of online platforms, causing sites like Trello, Slack, Imgur, and a section of the Apple app store to become inaccessible. Despite this, major companies like Netflix and Wix, which utilize AWS infrastructure, did not experience any disruptions.
Amazon has been a dominant player in the computing and cloud hosting market since 2006, with a 31% market share by 2016, surpassing competitors like Google and Microsoft, as reported by Forbes.
According to the company’s statement, the problem was caused by a typing error made by an engineer during a routine procedure to remove inactive accounts. This led to the accidental removal of several active accounts, resulting in temporary service unavailability for the companies affected. All account data was eventually recovered and services were restored after a few hours.
The brief period of instability had global consequences, prompting Amazon to announce a review of its procedures and systems to prevent similar issues from occurring again.
What can regular people learn from this event? If a big company like Amazon can experience this issue, what might happen with the company that hosts my website? Let’s explore the valuable lessons we can glean from this situation.
Human mistakes are possible in any organization.
This was the biggest outage ever seen in the history of Amazon AWS. The issue was attributed to a member of the maintenance team who had authorization. In essence, it was not a curious or adventurous individual, but a skilled system administrator, likely competent given their position in the company.
Despite the advanced technology and security measures in place, no company is immune to human error.
A few months back, a similar incident happened at a small American web hosting company. The person in charge of the system accidentally executed a command that deleted all files and directories on the server, including around 1500 client accounts with no possibility of recovery, as even the backups were lost. When seeking technical assistance in a forum, they were told to consult a lawyer instead of receiving specialized support. It’s a unfortunate situation, isn’t it?
Human errors can occur in any company, regardless of its size, although it is not the norm but rather an exception.
There is continuous availability.
The incident faced by the websites on AWS was unusual and does not occur frequently. It pertains to uptime, which is the amount of time a server is operational and accessible.
The uptime cost is specified in the hosting companies’ service contract. Amazon states that S3 aims for 99.9999999% availability. They provide a refund policy based on the level of unavailability, with varying percentages of credits returned to the customer.
Rarely do companies offer 100% uptime guarantees, and even those that do often include clauses in their contracts for compensation in case of downtime. Service Level Agreements (SLAs) differ from company to company, highlighting the variability in availability promises.
Internet servers require periodic restarts, not only for system updates but also for other maintenance tasks like fixing issues. Companies typically notify users in advance about service unavailability, often scheduling maintenance during low-traffic hours.
Every company faces challenges.
The popular saying suggests that problems can occur in any family, and this also applies to the corporate world. It is important to bear in mind that even well-established companies, despite appearing immune to mistakes, may experience errors at some point.
Small issues occur daily in any company, with consequences varying based on the problem’s scale. This includes instances like the Samsung phone catching fire, Apple’s faulty batteries in 2016, and customer data leaks compromising security.
It is crucial to remember that no company is without difficulties. Being aware of this can help prevent unexpected issues. When it comes to website hosting, it is wise to have a backup plan in place and ensure your site’s backups are current.
Read the agreement before selecting a web hosting service.
The uptime guarantee, also known as SLA, differs among companies. It is crucial to assess this aspect along with other important factors in a web hosting service agreement.
Not all companies provide website backups, which can be a crucial consideration if your site contains valuable information. It’s important to check the contract to understand the backup policy and any exclusions from liability. Consider hosting services that offer backup options and review the coverage details in the service agreement before making a decision.
Being familiar with the restrictions set by the hosting company is crucial before opting for their services, as these guidelines are typically outlined in the service contract. Hence, it is essential to carefully review the terms of service prior to selecting a hosting provider.
Have a strategy in place for when your website goes down.
Having a backup plan is crucial, especially in the event of your website going offline. Even if you rely on a service like Amazon, it’s important to prepare for any potential downtime.
There are various methods to prepare for a situation similar to the AWS outage, such as having multiple instances of your project in different data centers. This can be achieved by using the same service provider with servers geographically distributed or by distributing the service among multiple providers.
An alternative option is to utilize a CDN to distribute your content across multiple geographic locations. CDN stands for Content Delivery Network, which involves services like CloudFlare storing copies of your website’s content on servers worldwide. This approach offers numerous benefits, such as faster loading times for site pages and reduced strain on your hosting server due to the pre-existing content copies on the CDN.
If your server is offline, you can activate “always online” mode on specific CDNs to ensure users can access a cached version of your site.
Conclusion
We learned important lessons from the incident involving Amazon’s cloud hosting service, emphasizing the need for companies to have contingency plans to prepare for unexpected issues and avoid website disruptions.
Feel free to ask any questions or leave a comment if you have something to add. I’m happy to chat with you! 🙂
Published on 08/03/2017 with updates made on 07/03/2019.