Insights overview

Amazon and the $150 million typo: cloud risks for early stage companies, and how to mitigate them

Although the impact was not quite as big as some headlines had suggested (“Amazon Just Broke the Internet”), the outage of Amazon’s Simple Storage Solution (S3)… Read more

more content below

Although the impact was not quite as big as some headlines had suggested (“Amazon Just Broke the Internet”), the outage of Amazon’s Simple Storage Solution (S3) in the US-East-1 region on Tuesday 28 February caused significant disruption. The Wall Street Journal quoted Cyence Inc., a start-up specialising in cyber-risks, as estimating that the Amazon outage cost companies in the S&P 500 index $150 million. Apica Inc., a website-monitoring company, said 54 of the internet’s top 100 retailers saw website performance slow by 20% or more. Connected lightbulbs, thermostats and other IoT hardware were also impacted, with many unable to control their devices as a result of the outage. Nest warned customers that its internet-connected security cameras and smartphone apps were not functioning properly as a result of the Amazon issue. Amazon was unable to update its own Amazon Web Services (AWS) status dashboard for the first two hours of the outage because the dashboard itself depended on the unavailable systems.

Amazon’s explanation was that “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.” Removing a significant portion of the server capacity required full restarts and this problem was compounded by the fact that parts of the system had not been completely restarted for several years, a process which took longer than expected.

As a result of the outage, Amazon said it is making several changes to the way its systems are managed and promised to make changes to improve the recovery time of key subsystems.

In signing up to cloud hosting contracts, a lot of companies assume everything will be fine and their websites, applications and data will always be available when needed, particularly if they are choosing one of the leading providers of hosted services such as AWS. In August 2016 Gartner identified AWS and Microsoft as the only two companies in its “Leader” category for cloud infrastructure as a service (IaaS) worldwide (ranking AWS ahead of Microsoft) and said that “The market for cloud IaaS has consolidated significantly around two leading service providers.”. This consolidation increases the impact of outages such as the one impacting Amazon’s S3 service.

Given the potential impact of an outage on critical services, customers may need to reconsider how they mitigate the risk of downtime, and we discuss the possible options below.

Increasing the target for availability

Taking Amazon’s S3 service as an example, when used in a single region it is said to be designed for 99.99% of availability with a service level agreement for availability of 99.9%. However, relying on a service in a single region offers the potential for a single point of failure. The Amazon outage on 28 February involved just one region, US-East-1 in northern Virginia USA, but the impact of the outage was so significant as this is the most heavily-used regions in the AWS global infrastructure.

The impact would not have been so significant if AWS customers had chosen a multi-region architecture as sites and applications using S3 in a different region would not have been affected. AWS currently operates 42 availability zones (AZs) within 16 geographic regions around the world. AZs consist of one or more discrete data centers, each with redundant power, networking and connectivity, housed in separate facilities, miles apart from each other on separate flood plains. By contrast, another of AWS’s services, EC2, provides an SLA of 99.95% but this greater availability threshold is based on deployment to at least two AZs (although S3 can only be selected by region, not by AZ).

The disadvantage of this approach is that multi-region implementations will increase cost and complexity. Customers are understandably reluctant to achieve an extra ‘9’ of availability by selecting another region and potentially doubling their hosting costs. However, the additional costs and complexity will need to be measured against the risks of operational disruption, financial loss and reputational damage arising from significant unavailability of critical data and/or applications in a worst case scenario.

Negotiating a stronger contractual position

Contracts with major hosting providers usually restrict the customer’s remedy to service credits if the provider fails to meet its availability target. For Amazon’s S3 service for example, if availability falls below the service level of 99.9% in a month customers would typically be awarded a service credit of 10% of the monthly fee. This may well be wholly insufficient recompense to customers who need to ensure that they can access their data or keep their sites and applications up and running at critical times, particularly if the service credits do not cover customers’ liability to their own customers as a result of unavailability.

The major hosting providers have shown some willingness to offer more contractual protection for their customers by offering increased limits on their liability for damages caused by service level failures but this has come with a significant cost in terms of fees or only been available to customers spending very significant sums with the hosting provider. Such additional legal protection has typically not been afforded to customers spending less, and this is understandable: from the hosting providers’ perspective, they are offering a low-cost and largely commoditised solution and it is simply not realistic to expect them to carry significant legal risks at the price point at which the lower end services are offered. In other words, you don’t get what you don’t pay for, and so at the cheaper end of the market where commoditised services are being provided, customers are very unlikely to be able to negotiate better legal protections.

However, where high levels of availability are essential to their business model customers should insist on having visibility over who is hosting their data and applications and ensure that during contract negotiations suppliers are required to identify all key subcontractors (and their subcontractors) so that the customer can identify potential vulnerabilities in the supply chain and consider steps to mitigate the risk of downtime before becoming committed to the contract.

Taking more control over hosting arrangements

Moving away from a massive scale, multi-tenant model towards a single-tenant, private cloud or even on premise deployment provides an opportunity for more control but at a cost both financially and in terms of operational flexibility. The cost benefits of deploying to the cloud are a significant source of advantage for start-ups and smaller organisations which do not have a major investment in existing on premise hardware, combined with the agility and flexibility of cloud computing and instant access to global infrastructure. In contrast, large enterprises deploying to the cloud face a considerable incremental cost in addition to maintaining legacy on premise resources until these can be retired, a process which may take several years.

Even for start-ups though, the need to take control over how critical services are delivered may outweigh the costs. Digital challenger bank Monzo, which offers a contactless prepaid Mastercard and plans to offer a free current account this year, said that a severe outage resulting in its cards and app not working for most of Sunday 5 March was caused by a third party processor used by Monzo to connect to payment networks. When it first started it made sense for Monzo to use a third party processor because the process for connecting directly to the payment networks was long, costly and complex and at the time there seemed to be no benefit to its customers. However, Monzo has just finished a 12-month project to connect directly to Mastercard so that it can process transactions entirely using its own technology. Announcing this change in a blog post published on 6 March, Monzo’s head of engineering Oliver Beatties said that “We see ourselves as a technology company as much as a bank, and going forward our strategy is to bring all critical systems in-house and continue to develop our own platform atop modern technology which we control.”.

Local back-ups as a safety net

Despite the attractiveness of short-term savings in moving data and applications to a single region, cloud-based solution, this approach could end up being very costly if businesses are dependent on a single point of failure without having an alternative solution which they can access quickly. From a practical perspective, whatever model they adopt for hosted services customers need to ensure that they make regular back-up copies of their data stored by a hosting provider, downloading copies of the data to their own systems or to an alternative hosting provider so that if absolutely necessary they can quickly implement an alternative solution.

The same applies to software: keeping full back-up copies of key applications on-site means that, should a hosting provider have an extended outage, there is at least an option to redeploy elsewhere rather than risk an indefinite interruption in service.

Worth paying the extra hosting fees?

While cloud storage and processing does offer significant price and operational advantages for start-ups, it may well be worth even for early stage start-ups thinking about the relative costs of paying for hosting in an extra region and / or with alternative provider, relative to the impact on operational stability, reputation and customer retention that a prolonged full outage might have on a growing business. Even the most heavily negotiated hosting contracts are highly unlikely to afford adequate recompense for the effects of a full outage after it has happened. As such, while it is still strongly advisable to review the contracts (not least to ensure compliance with, for instance, data protection legislation), the strongest way to deal with the risks emanating from an outage is probably still to use an architecture for the hosting of data and software that will minimise the risk of there being a full outage in the first place.