An outage of Amazon Web Services on Tuesday that led to disruptions for huge numbers of cloud-based businesses and websites was caused a technician inputting an incorrect command, Amazon says.
The Amazon Web Services (AWS) facility in North Virginia suffered outages on February 28, leading to a number of websites that rely on its services to slow down or experience intermittent outages.
AWS is one of the biggest providers of cloud-computing services for websites and businesses all over the world.
It offers database storage, on-demand content delivery, and other essential infrastructure services.
Prominent businesses like Xero, Slack and Square were affected by the issue, as were websites like Atlassian’s Bitbucket, Github, and Kickstarter, according to VentureBeat. Hundreds of other websites were also affected.
The services are now back online. AWS released a statement today addressing the cause of the outage, blaming it on an incorrect command that was entered by a technician whilst debugging the company’s Simple Storage Service (S3).
“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” AWS said.
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
This caused a chain reaction, where the removed servers had been supporting another set of two servers, which required them both to be restarted.
“While these subsystems were being restarted, S3 was unable to service requests,” AWS wrote.
Companies should implement “control” systems
The company has said it will be making “several changes” such as system safeguards as a result of the incident.
This course of action is essential, says cyber security expert at Sense of Security Michael McKinnon.
“What AWS is doing is implementing a control, which are mechanics [that] companies can use to prevent a certain action from happening,” McKinnon told SmartCompany.
“There was really no protection in place beforehand to stop someone at AWS from taking that certain action [leading to the outage], so now they’re implementing a technical control to prevent it.”
McKinnon believes the nature of AWS’s infrastructure resulted in the significant failure, with the system being built and maintained by the company itself.
“Naturally there will be an occasion like this, an unintended consequence of the system discovered by human error,” he says.
Unfortunately for SMEs, McKinnon says there’s not much to be done about human error, even with “all the systems in place”.
“Human error is human error, and businesses will suffer. You can have all the systems in place, but at some point something will rely on human decisions to be made,” he says.
“You can’t know all the exposure points, even if you do the best brainstorming possible you still won’t identify all of them.”
“Good businesses build systems around people”
In situations like system outages, McKinnon says it’s best to deal with the fallout “swiftly and efficiently” and implement any measures possible to stop it from happening again.
As for what those measures are, “good businesses build systems around people”, McKinnon says.
“Look at these things from the perspective of what the business could have done. Did you provide enough training?” he says.
“It’s best to always build your system to cater to human error.”
Finally, in communicating outages or errors to customers, it’s important for businesses to differentiate the scenario from a data breach or hack, believes McKinnon.
“It’s important to convey is as an outage not a breach, as it’s a common misconception. ‘Oh no they’ve been hacked’ is a default view, so it’s good to reassure people,” he says.
“Paint human error in a way where we can deal with it. It is and always will be a fundamental tech issue, and it’s part of who we are.”
SmartCompany contacted Amazon Web Services Australia but it had no further comment on the outage.