Amazon is back with an apology and an explanation for a high-profile malfunction that caused websites across the Internet to grind to a halt for hours on Tuesday.
The online retail giant, which runs a popular cloud computing platform for sites such as Airbnb, Netflix, reddit and Quora, is blaming the outage on a simple – and perhaps somewhat amusing – employee mistake.
A team member was doing a bit of maintenance on Amazon Web Services Tuesday, trying to speed up the billing system, when he or she tapped in the wrong codes – and inadvertently took a few more servers offline than the procedure was supposed to, Amazon said in a statement Thursday. With a few mistaken keystrokes, the employee wound up knocking out systems that supported other systems that help AWS work properly.
The cascading failure meant that many websites could no longer make changes to the information stored on Amazon’s cloud platform. For everyday users, that meant being unable to load pages, transfer files or take other actions on some of the sites they regularly use.
“In this instance, the tool used allowed too much capacity to be removed too quickly,” Amazon said. “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.”
Amazon said it was sorry for the outage’s effect on its customers and vowed to learn from the incident. One immediate next step? The company said it will subdivide its servers even more than before “to reduce blast radius and improve recovery,” should something like this happen again.