According to the company, the outage was not driven by any memory problem in the network. Rather, it was triggered by the addition of new servers to the Amazon Kinesis real-time data processing service.
Adding new capacity caused all servers in the Kinesis system to exceed the maximum number of 'threads' allowed by an operating system (OS) configuration.
Servers in the Kinesis system need to generate threads between each other in the front-end fleet and when they couldn’t, the whole lot went tits up.
This resulted in a series of other problems that eventually took down thousands of websites and services, including those from some big companies such as Adobe, Flickr, Roku, Twilio and Autodesk.
AWS's own services were also affected, including ACM, Amplify Console, AppStream2, AppSync, Athena, Batch, CodeArtifact, CodeGuru Profiler, CodeGuru Reviewer, CloudFormation, CloudMap, CloudTrail, Connect, Comprehend, DynamoDB, Elastic Beanstalk, EventBridge, GuardDuty, IoT Services, Lambda, LEX, Macie, Managed Blockchain, Marketplace, MediaLive, MediaConvert, Personalize, RDS Performance Insights, Rekognition, SageMaker and Workspaces.
The multi-hour outage affected the US-East-1 region, according to the company.
Apparently it was all fixed by turning it off and turning it on again. Unfortunately, since that meant the entire Kinesis service, it took a while.
Amazon has said sorry for the outage and said it would apply lessons learned to further improve the reliability of its services.
In the short term, the company plans to move to servers with more powerful CPUs and more and memory to help it reduce the number of servers and the thread count across the fleet.
It is also carrying tests to increase thread count limits in OS configuration. AWS believes the measure will give additional safety margin by providing more threads per server.
The company also plans to introduce lots of other changes to "radically improve the cold-start time for the front-end fleet".