Published in Cloud

Amazon's Oracle purge caused Prime Day outage

by on24 October 2018


Banishing an Oracle is tricky

Amazon’s move away from Oracle software might have cost it a few delivery problems on its biggest day – Prime day.

According to an internal report, the problem was in large part due to Amazon's migration from Oracle's database to its own technology.

Oracle is expected to claim that it is proof that the company cannot move entirely off Oracle's database by 2020 and its database is more efficient in some respects than Amazon's rival software.

The 25-page report shows that Amazon struggled to find the cause of the Prime Day issue because of a feature it lost after the database was moved over. It also did not come up with a contingency plan in case of an error in its newly installed database, called Aurora PostgreSQL.

Apparently "Oracle and Aurora PostgreSQL are two different [database] technologies" that handle "savepoints" differently.

On Prime Day, an excessive number of savepoints was created, and Amazon's Aurora software wasn't able to handle the pressure, slowing down the overall database performance, the report said.

An Amazon spokesperson played down the issue in an emailed statement and said there was no outage, even though the internal document states that the database "degradation resulted in lags and complete outages".

"It is important to point out that there was never an outage at the facility, and the issue only resulted in delaying shipping of about one per cent of packages for a short period of time", the spokesperson said. "This issue was quickly diagnosed and resolved."

In a section titled, "Lessons Learned", Amazon engineers wrote that Oracle's software would have handled the issue more efficiently. It also says SQL statement data did not exist for analysis in PostgreSQL and having access to that data "would have helped pinpoint" theroot cause of the problem.

However, the outage may have been less severe had Amazon been more prepared. In one part of the document, the company said it "took a long time to mitigate" the problem because of a "lack of a reaction plan when the underlying PostgreSQL DB experiences performance issues".

The document also said a "well-established reaction plan or runbook" could have helped "mitigate the impact sooner".

What that seems to suggest is that Amazon moved over from its legacy Oracle system and didn’t test the exact load model that occurred during their Amazon Prime Day and got surprised.

Principal analyst at Moor Insights & Strategy, Patrick Moorhead said the incident shows how hard it is for older applications, like those used in Amazon's warehouses, to move off Oracle, which has spent decades working with the world's largest enterprises.

"AWS Aurora is designed for forward-looking applications and Oracle for more legacy applications", he said.

 

Last modified on 24 October 2018
Rate this item
(0 votes)

Read more about: