How Google <really> runs production systems

#articles #cloud

If there is a book that engineers working in operations have (started to)read in the past 8 years, chances are that the book’s title is Site Reliability Engineering with the subtitle: HOW GOOGLE RUNS PRODUCTION SYSTEMS… except that it doesn’t, as we will see below.

« During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. »

The customer in this case was UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members.

UniSuper had the nasty surprise on 2nd of May, 2024 to see its multi-geography cloud infrastructure… disappear. It took them almost 2 weeks (until 15th of May) to be fully back online – and that mostly thanks to offline backups (see this article] from arstechnica)

Read the official account from google here.