Google Cloud suffers 14-hour outage caused by ‘large backlog of queued mutations’
Google Cloud experienced a 14-hour outage last week caused by a failure to scale up, according to an internal investigation made by the cloud firm.
The outage occurred on Thursday 26th March 2020, and resulted in the shutdown of Google’s cloud services in several regions, including Dataflow, Big Query, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine and Cloud Console.
At 16:14 US/Pacific time, Cloud IAM experienced elevated error rates, which caused disruption across many services for a duration of 3.5 hours, and stale data (resulting in continued disruption in administrative operations for a subset of services) for a duration of 14 hours.
The company revealed that what triggered the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions, which generated a large backlog of queued mutations to be applied in real-time.
The processing of the backlog was degraded by a latent issue with the cache servers, which led to them running out of memory; this, in turn, resulted in requests to IAM timing out.
Time is precious, but news has no time. Sign up today to receive daily free updates in your email box from the Data Economy Newsroom.
The problem was temporarily exacerbated in various regions by emergency rollouts performed to mitigate the high memory usage, according to the company.
“Google’s commitment to user privacy and data security means that IAM is a common dependency across many GCP services,” said the company on its website.
“To our Cloud customers, whose business was impacted during this disruption, we sincerely apologise – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.
“We have conducted an internal investigation and are taking steps to improve the resiliency of our service.”
Read the latest from the Data Economy Newsroom: