March 11, 2022

Printed by Spotify Engineering

On March 8, we skilled a world outage induced by considerations in a cloud-hosted provider discovery machine old at Spotify. We possess been made attentive to considerations with login at 18: 12 UTC / 13: 12 ET and started enforcing fixes to serious systems at 18: 39 UTC / 13: 39 ET. This outage affected our users and we order sorry for the peril it can possess precipitated. Our provider has now totally recovered.

What took space?

The Spotify backend features a number of microservices that be in contact with every other. For microservices so as to search out every other, we consume a number of provider discovery technologies. Most of our companies and products are the spend of a DNS basically basically based provider discovery machine; on the choice hand, some of our companies and products spend an xDS basically basically based internet page traffic withhold watch over airplane and discovery machine known as Website online traffic Director

On March 8, Google Cloud Website online traffic Director skilled an outage. This in coordination with a malicious program in a consumer (gRPC) library precipitated the Spotify outage that affected many of our users: ought to you possess been logged out of a Spotify app, you possess been unable to log merit in.

As almost right this moment because the shy away changed into realized, we rolled out configuration modifications to revert our affected systems merit to make spend of our DNS-basically basically based provider discovery and noticed it salvage higher progressively. Ogle the timeline below for added runt print.

Timeline

18: 12 UTC / 13: 12 ET  – Reports of users being logged out of the a form of client apps birth to floor.

18: 39 UTC / 13: 39 ET – Remediations possess been being assign in space to restore the affected systems

20: 35 UTC / 15: 35 ET – Incident totally mitigated at Spotify

The build salvage we recede from here?

In the immediate term:

  • We’re working with Google Cloud to higher mark how considerations with Website online traffic Director resulted in a mountainous outage affecting Spotify’s users.
  • We can add additional monitoring and alerting to make certain that we might prefer the same provider discovery related complications earlier.

We can continue to speculate in resiliency by identifying and enforcing additional security nets by formula of monitoring, automatic error detection, and self-recovery.

Tags: