
All around the longest Atlassian outage of all time
My artful trainer says here is astonishing.
👋 Hi, here is Gergely with a bonus, free discipline of the Pragmatic Engineer E-newsletter. Whereas you occur to’re no longer a chunky subscriber yet, you missed the deep-dive on Dealing with a low-tremendous engineering custom at Huge Tech, one on Far-off compensation ideas and just a few others. Subscribe to assemble this e-newsletter a week 👇
We are in the middle of the longest outage Atlassian has had. End to 400 firms and any place from 50,000 to 400,000 users had no entry to JIRA, Confluence, OpsGenie, JIRA Station web page, and various Atlassian Cloud products and companies. The outage is its 9th day, having started on Monday, 4th of April. Atlassian estimates many impacted customers is doubtlessly no longer able to entry their products and companies for one more two weeks. On the time of writing, 45% of firms salvage viewed their entry restored.
For most of this outage, Atlassian has gone still in communications across their indispensable channels akin to Twitter or the neighborhood boards. It took until Day 9 for executives on the corporate to acknowledge the outage.
Whereas the corporate stayed still, outage news started trending in niche communities like Hacker News and Reddit. In these boards, of us tried to guess causes of the outage, wonder why there might be chunky radio silence, and heaps took to mocking the corporate for the vogue it is going by the matter.
Atlassian did no greater with talking with customers all the procedure by this time. Impacted firms received templated emails and no answers to their questions. After I tweeted about this outage, several Atlassian customers grew to was to me to vent referring to the matter, and hope I will offer extra tiny print. Potentialities claimed how the corporate’s statements made it seem they received crimson meat up, which they, truly, did no longer. Several customers hoped I might presumably also attend gather the eye of the corporate which had no longer given them any tiny print, beyond telling them to attend weeks until their files is restored.
Sooner or later, I managed to assemble the eye many impacted Atlassian customers hoped for. Eight days into the outage, Atlassian issued the first assertion from an executive. This assertion was from Atlassian CTO Sri Viswanath and was also despatched as a response to at least one in every of my tweets sharing a buyer criticism.
In this discipline, we screen:
-
What took location? A timeline of occasions.
-
The motive on the back of the outage. What we know up to now.
-
What Atlassian customers are saying. How did they explore the outage? What commerce influence did it point out for them? Will they take care of Atlassian customers?
-
The influence of the outage on Atlassian’s commerce. The outage comes at a indispensable time when Atlassian begins to retire its Server product – which was proof in opposition to this outage – in desire of onboarding customers to its Cloud offering, which was advertised as offering larger reliability. Will customers believe the Atlassian Cloud after this lengthy incident? Which opponents benefitted from Atlassian’s fumbling and why?
-
Learnings from this outage. What can engineering groups think from this incident? Both as customers to Atlassian, or as groups offering Cloud merchandise to customers.
-
My think. I were following this outage for a whereas and offer my summary.
What Took location
Day 1 – Monday, 4th of April
JIRA, Confluence, OpsGenie, and various Atlassian web sites discontinuance working at some firms.
Day 2 – Tuesday, Fifth of April
Atlassian notices the incident and begins monitoring it on their location web page. They publish several updates this day, confirming they are engaged on a fix. They end the day by saying “We can present extra detail as we development by option”.
Some customers think their frustration to Twitter. This thread from an impacted buyer rapid drew in responses from assorted customers affected:
Day 3 – Wednesday, 6th April
Atlassian posts the identical updates every few hours, without sharing any relevant files. The artificial reads:
“We are persevering with work in the verification stage on a subset of circumstances. As soon as reenabled, crimson meat up will substitute accounts by opened incident tickets. Restoration of buyer web sites stays our first precedence and we’re coordinating with groups globally to be particular that that work continues 24/7 until all circumstances are restored.”
Potentialities gather no order communication. Some think it to social media to whinge.
The publish “The broad majority of Atlassian cloud products and companies were down for a subset of users for over 24 hours” is trending on the sysadmin subreddit. A Reddit person feedback:
“Huge Seattle tech company here. Also can honest no longer dispute who I work for nonetheless I guarantee it is doubtless you’ll presumably maybe well also salvage heard of us.
Our Atlassian merchandise were down since 0200 PST on the fifth – in assorted phrases, for about 29 hours now.
I’ve never viewed a product outage closing this lengthy. The latest substitute says it might perhaps probably presumably also think several days to restore our stuff.“
Day 4 – Thursday, 7th April
The Atlassian Twitter story acknowledges the matter and affords some light tiny print. These tweets might presumably maybe well be the closing communication from this agreeable story sooner than it goes still for five days straight.
The Atlassian location web page posts the staunch identical substitute every few hours:
“We continue to work on partial restoration to a cohort of customers. The realizing to think a controlled and palms-on technique as we gather feedback from customers to be particular that the integrity of this first spherical of restorations stays the identical from our closing substitute.”
Days 5-7 – Thursday 9th April – Sunday 10th April
No proper updates. Atlassian posts the identical message to their location web page over and over and again…
“The crew is persevering with the restoration direction of by the weekend and working in the direction of restoration. We are always bettering the contrivance in accordance to buyer feedback and applying these learnings as we bring extra customers online.”
On Sunday, 10th April I publish referring to the outage on Twitter. Sad and impacted Atlassian customers commence messaging me nearly in an instant with complaints.
The news on Atlassian’s outage also traits on Hacker News and on Reddit over the weekend. On Reddit, the generous-voted comment is questioning whether or no longer of us will support the utilize of Atlassian if forced to switch to the cloud:
“Successfully. That is a sizable crimson flag in the option making direction of of accurate to utilize Atlassians merchandise, finally they mainly force you to switch to their cloud interior the next two years.”
Day 8 – Monday, 11th April
No proper updates from Atlassian beyond copy-pasting the identical message.
News of the outage is trending on Hacker News. The perfect-voted comment is someone claiming to be an ex-Atlassian worker and commenting that engineering practices internally oldschool to be subpar:
“This would now not suprise me in any admire. (…) at Atlassian, their incident direction of and monitoring is a shaggy dog memoir. Bigger than half of of the incidents are buyer detected.
Most of engineering practices at Atlassian form out solely the pleased route, nearly no one considers what can poke pass. Every machine is so interconnected, and there are extra SPOF than the employees.“
Day 9 – Tuesday, 12th April
Atlassian sends mass communication to customers. Several impacted customers rating the identical message:
“We were unable to confirm a extra company ETA until now attributable to the complexity of the rebuild direction of to your residence. Whereas we’re starting to bring some customers back online, we estimate the rebuilding effort to closing for up to 2 extra weeks.”
Atlassian also updates its Station Page, claiming 35% of the customers were restored.
For the first time since the incident started, Atlassian issues a press open. They suppose hundreds of engineers are engaged on the matter. Of their assertion besides they suppose:
“We’re talking in an instant with every buyer.”
A buyer messages me saying this closing assertion is rarely any longer true, as their company is receiving solely canned responses, and no specifics, despite questions. I answer to the corporate, highlighting customers’ file that they mustn’t salvage any order communications, despite being paying customers:
In response is the first time that an executive acknowledges the flaws from Atlassian. Replying to me, Atlassian CTO Sri Viswanath shares a press open the corporate publishes at that time, and which assertion begins with an apology impacted customers were anticipating:
“Let me commence by saying that this incident and our response time are no longer up to our long-established, and I yell regret on behalf of Atlassian.”
Head of Engineering Stephen Deasy publishes a Q&A on the active incident in the Atlassian Community.
The motive on the back of the outage
For the previous week, all americans has been guessing referring to the motive on the back of the outage. The most in vogue suspicion coming from several sources like The Stack was how the legacy Insight Spin-In plugin was being retired. A script was speculated to delete all buyer files from this plugin nonetheless by likelihood deleted all buyer files for anybody the utilize of this plugin. Up Day 9, Atlassian would neither confirm, nor dispute these speculations.
On Day 9, Atlassian confirmed that this was, indeed, basically the most important reason in their agreeable substitute. From this file:
“The script we oldschool offered each and every the “ticket for deletion” capability oldschool in long-established day-to-day operations (the put recoverability is well-organized), and the “completely delete” capability that is required to completely resolve files when required for compliance causes. The script was executed with the pass execution mode and the pass checklist of IDs. The consequence was that web sites for about 400 customers were improperly deleted.”
So why is the backup taking weeks? On their “How Atlassian Does Resilience” web page Atlassian confirms they’ll restore files deleted in a matter of hours:
“Atlassian assessments backups for restoration on a quarterly basis, with any issues identified from these assessments raised as Jira tickets to be particular that that any issues are tracked until remedied.”
There might be a topic matter, even supposing:
-
Atlassian can, indeed, restore all files to a checkpoint in a matter of hours.
-
On the opposite hand, in the event that they did this, whereas the impacted ~400 firms would come back all their files, all americans else would lose all files dedicated since that time
-
So now every buyer’s files need to be selectively restored. Atlassian has no tools to put that in bulk.
They also confirm here is the root of the danger in the bogus:
“What we now salvage no longer (yet) automated is restoring a ample subset of customers into our present (and currently in utilize) atmosphere without affecting any of our assorted customers.”
For the first several days of the outage, they restored buyer files with handbook steps. They’re in actuality automating this direction of. On the opposite hand, even with the automation, restoration is tiny, and can solely be carried out in tiny batches:
“On the moment, we’re restoring customers in batches of up to 60 tenants at a time. Stay-to-discontinuance, it takes between 4 and 5 elapsed days to hand a residence back to a buyer. Our groups salvage now developed the aptitude to urge just a few batches in parallel, which has helped to diminish our total restore time.”
What Atlassian customers are saying
The generous criticism from all customers has been the terrible communication from Atlassian. These firms misplaced all entry to key programs, were paying customers, and yet, they might perhaps presumably maybe even no longer consult with a human. Up to Day 7, many of them purchased no communication, and even on Day 9, many still solely purchased the majority emails referring to the 2 weeks to restore that every impacted buyer purchased despatched.
Potentialities shared:
“We weren’t impressed with their comms either, they positively botched it.” – engineering manager at a 2,000 person company which was impacted.
“Atlassian communication was terrible. Atlassian was giving the identical lame excuses to our interior crimson meat up crew as what was circulated online.” – utility engineer at a 1,000 person company which was impacted.
The influence of the outage has been ample for these counting on OpsGenie. OpsGenie is the “PagerDuty for Atlassian” incident administration machine. Every company impacted by this outage purchased locked out of this utility.
Whereas JIRA and Confluence no longer working were issues many firms were able to work spherical, OpsGenie is a indispensable fragment of infrastructure for all customers. Three out of three customers I talked to salvage onboarded to competitor PagerDuty, so they’ll support their programs running securely.
The influence across customers has been ample. Many firms did no longer salvage backups of paperwork on Confluence. None of these I talked with had JIRA backups. Firm plannings were delayed, tasks wanted to be re-deliberate. The influence goes well past correct engineering, as many firms oldschool JIRA and Confluence to collaborate with assorted commerce functions.
I asked customers in the event that they would offboard Atlassian on story of the outage. Most of them mentioned they received’t leave the Atlassian stack, as lengthy as they don’t lose files. It is miles because inspiring is advanced and besides they don’t see a switch would mitigate a possibility of a cloud provider happening. On the opposite hand, all customers mentioned they’ll make investments in having a backup realizing in case a SaaS they rely on goes down.
The customers who confirmed to be inspiring are these onboarding to PagerDuty. They see PagerDuty as a extra legit offering and were all terrified that Atlassian did no longer prioritize restoring OpsGenie forward of varied products and companies.
What compensation can customers quiz? Potentialities salvage no longer received tiny print on compensation for the outage. Atlassian compensates the utilize of credits, which can presumably maybe well be reductions in pricing. These are issued in accordance to what uptime their service has over the previous month. Most customers impacted in the outage salvage a 73% uptime for the previous 30 days, today, and here is happening with every passing day. Atlassian’s credit compensation works like this:
-
99 – 99.9%: 10% carve price
-
95-99%: 25% carve price
-
Under 95%: 50% carve price
Because it stands, customers are eligible for a 50% carve price. Call me shocked if Atlassian would now not offer one thing far extra actual, given they are at zero 9’s availability for these customers.
The influence of the outage on Atlassian’s commerce
Atlassian claims the customers impacted were “solely” 0.18% of its buyer defective at 400 firms. They did no longer fragment the number of seats impacted. I estimate seats are between 50,000 – 400,000, in accordance to the incontrovertible truth that I salvage no longer talked with any impacted customers smaller than 250 workers, most of them with seats.
The generous influence of this outage is rarely any longer in misplaced income: it is reputational injury and can damage longer-term Cloud sales efforts. The feared thing about how the outage performed out is how it might perhaps probably presumably also were any company that loses all Atlassian Cloud entry for weeks. I’m obvious the corporate will put steps to mitigate this occurring. On the opposite hand, believe is one which is easy to lose and laborious to invent.
Unfortunately for the corporate, Atlassian has a historical previous of a repeat incident of the worst kind. In 2015, their HipChat product suffered a security breach, this breach driving away customers like Coinbase on the time. Ideally pleasant two years later, in 2107, HipChat suffered yet one more security breach. This 2nd, repeat offense was the motive Uber suspended their HipChat utilization fine in an instant.
The irony of the outage is how Atlassian was pushing customers to its Cloud offering, highlighting reliability as a selling point. They discontinued selling Server licenses and can still discontinuance crimson meat up for the product in February 2024.
This breach and the historical previous of repeat offenses with a specific incident mixed will elevate inquiries to enterprises currently on the Server product on how powerful they’ll believe Atlassian on the Cloud. If forced to switch, will they take care of one more dealer as a substitute?
One other scenario is how Atlassian shall be forced to poke into reverse on selling Server licenses, and extend the crimson meat up for the product by one more few years. Thid technique would give customers astronomical reassurance that they’ll operate their Cloud product without massive downtimes like in this case. I myself see this feature as one which can presumably maybe even prefer to be on the desk, if Atlassian would now not are looking for to lose out ample customers which hesitate to onboard following this outage.
Atlassian’s opponents are obvious to exercise from this fumble, even in the event that they are no longer proof in opposition to the same complications. On the opposite hand, unlike Atlassian, they don’t yet salvage an incident the put all communications were shut down for end to per week, as customers scrambled to assemble protect of their distributors. Regarded as most certainly the most opponents, Linear already offered to attend customers waiting on Atlassian to restore and no longer trace them for the first 300 and sixty five days:
Learnings from this outage
There are many learnings on this outage that any engineering crew can think. Don’t await an outage like this to hit your groups: prepare forward as a substitute.
Incident Handling:
-
Derive a runbook for bother restoration and sad swan occasions. Ask the unexpected, and realizing for the vogue you’ll answer, assess, and keep in touch.
-
Discover your own runbook of bother restoration. Atlassian published their bother restoration runbook for Confluence, and yet, did no longer notice this runbook. Their runbook states that any runbook has communication and escalation guidelines. Either the corporate did no longer salvage communication guidelines, or they did now not notice these. A immoral seek, either manner.
-
Talk in an instant and transparently. Atlassian did none of this until 9 days. This lack of communication eroded a huge amount of believe no longer correct across impacted customers, nonetheless anybody being responsive to the outage. Whereas Atlassian might presumably maybe even need assumed it is gracious to no longer dispute one thing: here is the worst strategy to kind. Keep in mind how clear GitLab or Cloudflare communicates all the procedure by outages – each and every of them publicly traded firms, correct like Atlassian.
-
Talk your buyer’s language. Atlassian location updates were obscure, and lacked all technical tiny print. On the opposite hand, their customers weren’t commerce of us. They were Head of ITs and CTOs who made the strategy to think Atlassian merchandise… and can now no longer answer what the danger with the machine was. By dumbing down messaging, Atlassian save their greatest sponsors – the technical of us! – in an not doubtless discipline to defend the corporate. If the corporate sees buyer churn, I very a lot attribute it to this error.
-
An executive taking public ownership of the outage. It took until Day 9 for a C-stage to acknowledge this outage. All yet again, at firms which builders believe, this occurs nearly in an instant. Executives no longer issuing a press open signals the matter is too tiny for them to care about it. I wrote about how at Amazon, executives becoming a member of outage calls is in vogue.
-
Attain out on to customers, and consult with them. Potentialities did no longer feel heard all the procedure by this outage and had no human consult with them. They were left with automated messages. During a sad swan match, mobilize of us to chat on to customers – it is doubtless you’ll presumably maybe well put that without impacting the mitigation effort.
-
Steer obvious of location updates that dispute nothing. The broad majority of location updates on the incident web page were copy-pasting the identical substitute. Atlassian clearly did this to offer updates every few hours… nonetheless these weren’t updates. They added to the feeling that the corporate did no longer salvage the outage beneath support watch over.
-
Steer obvious of radio silence. Up to Day 9, Atlassian has been on radio silence. Steer obvious of this system in any admire costs.
Averting the incident:
-
Derive a rollback realizing for all migrations and deprecations. Within the Migrations Done Successfully discipline, we lined practices for migrations. Employ the identical suggestions for deprecations.
-
Stay dry-runs of migrations and deprecations. As per the matter Migrations Done Successfully.
-
Stay no longer delete files from manufacturing. In its put, ticket files to be deleted, or utilize tenancies to support far from files loss.
My think
Atlassian is a tech company, constructed by engineers, building merchandise for tech specialists. It wrote one in every of basically the most referenced ebook on Incident Handling. And yet, the corporate did no longer notice the rules it wrote about.
Now, whereas some of us might presumably maybe even feel outraged about this truth, correct this week I wrote about how Huge Tech shall be messy from the interior and we are able to also still no longer protect an organization on impossibly high expectations:
“You be a a part of a notorious tech company. You’ve heard solely perfect stuff referring to the engineering custom, and after spending hundreds time reading by their engineering weblog, you’re particular here is a location the put the engineering bar – its requirements – is high, and all americans appears to be like to be to work on high-influence tasks. Yet while you occur to be a a part of, actuality totally fails to are living up to your expectations.”
What I discovered disappointing in this going by was the radio silence for days, coupled with how zero Atlassian executives took ownership of the incident in public. The company has two CEOs and a CTO, and none of them communicated one thing externally until day 9 of the outage.
Why?
Regarded as one of Atlassian’s company values is Don’t #@!% the buyer. Why was this missed? How did leadership ignore this trace for 9 days and why did they put it?
What does this vogue of passive habits from executives point out referring to the custom on the corporate? Why might presumably maybe even still a buyer save its believe in Atlassian when its leadership doesn’t acknowledge when one thing goes very pass for hundreds of its customers and tens, or thousands and thousands of its users at these firms?
Outages took location, occur, and can still occur. The root reason is less critical in this case.
What is important is how firms answer when issues poke pass, and the procedure rapid they put that. And tempo is the put the corporate failed first and important.
Atlassian did no longer answer to this incident with the nimbleness that a well-urge tech company would. The company might presumably maybe well salvage astronomical time to search out out the explanations for this terrible response – once all customers gather entry in the next few weeks.
Within the meantime, every engineering crew and executive might presumably maybe even still save a query to themselves these questions:
-
What if we misplaced all JIRA, Confluence and Atlassian Cloud products and companies for weeks – are we ready? What about assorted SaaS companies we utilize? What occurs if these products and companies poke down for weeks? What is out Opinion B?
-
What are the learnings we are able to also still think far from this incident? What if we did a partial delete? Stay we now salvage partial restore runbooks? Stay they work? DO we exercise them?
-
What is one fragment of enchancment that I will put into effect in my crew, going forward?
-
What criteria put I utilize to take care of my distributors between functionality, promised and staunch SLAs? How put ample outages form my dealer substitute direction of?
Whereas you occur to loved this text, it is doubtless you’ll presumably maybe well also expertise my weekly e-newsletter. It’s the #1 expertise e-newsletter on Substack. Subscribe here 👇
Pragmatic Engineer Jobs
Study out jobs with ample engineering custom for senior utility engineers and engineering managers. These jobs get no longer decrease than 10/12 on The Pragmatic Engineer Take a look at. Scrutinize all positions here or publish your own.
-
Senior/Lead Engineer at TriumphPay. $150-250Good ample + equity. Far-off (US).
-
Senior Corpulent Stack Engineer – Javascript at Clevertech. $60-160Good ample. Far-off (World).
-
Senior Corpulent-Stack Developer at Commit. $115-140Good ample. Far-off (Canada).
-
Crew Utility Engineer at Gradually. $250-300Good ample + equity. Far-off (US, Canada) / Austin (TX).
-
Utility Engineer at Anrok. Far-off (US).
-
Senior Ruby on Rails Engineer at Aha!. Far-off (US).
-
Senior Utility Engineer at Clarisights. €80-140Good ample + equity. Far-off (EU).
-
Corpulent Stack Engineer at Assemble. $145-205Good ample + equity. Far-off (US).
-
Senior Developer at OpsLevel. $122-166Good ample + equity. Far-off (US, Canada).
-
Senior Cell Engineer at Bitrise. $100-240Good ample + equity. Far-off (US).
-
Senior Frontend Engineer at Hurtigruten. £70-95Good ample. London, Far-off (EU).
-
Cell Engineer at Treecard. $120-180Good ample + equity. Far-off (US).
-
Engineering Manager at Clipboard Health. Far-off (World).
-
Senior Backend Engineer (PHP) at Insider. Far-off (World).
-
Senior Utility Engineer at Intro. $150-225Good ample + equity. Los Angeles, California.
-
Senior Utility Engineer at OpenTable. Berlin.
-
Utility Engineer at Gem. San Francisco.
Other openings:
-
Engineering Manager at Basecamp. $207Good ample. Far-off (World).
-
Senior Corpulent Stack Engineer at Upright Canines. $150-170Good ample. Far-off (US), NYC.
-
Senior Corpulent-Stack Developer at Commit. $80-175Good ample. Far-off (Canada).
-
Utility Engineer at TrueWealth. €100-130Good ample. Zürich.
-
Senior iOS Engineer at Castor. €60-100Good ample + equity. Far-off (EU) / Amsterdam.
-
Senior Backend Engineer at Akeero. €75-85Good ample + equity. Far-off (EU).
Read More
Part this on knowasiak.com to establish with of us on this matterRegister on Knowasiak.com now while you occur to’re no longer registered yet.