# Details on the April 9th, 2018 Asana outage

> Asana was down on April 9th, 2018 for multiple hours. Here's what caused this outage and how we resolved it.

Source: https://asana.com/inside-asana/april-9-2018-asana-outage

## Details on the April 9 Asana outage

Asana was unavailable for multiple hours on Monday, April 9, 2018. The web app was partially unavailable between 6:08am and 6:53am PDT, and fully unavailable between 6:53am and 7:53am. The API was partially unavailable between 6:08am and 7:42am, and fully unavailable between 7:42am and 9:15am. Both of these outages were reported on [trust.asana.com](https://trust.asana.com/), although our reporting of the API outage was incomplete.

Below is a summary of what happened and some of the measures we’re taking to prevent and mitigate incidents like this in the future.

## A timeline of what happened

Requests on our webservers are handled by a large number of processes, each of which can serve multiple requests sequentially. Spinning up these processes from scratch is expensive, so by default we fork processes rather than spinning them up from scratch. In particular, we aim to create a single “zygote” process, and to create other processes by forking this zygote.

On the afternoon of Friday, April 6th, we deployed code to production that broke our ability to bring up new zygotes on webservers. This didn’t trigger alarms, and we have fallback behavior where webservers will simply bring up new processes from scratch whenever needed. The effect of this was that our webservers had higher CPU load than normal. The increased load went unnoticed over the weekend.

On Monday morning, around 6am, our traffic increased to the point that the increased CPU load on webservers became important. Webservers started being overloaded, and over the course of 45 minutes, Asana went from being impaired to being almost entirely unavailable.

An oncall engineer was first paged at 6:08am. They quickly escalated the issue to get a number of engineers involved. At 6:32am an engineer identified the increased forking failure rate as a likely cause of the problem, and at 6:54am the team rolled back the code to a previous release.

Once the code was reverted, webservers started recovering, and [load on the main database started rapidly increasing](https://en.wikipedia.org/wiki/Thundering_herd_problem). At 7:09am, the database ran too low on memory due to a large number of mostly-inactive connections and started swapping. The database’s throughput dropped, and it was unable to keep up with requests. In order to un-stick the database, engineers triggered a manual failover to a backup at 7:30am. Once the failover completed, the database had plenty of memory and became CPU bound instead. Clients of the database had inconsistent responses to the database being overloaded, and in particular, webservers backed off more aggressively than API servers, resulting in webservers being frozen out entirely. In response, we shut off the API servers. This took enough load off the database for it to recover, and by 7:53am the web app had recovered.

We then began restoring the API. This took substantially longer than expected, and we have not completed our investigation into why this is the case.

## Measures we’re taking

We’ve already improved our alerting for problems like this, which would have allowed us to fix the problem long before it impacted Asana users. We’ve also increased the size of our main database, which gives us more headroom.

In the short term, we’re also planning on updating our internal documentation to provide clearer instructions for how to respond to overload—since certain actions took us longer than they should have—and fixing some tools that behaved poorly when the database was overloaded. Finally, we’re actively looking into why the reporting for API downtime was incomplete on [trust.asana.com](https://trust.asana.com/).

In the medium term, we plan to improve the back-off behavior for all of our database clients. This should make it easier for the system to recover on its own from an overloaded database. It should also give us better and more appropriate tools for shutting down problematic clients. For example, we should be able to stop the API from overloading the database without shutting it off entirely.

Finally, in the long term, we plan to replace most of the functionality of the main database. The main database currently serves as a monolith that performs multiple unrelated functions. We plan to split this up into multiple services to reduce the blast radius when one of these services is overloaded. This will also allow strategies such as read replicas when appropriate.

- [Being a Female Engineer at Asana](/nl/inside-asana/female-engineer)

Engineering

Note: This post was originally published on Quora in response to the question What are some particularly female-engineer-friendly companies to work for in San Francisco?As a femal ...

- [Scaling our invalidation pipeline: Part 1](/nl/inside-asana/scaling-invalidation-pipeline-part-1)

Engineering

#### Platform Engineer

At Asana, our invalidation pipeline is a key part of implementing near-realtime reactivity in the webapp. Reactivity is how each-and-every Asana tab keeps up-to-date with the lat ...

- [AI Agents Built for Teams: Shared Context and Transparency in Enterprise AI](/nl/inside-asana/ai-agents-built-for-teams-context-transparency)

Engineering

Artificial Intelligence (AI)

The Accountability gap Enterprise AI agents are AI systems that can take actions inside shared workflows across teams and projects. This landscape has grown quickly as a growing l ...

- [Scaling LunaDb, our in-house declarative data loading system](/nl/resources/scaling-lunadb)

Engineering

#### Platform Engineer

Here at Asana, we’ve built a data loading system called LunaDb that serves as the backbone of our webapp. Despite the name, it’s not a database. Rather, it’s a GraphQL-like system ...

- [Details on the April 9 Asana outage](/nl/inside-asana/april-9-2018-asana-outage)

Engineering

Asana was unavailable for multiple hours on Monday, April 9, 2018. The web app was partially unavailable between 6:08am and 6:53am PDT, and fully unavailable between 6:53am and 7: ...

- [Engineering](/inside-asana/engineering-spotlight)
