Turn conversations into actions with the Asana app in Claude. Create projects and tasks directly from chats to jump from planning into execution faster.Read the blog
At Asana, our invalidation pipeline is a key part of implementing near-realtime reactivity in the webapp.
Reactivity is how each-and-every Asana tab keeps up-to-date with the latest changes without refreshing. The invalidation pipeline is the system responsible for watching every change, figuring out what data’s affected, and notifying all relevant parties.
The problem is clearly tractable for a few users in an office—but how do you scale that for millions of users globally?
Perhaps more pertinently, how do we get there? Practically every change-data-capture (CDC) system is vulnerable to factors like write load and read load—factors that are bound to be stretched by vast increases in user traffic. I think the hard part is strategy: predicting what will break, when—and then investing the correct amount into scaling the required systems.
In this two-part series, we'll break down 3 key scaling challenges and our strategy for solving them. Part 1 provides a background on our architecture and covers the first major challenge, handling expensive transactions. Part 2 tackles the remaining challenges and lessons learned.
simplified view of data loading infrastructure
Each web app client creates a stateful session. These sessions are handled by components within LunaDb, our GraphQL-like data loading system (despite the name, it’s not a database). Web app clients send subscription requests for data over the websocket and receive diffs of updated data from LunaDb. Notably, all LunaDb data loads are subscriptions—in other words the system is reactive by default.
Reading data is straightforward: LunaDb can process subscriptions and make any necessary database queries via the underlying WorldStore service. But reading data is one thing, and reacting to changes is another. This is commonly handled by a publisher-subscriber (aka pubsub) system where data change notifications (aka invalidations) are distributed to downstream processes.
Pubsub is often handled by explicitly issuing invalidations as data is mutated. For example, a mutation handler is typically responsible for both mutating the data and issuing related invalidations. This is how our original pubsub system worked. But writing data to a database, invalidating caches, and issuing invalidations are multiple operations across multiple systems. If any operation fails, we might leave our entire system in an inconsistent state (e.g. think stale tasks, outdated fields). Our old stack required adding a bunch of fail safes to our mutation handler to prevent this.
mutation handler that performs invalidations
When we started building our new LunaDb stack, we were curious if we could simplify the mutation handler. Our idea was to decouple mutations from invalidations, i.e. make the mutation handler only responsible for making a single atomic write and make the invalidation pipeline responsible for consistency based on the write. To implement this invalidation pipeline, we created a change-data-capture (aka CDC) process that watched all database changes and handled the non-atomic operations (invalidating caches and issuing invalidations). Since invalidations are idempotent (at worst, duplicates only trigger extra reads), we could safely retry these operations without holding up the mutation handler.
decoupled mutation handler and invalidation pipeline
This is exactly how our current invalidation pipeline works. It relies on the MySQL binary log/binlog (a linearized stream of all transactions committed to the database) for watching changes. The main component of the pipeline is a process called the Invalidator that tails the binary log and processes each transaction to maintain consistency.
simplified view of the invalidation pipeline’s role in our infrastructure
For each transaction, the Invalidator
Determines the set of affected data (invalidation generation)
Invalidates WorldStore caches accordingly (cache invalidation)
Streams the set of invalidations to all downstream consumers in LunaDb (invalidation distribution)
The most complex part of the invalidation pipeline is invalidation generation. We determine the set of affected data by using the datamodel to evaluate which objects and/or queries are affected by the transaction.
object invalidation
Object invalidations are simple and directly derived from the row-level changes. For example: If objectId=1 has an assignee updated to Bob, then we emit an invalidation for objectId=123.
Query invalidations are more complex and may require loading additional data from the database to find all affected joins.
query invalidation
For example: If we also have a query that filters by project and Task.assignee, we would need to query the database to find all of the projects that this task belongs to and emit a query invalidation for each project.
This is a fairly simple and contrived example—but our actual datamodel has fairly complex relationships. When you combine these complex relationships with large amounts of data to sift through and a high rate of changes, it quickly becomes a challenging scalability problem.
Of course the invalidation pipeline alone isn’t sufficient for end-to-end reactivity. The other portion of our pubsub system is implemented in LunaDb and is responsible for tracking subscriptions/cached data, consuming the invalidation stream, and updating all affected data. Each of these LunaDb processes are called invalidation consumers since they rely on the pipeline for reactivity and consistency.
There are many interesting problems (and solutions here), but that’s a topic (no pun intended) for another day.
This decoupled system for invalidations is an interesting and perhaps non-traditional solution—but it has distinct advantages.
Developer simplicity
From a developer perspective, most invalidation problems are automatically handled.
For reads, the “topics” that each bit of data is subscribed to are automatically generated.
For writes, the mutation handler only needs to worry about the atomic write to the database. Invalidation generation is derived from the datamodel and database write.
If your client can read the data on your local development machine, you're good to go! Check it into prod! We'll make sure that anything you’ve read will be reacted to and new data pushed your way.
System correctness
From a system perspective, if you were able to commit your single atomic transaction, we'll guarantee that all the right pubsub topics will be dispatched, eventually.
These are very valuable properties to have that were certainly not standard in the GraphQL environment of the day—but as we’ll see they do bring challenging scaling problems!
With this picture of how the invalidation pipeline works, we can start to reason about the scaling challenges and form a rough risk model. The main risk (outcome to avoid) is stalling the invalidation pipeline. If the pipeline is stalled for long periods of time, this is pretty bad for product experience—users stop seeing reactive updates, caches remain stale, and use cases requiring strong consistency see complete downtime.
Our risk model is based on
breaking down our system into rough independent functions and identifying how their failures affect the system
going over the different shapes of traffic and identifying how they impact these components
We can mostly reduce the invalidation pipeline into two functions, invalidation generation and invalidation distribution. If invalidation generation fails, then the pipeline breaks because we can no longer figure out the set of data affected by each change. Similarly, if invalidation distribution fails, then the pipeline breaks because we can no longer distribute these notifications to affected processes.
invalidation generation vs invalidation distribution
Read and write load
First off as a CDC system, the workload of the entire pipeline scales linearly with the amount of write traffic. Too many transactions (e.g. large migrations, bulk edits, email/integration loops, etc.) could overload invalidation generation. Likewise, a flood of transactions could overload invalidation distribution.
Specifically for invalidation distribution, the work required also scales linearly with the amount of subscribed processes in LunaDb, as each LunaDb process consumes from the stream independently. As a result, too many subscribed processes could overload invalidation distribution. When could this happen? Whenever we autoscale these processes to handle more read traffic (ie. more user sessions and data loads)!
Finally, the total load on each invalidator will roughly correspond to the rate of transactions × invalidation fanout, so simultaneous increases to the amount of work in generation and distribution could overwhelm the invalidator without hitting the typical limits in isolation. (Think an increase in upstream write traffic at the same time as we scale up our downstream processes)
Outside of read and write load, there are a few other relevant scaling factors.
Database load
Computing invalidations for transactions often requires querying the database. This means that we are directly transforming our write load into read load as part of generating invalidations. While typically benign, this can be a vector for issues if certain writes trigger disproportionately more expensive invalidation generation queries.
Expensive transactions
Continuing on this theme, it is very possible for specific transactions to be much more expensive to compute invalidations for. If a single transaction took much longer to process it would block all subsequent invalidation generation. This means that the invalidation generation pipeline is pretty vulnerable to head-of-line blocking caused by expensive transactions.
What could be that expensive?
changes to highly-connected data like domain objects
large parts of the product may depend on domain information so there may be a high branching factor
large updates like bulk imports from CSVs
we can typically break up patterns like these on a case-by-case basis, but there can always be new patterns 🙃
Operational complexity
As demonstrated above, there are a variety of different mechanisms by which to stall the invalidation pipeline. In these cases we often need to reset the pipeline.
We call resetting the pipeline “fast forwarding”, to indicate that we’re skipping past a period of updates to a new latest entry. Unfortunately, any automatic fast forwarding effort is fraught due to the overwhelming cost of updating all relevant caches. As such, the system was initially designed around manual fast forwarding. As with any system that cannot self-heal, this means we are directly vulnerable to operational overload.
Phew, there are so many ways to break our system (on paper)! Discussing these theoretical scaling risks is fun, but the more interesting question is priority. Which one of these scaling issues will actually cause problems first? And how will we go about solving them?
What’s the first big issue?
You might assume that the first major scaling issue would be total load scalability, however, a few key design decisions forestalled this.
The previous architecture I drew was a simplified view of the world. Rather than only operating a single database, we actually have many independent domain databases. Each customer has their data isolated to a single domain. Many domains can exist on a single domain database (shared tenancy). Each invalidator is responsible for a single domain database.
Helpfully, in LunaDb most data loads are isolated to their respective domain and don’t have cross-domain dependencies.
Additionally, early on we built out a fairly robust live shard migration process. Whenever domain databases started getting too large, we would split out its domains into a new database. Capping the size of each database constrains the rate of transactions per database.
That’s not to say that we didn’t encounter scenarios where our invalidation pipeline was overloaded. In particular, large datamodel migrations (like migrating every Asana task to a new datamodel) and high-throughput shard migrations would overly stress our systems. However, the overall load was still sufficiently small and the spikes sufficiently infrequent that manual vertical scaling of invalidators was adequate.
Suffice to say, the first problem we encountered that we couldn’t easily solve with our existing toolkit (i.e. domain sharding + shard migrations, vertical scaling) was expensive transactions.
We operate a single Invalidator per domain database. The Invalidator tails the binlog for changes, generates all of the corresponding invalidations per transaction and streams this to consumers in-order.
Why does ordering matter? In lieu of a system wide versioning scheme, we use our progress through the binlog as a checkpointing/versioning construct. Specifically, we define a checkpoint as
(binlog_file_name, binlog_file_offset)
As such, we have a correctness requirement that invalidations are delivered in-order. Normally, this is workable when transaction processing is quick (ie. sub-second). However, this becomes a serious problem when transaction processing starts taking much longer (ie. tens of seconds or even minutes). If the latest transaction is very slow, we cannot deliver the invalidations corresponding to any subsequent transactions—even if we already generated them. We end up with a classic head-of-line blocking problem where the entire pipeline will stall on these specific slow entries.
And it gets worse. We operate a single invalidator per database, and each database has thousands of domains. Therefore, a single slow transaction on one domain can block invalidations (and therefore break reactivity) for every domain on the database—even if everything else was fast!
When this first popped up, we implemented a bevy of small fixes like capping the size of transactions, setting thresholds for invalidations per transaction and optimizing per-transaction invalidation generation to deduplicate work. While the fixes helped, the increasing frequency of expensive transactions ultimately led to a no-winner tradeoff between either system instability or too much manual work.
Our initial approach was to short-circuit work on these expensive transactions (i.e. stop if it exceeds a threshold) and instead emit a broader “dangerous invalidation”. These “dangerous invalidations” would be technically correct but would invalidate way more data than necessary, i.e. an entire domain, or an entire index.
Unfortunately, to be safe we’d have to process them manually. Tuning this threshold gradually became impossible. Set it too low, and we’ll cause system instability. Set it too high and we’ll create way too much manual work. Clearly, we needed more than just a bandaid here—we needed to rethink the semantics of our invalidation pipeline.
Our first idea was changing our pipeline from doing post-commit invalidation to pre-commit invalidation. In-short we’d migrate away from doing binlog-based CDC and instead build a separate pipeline for emitting invalidations before mutations are committed to the database (similar to our previous design). The main advantage would be allowing us to accurately determine if a mutation would be expensive to process pre-commit and automatically splitting it up into tolerable smaller writes.
This approach seemed really promising and would open up some really powerful changes to our invalidation pipeline. Unfortunately, it would also have large issues with eventual consistency. A great attribute of our post-commit invalidation approach is that it provably guarantees eventual consistency—and our entire system relies on it! Our entire invalidation-pipeline based caching architecture requires that caches are always eventually consistent (we don’t use TTLs). Any inconsistencies could lead to perpetually stale data and missed updates to client sessions.
When we tried to prove the eventual consistency of pre-commit invalidation, we found it was possible for issues to occur when the results of queries used for invalidation generation changed concurrently. As such, it was only possible to guarantee correctness if we used locking reads by row (ie. SELECT…FOR SHARE) to prevent that from happening. Unfortunately, this locking isn’t free. As our system is fairly write heavy, this change had the possibility of seriously obstructing our mutation system and so we ultimately ended up rejecting this approach.
Hmm, so maybe we can’t split up these pesky expensive invalidations. Remember this whole issue exists because we have strict constraints to process invalidations in-order AND we deal with expensive invalidations that we can’t easily break up post-commit. If we must live with some expensive invalidations, what if we could relax the constraint around invalidation ordering?
That’s exactly what we ended up doing. Our second approach was to change our system such that we could relax the guarantee for in-order invalidations. To do this, let’s re-examine why we need in-order delivery.
Requirement for in-order delivery
We’ve covered this more elsewhere, but we primarily need it for our distributed caching system. When processes in LunaDb request data from WorldStore, they include a checkpoint. WorldStore guarantees that the returned result will be at least as new as the checkpoint. Without this check, it’d be possible for LunaDb processes to inadvertently load perpetually stale data.
Clearly then, we need to keep delivering invalidations in-order. Or do we? Let’s consider if we relaxed this requirement and started delivering invalidations out of order. A simple way to allow this is by instead delivering an “outstanding invalidation” message.
outstanding invalidations in the stream
The outstanding message simply means that at some arbitrary point in the future we will deliver the corresponding invalidation (i.e. for checkpoint=1 in this case). In other words, it will be delivered out-of-order. Correspondingly, each process that consumes from the invalidation pipeline can track the latest seen invalidation (checkpoint=3) and any outstanding/missing invalidations (checkpoint=1).
outstanding invalidations in our invalidation pipeline architecture
How can we tell if this is correct? In the above scenario…
We have checkpoint1 and checkpoint3, where checkpoint1 < checkpoint3
A LunaDb process requests data from WorldStore at checkpoint3 without having yet seen invalidations corresponding to checkpoint1
Likewise, the WorldStore cache is up-to-date with checkpoint3 and hasn’t seen checkpoint1
Is it safe to use this data? Intuitively yes—they’ve seen the same messages from the invalidation pipeline. What about future updates?
We can make an important observation here. WorldStore doesn’t need to serve any data that reflects transaction T if the requesting LunaDb process has never received the invalidations for T. Why? Intuitively, this is because the LunaDb process must eventually receive the invalidations for T. When it does, it will trigger the appropriate reloads and we will have the opportunity to refresh the previously stale data.
Data freshness
The above example was pretty straightforward since the WorldStore cache had seen the exact same entries as the LunaDb process. What if the WorldStore cache and LunaDb are missing different invalidations and/or have seen different invalidations?
Based on the information we track above, we can formalize a version construct called DataFreshness that tracks both the highest checkpoint (checkpoint3) and any missing previous checkpoints (ie. checkpoint1). This is a similar construct to vector clocks, and as such we can define a partial ordering over DataFreshness. Without getting into the weeds, these properties let us prove that out-of-order invalidation is eventually consistent.
If you’re lost amongst all the technical detail here don’t worry—there’s a simple intuition that should help convince you about correctness. No matter which order we deliver invalidations, and no matter which prior invalidations a process might have not seen, the database is guaranteed to have seen all of them. As long as our caching layer is implemented correctly, any time we fall back to the database we will get an up-to-date result.
The resulting system is then straightforward, for each transaction
we try to generate invalidations
if it succeeds within a time limit, say 10s, we emit the invalidations in-order
if it times out, then we emit a different message that lets any consumers know to fetch out-of-order invalidations
Designing, building and rolling out this system took a period of around six months. It required rewriting much of our core invalidation distribution logic, our library code around database checkpoints and version, and our WorldStore caching code. Despite the scale of the rewrite, we were able to roll it out without any major cache inconsistency issues and it’s worked successfully ever since. It’s a testament to the success of the approach that we have not invested any more time into this problem or into optimizing invalidation generation since (going on 6 years now).
Fun fact: when we first built the invalidation pipeline, we actually initially tried using Kinesis and found that it didn’t scale. Our conclusion at the time was that Kinesis was architected for many producers and few consumers, whereas our use case was for few producers and many consumers. Here’s what our original architecture diagram looked like
diagram from the original design doc to remove kinesis
Kafka wasn’t as mature at the time, and it had the reputation of being difficult to operate. With our small team at the time (<5 people), we didn’t want to incorporate another large source of complexity and operational load.
If we re-did this today, Kafka would be an ideal solution (and is in-fact something we’re considering for the future). We still probably wouldn’t be able to easily use a partition per domain as a LunaDb process tracks a dynamic set of domains—but we would at least be able to set up in-order and out-of-order partitions.
With out-of-order invalidations, we have a robust way of handling expensive invalidations that should work for the foreseeable future. But as we mentioned earlier, there are other ways in which our system must scale. So what breaks next, and how should we fix it?
In part 2 of this series, we’ll break down the two remaining large challenges, discuss the future possibilities and reflect on some lessons learned from building and scaling this system.
Explore career opportunities at Asana. If you don’t see a role that’s right for you at Asana today, sign up for Asana’s Talent Network to stay up to date on career opportunities and life at Asana.
Arvind Vijayakumar is an engineer on the LunaDb team, where he works on helping build and scale Asana’s core Data Loading platform—the critical infrastructure that ensures users always see accurate, reactive, and lightning-fast updates across our web apps and APIs.
Scaling the invalidation pipeline has been a long-running team effort spanning the last 8 years (at least), involving many members of LunaDb, both present and past: Brandon Zhang, Andrew Budiman, Bjorn Swift, Theo Spears, Spencer Yu, George Ong, Eric Walton, Alex Matevish, Natan Dubitski, Vinodh Chandra Murthy, Sophia Yao, Koushik Ghosh, Hannah Christenson, Jerald Liu, Tyler Prete and more (folks who started before me)
WorldStore cache invalidation is trivially expensive by comparison
A flood of invalidations also technically leaves us vulnerable to spikes in data reloads in LunaDb, but that’s a topic for another day.
This has actually caused issues in the past for some particularly complex db queries that expose weaknesses in our datamodel and end up being extraordinarily expensive.
There is some metadata that doesn't fit cleanly into this model and is on user sharded/non-domain databases or in DynamoDB
We built this system for more than just invalidation pipeline scaling concerns—we were already hitting other issues with total db load and size.
We did actually write a proof by induction! Unless folks are interested, I’ll save y’all the exercise in discrete math.
In short given data freshness values of f1 and f2 we roughly know that f1 and f2 are incomparable if f1.outstanding ⊈ f2.outstanding and vice versa
Proof is excluded for brevity
An interesting detail to note here is bounding how “out of order” our system gets. While it is eventually consistent, it might lead to pretty weird user interactions if many changes take a lot longer to reflect. To minimize this we bound the number of concurrent outstanding invalidations on a process. We haven’t had to change this threshold since we first rolled it out.