Turn conversations into actions with the Asana app in Claude. Create projects and tasks directly from chats to jump from planning into execution faster.Read the blog
This post is part of a two-part series breaking down how we scaled our invalidation pipeline over the last few years. Read part 1 before reading this article.
Last time we talked, we outlined how our data loading infrastructure worked, noted key scaling challenges and looked at how we scaled past one of the first major challenges: expensive transactions causing head-of-line blocking. As discussed, implementing support for out-of-order invalidations foundationally solved this issue for the foreseeable future.
However, even with this mitigation, we still eventually ran into the total viable load limitations of running a single invalidator per database. At the same time, the strains of managing increasing total system load were also making the system more difficult to operate.
In this post, we’ll pick up from there and look at how we dealt with these system-wide challenges of total load scalability and operational complexity.
Total load is a term we previously defined to describe the total amount of work that an invalidator has to do. The invalidator’s main functions are generating invalidations and distributing invalidations—so unsurprisingly, total load is proportional to write traffic × invalidation fanout (aka number of subscribed processes). Each invalidator has a natural limit on total load, that when exceeded will cause invalidators to fall behind in the stream or entirely fail. Ensuring that this doesn’t happen is the problem of total load scalability.
Total load scalability is a problem that we have recurrently managed (or avoided) with a mixture of database sharding and vertical scaling. Eventually, these became insufficient to continue sustainably scaling. Surprisingly, it wasn’t primarily due to growth in write traffic per database but rather due to increased read traffic.
How so? Continual database re-sharding helped limit the write traffic factor (although there still was steady per-domain growth). On the other hand, the size of our production LunaDb cluster grew from hundreds of pods to thousands, which put considerable strain on invalidators.
To exacerbate the issue, we relied on blue-green style updates for LunaDb deployments—so every deployment meant spinning up an entire new cluster, warming it up and then hotswapping it over. During this period of time, each invalidator had to serve double the amount of invalidation streams. With these factors, we did still have some vertical scaling headroom, but not enough to be comfortable—especially if the growth rate continued.
Invalidation fanout
The main problematic factor here is the fanout involved in invalidation distribution. Given m database and n invalidation consumers, we end up with m × n concurrent invalidation streams. In other words, every invalidator had to send every invalidation to every consumer, which compounds quickly as the cluster grows.
Our theory at the time was that either lowering the number of connected consumers per invalidator or reducing the cost per connection could yield sizable scalability benefits. As a result, we ended up iterating through a few different approaches that might improve these factors.
invalidation fanout problem (from old architecture)
Our first approach was to try building domain isolated invalidations. Our goal here was to only distribute invalidations for processes that care about the associated domain. We projected this would reduce the amount of distributed invalidations by 75%.
We built a simple system here on top of out-of-order invalidations. We would deliver a stream of entries like “OutstandingInvalidation for domain=xyz at checkpoint=A”. If a process was serving data for the mentioned domain, we could just fetch the associated invalidation separately.
The results were surprising! This “work saving” measure not only increased our end-to-end invalidation pipeline lag, it also increased invalidator CPU and memory usage. This was counterintuitive and one of our bigger surprises in scaling the system. What was going on?
Our existing approach heavily batched invalidations in the stream, whereas our new approach did not. Benchmarks showed that our batching had a significant impact on throughput.
Replacing a long-lived grpc stream call with many ad hoc unary calls adds not-insignificant overhead. We ran an experiment where all out-of-order fetch calls for invalidations were replaced with empty responses. Surprisingly it still indicated a fairly elevated CPU in the distribution components.
Most critically, the throughput overhead of including invalidations in the streams (old approach) was fairly small relative to the latency of making an additional roundtrip (new approach). We ran a benchmark that replayed the invalidation stream twice—once without invalidations and once with invalidations.
In-short we weren’t transmitting unpopulated entries fast enough for this change to be worth it. Moreover, when we considered the end-to-end latencies of both approaches, it becomes clear that the approach was untenable due to the extra round trip for fetching invalidations. So even though the idea seemed promising on paper, the real-world system dynamics and latency penalties made it a net loss.
With the knowledge that deviating from streaming was likely not viable, we elected to go in a more predictable direction: just reduce the number of consumers for each invalidator.
Our first idea was based around building silos in LunaDb that each served a subset of domain databases. We played around with this idea and soon realized that it would be a considerable undertaking. The rest of infra caught on, and we instead decided to do this at a broader level for all of our critical infrastructure. Ultimately, the entire infrastructure org launched a much larger initiative to modernize and migrate our Kubernetes infrastructure to a cell-based architecture. If you’d like to hear more about how we accomplished that stay tuned for a follow-up blog post!
This migration would eventually result in 10 - 20 independent infrastructure cells that each only loaded data from a subset of databases. The upshot for the invalidation pipeline was a 10x - 20x reduction in invalidation fanout, which would give us considerable scaling headroom for the future.
But at the time we were facing this issue, cells were just an idea—and they were potentially years away. To bridge this gap, we ultimately decided to build invalidator replication per database. In-short, we would make it possible to configure a custom amount of invalidator replicas for each database. Increasing the number of invalidators per database linearly decreases the total load incurred per process.
invalidator database assignment
The work involved was straightforward. We simply had to update our invalidator database assignment to support multiple leases per db, update our controller to scale to these new configurations and update our routing to support multiple backends per db. All-in-all, we were able to go from design to rollout within a couple months.
invalidator replication
Despite the simplicity, it was wildly successful in solving our pressing fanout problems and ended up being a critical stopgap. Admittedly, it was an inherently inefficient solution to the problem (we were duplicating the work of generating invalidations across invalidators). But while it wasn’t elegant, it was a pragmatic solution that bought us important time to complete the migration to our new cellularized infrastructure.
Finally, let’s talk about the least favorite part of running infrastructure: operational or keep-the-lights-on (KTLO) work. When you consider scaling a system for higher loads, the problem is usually straightforward and the solution (while involved) is proactive and strategic. In other words, you can plan for it. By contrast, operational work is reactive, the problems can be myriad, and the solution work is tactical, high-pressure and often bespoke to the situation. Excess operational work will usually kill a team's capacity to make meaningful progress on their long-term goals.
We’ve intermittently dealt with operational issues with the invalidation pipeline for most of the lifetime of the service. The number of unique ways to impact the system are too many to briefly describe—and of course almost every issue here will stall the invalidation pipeline and break reactivity in the product. Moreover, the dynamics of the system are complex in ways that can make discovering the correct solution time-consuming and difficult. In a nutshell, that’s the problem of operational complexity.
Our system’s vulnerability to these varied breakages have led to our belief that rather than stopping all the leaks (so to speak), it’s more important to ensure that the system can either self-heal or be easily repaired. In other words, it’s going to break, so let’s make it easier (or automatic) to fix.
For brevity, let’s focus on one particularly thorny issue that became particularly influential on subsequent development. To do so, we’ll have to dip a bit into the implementation details of the invalidator.
Invalidation caching
Invalidators track each new database transaction in a cache. This cache maps database checkpoints (ie. binlog file and offset) to the next transaction. In other words, it semantically behaves like a linked list—just with great concurrent access and built in eviction. This cache is populated by a binlog tailer, a process that walks the binlog from a provided database checkpoint onwards and adds each observed entry to the transaction cache. At minimum, we will always operate at least one binlog tailer—the one responsible for invalidating WorldStore caches.
Remember, every invalidation consumer in LunaDb independently consumes from the invalidation stream. The corresponding distribution process on the invalidator will also walk the transaction cache and serve the corresponding invalidations. If this process ever comes across a missing entry in the cache, it means there’s missing data. To fill that data, it will dynamically spin up another binlog tailer to backfill the missing entries (we use the binlog itself as the persistence layer for data changes)
Why do we ever need more than one binlog tailer? If the transaction cache starts evicting entries newer than the oldest invalidation consumer, we will start evicting entries eventually required by that consumer. The multiple tailer approach works pretty well to handle these problems—unless you start making too many tailers.
two consumers spread far apart in the stream
There is a performance cost to making more binlog tailers. Each one is independently trying to do unfiltered change-data-capture (CDC) on the underlying database. This is a non-trivial amount of load on the database—and there are reasonable limits to how many concurrent tailers the database can handle. Of course this is all exacerbated by our previous effort to build invalidator replication by database. As a result, we limit how many tailers a process can concurrently run. What causes us to spawn many tailers?
Holes in the cache
Fundamentally, holes in our transaction cache cause us to spawn more tailers. When every LunaDb invalidation consumer is operating in perfect lockstep there are never any holes. There is simply the latest tailer that’s driven by the invalidation generation process. However, invalidation consumers don’t operate in lockstep. Some of them are doing more work than others and end up lagging behind.
many consumers spread across the stream
When the downstream LunaDb cluster becomes really large, there can be a large difference between the newest and oldest consumers, with a distribution of consumers in-between. When this happens, things break from the resulting cache thrash. The cache is constantly full, constantly evicting and constantly being populated from multiple different locations. If this happens enough, one or more invalidation consumers will find themselves “marooned” on an island of transaction entries with no spare tailers to spin up and “pave the way out”. Once the system enters this state, its performance degrades enough that it cannot recover without intervention.
As mentioned before, our main mechanism for solving any issue with the pipeline being stuck is manually “fast forwarding” the pipeline. This resets the checkpointing on the invalidator so it skips ahead to the very latest entry from the database. Unfortunately, it also means that it skips past an entire block of transactions. This is data loss, and it requires repairing every downstream cache that depends on the pipeline for consistency. The simplest way to do that is turning it off and turning it on again. This is more or less what we do now, we run a deploy that cycles all of the in-memory caches. Unfortunately though, this is manual, slightly disruptive to uptime, and requires a full deploy cycle to clean up cache inconsistency. Surely we could do better…
Instead of cycling processes to wipe their caches, couldn’t we do something a bit more precise? What if we could invalidate all downstream consumers automatically when we fast forward? If we had such a mechanism it’d solve our cache thrash issues as well—we could automatically fast forward the oldest consumers.
This (automating fast forwarding) is something we actually tried, but in-short the cost of refreshing data on all downstream systems made it impractical. Let’s break down why.
There are a few things happening when we “invalidate all downstream caches”. First, this would passively invalidate all of the in-memory data caches in LunaDb. Second, this would invalidate all of the actively held subscriptions.
Passively invalidating our data caches does increase server-side work but it’s mostly a better situation than warm start on the JVM. On the other hand, reloading all of our subscriptions is a monumental task. We knew this; so we intentionally spaced out these reloads. What we didn’t realize was the scale of reloading all these subscriptions. We had to space apart these reloads so far out that it’d be minimally beneficial for the intended use case. Even still, we saw elevated reloads, CPU and GC pressure due to more complex interactions in the underlying LunaDb system. We realized that it was simpler, safer and almost as fast to just re-deploy LunaDb. Instead of investing more complexity into making this system work we realized we could mostly solve this in a couple different ways.
First, we could make cycling LunaDb processes safer, easier to trigger, and faster. We can now do this in a single command without spinning up any new k8s cluster resources. This of course has broad benefits for operating our system.
Second, we could simplify the invalidation pipeline paradigm such that an invalidator would only be responsible for generating invalidations. Our troubles with in-memory caching clearly show that our reliance on doing everything in one process was starting to cost us—but how could we evolve our approach?
If we reflect back on our history of scaling challenges here, we can see that most of our issues with operating and scaling invalidators stem from the invalidator process doing too much. The work-multiplying combination of generation and distribution led to total scale issues, and complexities of implementing our own concurrent-in-memory queue to link generation to distribution made it difficult to operate. What if we could split them up?
Simply put, that’s our current plan! We should split up the two main responsibilities of the invalidator (generation and distribution) and connect them with a sufficiently persistent datastore.
simplified view of the invalidator separation plan
What are the benefits of this approach?
The invalidation generator is now just a much simpler invalidator:
it has a single binlog tailer (less complexity, less load on the db)
It consumes entries from the binlog, generates invalidations and stores them in an external invalidation store
it has basically no strong reliance on in-memory state
The invalidation generator only needs to publish to a single consumer, the invalidation store. Freed from invalidation fanout, it should scale exceptionally well (at least >10x better than now)—especially if we increase pipeline parallelism. At that point, we expect the database costs of generating invalidations to become the critical scaling bottleneck rather than the pipeline architecture.
The distributor streams entries from the datastore and publishes them to LunaDb consumers. It’s a simple process that
acts like a proxy to the invalidation store
is entirely stateless and can simply horizontally autoscale by cpu
In the long-term, once we’ve simplified our API here we can likely remove it and directly connect consumers to the invalidation store.
The invalidation store can be a simple off-the-shelf component like Redis Streams/NATS JetStream/Apache Kafka. As we mentioned in part 1, there wasn’t a rich variety of options here when we first built invalidators—but now there are! We should take full advantage!
Scaling the invalidation pipeline over the last few years has led us through a variety of system scaling problems and taught us some valuable lessons.
Bias for simplicity wherever possible. Complexity can be alright if absolutely necessary, but be cognizant about the tradeoffs. Out-of-order invalidations is an example of a good tradeoff because there was no way of solving our problems without breaking ordering guarantees in some way. On the other hand, shard isolated invalidations was probably the wrong one. Despite having other reliable paths for improving scaling (ie. reducing invalidation fanout) we opted for a speculative approach and didn’t correctly vet things ahead of time.
It is really valuable to validate the core principles of an approach as early and cheaply as possible. On the other hand, synthetic benchmarks can often be misleading, and you should endeavor to replicate the critical aspects of the problem as closely as possible. For example, synthetic benchmarks for unary vs streaming requests didn’t indicate any overhead. However, it clearly showed issues in production. These issues were reproducible with more realistic benchmarks.
There is a very real silent cost to maintaining complex systems long-term. Those complexities may not always show up in uptime metrics or maintenance costs—but they will bog down velocity and make scaling expertise much more difficult.
Let’s substantiate this by considering how generic the invalidation pipeline is. It runs generic invalidation generation code that is entirely dependent on the product-defined datamodel specifications. The complex relationship between invalidation generation and the datamodel can require a relatively rare intersection of databases, infrastructure, framework and datamodel knowledge. You could imagine that issues like expensive transactions would be much simpler to solve in a more finite state space, where the owners of the rules underpinning invalidation generation also own the surrounding system. Interactions like these lead to the development of complex systems to manage them—like out-of-order invalidations.
In-short, this design leads to increased costs in maintenance, onboarding, collaboration with neighboring teams and unplanned work when features put load on this system. It’s a common maxim that the more generic a system is, often the more complex it will be to design, maintain and scale—and clearly the invalidation pipeline is no stranger to this.
Don’t get us wrong, it is really valuable that we can operate a completely generic invalidation pipeline. It would be far more expensive to operate individual pipelines for every use-case or surface area across the product. But we’d be remiss to ignore the very real tradeoffs here.
Continuing on this theme, it is even more important to establish clear system guarantees in such a system. Our invalidation pipeline is provably consistent (barring stalls). While the increased complexity does raise the bar for changes (writing proofs is harder than code), guaranteeing correctness in this part of the system is immensely valuable for reducing the scope of possible breakages when we discover correctness issues in downstream systems.
We’ve done a lot of disparate work on scaling our invalidation pipeline over the years, but the underlying architecture and principles have been surprisingly durable. I think it can be easy to propose additional complexity but often a simple solution will scale much farther than you’d expect, especially on modern hardware.
Is our invalidation pipeline close to perfected? Probably not, but I’d wager that the bulk of the available opportunities for improving the scale and end-to-end latency of our reactivity lie in the surrounding systems rather than in the core design. In that spirit, our focus has primarily shifted to improving our LunaDb read path. We’re applying some of the same principles of reducing operational complexity as part of our ongoing work to break up our data loading monolith. If you’d like to hear more, stay tuned for a follow-up blog post!
Explore career opportunities at Asana. If you don’t see a role that’s right for you at Asana today, sign up for Asana’s Talent Network to stay up to date on career opportunities and life at Asana.
Arvind Vijayakumar is an engineer on the LunaDb team, where he works on helping build and scale Asana’s core Data Loading platform—the critical infrastructure that ensures users always see accurate, reactive, and lightning-fast updates across our web apps and APIs.
Scaling the invalidation pipeline has been a long-running team effort spanning the last 8 years (at least), involving many members of LunaDb, both present and past: Brandon Zhang, Andrew Budiman, Bjorn Swift, Theo Spears, Spencer Yu, George Ong, Eric Walton, Alex Matevish, Natan Dubitski, Vinodh Chandra Murthy, Sophia Yao, Koushik Ghosh, Hannah Christenson, Jerald Liu, Tyler Prete and more (folks who started before me)
Note: Ultimately, we just skip processing invalidations that are not relevant on each invalidation consumer which yields a modest CPU improvement
This may be surprisingly difficult at scale, especially with pre-commit invalidation or non-database sourced invalidation pipeline architectures.