Here at Asana, we’ve built a data loading system called LunaDb that serves as the backbone of our webapp. Despite the name, it’s not a database. Rather, it’s a GraphQL-like system for declaratively fetching data—basically a way to load the latest version of data and all future updates.
We initially launched LunaDb in 2015 as a radical rewrite of our backend infrastructure¹. The central component of this new system was the sync server, a monolith that performs everything from client synchronization to data loading and access control. Without significant changes, this initial architecture scaled far beyond the early expected levels, all the way to millions of weekly active users and billions of daily queries.
While performance remained strong, as traffic and feature complexity grew, it became increasingly difficult to operate and improve LunaDb due to the constraints of the sync server.
overview of our data loading infrastructure
Why was it difficult to operate?
Traffic shifting is expensive
The sync server directly managed persistent websocket connections to clients. Each websocket was backed by a stateful client session. When a connection was broken, all of this state got thrown away and the client would re-subscribe to all of the data it cared about. When this happens with a lot of sessions, it gets expensive fast. So we had to be careful when shifting these connections, given the large spike in work from reconnects.
traffic shifting
Deploying sync servers means shifting traffic
Of course, you cannot perpetually forestall shifting traffic. Whenever you want to push new code or scale up/down you will need to decommission sync servers—and that requires shifting all of the traffic off the terminating instances.
rolling updates
Sync servers aren’t performant at startup
At the same time, sync servers would only become performant after a non-trivial amount of process warming. Managing both of these issues has been a fairly fragile balance and has required extensive engineering work in the past.
Sync servers are complex and difficult to monitor
Finally, as sync servers ran many arbitrary bits of product code (via custom server-side functions), they were very vulnerable to noisy neighbor-based performance regressions that were difficult to attribute².
the noisy neighbor problem
A reasonable question to ask is, “Why are sync servers complex and non-performant at startup?”
A common cause of these issues is our server-side product code. The main bits of sync server code are written in Scala. Despite some complexities related to session state management and the various aspects of the Luna framework, this framework/platform code mostly behaves as we expect (there are relatively few operational and performance issues).
On the other hand, these product server-computed-values (we call them SCVs, but think custom resolvers) are written in Typescript. Both sets of code run together in GraalVM, a polyglot VM that allows the use of multiple languages via its Truffle framework. As they’re written in Typescript, SCVs are essentially interpreted at startup—which predictably results in unacceptable performance and CPU usage. GraalVM will try to perform just-in-time compilation on invoked SCVs. This is good! GraalVM/Truffle can greatly optimize their performance—but doing so isn’t free. SCV compilation can be quite expensive (in cpu, code cache, etc).
our polyglot VM setup
Why the two languages?
Our first design for SCVs was entirely in Scala. On the other hand, our mutation and async job systems are written in Javascript/Typescript. While Scala-based SCVs worked, the duplication of business logic between our mutation and async job systems and LunaDb, along with product engineers' unfamiliarity with Scala, became quite a drag on product velocity.
Why GraalVM?
We cache a lot of data in-process to speed up computing (and re-computing) subscription results. Using GraalVM gives us a straightforward way to share these caches across languages without the correctness or performance concerns that might arise from splitting the Scala and TypeScript portions into separate containers.
Why was it difficult to improve?
Since the server did so many things and was relatively fragile to operate, we tended to avoid larger changes. Not just due to code complexity, but also to the high overhead of safely rolling out new changes.
Yes I like asking questions
Given the difficulties with operating and improving the sync server, we made the difficult decision to change our architecture. Mainly, we decided to replace the monolithic sync server with two smaller kinds of components:
a session broker that manages client connections and state resolution
a syncable loader responsible for data loading, i.e. fulfilling queries from session brokers
session-brokers and syncable-loaders
Why does this help?
Immediately, this new architecture separates shifting websocket traffic (i.e. deploying session brokers) from warming new processes (i.e.. deploying syncable loaders). As a result, we can minimize disruption by deploying syncable loaders separately from session brokers.
Each of these new components are simpler.
Session brokers are much lighter-weight, require no process warming and run no product code. As a result, we don’t need to deploy them as often—and when we do it’s fairly straightforward.
Syncable loaders have a simpler interface (stateless subscription requests) that are easier to adapt to standard Kubernetes horizontal pod autoscaling. This also makes it faster and more straightforward to warm them—we can just use traffic mirroring on the ambient requests flowing between session brokers and syncable loaders
The new architecture allows us to greatly simplify the product development process. From inception, the independent deployment schedules of server-side product code and calling client-side product code have been a drag on velocity and a source of operational toil (due to version incompatibility). Since syncable loaders are now the only remaining process that runs product code and deploying them is no longer disruptive, we can redeploy them whenever we push new product code.
This new architecture allows us to better scale to new features by deploying different pools of syncable-loaders for different workload types (such as different distinct features like Inbox, Task, Goals, etc). The session broker functions as a service gateway that can directly control how data queries are routed to different upstream syncable loaders.
Great! This new architecture sounds way better, but how did we make it a reality? The sync server is basically a monolith in that it encompasses multiple functions—and breaking apart monoliths is almost always tricky. In our case, we had to overcome some key design obstacles.
Breaking apart PubSub
PubSub, our system for implementing reactivity, was designed around a single process (the sync server) being responsible for loading new data and sending it to the client. We had to redesign PubSub in a way that guarantees correctness across these two now independent process types (syncable loaders and session brokers).
Let’s briefly dig into how it's implemented in sync servers. Note: It may be helpful to read our prior post on the invalidation pipeline, but we’ll provide a no-prerequisites view of the system.
On the sync server, we track subscriptions per session. We use the invalidation pipeline to continually monitor subscriptions for updates. On each new invalidation message, the sync server will reload all affected subscriptions.
Sync servers heavily cache db object/query results, custom resolver results, and previous subscription results to optimize subscription reloads (i.e. using a read-through pattern). Cached artifacts are passively invalidated by the same invalidation pipeline used for subscriptions. Whenever we try to use cached data we will check its validity and fallback to re-computation if need be.
We can observe a clear dependency between reloading subscriptions and invalidating cached data. Upon receiving an invalidation, if we reload a subscription before the cached data has been invalidated, we will potentially compute a stale result. When both data loading and subscription management happen on the same process, it is trivial to guarantee this dependency—simply invalidate caches before reloading.
Race condition can cause stale data on update
In our proposed new architecture, session brokers and syncable loaders are both independent consumers of the invalidation pipeline. How then can we enforce that caches are invalidated before subscriptions are reloaded?
Request and response versioning
We could’ve solved this by making the invalidation pipeline deliver messages in lockstep. Or we could’ve built a mechanism to enforce the ordering guarantee that no invalidation pipeline messages arrive at the session broker before the syncable loader. Both of these solutions had unideal tradeoffs though—critically, they increased the coupling of session brokers and syncable loaders.
Instead, we solved this problem by augmenting our data loading protocol with request and response versions based on their relative progress in the invalidation streams. Since the stream represents a total ordering of updates to the databases, our stream progress can be used like a global version counter.
request and response versions
Invalidation reloads
Sync servers load a ton of data—the majority of all reads for the site. Our new architecture requires session-brokers and syncable-loader to exchange a lot of data over the network. For new subscriptions, this network overhead is relatively negligible. However, this is particularly inefficient for invalidation reloads, as we often don’t need to return the full response—only the updated data³. Certain cases are particularly bad here. Imagine a case where one user has paged 10000 tasks down in a project, and another user is constantly changing task descriptions in this project: all tasks would need to be sent over on every invalidation! Clearly, it’s ideal to only send back the updated data, but how do we efficiently implement that?
invalidation reloads
Fingerprinting
For the syncable loader to compute the delta of updated data, it has to know which data the requester already has. But passing the latest data with the request would be as expensive as returning the full result. We need to represent the data in a more space-efficient way.
Well, hashing is a great way to save space. Each bit of granular data we care about is called a syncable. We can compute a 128-bit murmur hash of each serialized syncable to use as a fingerprint⁴. Specifically, this fingerprint is an identifier for that version of the syncable.
Wherever we track syncable data, we can instead use their fingerprints. Now, when we want to track a full subscription response, we can just use a set of fingerprints without having to pass the full data around!
Sidenote: What’s a syncable and how does it relate to subscriptions?
Syncables are the contents of a subscription’s result. When we load a subscription, the results are returned as a set of syncables. More specifically a syncable can be an object, a query or an SCV result.
syncable to subscription mapping
Clearly, each subscription maps to multiple syncables. However, syncables can be shared by multiple subscriptions (when they load overlapping data). Therefore, there’s actually a many-to-many mapping between subscriptions and syncables.
We pass the set of these fingerprints with each request from the session broker. On the syncable loader, we compute the full response, compute its set of fingerprints, exclude any data that overlaps with the request, and return the delta.
Given the size and criticality of the change, we split the rollout into roughly 4 stages.
Stage 1 - Refactoring the monolith
Break up our highly coupled session management and data loading code into independent components
Stage 2 - Local syncable-loader
Use the new data loading component to create a local gRPC server and migrate over data loading
Stage 3 - Remote syncable-loader
Create a new syncable-loader binary and deployment
Migrate all sync-server data loading to our new syncable-loader deployment
Stage 4 - Separate session-broker binary
Create a new session-broker binary and deployment
Migrate all traffic from sync-servers to session-brokers
So many. Where to begin?
Large responses
One problem we quickly ran into was large response sizes. Since data loading on sync servers was all within the same process, this hadn’t proved to be a huge issue up till now⁵. However, once we started loading data across a local gRPC boundary, we started to encounter a lot of issues.
We’d suspected that some responses might be large all along, but when we started investigating this, we found truly startling results. We had thousands of loads per day regularly exceeding 100MiBs! We physically could not return responses that large over a unary gRPC method (you start hitting http2 max frame size). What to do?
We considered a few ways to fix this systematically, but ultimately concluded that we needed to address the root causes. We could have implemented gRPC server-side streaming—but the incurred steep serialization costs and elevated socket contention would be fairly regressive to latency and throughput. We could just reject these responses—but the occurrence rate was too high for this to be acceptable.
We settled on a three-phase approach where we mark all large responses, analyze and eliminate each problematic case, and then enforce a strict upper bound on response size.
We marked all loads larger than 1MB and logged detailed events about source, usage, and data breakdown. There were a few different expensive use cases, but the most notorious was probably attachment thumbnail blobs. It turns out that they were being encoded as base64 strings and included in the serialized responses. They were tolerable in small numbers, but quickly became huge when loaded in mass—like when loading a grid-based view that rendered attachment thumbnails for each task.
We were able to progressively patch issues like these by restricting thumbnails for huge responses, using smaller thumbnails, and eventually moving binary data out of the response. After simple mitigations like these, the number of huge responses disappeared, and we were subsequently able to enforce framework-level response size limitations⁶.
Topic collisions
Another strange problem we encountered was pubsub topic collisions. It turns out that we had noncompliant framework usages that generated the same subscription topic regardless of domain. When pubsub only happened on a single process type, the effects were relatively benign. Typically, a single pubsub topic corresponds to data from a single domain. However, with pubsub now split across session-brokers and syncable-loaders it was possible for the two process types to disagree on a particular topic’s domain. When this happened, we’d see a stable elevated rate of invalidation reloads due to this “domain mismatch”. Thankfully, the fix was fairly straightforward—but it’s interesting how this bug survived in our framework for so long without detection.
Re-tuning workloads
Session brokers and syncable loaders have workloads that are considerably different from those of the original sync servers. Session brokers are only responsible for session management and syncable loaders are only responsible for data loading.
We weren’t exactly sure how this would affect their resource requirements, so we started both out with similar resource (cpu/mem) requests and horizontal pod autoscaler (HPA) settings.
session-brokers
As we observed, it became clear that session brokers were considerably more lightweight. They were reliably operating at very low CPU (if anything they were much more memory bound⁷). A few replicas seemed sufficient to serve an entire infrastructure cell’s traffic. However, when we actually reduced the minReplicas on the HPA we observed that data reloads and reload latency shot up. What was going on?
In short, we’d neglected to consider all of our shared settings around cache sizing and throttlers. With only a few replicas, each session broker was handling way more sessions per pod (around 3.5x) than a typical sync server. Since they were each seeing much more data, they were completely filling up their pubsub topic ⇔ subscription caches and each eviction was triggering a reload (out of safety). Appropriately increasing this threshold by ~6x fixed the elevated cache eviction and reload rates. Likewise, we found that our hierarchical reload throttlers were badly misconfigured for the new rates of traffic coming through. Rightsizing these throttler settings similarly led to a dramatic reduction in reload latency and end-to-end reactivity lag (i.e. how long does it take for a web app to see its own write) of around 5x - 10x.
syncable-loaders
On the other hand, syncable loaders were considerably heavier than expected. Each server would load more subscriptions per second (around 1.5x) than an equivalently resourced sync server. Unlike session brokers, they were much more CPU-bound ⁸.
Interestingly, a not-insignificant portion of CPU was attributed to an increase in Truffle deoptimizations related to our TS SCV code. Most likely, this was caused by each syncable loader accessing more of our SCV code. Regardless, it necessitated a modest increase in our code cache size⁹.
Process warming
Process warming for data loading has historically been a challenge. Thankfully, in our new architecture, it’s a bit more straightforward. Our syncable-loaders main interface is stateless data queries—so we can warm them just by replaying or mirroring existing traffic between session-brokers and syncable-loaders.
On the flip side, we still face many of the same challenges as with warming sync-servers. Mainly, it costs a lot of CPU to warm processes, and this causes all sorts of noisy neighbor problems (for us, the relevant metric here is partial stall since we’re not hitting k8s throttling limits) at startup. A nice improvement we made here was using in-place pod resizing to cap the resources of a syncable-loader at startup, but allow it to burst after startup.
Despite this, warming each pod still takes on the order of a few minutes. From looking at JFR profiles, we believe the main bottleneck is the insufficient compilation of TS SCV code during warming, and we believe there’s more headroom there. We’re actively looking into more precisely warming the relevant paths with more targeted methods and re-shaping our TS interfaces¹⁰ for better compilation.
Our session-brokers aren’t responsible for any data loading and in practice never required any sort of warming.
Our new architecture significantly reduced our operational complexity, sped up deployments and code velocity, and unlocked future velocity and scaling opportunities.
Our new architecture simply autoscales to changes in total read traffic without requiring complex traffic management. Each kind of traffic shifting operation is neatly handled by simply restarting pods. Need to boot all sessions? Cycle all session-brokers. Need to clear our caches? Cycle all syncable-loaders. Both operations are safe in regards to uptime.
In the past, product development velocity was mainly bottlenecked by deploying sync-servers. In our new world, we only need to deploy syncable-loaders to deploy product code. This we’ve done by moving syncable-loaders into their own deployable cell that we’re working to deploy more frequently (and eventually alongside product code). Deploying syncable-loaders is already ~40% faster (~20 minutes faster) than sync-servers (we can safely surge much higher), and we’re targeting greater increases moving forward.
Notably, performance improvements weren’t a goal of this work, but it’s worth discussing given how much of the design and implementation involved it. Latency of computing query results actually improved in our new system (likely due to better load balancing/tuning/autoscaling). Correspondingly, initial subscription latency is noticeably better. On the other hand, end-to-end mutation reflection latency (i.e., how long it takes for a change to be distributed to other sessions) is about the same. Traces show that this is likely due to the skew in consuming the PubSub stream between session-brokers and syncable-loaders (syncable-loaders cannot serve a request till they’re up-to-date with the request version).
Our current system is much improved, but it still caps product velocity by requiring considerations around backwards/forward compatibility with every release. However, with all of our improvements in deployment speed, it’s now feasible to deploy all of our datamodel changes together, thereby removing any need to reason about backwards/forwards compatibility. We’re actively looking into building this in the near future.
One of our main motivations for this work was performance regressions that are difficult to attribute. Notably, this is one area where we haven’t made significant improvements as a result of this work. On the plus side, this is now one of the main problems we’re tackling moving forward. Unlike the bulk of the work discussed here, it’s a considerably more cross-functional problem that involves product domain, data model, framework, and infra considerations.
syncable loader worker pools by feature
We’re excited about exploring solutions across platform infrastructure (e.g., worker pools by feature), platform frameworks, and platform tooling (e.g., black box/white box testing).
Arvind Vijayakumar is an engineer on the LunaDb team, where he works on helping build and scale Asana’s core Data Loading platform—the critical infrastructure that ensures users always see accurate, reactive, and lightning-fast updates across our web apps and APIs.
This work to scale LunaDb has been a long-running team effort spanning the last few years, involving many members of LunaDb, both present and past: Brandon Zhang, Alex Matevish, Sean Wentzel, Eric Walton, Spencer Yu, Sophia Yao, George Ong, Tyler Prete, Koushik Ghosh, Natan Dubitski, Vinodh Chandra Murthy and more
Notably we built it before Facebook open sourced GraphQL
There are more details in the prior post on process warming but tl; dr we rely on allowing sync pods to arbitrarily burst as a way of quickly scaling to local traffic spikes
We also do updates at a per-object granularity rather than per-field (due to historical reasons). This compounds the amount of data involved in invalidaiton reloads.
We use a 128-bit murmur hash to avoid collisions.
Notably, we’ve been using websocket compression for a very long time, which probably mitigates client-perceived effects.
Notably, we also originally had some issues with the overhead of the network hop. After some investigation, we found most of the overhead was actually due to our service mesh (Istio/Envoy) and we made some focused tweaks to improve performance here.
A significant factor in retained memory was rewriting our server-side syncable store to only retain hashes of data. It wasn’t a straightforward problem and we may release a follow-up post on this! Without this, the stored client data made session-brokers use considerably more memory.
Prior to this all sync servers ran on memory-optimized instances. Syncable loaders on the other hand benefit from more CPU-rich node types like mixed instances.
A notable attribute of running TS on GraalVM in this manner is that it uses quite a bit more code cache than is typical in a more standard Scala/Java JVM application.
As-far-as-we-can-tell high polymorphism makes TS compilation more expensive. We looking into changing to monomorphic interfaces and leveraging reporting polymorphic specializations