How Asana Built A Resilient ID Allocation System

Asana Engineering TeamEngineering Team
May 12th, 2026
7 min read
facebookx-twitterlinkedin
Asana Engineering Spotlight

Every object in Asana—every task, project, comment, and attachment—needs a unique identifier. At Asana, these IDs are sequentially incrementing integers, allocated in blocks from a central database. Because every write operation ultimately depends on obtaining one of these IDs, the allocation system sits on the critical path for virtually all user activity. If it slows down or becomes unavailable, users can't create tasks, post comments, or upload attachments. That makes ID allocation one of the most important low-level services in our infrastructure.

For years, Asana relied on a centralized, regional ID Allocation system. While it ensured unique IDs, this architecture has become structurally incompatible with our current focus on isolated failure domains. The regional ID cache at the core of this system was particularly brittle. Our applications required it to be available at all times meaning cache failures could negatively impact every customer in that region. This fragility made standard maintenance high-risk, often leading engineers to avoid routine updates altogether to prevent a major outage.

To improve this high-risk architecture, we chose a pragmatic goal: keep the existing sequential ID model, but redesign the allocation path for maintainability and resilience. To achieve this, we built a new multi-tiered allocation system that uses Redis caching, a dedicated GRPC fan-in service, and a direct database fallback—all deployed locally within each Kubernetes cluster. This post walks through the design goals behind that system, how each tier works, and the trade-offs we navigated along the way.

Background

The key design shortcomings we chose to address in the redesign were:

  1. Regional Cache Dependency: The cache was regional. This design choice resulted in a broad blast radius, where a failure could affect all customers in that region.

  2. Implicit Cache Requirement: The system implicitly depended on the cache always being available. This meant routine maintenance and upgrades were unnecessarily risky.

  3. Production-Only Validation: Key infrastructure components, like the cache infrastructure and the process that keeps it filled up, were only run in production. This prevented full validation of changes in Beta or Canary stages.

This made standard maintenance risky and allowed unvalidated changes to reach production. The cache infrastructure, in particular, became so fragile that engineers often avoided routine updates to avoid risking a major outage.

At this point it might be fair to ask why we did not just migrate over to a decentralized ID model like Snowflake IDs? It would solve all of the problems mentioned above.

We did consider migrating to a decentralized model at one point. However, the engineering effort required to switch from our centralized sequential ID system used for over a decade was more than we were willing to invest in this project. The additional work needed would have transformed the project from an isolated drop-in replacement into a massive undertaking. This would include complex coordination and collaboration across multiple teams, costly migrations, and the setup of guardrails. These guardrails would be necessary to prevent collisions with IDs previously allocated by the sequential ID Allocation system. All of which would have extended the project timeline by months and increased the risk of outages during the migration from the previous system.

Those constraints and the problems listed above pushed us toward a more pragmatic goal: keep the existing sequential ID model, but redesign the allocation path so cache failures are contained, upgrades are less risky, and all changes are tested in Beta and Canary before they hit production.

Anatomy of a Resilient ID Allocation System

Anatomy of a Resilient ID Allocation System

Driven by these pragmatic goals, we designed a multi-layered architecture that prioritizes isolation and availability. Our approach centers on a dedicated GRPC fan-in service, a robust three-tier resolution path, and native Kubernetes stability mechanisms that safeguard the system during routine maintenance.

While the rest of this post will focus on the ID allocation critical path, it's worth briefly noting the role of the Redis Monitor. This dedicated process continuously hydrates the cache that the GRPC service uses with a 24-hour buffer of ID blocks, allowing the underlying database to undergo extended maintenance without disrupting user writes.

Fan-in through a dedicated GRPC service

At the heart of the architecture is a dedicated ID Allocation GRPC service that mediates all ID allocation requests within each cluster. Rather than allowing every backend process to reach out to Redis or the database independently, a small set of ID Allocation server processes handles all ID block requests on behalf of the much larger fleet of backend workers. 

This fan-in design is critical: because far fewer ID Allocation servers exist than backend processes, the number of connections that can reach the database at any given moment is tightly bounded. Even under a full cache failure, the database sees only a controlled trickle of requests instead of a thundering herd from thousands of backend processes.

Three-tier resolution path

Three-tier resolution path

When a GRPC server receives a request for an ID block, it follows a three-tier resolution path:

  • Tier 1 — In-memory cache. The server first checks its own local store of pre-fetched IDs, populated whenever it falls back to the database. Since these ID Allocation servers are ephemeral and may be replaced at any time, draining this in-memory store takes priority to minimize wasted IDs.

  • Tier 2 — Cluster-local Redis. If the in-memory cache is empty (the normal steady-state path), the server queries the cluster's own Redis instances dedicated to ID allocation. Because every cluster runs its own Redis deployment, a cache failure in one cluster affects only a fraction of the region's traffic, a fraction of our customers. 

  • Tier 3 — Central database fallback. If every Redis pod in the cluster is unreachable, the server fetches a large block of IDs directly from the central database. This fallback path means Redis never becomes a critical point of failure—it remains a performance optimization that the system can gracefully do without.

To ensure that all three tiers work as expected in production, we follow the philosophy of “The Code Path That You Never Use Doesn’t Work”. Instead of relying only on the database fallback path when there are issues with the Redis cache, we consistently and intentionally route a tiny percentage of incoming requests to the database fallback path during normal operations, bypassing the other two layers. This way, the fallback path stays warm and well-exercised at all times. If we did not exercise the fallback path regularly in production like this, we would be risking the path not working when we finally need it during an emergency. 

Protecting the database from overload

The fan-in layer bounds how many processes can contact the database, but that alone isn't enough. If every ID Allocation server fell back to the database simultaneously, the combined load of every request sent to them could still be significant. To prevent that from happening we added two additional mechanisms to the system:

  • Per-pod concurrency limits. Each ID Allocation server pod is limited to one concurrent database request at a time. Because each request fetches enough IDs to cover several minutes of user traffic, even sustained cache downtime produces less than one database request per pod per minute. Any request that cannot acquire the concurrency lock simply fails back to its caller, which handles the transient failure with exponential backoff, resulting in little to no user-facing impact.

  • Exhaustive Redis routing. Rather than relying on a Kubernetes Service that randomly selects a single Redis pod, each ID Allocation server tries every Redis pod in the cluster before falling back to the database. If the first pod is down or empty, it moves to the next, cycling through all available instances. Only after exhausting every pod does it contact the database. This eliminates unnecessary fallbacks caused by one unlucky routing decision to an unhealthy Redis pod and reduces database calls from the servers by around 40%.

Together, these layered controls safeguard the central database, ensuring that the system can handle both cache failures and the resulting database fallback without causing database overload.

Stability during maintenance and upgrades

Preventing problems during normal operations is critical, but it means little if the system can't handle being disrupted by routine maintenance and upgrade operations. To prevent that from becoming an issue, the system uses several Kubernetes-native mechanisms:

  • Pod Disruption Budgets prevent Kubernetes from evicting too many Redis or ID Allocation server pods simultaneously during node drains or cluster upgrades.

  • PreStop hooks give terminating pods time to drain connections gracefully and allow replacement pods to warm up before old ones are removed.

  • Node scheduling constraints spread Redis and ID Allocation server pods across different nodes so that a single node failure cannot take out multiple pods at once.

Collectively, these native stability patterns prevent the simultaneous unavailability of all Redis and all ID Allocation server instances. This allows us to safely perform common infrastructure operations, such as node drains or Redis upgrades, and experience common infrastructure failures, such as node failures, without simultaneous unavailability or downtime for our users.

Incremental rollouts through CI/CD

Incremental rollouts through CI/CD

While these Kubernetes primitives provide a robust foundation for infrastructure stability, they can’t protect the system from logic errors or regressions in the application code itself. In a service as foundational as ID allocation, operational resilience must extend beyond the infrastructure layer and into the software delivery lifecycle.

Because our ID allocation system is built as a standard service running within each cluster, it integrates directly with Asana’s global CI/CD pipeline. Every change—from GRPC service logic to Redis configurations—must be promoted through our Beta and Canary environments before hitting production. This staged rollout process ensures that performance regressions or configuration mishaps are identified in low-traffic isolation, long before they can impact our broader customer base. By treating infrastructure-level updates like Redis version upgrades with the same promotional rigor, we ensure that every layer of the system remains both stable and predictable.

Conclusion

With the layered architecture, we successfully delivered on the pragmatic goals and successfully navigated the constraints set out in the beginning—achieving operational resilience without requiring a costly and time-consuming migration away from our existing sequential ID system.

The success of this approach is most clearly demonstrated by how the new system addresses each of the legacy problems we identified earlier:

  • Blast Radius Contained: Moving from a regional cache to cluster-local deployments solved the problem of regional outages, containing cache and GRPC failures to a single cluster.

  • Cache is an Optimization, Not a Dependency: The three-tier fallback path eliminates the old system's implicit dependency on the cache, ensuring IDs can always be allocated, which makes routine maintenance safer and eliminates customer-facing impact from cache outages.

  • Changes Roll Out Incrementally: Integrating the service into our global CI/CD pipeline ensures all logic and configuration changes are tested in Beta and Canary before reaching production, solving the problem of production-only validation.

  • Upgrades are Automated and Low-Risk: Kubernetes-native stability features (PDBs, PreStop hooks) keep the system stable through routine maintenance and upgrades, reducing operational risk.

By adopting a multi-layered approach centered on a dedicated fan-in service, tiered resolution logic, and robust database protection, we replaced a brittle system with an architecture built for operational resilience. This redesign affirms a core engineering philosophy: rather than relying on the availability of a single regional component, we prioritize graceful degradation and isolating failures at every layer, ensuring that Asana's critical ID allocation path remains continuously reliable even when individual systems struggle. This change delivers a significantly more reliable core write path for every Asana user.


Author Biography

Sigurður Skúli Sigurgeirsson is an Infrastructure Engineer in the Platform group at Asana. His work focuses on designing and scaling core infrastructure for stability, performance, and cost-efficiency, alongside developing Asana’s observability platform and pipelines.

Team Shout Outs

The design and implementation of the new Id Allocation System has been a huge team effort, involving everyone on the project: Kalman Oddsson, Kriti Singh, Gabriel Mikaelsson, Olaf Magnusson, Thordur Fridriksson, Osk Olafsdottir, Stefania Stefansdottir, Walter Li, Ed Korthof, James Sigurdarson, and Vignir Hafsteinsson.

Related articles

Engineering

The Athletic Position