The Accounter: Scaling Operational Throughput on Uber’s Stateful Platform
November 21, 2024 / GlobalIntroduction
In a previous post, we introduced Uber’s stateful platform, Odin. We discussed how the platform’s scale and the growing need for fleet-wide operations required better coordination among its many remediation loops. Multiple conflicting operations could compromise storage clusters without centralized coordination, leading to availability or durability issues. As shown in Figure 1, when uncoordinated remediation loops operate a quorum-based storage cluster, it causes problems. This post explores how we overcame this problem and scaled Odin’s throughput by introducing global coordination of all operations.
Operations on Odin are implemented using Cadence workflows. When an actor, whether human or automated, wants to operate one of the managed storage clusters, it does so through workflows. A workflow consists of actions, like changes to the system state, and waiting periods, like waiting for system converges, that collectively orchestrate transitioning the system from one state to another. Workflow executions can range from seconds, such as upgrading container images, to hours, like migrating workloads between hosts (Uber’s fleet uses locally attached disks). We’ll refer to these workflows as operations from this point forward.
We needed a mechanism to gate the initiation of new operations or, to put it another way, answer the question: Given the current circumstances, is it safe to proceed with this operation on this cluster?
Our design requirements were as follows:
- Independent remediation loops: These loops should remain unaware of each other. This is crucial for scaling the development of high-level functionality. In other words, remediation loops shouldn’t hard-code rules for determining when it’s safe to perform operations on clusters.
- Technology-specific policies: Odin manages all stateful technologies at Uber, and each technology has unique safety tolerances for cluster operations. So, different technologies may require different policies.
- Platform-wide limits: The system should support enforcing global limits across all technologies/operations on the platform.
The solution we chose is a global software component called The Accounter, which provides operation coordination as-a-service. Its name reflects its core purpose: to serve as a central registry that tracks all ongoing operations, to understand the relationship between operations, and to act as a gatekeeper for initiating new operations. A good mental model is to think of The Accounter as an advanced disruption budget or a fuzzy semaphore.
When an operation is initiated, permission to operate on the target storage cluster must first be requested from The Accounter, a process we call taking a claim. The claim covers the entire operation, which might involve multiple changes to the system state.
The Accounter uses a technology-specific policy to determine whether a claim can be granted. The policy takes two inputs: cluster health, a collection of the current health state stored in Grail, and the currently ongoing operations, which are tracked and stored in etcd®.
Cluster Health
A storage cluster on Odin consists of one or more workloads, like a Raft, Apache Cassandra®, or similar database cluster node. Each workload is a collection of containers: a worker container, a primary database container, and potentially several sidecar containers. The worker is responsible for managing the host-level life cycle of the database and sidecar containers. It monitors the workload’s status continuously and communicates with the control plane. The most recent workload state is stored in Grail.
When determining cluster health, several other health signals must be considered beyond what can be observed within individual workloads. For example, does the cluster have under-replicated data? Is the cluster experiencing stress from excessive data shuffling or an increased client load? Storage teams typically manage/collect cluster-level health information like this, and Odin provides ways to ingest the cluster status into Grail.
A technology’s tolerance for unhealthy workloads varies significantly depending on the specific technology, and this is captured in technology-specific health policies—more on this later.
Ongoing Operations
The system models ongoing operations using two key concepts: operations and groups.
Each operation is represented by an operation object, which contains critical details such as the targeted storage workload, the type of operation (like drain or downtime), and its potential disruption to the storage cluster. Every operation is associated with one or more groups.
A group tracks the number of operations linked to it and stores additional metadata beyond the operations count. For example, it records the most recently started and completed operations. This data allows for enforcing time-based rate limits on the operations permitted within each group.
Although there are many groups, they can be broadly categorized into two types: platform-wide groups (for example, failure domains like regions, zones, and racks) and technology-specific groups (for example, individual storage clusters and workloads). A global group tracks all ongoing operations.
Platform-wide groups enforce global concurrency limits, helping to prevent overloading Odin and the underlying infrastructure. At the same time, technology-specific policies leverage cluster and workload groups to protect cluster availability and durability.
The number of operations permitted on a cluster varies depending on the technology. Some technologies restrict operations to a single workload at a time, while others allow a percentage of the cluster to be operated on concurrently. More specialized groups can be created dynamically to track operations on specific subsets of workloads within a cluster, such as roles. This flexibility enables more nuanced safety policies tailored to different storage technologies’ requirements.
The figure below illustrates how an operation is modeled, with the operation object linked to several group objects. These explicit associations facilitate the cleanup process when an operation completes or fails.
Safety Policies
Now that we can efficiently gauge cluster health and get an overview of all ongoing operations, we can introduce safety policies. A safety policy is a codified disruption budget that allows the expression of technology-specific policies for how operations can be overlapped within the storage clusters.
Safety policies comprise two parts: the health policy and the limit policy.
The health policy uses the latest collected cluster health information to determine whether the requested operation can be performed. For example, the technology team might want to prevent operations from being started on a cluster that’s seeing an increase in client load or has unhealthy workloads.
Limit policies, on the other hand, can limit the number of concurrent operations affecting a group, implement grace periods between sequential operations, or provide operations exclusivity so that if one group is being operated, other groups will reject all claims. This is particularly useful when you want to operate a single rack at a time.
The Accounter provides a collection of functionalities for policy implementers, such as:
- Methods for gauging workload and cluster health
- CheckMaxOperations(group, max): Check that the specified group has at most max operations
- CheckElapsedFromLastClaim(group, duration): Check that the given time has elapsed since the last claim associated an operation with the specified group
- CheckElapsedFromLastUnclaim(group, duration): Check that the given time has elapsed since a claim associated with an operation was last released from the specified group
Health evaluations are a point-in-time check and can’t provide hard guarantees about the safety of operations. Health data collection inherently involves latency in a distributed system, meaning two simultaneous requests for disruptive operations might be approved based on an outdated view of cluster health. The checks performed by limit policies do provide hard guarantees as they are conditionally committed through an etcd transaction. Let’s explore how that works.
Architecture
To persist operations and groups, we use etcd as a key-value store. When a workflow wants to make a change to one of the storage clusters, it goes through the following process:
- The workflow that wants to take a claim calls The Accounter, with information about the target storage workload and the purpose of the operation (1).
- The Accounter retrieves the current cluster health state from Grail (2) and the current operations state from etcd (3).
- The Accounter evaluates platform-wide concurrency and rate limits before evaluating the target workload (4).
- Next, technology-specific health and limit policies are evaluated against the state from etcd (5). If either policy fails, the claim is rejected immediately without resulting in a transaction.
- Otherwise, if all the criteria are met, a single transaction for the required changes is built and committed to etcd (6).
- The claim is accepted or rejected depending on the success of the etcd transaction (7).
- The operation can now proceed (8).
- After the operation, the workflow is responsible for releasing the claim through The Accounter.
Changes in etcd are executed transactionally, ensuring a consistent view of ongoing operations. Specifically, we use optimistic locking to verify assumptions about the number of operations within groups before committing changes. A transaction builder library abstracts this complexity for safety policy developers, giving them the impression of working directly in memory. This approach is similar to etcd’s STM (Software Transactional Memory) library but with optimizations tailored to improve throughput.
If the transaction is rejected due to optimistic concurrency conflicts, it’s retried internally a few times. If the claim is rejected, we rely on the operation to retry as long as the operation remains relevant. If the rejection is due to a violation of the rate limit on a group, the Accounter provides a meaningful backoff time that the operation can use to decide how to proceed.
To avoid all claims having to fetch state from etcd directly (3), all claims are first evaluated against a continuously updated in-memory snapshot of the data. If the claim violates either of the policies using the cached state, it’s immediately rejected without attempting to commit the transaction to etcd. This is essential as the system has to scale to 3,000-4,000 claim attempts per second. Most of this traffic comes from the platform auditing workloads’ ability to move through dry-run claim attempts.
The etcd transaction finalizes taking the claim and checks for operational limits transactionally, granting the claim when the transaction is committed.
Claim Life Cycle
Our operations are often hierarchical, so we designed The Accounter to support the passing of claims from parent to child operation. These passed-down claims are reentrant, meaning that when a child operation attempts to claim, it becomes a no-op. This design allows for more complex operations while keeping the operation logic straightforward. Programmers don’t need to understand the entire operation structure to determine whether a claim has already been taken—they simply take the claim as needed, knowing the system will handle it correctly.
Operations are responsible for releasing claims once their operations are completed. However, there are cases where an operation may be terminated, fail unexpectedly, or contain bugs. While these instances are rare, they do occur at our scale. The system must ensure that claims are eventually released, as stale claims can block operational throughput. The Accounter can always trace back to the operation since operations are linked to their claims in the data model. This is used to identify inactive operations and safely release stale claims.
Auditing
With great power comes great responsibility. Delegating policy development to technology teams carries the risk that overly conservative policies could hinder the platform’s ability to perform fleet-wide operations. Uber colocates workloads of different technologies on the same hosts, resulting in hundreds of workloads on the same hosts. When the platform has to drain a host, all workloads must first be drained (that is, moved to other hosts). Restrictive safety policies increase the risk of only being able to drain the host partially.
To address this issue, we’ve implemented an extensive auditing system. This system continuously evaluates the claimability of workloads, providing an accurate snapshot of which operations are possible across the platform. This information is published to Grail and used by remediation loops as a pre-filter to identify feasible operations.
Additionally, the Odin team leverages this data to gain insights into workloads whose operations have been blocked for extended periods, allowing the Odin team to alert the team responsible for the affected storage technology.
Alternative Approaches
We’ve encountered at least two other common approaches for coordinating operations: distributed lock managers and Kubernetes® disruption budgets. Here, we explain how they differ from The Accounter’s approach.
Distributed lock managers typically involve acquiring a lock on a cluster, ensuring that only one operation can be performed at a time. However, given the lengthy time required to operate on a single workload in Odin—primarily due to locally attached disks—locking an entire cluster for a single workload operation would be inefficient and impractical.
A more flexible alternative to locks could be to extend the lock to a semaphore, allowing a predefined number of tokens to be granted simultaneously. This is similar to the approach taken by Kubernetes, where the disruption budget sets a fixed number of operations upfront. The Accounter, however, diverges from these approaches by focusing solely on counting operations, leaving the responsibility of enforcing limits to a separate policy. This method offers much greater flexibility in policy design. For example, it allows for specifying that only a certain number of optimization moves are permitted. Still, if a host fails and a request is made for an emergency move, that’s always granted. Keeping these emergency cases represented in the model is an advantage because the policy could then state that no efficiency moves are granted from that time on until the host failure is fully remediated. This flexibility is crucial in maintaining operational efficiency while adapting to real-time conditions.
Scale
The Accounter has now been fully integrated into all operations, and the number of operations done by the platform translates to a lot of traffic. Let’s take a look at the current numbers:
Traffic
- >300,000 claim evaluations per hour
- >7 million dry-run claim evaluations per hour
Active Operations and Groups
- >2,000 active operations
- >700,000 distinct groups
Results
Over the years, The Accounter has significantly improved Odin’s operational efficiency, enabling small teams to manage thousands of clusters safely. It’s facilitated centralized efficiency programs and empowered leadership to treat Uber’s physical infrastructure as flexible and impermanent. Moreover, The Accounter has preserved the independence of teams through a clear separation of concerns: remediation loop owners focus solely on determining which operations are necessary without needing to worry about safety considerations.
One noteworthy example was our effort to adopt encryption-at-rest. A process that we’ve been able to drive fully centralized. All we had to do was ask our automation to move workloads to hosts with encryption-at-rest, and The Accounter ensured that it happened safely. This process involved migrating 2.1 million vCores and 1.6EiB locally attached disks. In the past, operations like this would have required extensive planning/execution, involving all the technology stakeholders and costing years of engineering time. Now, they are a no-op.
Future Work
The Accounter paradigm is actively evolving, and we’re working to address some of its current limitations. One significant area for improvement is support for prioritizing operations. Currently, operations rely on continuous polling to obtain a claim, which generates unnecessary traffic and doesn’t allow for the prioritization of different operation types. This becomes particularly problematic when lower-priority efficiency optimizations block high-priority, human-initiated operations. Another area of interest is the ability to define circuit breakers directly within The Accounter. Currently, each loop in Odin implements this functionality to protect against misbehavior caused by bugs. We aim to offer this as a built-in feature of The Accounter, streamlining the process and enhancing overall system resilience.
Summary
In this post, we introduced The Accounter, a global coordination system designed to improve the throughput and safety of operations on Uber’s stateful platform, Odin. Providing operation coordination as-a-service, The Accounter allows for the efficient execution of large-scale operations while maintaining cluster safety and avoiding conflicting actions. It tracks ongoing operations, enforces technology-specific policies, and ensures that new operations are only initiated when safe. The Accounter has significantly enhanced Uber’s operational efficiency, allowing small teams to manage thousands of clusters safely and drive centralized programs like encryption-at-rest migrations.
Kubernetes®, etcd®, and its logo are registered trademarks of The Linux Foundation® in the United States and other countries. No endorsement by The Linux Foundation is implied by the use of these marks.
Apache Cassandra® are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. The use of these marks does not imply endorsement by the Apache Software Foundation.
The cover photo was generated using OpenAI’s ChatGPT Enterprise and edited using Pixlr.
Jesper Borlum
Jesper Borlum, Sr. Staff Engineer at Uber, is a seasoned software engineer, architect, and team player. He leads the Stateful Platform team, responsible for building the infrastructure to manage all of Uber’s stateful systems. The team’s mission is to deliver a fully self-healing platform without compromising availability, reliability, or cost. He’s currently leading the effort to adopt Arm at Uber.
Gianluca Mezzetti
Gianluca Mezzetti, Sr. Staff Engineer at Uber, was among the pioneers of the Stateful Platform team. His extensive contributions across multiple platform domains, such as workflows, concurrency control, host remediation, goal state storage, and auditing, have been instrumental in expanding the platform’s capacity. Currently, he leads the initiative to integrate Kubernetes into Odin.
Alexander Blazhenskikh
Alexander Blazhenskikh, Sr. Software Engineer at Uber, is a member of the Stateful Platform team. He contributes to the Accounter, a critical concurrency control service, addressing safety, consistency, and scalability challenges with expertise and precision.
Posted by Jesper Borlum, Gianluca Mezzetti, Alexander Blazhenskikh
Related articles
Most popular
Moving STRIPES: innovating student transportation at Mizzou
Case study: how the University of Kentucky transformed Wildcab with Uber
How Uber Eats fuels the University of Miami Hurricanes off the field
How Uber Uses Ray® to Optimize the Rides Business
Products
Company