Sometime back, I wrote a post on Dynamo which is Amazon’s key-value store. This paper is not about that. This paper was presented at a recent USENIX conference and describes Amazon DynamoDB which is a cloud database service. The paper doesn’t deal with intricacies of building a key-value store but rather how the original DynamoDB was used to build a highly available NoSQL cloud database. Amazon DynamoDB is an AWS service that is used by Alexa, Amazon.com, AWS Lambda and many other high-traffic Amazon services.
As part of this post, we will discuss what led to the development of this service and what makes Amazon DynamoDB a cloud database service that provides consistent performance at any scale.
Why Amazon DynamoDB should not be confused with Dynamo?
DynamoDB was initially built to provide a storage solution which has predictable performance and is highly reliable. Developers can rely upon DynamoDB to build applications for customers with delightful experience. Although every team that used DynamoDB was responsible for managing their own DynamoDB instances. This forced teams to invest heavily in understanding various attributes of DynamoDB and operational overload associated with it. This additional workload associated with using a storage solution deterred many teams from using DynamoDB even though the reliability and consistency attributes associated with it matched exactly with their storage requirements.
By the time, Amazon launched AWS SimpleDB which was a fully managed database as a service. SimpleDB took over all the operational complexity associated with DynamoDB and soon became the go-to solution for developers who wanted to focus on their business logic instead of investing in operating a database on their own. Although SimpleDB came with its own set of limitations. Tables in SimpleDB had limited capacity in terms of storage and throughput. The latency associated with queries was also unpredictable. This introduced a new form of operational cost for developers where they have to now divide their data into multiple tables to overcome limited capacity.
Looking at the challenges faced by developers while using DynamoDB and comparing it to the limitations of SimpleDB was a lightbulb moment. What the application developers required was best of both the world. The reliability and performance associated with DynamoDB coupled with ease of usage that SimpleDB provided. This led to introduction of Amazon DynamoDB which is an AWS service, that shares the name with DynamoDB but little of its architecture.
Amazon DynamoDB divides a table into multiple partitions hosted on various nodes. Each partition maps to a set of key-value entries of the table. The partitions are replicated across various availability zones for durability purposes. In order to conduct leader election and consensus for storage operations, Amazon DynamoDB uses multi-paxos. As part of the consensus protocol, any replica can trigger leader election and only a leader node can process a write and strongly consistent read operation. This processing looks similar to any consensus protocol, where the leader first writes the operation in it’s own write-ahead log(WAL) and then sends it to peer replica nodes. Once a quorum is achieved, the leader performs the operation in its storage and sends a confirmation to replica nodes(If you are new to consensus protocols, I will advise you to read my post on Raft protocol which is a much simpler version compared to Paxos).
In order to improve availability, replication groups also contain replicas that only consist of recent write-ahead logs instead of the complete storage system. These replicas are called as log-replicas and they play a key role in availability and reliability of the service.
Amazon DynamoDB consists of multiple microservices such as metadata, routing etc with each having their own set of responsibilities that work together to build a reliable and performant cloud database service.
Journey from provisioned to on-demand
Initial DynamoDB release introduced units which customers can use to specify the throughput of a database table. The units were as follows:
- Read capacity unit (RCU): For items up to 4KB in size, 1 RCU can perform a strongly consistent read
- Write capacity unit (WCU): For items up to 1KB in size, 1 WCU can perform a write request
The specification(Number of RCU & WCU) provided by customers was called provisioned throughput. DynamoDB splits a table into multiple partitions and store these partitions on different storage nodes. With changes in load or capacity, partitions are split further. The provisioned throughput is distributed among all the partitions. But soon it was realized that applications encounter a non-uniform workload over time. So if a partition sees increase in load and is split into multiple partitions, the throughput allocated to it will also be split. If the increase in load was due to a certain number of key accesses, now the partition is under more pressure as the load remained the same but the throughput allocated to it is decreased due to the split. This led to issues such as hot partitions & throughput dilutions.
To handle such scenarios, DynamoDB made certain improvements that led them to move from a point where customers were expected to provide an estimate for their storage versus when the throughput allocation turned to an on-demand system.
Initial improvements to admission control
One solution to tackle the issue of hot partitions was to allow the heavy-load partitions use the unused throughput at partition level. The unused capacity at partition level is called burst capacity. So a read request can be processed only when the partition has remaining burst capacity at partition level. For processing write request in case of a burst, additional checks are required to see if all the replicas of this partition can also process the write capacity. This is done in order to maintain consistency and ensure that a quorum is reached.
Global admission control
Bursting helped for short-lived spikes. As soon as the partition ends up consuming the burst capacity and is still subjected to increased load, it will end up impacting the customer experience by resulting in errors. Dynamo introduced a Global admission control(GAC) to allow the partitions to burst always by allowing workload isolation. GAC was built on idea of token buckets and GAC tracked total consumption of a table in terms of tokens. A request router initially starts with a set of tokens and reaches out to GAC for demanding more tokens whenever it runs out of it. GAC does estimation on allocating number of tokens to the request router in chunks based on information provided by the client.
Splitting for consumption
As discussed, DynamoDB splits out partitions when it sees increase in load. This split is based upon the kay range that is stored in the partition. When the high traffic is subjected to a single key or a sequential range of keys, splitting the partition will not help in increasing the throughput of the partition. DynamoDB tracks this pattern and avoid splitting the partition in such cases.
Many customers who started using DynamoDB were previously working with self hosted database. Units introduced by DynamoDB was a new concept for them to understand the allocation they need for their storage. Customers ended up either over-allocating or under-allocating. DynamoDB solved this by providing on-demand tables that studies client’s read and writes and calculates the consumed capacity. Looking at the peak traffic, it allocates twice the size in terms of capacity for the table. This removes the headache for customers of estimating allocation for their storage needs and allow DynamoDB to take care about the provisioning.
Durability & correctness
Amazon DynamoDB is built to ensure that no data is lost and to prevent any corruption of data in storage. It deploys various techniques that helps in achieving this goal.
DynamoDB relies on write-ahead logs for providing durability during a crash. All the replicas of a partition store the WAL. The logs are also stored on S3 at regular intervals as additional step to improve durability. In case of a node failure, spinning up a new node from logs takes several minutes and it can impact availability if the quorum is impacted by this node failure. In order to overcome this issue, leader of the group adds a log-replica so that the quorum is met while the underlying storage is being built up using the logs.
Silent Data Errors
Hardware failures can lead to data corruption which if not detected can corrupt the complete storage. DynamoDB uses checksums to detect any form of data corruption. These checksums are stored with every log entry, message and log file which helps in validating data correctness at any point of transfer between two nodes. Even the log files stored on S3 are also put through various checks before uploading to S3. Upon uploading, these log files are stored along with their checksum and background agent downloads this file to compare with existing log in order to detect any data discrepancy.
Even for data at rest, DynamoDB does continuous verification using process called scrub. The scrub process verifies that all the replicas in the replication group have the same data and data on existing replica matches the data of replica built from log file stored on S3. Such processes for continuous verification has proven to be most reliable method for preventing any data corruption due to hardware failures.
DynamoDB uses formal methods such as TLA+ to verify their software systems. One area which undergoes this verification is the underlying replication protocol. Failure and stress testing also helps DynamoDB in ensuring the correctness and reliability of the system.
Backups & Restores
Amazon DynamoDB supports backup mechanism for its storage layer. The backup process does not impact the performance of the system as it leverages write-ahead logs stored in S3. These backups can be used to restore a new DynamoDB instance and can also be used for debugging by restoring a past storage using point-in time backups. If the point-in time backup is enabled, the write-ahead logs are persisted at regular intervals to S3 based on the threshold size of logs.
Amazon DynamoDB hosts replicas of tables in multiple availability zones for high availability. In order to test for availability, multiple techniques are used and one such technique is power-off tests. As part of this, random nodes are powered off and then data present in nodes is verified for correctness. There are some major developments that DynamoDB does in order to ensure high availability for the cloud service.
Write & consistent read availability
For consistent read & write operations, the system should reach a quorum. In case of node failures, spinning up a new node will lead to loss in availability. In case of Dynamo, leader node spins up a log replica whenever a node failure is detected. This is the quickest way to reach a quorum in case of node failure as it doesn’t comes with bottleneck of spinning a complete node involving B-Tree from a write ahead log. It is a major change from typical Paxos implementation and is verified by formal proof of existing implementation.
Gray failures aka partial failures can lead the system to complete collapse if not handled carefully. These are failures where a replica is unable to reach the leader due to certain issues such as network failure, background garbage collection process etc. This results in replica invoking a leader election and in turn disrupting an already functioning system. To overcome this, DynamoDB has added a brilliant improvement in leader election process. Now whenever a node fails to reach the leader, it sends a message to other nodes in the system for confirmation. If any of the nodes send an acknowledgement about a healthy leader, it can cancel the leader election process and retry reaching the leader node. This in turn avoids a full-fledged leader election in case of transient issues in communication.
DynamoDB measures availability at regular intervals at both service and storage levels. The data collected as part of this measurement is used to track customer facing availability trends and raise an alarm if the errors reach a certain threshold. These alarms are called as customer-facing alarms(CFA). The results are also uploaded to S3 for further analysis of availability trends.
In a distributed system when the storage software is hosted on multiple nodes, deploying an update or a new feature is not an atomic process. Any deployment will result in new code being deployed to a set of nodes along with nodes running an older version of software. Also with software deployments, things can go wrong and the undoing the deployment can result in pushing the system to an unknown state. DynamoDB solves this challenge by running a set of upgrading and downgrading tests before actually pushing the update.
New software can also bring compatibility challenges with new data types or protocol changes. To accommodate for this, DynamoDB first deploys the software to read the updated state and once the read operations are verified, it moves forward with the deployment of software that actually starts producing updated state.
Dependencies on external services
Amazon DynamoDB relies on multiple other services such as AWS IAM, AWS KMS and its availability can be impacted by the availability of dependencies. DynamoDB is designed to operate even when the underlying dependencies are unavailability. It does this by caching the results by these dependencies as part of request router. So when these services go down, DynamoDB is able to function with the cached results. To achieve this without hampering the security aspect, it regularly refreshes the cache in asynchronous manner.
One key piece of information that a request router of Dynamo needs is the location of storage node associated with a table’s primary key. To achieve this in an efficient manner, DynamoDB has built a distributed datastore known as MemDS. MemDS uses a data structure called Perkle which is a hybrid of Patricia tree & Merkle tree to store the required mapping. It is horizontally scalable and provides efficient lookup with a wide range of query patterns. MemDS plays a key role in helping DynamoDB achieve high availability while incorporating various metadata changes along with quick lookup functionality.
Amazon DynamoDB is a leader in cloud-native NoSQL solutions and powers multiple applications at humongous scale without any degradation in terms of availability and performance. This paper provides behind the scenes of how much effort and technical advancement goes into building a developer friendly cloud solution using the original Dynamo system which in itself is a brilliant piece of software.