Paper Notes: Dynamo – Amazon’s Highly Available Key-value Store

Dynamo is one of the most influential papers in the domain of distributed storage technologies. It has inspired multiple open source databases such as Cassandra, Voldemort to name a few. Reading this paper will expose you to multiple concepts of distributed computing and that is what makes this paper an absolute beast. Dynamo is not a mere optimization on top of existing database technologies but rather a ground up rebuild to solve tough scalability issues. It brings various ideas and merge them together in order to orchestrate a robust storage system.

As part of this post we will be going through this paper and dive into various concepts that are introduced in the paper. One post might not be enough for covering all the concepts so I have added references to blog posts where I have discussed the concepts in more detail. Therefore be on the lookout for links to the reference posts and try to consider this post more as an index of multiple concepts than a single source of all information covered as part of this paper. Buckle up and let’s dive in to this groundbreaking research paper in the field of distributed computing.

The need for Dynamo

Amazon as an e-commerce platform serves millions of customers and in order to do so they run their software across thousands of servers. At such a large scale, a server failure is not a question of if but when. With such large scale of operations, Amazon cannot tolerate even a fraction of downtime as it will result in huge financial loss and a bad customer experience. So reliability of the e-commerce application is of utmost essence. The backbone of this reliable system is a fault-tolerant storage layer so that the customer facing application can save and retrieve customer data without worrying about server failures. Dynamo provides this support by not just providing a storage layer but also acting as an abstraction to overcome any form of server failure either within a datacenter or across multiple data centers. It hides all the complexities that come with operating a system across multiple nodes and provides a clean primary-key based interface to store and retrieve data to its clients.

Dynamo prioritizes reliability of the system by making a compromise for consistency guarantees of the storage layer, which is to say that Dynamo offers an eventual consistency guarantee for its storage layer. Amazon’s architecture is build in such a way that their use case can be solved by using an eventually consistent system at scale. Primary-key based interface also works for Amazon’s architecture that is highly decoupled and have very clear boundaries for the data each service in the architecture is trying to access.

Amazon’s e-commerce system comprising on many services that communicate with each other to provide a rich set of functionalities. Each service runs it’s own instance of Dynamo and operates under a set of assumptions which are as follows:

Data access & update is done via a key-value based query model. This simplifies the interface for Dynamo as now it doesn’t need to provide complex query model that performs joins over multiple tables.
There is a compromise on consistency of data stored on Dynamo nodes. Instead of strict consistency, applications using Dynamo as storage layer work with an eventual consistency model.
Writes are given a higher priority in case of conflict resolution. With tight latency requirements, write requests are processed without resolving the conflict and all the conflicting states related to a key are preserved. Conflict resolution happens at read time based on policy such as last-write-wins or the resolution logic is moved to the client side.
Every Dynamo node in a system is identical to other nodes in the same system. This means that there is no master node with extra responsibilities. This makes maintenance of nodes in a system much easier as compared to a typical master-replica model.

Related work in field of distributed storage systems

Various existing peer to peer systems have solved a certain set of problems individually. Pastry is one such P2P system that ensures a query is answered by one of the nodes in the system within a certain number of network hops. Oceanstore build on such P2P routing layer is another such system that performs conflict resolution by providing an ordering among the transactions.

Distributed file system is another methodology of distributing data among multiple nodes in a scalable manner. Google file system is one of the example of such scalable distributed file systems. Distributed storage systems such as Bigtable provide a scalable way to store structured data. In contrast, Dynamo focusses on providing a simpler key-value based data storage with strong emphasis on availability on read-write operations even in case of network partitions. Compared to P2P systems, Dynamo doesn’t require multiple node hops to answer a query which adds to its scalability characteristics.

Architecture

A typical storage system is in itself a complex piece of software. When that system is subjected to a distributed environment across multiple nodes, we need to start tackling another set of challenges that come along with distributed systems. As part of reviewing this paper, we will dive into these set of challenges and see how Dynamo solves these problems in a creative manner to build a robust distributed storage system.

Going through each of these challenges, remember and try to appreciate the simple interface that Dynamo provides i.e. GET(key) & PUT(key, context, object) as its simplicity plays a key role in solving these set of challenges. The context includes some metadata about the version of the object.

Partitioning: In order to continue scaling the storage system, Dynamo needs to partition data over multiple nodes once the node reaches a certain threshold in terms of capacity. In order to perform this partitioning, Dynamo relies upon Consistent Hashing. Basic implementation of consistent hashing poses problems with non-uniform data distribution. Therefore Dynamo makes use of virtual nodes on top of consistent hashing ring in order to distribute data more uniformly across the ring.
Replication: As high availability is a key requirement for Dynamo, any key that is persisted in the storage layer is stored across N nodes. This replication is performed by a coordinator that stores the key in node assigned to key in the ring according to consistent hashing algorithm and additionally replicate it to N - 1 nodes in clockwise direction in the ring. The list of these N nodes responsible for storing the key is known as preference list.
Data Versioning: Dynamo focusses on providing eventual consistency and therefore a scenario where a put() operation succeeds without the update being persisted on all nodes is possible. Also Dynamo considers failure as a normal scenario and considers failures as an event rather than an exceptional state. These failures can be node failures or even partition failures due to network outage. Dynamo persists every update associated with a key even in state of failure and makes use of vector clocks to find the correct ordering of updates associated with a key. This vector clock is used to find the causality among updates happening on a key on multiple nodes or partitions.
Executing Database Operations: Any read/write operation in Dynamo is handled by a node also known as coordinator node. This node is the first accessible node among the top N nodes in preference list (Discussed in replication section). To perform the read/write operation coordinator node communicates with all N nodes in the preference list. Any node that is experiencing an outage or is unaccessible due to network failure is skipped. To maintain consistency, Dynamo uses a quorum based approach where it consists of two configurable parameters R & W. To maintain the quorum, R & W are set so that R + W > N.
Handling Temporary Failures: Using a traditional quorum based approach, Dynamo runs a risk of sacrificing the availability of its storage system. The requirements for quorum can be broken in case nodes inside the preference list goes down or nodes in the list are unreachable due to partition failure. To overcome this, Dynamo makes use of Sloppy Quorum & Hinted hand-off. Doing this all read & writes are performed on first N healthy nodes which may not be necessarily first N nodes on the hash ring.
Handling Permanent Failures: Hinted handoff has an edge case where the node with hinted message might go down permanently before it has transmitted the message to original node. In this case we run into the risk of compromising the durability of our storage system. This might have a cascading effect where we end up with inconsistent data and might not realize the out of sync state before it is too late. To overcome this Dynamo makes use of Merkle tree which is an anti-entropy protocol. Merkle tree is also a well known concept in blockchain technology. Dynamo uses Merkle tree to detect inconsistencies among replica nodes and stop the spread of outdated data among replica nodes.
Membership & Failure Detection: Dynamo uses a gossip-based protocol to provide a consistent view of nodes in the system to other nodes. So whenever a new node is added to the system or a node goes down, other nodes in the system communicate using this protocol to figure out the state of nodes in the storage system. The gossip based mechanism is also useful for failure detection. Information about nodes that are marked as failed is propagated to other nodes which can use this information to avoid unnecessary communication during database operations.

On a high-level, Dynamo comprises of three main components which are request coordination, membership & failure detection and local persistence engine. All components are implemented in Java programming language. Local persistence engine is build in a configurable so that any other storage engines can be plugged in to provide additional functionality based on the use case of the application.

Learnings

There are a certain set of learnings that have come from building Dynamo and which has led to adding more improvements in the design of this data store. Some of these learnings are as below:

Though the primary focus of Dynamo is on availability, performance is also of essence at Amazon’s scale. Certain applications might have higher requirements for performance when dealing with critical user facing flow. Dynamo caters to this by providing an option to compromise on durability of data in exchange of an elevated performance. It is done by providing an in-memory buffer which stores the client updates and is written to storage layer at regular intervals. This also improves the read performance as now application will first read from this buffer and then route the request to storage if record is not found in buffer. This elevates the performance as now each update is not required to be persisted on disk in order to send a successful response but at the same time introduces a new set of challenges in terms of durability as now the node which holds the buffer can go down before the changes are persisted on storage node.
Dynamo uses consistent hashing to store data across a set of nodes. With time these nodes can encounter an imbalance in traffic they are processing and this needs to be kept in check to maintain the overall health of system. Dynamo has provided a threshold percentage for load on any node and if a node crosses this threshold in terms of traffic it is serving then it is marked as out of balance. Dynamo’s partitioning methodology has evolved over time with focus towards distributing data across the nodes as evenly as possible.
Some applications might require stronger hold on the consistency guarantees and therefore Dynamo provides the control of data consistency to developers by making N, R & W of the storage system configurable. So if your application requires strong consistency and you are ok with compromising the write traffic then you can set W=N & R=1 as an extreme measure. Now an update needs to be posted on all nodes before you can send a successful response to client. On other extreme if you are ok stale results but want to make your write requests as fast as possible then you can set W=1 so that write succeeds if it has been persisted on even a single node. All this control is with the developer so that they can update it on need to need basis.

Conclusion

Though Dynamo was initially developed to solve scalability challenges at Amazon, it has inspired many other open source databases such as Cassandra, Voldemort. It is one of the ground breaking innovation that has brought together multiple distributed systems concepts to solve a problem in an extensible manner. There are various open source projects that are built on top of ideas discussed in the paper by optimizing one of the areas to solve a very specific problem thought the key ideas remain the same. This paper is also one of the prime examples where we can see various concepts such as quorum, consistent hashing, vector clocks working hand in hand to build a highly available storage system. This is what makes this paper a must read for anyone interested in distributed computing.

References