Paper Notes: Zookeeper – Wait-free coordination for Internet-scale systems

In a distributed computing environment, servers often need to work with certain shared set of attributes such as locks, system configurations etc. In order to access and update these attributes server need to have a mechanism which show them a consistent view of these attributes. Something as simple as a global config whose state is changed by servers to as complicated as leader election to decide which server is going to become the next leader and who is the current leader is considered a coordination challenge. Applications can decide to implement each of these primitives on their own but it comes with its own set of challenges. For example implementing a leader election algorithm such as Raft in production takes huge amount of efforts to test for correctness. Applications will need to take on the burden of testing and maintaining such coordination services. Or they can just make use of a centralized service which assists them in performing these coordination tasks in a scalable manner.

Zookeeper is one such service that provides an easy to implement interface for servers to co-ordinate among themselves. It was originally developed by Yahoo and is currently hosted as an open source project under Apache. As part of this post we will dive into the research paper about Zookeeper and see how it can be leveraged to solve multiple problems in a distributed systems. We will also look into the core implementation of Zookeeper and how clients interact with the service.

Introduction to basic Zookeeper functionality

Zookeeper in itself is not a solution to any particular coordination problem. Rather it is a toolkit which can be used to solve various coordination problems. So it provides the robust building blocks that can be leveraged to build innovative solutions and in turn put the developer in the driving seat instead of making them interact with a blackbox system.

Zookeeper provides API through which data objects can be updated in a wait-free manner(without any blocking locks). These objects are organized in a file-system structure providing client an interface which resembles close to interacting with a file-system. Zookeeper guarantees a FIFO ordering for client operations and performs writes in a linearizable order. To maintain durability of data, Zookeeper replicates the data on multiple nodes and follow a leader-replica model for replicating data from leader node to replica nodes. The replication for write operation happens through an atomic broadcast known as Zab. All the write operations go to the leader node from which they are broadcasted to replica nodes. Read request can be served by any of the replica nodes independently.

Zookeeper Service

Zookeeper exposes a library for clients to interact with the service. As part of discussing Zookeeper service, we will see an overview of service, core API, guarantees that Zookeeper provides to clients and some common distributed computing coordination primitives that can be built using Zookeeper API.

Overview: Zookeeper provides an abstraction over data storage to clients in form of znodes. Clients access and update the resources through znodes. Clients access these znodes same way we access file paths in a UNIX system. All the znodes contain data and child znodes(except ephemeral znodes that are created for small interval). Clients access these znodes through API to read and update the state of data.

Client nodes should cache the configuration data saved on Zookeeper on their end. This will improve client performance. For example if client has stored a configuration parameter on Zookeeper and they read it for the first time, it is advisable that they cache this value on their end rather than sending a request to Zookeeper for each read. In order to ensure that clients are aware of any changes to this configuration and to avoid them reading stale data from cache, Zookeeper provides a watch mechanism on each znode. So while reading any value from Zookeeper for the first time, client can specifically ask Zookeeper to notify it whenever there is any update in the value by enabling watch flag.

While communicating with Zookeeper, clients maintain a session that is associated with a timeout. All the communication from client to service and vice-versa happens through this session.

Core API: An important set of APIs exposed by Zookeeper service to client are as below. Synchronous and asynchronous version are available for each API method.

create(path, data, flags): Creates a znode at the given path and stores data in it. Flags describe various configurable properties for a znode that are exposed to the client as part of the service.
delete(path, version): Deletes a znode at given path if it exists and is at the expected version.
exists(path, version): Returns true if a znode at given path exists else returns a false. Sets a watch at the znode if it exists.
getData(path, watch): Returns data associated with znode if it exists on the path alongside setting a watch for the znode.
setData(path, data, watch): Sets the data of znode to the given data if the current version of znode is equal to given version.
getChildren(path, watch): Returns a set of children for znode available at given path alongside enabling a watch for the znode.
sync(path): Wait for all updates pending since start of the operation to reach the zookeeper server. The API input path is ignored at the time of this paper.

Zookeeper Guarantees: Zookeeper provides two ordering guarantees for client operations. All the update operations are serializable. This means that all Zookeeper nodes have a singular view of updates and don’t diverge at any point. Zookeeper also provides FIFO order for client operations which means that operations sent by the client are executed in the order they were sent. Zookeeper also guarantees that if majority of server nodes are available then the service will also be available. In other words Zookeeper is fault tolerant in case of partial node failure.

Coordination Primitives: Below are some example of coordination primitives which are well known in distributed computing domain. Please note that these are not directly a part of Zookeeper service but are built on top of APIs that Zookeeper exposes to its clients.

Configuration Management
Distributed Locks
Distributed Read-Write Locks
Double Barriers

Implementation

In this section we will discuss the core components that make the Zookeeper service. Most of the Zookeeper service is written in Java programming language. At the time the paper was published, the service exposed its client libraries in Java & C programming language.

Request Processor: Zookeeper classifies requests into two categories. Write requests requires coordination among servers. To process a write request, it needs to be first broadcasted to other servers so that the update can be persisted on multiple servers to account for failure. This communication is done by zab protocol. As Zookeeper database is replicated on all servers, read request don’t require any such server coordination and can be served independently by a Zookeeper server. All the transactions that a server processes is idempotent in nature. This is achieved by calculating the future state of system when an update will be applied and comparing this state with the version number provided by the client. The transaction is only processed if both these version numbers match. Else an error transaction is generated.
Atomic Broadcast: As Zookeeper follows a leader-replica model, all the requests that alter the state are forwarded to leader node. Leader broadcasts this change to other nodes through zab protocol. Zab uses a majority quorum check to decide whether an operation has been processed by majority of nodes or not. This ensures that even if leader goes down after processing a request, one of the updated replica nodes can be elected as a leader without any effect on durability of client data.
Replicated Database: Every replica in Zookeeper holds a copy of Zookeeper state. Zookeeper takes snapshots periodically so that these snapshots can be used to recover state in case of a failure. Zookeeper calls these snapshots as fuzzy snapshots as normal operations are not blocked in order to take the snapshot. So it can very well be the case that the state is undergoing a change at the same moment when a snapshot is taken. This is where idempotent operations play a major role as an operation which was persisted before crash can be applied twice as part of processing the snapshot without any effect.
Client-Server Interaction: While processing a write, server also sends notification related to watch associated with the znode it is updating. Due to FIFO nature of processing, server never processes read & write for the same znode concurrently. Notifications are only sent to clients that are connected to the particular server and not to all the clients that have enabled a watch on the znode. This responsibility is shifted to the individual servers. Consider that your manager is responsible for providing any updates in the organization to you. So it can be the case that another manager may have some information but it is not their responsibility to share this update with you and you will have to wait for your manager to process the information and communicate it with you. Read requests are handled by each server locally. This contributes to improved performance of read requests as no server coordination is performed. But at the same time this can result in stale data being read if the server is lagging in replicating an update to the znode being read. If strong consistency is a requirement for a read operation then client can make use of sync API before it sends a read request.

Use Cases

Zookeeper after being developed at Yahoo was put to various use cases. We will discuss two such use cases of Zookeeper as part of this section:

Fetching Service: One of the main functionality of Yahoo! search engine is crawling various web documents. Fetching service is responsible for fulfilling this functionality. It consists of multiple leader processes that control page fetching processes. The leader node provides fetching processes with configuration of page to fetch & fetching process respond by informing their fetch status & health. Zookeeper is used here to overcome failure of the leader nodes. The status of the fetching servers are stored in Zookeeper and leader node fetches this information from Zookeeper. So instead of sending request to a fetching node that might have already failed, leader node has the information of active fetching processes. Zookeeper is also used to select a new leader node when leader goes down by performing a leader election.
Yahoo! Message Broker(YMB): YMB is a distributed pub-sub system that is used for publishing and consuming messages in a scalable manner. Clients publish and read messages from topics. These topics are distributed among multiple servers for scalability and replicated on multiple machines for durability. Zookeeper is used for configuration management in order to figure out the distribution of topics. It is also used for failure detection and for group membership to decide on which servers are alive in the pub-sub system.

Conclusion

Zookeeper provides the lego pieces to solve various coordination challenges in distributed computing. It does not aims to solve a specific set of coordination problems but rather build an ecosystem which is extensible enough to solve any coordination challenge. It is a scalable service with simple interface that can be used to build powerful abstractions in a distributed system. Once Zookeeper was open sourced, various other organizations have leveraged this project to solve various interesting problems. For example Zookeeper is widely used at Facebook and they have built lot of infrastructure solutions on top of this project.

References

One Reply to “Paper Notes: Zookeeper – Wait-free coordination for Internet-scale systems”