This is third part of a series of posts about distributed consensus. I plan to cover distributed consensus in detail along with a deep-dive into Raft (A well-known consensus algorithm). I also plan to finish this series with a code-walkthrough of the Raft implementation in Java along with the resources that I referenced while studying the topic. Stay tuned.
List of articles in the series:
- Part 1: Why do we need everyone to agree?
- Part 2: How to decide what everyone agrees on?
- Part 3: How to deal with disagreement?
Up until now we have discussed the scenario where leader server is able to process a client request by replicating the request to majority of follower nodes. The process is straight forward and looks like as below:
Ask for consensus -> Get consensus -> Broadcast to everyone -> Update state machine -> Repeat
So you can use the above process repeatedly and continue processing requests considering your cluster consists of same leader and follower nodes. But we left our last post at a point where we raised questions about the below scenarios
- What happens when a new server joins as a follower? How does the leader node brings it up to speed with its log?
- What happens when a follower node becomes outdated due to missing out on messages due to network failure? How does the leader node corrects the logs stored on this node?
But why do we actually require to do all the above? Remember leader node at the end of the day is just a server node similar to all the follower nodes. It can and it will crash sometime in future and at that point one of these follower nodes will have to step in as a leader. If the logs in the follower nodes are inconsistent then it will result in inconsistent results to the client. Also don’t forget that new follower nodes can be added to our cluster and will have a chance to be elected as a new leader in case of a failure.
Our end goal is to build a system where we abstract away the internal workings of consensus and handle any server failure gracefully. Client should feel that they are interacting with our system that just consists one server which never fails.
Raft protocol handles the above scenarios using a
term_number and a
log_index. For each log that is stored in both leader and follower node, a unique pair of
term_number and a
log_index is associated with the log entry. So when a leader node sends an append entry to follower node, it sends a
term_number along with the
log_index which is equal to the last
log_index it used to store the log on its end.
Follower node verifies that it contains a log entry with this combination and logs the message on its end. If it fails to find a log with this combination it returns a failure message. Leader node replies back to this failure message by decrementing the
log_index by 1 and retrying the request with list of logs from that index forward.
log_index by 1 helps in reaching a state where both leader & follower node were in agreement with their logs. Leader node sends all the logs from this index which follower nodes writes on its end. Follower node also removes logs after the matched
log_index on its end as these logs are considered to be inconsistent. Going through this retry process ensures that both leader and follower node become consistent in terms of their logs and now follower can be elected as a leader if required.
Each node is also instantiated with a base log that starts with a base
log_index (Say 0). This comes in handy when a new follower nodes joins. Raft protocol uses the above process and as the log on the new follower node is empty except the base log, both leader and follower will reach an agreement on the base log. Doing this the leader node will append all the logs to the follower node in turn making it in consistent with the leader node.
There are a lot of optimizations that can be made to the above process and many open source projects have come up with very well optimized implementations for Raft protocol. This covers log replication section of Raft protocol in this series. Next we are going to see that happens when the leader node goes down and how a new leader gets elected.