Failure is the norm in distributed systems. It’s a question of when and not if. These failures can be due to various reasons.
- Node failure due to server meltdown
- Unable to reach node due to network failure
- Installing software updates/bug fixes on a node without downtime (Eg adding a log4J security fix without a downtime in application)
So if we have an application that relies on replicating data across nodes in a leader-follower pattern then we need to start thinking about how we are going to handle node failure. The failure is handled differently in case its a follower node or a leader node.
Follower node failure
In case of a follower node failure, it can be either a complete failure due to server meltdown in which we replace the follower node with another node through onboarding process. Or it can be a failure in reaching to follower node due to network issues or reboot process due to software updates.
In this case follower node misses out on messages from leader node after a certain point. So when it comes back up, it has to rebuild its storage up until the point where it received last message from leader node and then catch up with leader for the missed messages. After successfully completing these two steps, it becomes eligible to be a follower node. In order to accomplish these steps, a follower node follows the below process:
- A follower node keep a log of all changes on disk. It consists of all the messages that it has received from leader node as part of replication process.
- This log helps the follower node to rebuild their storage once it comes back up after a failure.
- After this the follower node reaches out to the leader node to catch up on the missed transactions. The follower node uses the last transaction in its log as a reference to pick up the point from which the failure occurred and it stopped receiving messages from leader.
- Once caught up it can act as a valid follower to the leader node for replication purposes.
Leader node failure
Things become complex when we are facing a leader node failure. In case of a leader node failure, we have to take care of the following:
- Determine correctly that leader has failed
- Figure out a way to elect a new leader
- Build a mechanism so that the client knows which node is the new leader
- Prepare for the scenario where the leader failure was due to a false alarm and the failed leader node comes back up after electing a new leader
Each of the above mentioned problems introduces an array of distributed computing domains and possible solutions in these domains.
- Heartbeat mechanism to detect leader failure (Determine correctly that leader has failed)
- Leader election algorithms such as Paxos, Raft (Figure out a way to elect a new leader)
- Request routing to send the client request to the new leader node (Build a mechanism so that the client knows which node is the new leader)
- Safety mechanism to ensure that if a failed leader node comes back up then it acts as a follower node and not as a leader node (Prepare for the scenario where the leader failure was due to a false alarm and the failed leader node comes back up after electing a new leader)
The above approach for handling leader node failure comes with its own set of challenges.
- If we are doing replication asynchronously then there can be a chance that the newly elected leader (A former follower) might not have the most updated data. It might have not received certain entries which the failed leader had received.
- A solution to solve this is to just pick the most updated follower as leader and consider the entries which this follower has not received as discarded. This can result in inconsistent results to the client.
- We can also end up in a system with two leader nodes. It can be either due to faulty leader election process or a failed leader node getting restarted. This results in a split brain problem where the client sends an entry(
set a 1
) to leader one and then reads from leader two (get a
). In this case it will get a404_Key_Not_Found
exception.- One solution to this is to have a mechanism to check for multiple leaders and shut down all but one leader nodes.
- This can become challenging as you have to decide the criteria on which you decide which node you will not shutdown.
- You also have to be careful to not shutdown all the nodes.
- How do we decide what will be the timeout for leader node failure?
- A high timeout can lead us to a state where our application is down for longer time in case of leader failure.
- A very small timeout can lead us to a situation where we consider a leader node as failed even if it delays response time due to high traffic load or packet loss in network communication. Consider the fact that leader election comes at the cost of the above challenges which we are discussing.
- The decision on timeout requires back of the envelope calculation and should be configurable based on the kind of request load we are seeing for our application.
There are no custom off-shelf solutions to solve the above challenges. There are various approaches that come with their own compromises and you as an application owner need to decide what is the best fit for your requirements.
Everyone loves what you guys are up too. Such clever work and coverage!
Keep up the fantastic works guys I’ve incorporated
you guys to my personal blogroll.
Thanks for the kind words.