“Lost time is never found again.” – Benjamin Franklin.
The above quote is true to its core in context of distributed systems. If you miss out on keeping track of events in a cluster of 100+ machines then good luck building a system that acts as a source of truth.
Time is an essential component when dealing with events in an application. Time is used to answer questions such as:
- When did this event occurred?
- Out of these two events, which one occurred first?
- Out of these
n
events, which event occurred last? - What events occurred simultaneously?
In case of a single node application, the above questions can easily be answered by referencing the system clock. Whenever an event occurs you can record the current time and use this recorded time to compare with events that will occur in future to build a timeline. You can easily answer the above questions by comparing the recorded time and be sure that you have the right answer as there is only one source of time i.e. your system clock.
But this guarantee is no longer there once you move to a multi-node system. Due to clock skew, we can no longer be sure that system clocks on two different nodes yield the same time. Each clock can run at different rates and we can no longer reference system clock time for comparing the ordering of events occurring on two systems. So if Server_A
writes Z=1
at time T1
and Server_B
writes Z=2
at time T2
, then we cannot ensure that Server_A
wrote before Server_B
even though T1 < T2.
This turns out to be a major problem when dealing with events occurring in distributed systems with network partitions. You need a mechanism to find the correct ordering of events in order to ensure that system eventually reconciles to an agreeable state.
Vector clock is a data structure that achieves the goal of determining the causality of events occurring on different servers. The vector clock maintains versions of events on each server in form of an array. Considering we have a system of 3 nodes, we will have an integer array of size 3 where each entry is initialized to zero i.e. {0,0,0}
. This array is tied to every event(For example set key Z=1
is an event) occurring on each node.
Let’s dive into an example to under stand how vector clock is used to determine if two events are concurrent or an event follows another event.
In the above example, we have three servers generating events. Initially all the three servers start with a vector clock initialized to {0, 0, 0}
. Let’s start looking into how each event is handled on the individual servers:
- Event
E1
is generated onServer1
.Server1
records the event by incrementing the vector clock by one the index mapping toServer1
i.e. first index. This results inServer1
vector clock to become{1, 0, 0}
- Event
E1
is generated onServer2
.Server2
increments its vector clock similar toServer1
on its index making the vector clock ofServer2 {0, 1, 0}.
- Event
E2
is generated onServer1
.Server1
till now doesn’t have any information of events occurring on other servers. It moves forward by incrementing its vector clock to{2, 0, 0}
- Event
E1
is generated onServer3
and it updates the vector clock in a similar manner to{0, 0, 1}.
- Now
Server1
sends an event toServer2
. It sends an updated vector clock i.e.{3, 0, 0}
as part of the request.Server2
acknowledges that this event is not generated on its server but rather an event coming fromServer1
. It needs to now perform a clock synchronization which is done by picking the maximum value for each servers. SoServer2
first increments its own index to record the event and then synchronizes the vector clock making it{3, 2, 0}
for eventE2
. Server3
similar toServer1
sends an event toServer2
with updated clock i.e.{0, 0, 2}
.Server2
performs another clock synchronization making clock for eventE3
onServer2
as{3, 3, 2}.
Above process is followed to synchronize clocks on servers. Once we have this synchronization in place, we can reference the vector clock of events occurring on each of these servers to find the causality of events. Rules to figure out ordering of two events based on their vector clock(V1
& V2
) is as follows:
- If majority of elements in
V1
are less than or equal toV2
and at least one element ofV1
is less than element ofV2
then we can say thatV1
is causal toV2
i.e. event associated withV1
occurs before event associated withV2
.- More formally it can be expressed as
V1 <= V2
along withV1[i] < V2[j]
fori = 1 to n.
- Finding this happens before relationship makes reconciliation of events easier in the system. If user sets
A=1
onServer1
as eventE1
and setsA=2
onServer2
as eventE2
then we can compare the vector clocks of these two events to figure out the ordering of events.
- Consider a simple case of event
E1
occurring onServer1
withV1
as{1, 0, 0}
and eventE2
occurring onServer1
withV2
as{2, 0, 0}
. Here majority(2 out of 3) of elements inV1
is less than or equal to elements ofV2
. So we can say thatV1
is causal toV2
. - For a more complex case, consider event
E2
onServer1
withV1
as{2, 0, 0}
and eventE2
onServer2
withV2
as{3, 2, 0}
. ComparingV1
&V2
we can figure out thatV1
occurs beforeV2
.
- More formally it can be expressed as
- If we cannot reach a conclusion from above comparison then we can say that two events are concurrent.
- If neither
(V1 <= V2 & NOT(V2 <= V1))
nor(V2 <= V1 & NOT(V1 <= V2))
- In this case a common approach for reconciliation is to save both the events and shift the responsibility of reconciliation to client.
- Consider event
E
1 onServer2
with vector clockV1 {1, 0, 0}
& eventE1
onServer3
with vector clockV2 {0, 0, 1}
. Here neitherV1
has all elements less than or equal toV2
norV2
has all elements less than or equal toV1
.
- If neither
One of the famous real-life use case of vector clock is how DynamoDb uses it to maintain ordering of events. Dynamo is built to tolerate network partitions and it continues accepting writes even in case the nodes are unable to communicate across the partitions. This can lead to an object ending up in different states across two nodes. So for reconciliation, it uses vector clock of events associated with these objects to reach a final state.
References:
Server 2 E1 should be {0,1,0} instead of {1,0,0}?
Correct. My bad on the incorrect clock representation in diagram. For server 2 it should be {0, 1, 0} as the increment should happen on index corresponding to Server2 i.e. index 1 in the array.