What is a Multiraft?
In computer science, there is a well-known problem called “consensus”. In a nutshell, it’s a task of getting all participants in a group to agree on some specific value based on the votes of each member. There are also several algorithms that aim to solve this problem, namely Paxos, Raft, Zab, 2PC. What is Multiraft then?
Consensus problem
Let’s first start with some problem statement and definition. Again, wikipedia offers an introduction to what it is and some other relevant information. In short (and using my own words), consensus problem arises in a distributed settings, where multiple processes are present, they communicate over some faulty medium (while also being faulty by themselves), and a single decision should be chosen. Usually, people say that multiple processes have to agree on some value, or it could be agreement over committing update to a database or not or select a leader in a group, etc.
One of the obstacles for convenient problem solution is a (pretty high) probability of a failure. Communication medium could be not that reliable, messages could get lost and misinterpreted, group members could also fail or just misbehave. Another obstacle is asynchronous nature of a system: one can’t be sure the speed of message delivery is constant, so if a message takes too long to arrive, it’s very hard to distinguish if something bad happened or is it just late.
Just a side note, strict consensus problem theory covers much more and deals with more diverse constraints-limitations and assumptions combinations. I won’t cover them in this post.
Raft
Raft is one of the algorithm to solve consensus problem that aims to be easy to understand and to implement. Before going into further details I highly recommend walking through this nice visualization.
Just to reiterate, Raft solves the problem by electing a leader and doing everything through it. If the leader fails, a new leader is elected. If a client asks non-leader about something, he is redirected to the leader. On update, arrival leader disseminates the change through the group and ensures change persists. Simple, right?
Those interested in technical details and more information are welcome to visit “original” website. Also, original paper describes how to implement the algorithm. However, the devil is in details.
etcd, CockroachDB and TiDB
Both CockroachDB and TiDB are distributed, transactional, consistent SQL databases, while etcd is a reliable, distributed key-value store. These distributed systems use Raft as their consensus algorithm. Both databases Raft algorithm implementations are based on etcd raft library.
CockroachDB uses it to agree on some objects: namely, Range
’s that holds some data (key-value pairs for some range of keys). Each node holds multiple ranges and therefore participates in multiple Raft groups. Nothing wrong with that, basically.
In a blog post Scaling Raft CockroachLabs describes the problem and their solution: “modify” raft to handle not single value (that is, Range
), but multiple of them. So, MultiRaft is used.
Here is visualization of differencies between original Raft and MultiRaft (images taken from original CockroachLabs blog post), left is original, right is MultiRaft:
The obvious difference here is a number of connections and messages that are used to make everything work. Multiraft works by effectively “multiplexing” many communication channels into per-node ones.
As per more detailed description, problems begin when the number of ranges per node (replica) increases. Obviously, then the node has to participate in many-many raft groups with all that possible overhead. I haven’t stated that previously, but raft protocol assumes periodic heartbeat events to be exchanged within the group. What does happen if one node is a member of multiple groups? Service protocol traffic increases. Using MultiRaft (and using only one Raft instance per Node - Store
in Cockroach source code terminology - rather than per Range
) solves the problem of traffic increase. Under the hood, CockroachDB coalesces/uncoalesces heartbeats in a single response/request (according to pkg/storage/store.go
).
On the other hand, TiKV (as an underlying building block for TiDB) also uses MultiRaft for exactly the same purpose. Basically, its atomic unit of transfer is called Region
(rather than Range
). Behind the scenes, it uses multiplexing for Region-containing messages but no heartbeat coalescing. The best explanation I’ve found is here.