Scaling Raft

发表于 2024-01-25 分类于数据库阅读次数：

https://www.cockroachlabs.com/blog/scaling-raft/

Scaling Raft

In CockroachDB, we use the Raft consensus algorithm to ensure that your data remains consistent even when machines fail. In most systems that use Raft, such as etcd and Consul, the entire system is one Raft consensus group. In CockroachDB, however, the data is divided into ranges, each with its own consensus group. This means that each node may be participating in hundreds of thousands of consensus groups. This presents some unique challenges, which we have addressed by introducing a layer on top of Raft that we call MultiRaft.

在 CockroachDB 中，我们使用 Raft 共识算法来确保即使机器出现故障，您的数据也保持一致。在大多数使用 Raft 的系统中，例如 etcd、Consul，整个系统就是一个 Raft 共识组。然而，在 CockroachDB 中，数据被分为多个范围，每个范围都有自己的共识组。这意味着每个节点可能参与数十万个共识组。这带来了一些独特的挑战，我们通过在 Raft 之上引入一个称为 MultiRaft 的层来解决这些挑战。

With a single range, one node (out of three or five) is elected leader, and it periodically sends heartbeat messages to the followers.

在单一范围内，一个节点（三到五个）被选为领导者，并定期向追随者发送心跳消息。

As the system grows to include more ranges, so does the amount of traffic required to handle heartbeats.

随着系统发展到包含更多范围，处理心跳所需的流量也随之增加。

The number of ranges is much larger than the number of nodes (keeping the ranges small helps improve recovery time when a node fails), so many ranges will have overlapping membership. This is where MultiRaft comes in: instead of allowing each range to run Raft independently, we manage an entire node’s worth of ranges as a group. Each pair of nodes only needs to exchange heartbeats once per tick, no matter how many ranges they have in common.

范围的数量远大于节点的数量（保持较小的范围有助于提高节点故障时的恢复时间），因此许多范围将具有重叠的成员资格。这就是 MultiRaft 的用武之地：我们不再允许每个范围独立运行 Raft，而是将整个节点的范围作为一个组进行管理。每对节点每个时钟周期只需要交换一次心跳，无论它们有多少个共同范围。

In addition to reducing heartbeat network traffic, MultiRaft can improve efficiency in other areas. For example, MultiRaft only needs a small, constant number of goroutines (currently 3) instead of one goroutine per range.

除了减少心跳网络流量之外，MultiRaft 还可以提高其他方面的效率。例如，MultiRaft 仅需要少量、恒定数量的 goroutine（目前为 3 个），而不是每个范围一个 goroutine。

Implementing and testing a consensus algorithm is a daunting task, so we are pleased to be working closely with the etcd team from CoreOS instead of starting from scratch. The raft implementation in etcd is built around clean abstractions that we found easy to adapt to our rather unusual requirements, and we have been able to contribute improvements back to etcd and the community.

实现和测试共识算法是一项艰巨的任务，因此我们很高兴与 CoreOS 的 etcd 团队密切合作，而不是从头开始。 etcd 中的 raft 实现是围绕干净的抽象构建的，我们发现这些抽象很容易适应我们相当不寻常的需求，并且我们已经能够为 etcd 和社区做出改进。