How CockroachDB checks replication

发表于 2024-01-25 分类于数据库阅读次数：

https://www.cockroachlabs.com/blog/trust-but-verify-cockroachdb-checks-replication/

Trust, but verify: How CockroachDB checks replication

We built survivability into the DNA of CockroachDB. And while we had a lot of fun doing so, and are confident that we have built a solution on a firm foundation, we felt a nagging concern: Does CockroachDB really survive? When data is written to the database, will a failure really not end up in data loss? So to assuage those concerns, we adopted a Russian maxim: “Dovorey, no provorey – Trust, but Verify.”

我们将生存能力融入了 CockroachDB 的 DNA 中。虽然我们这样做很有趣，并且相信我们已经在坚实的基础上构建了一个解决方案，但我们感到一个挥之不去的担忧：CockroachDB 真的能生存下来吗？当数据写入数据库时，失败真的不会导致数据丢失吗？因此，为了缓解这些担忧，我们采用了一句俄罗斯格言：“Dovorey，no provorey – 信任，但验证。”

To understand CockroachDB’s survivability promise, you must first understand our key-value store and replication model, as they form the foundation for survivability.

要了解 CockroachDB 的生存性承诺，您必须首先了解我们的键值存储和复制模型，因为它们构成了生存性的基础。

CockroachDB is a SQL database built on top of a distributed consistent key-value store. The entire key-value space is split into multiple contiguous key-value ranges spread across many nodes. CockroachDB uses Raft for consensus-based replicated writes, guaranteeing that a write to any key is durable, or in other words, that it survives. On each write to a key-value range, the range is synchronously replicated from a Raft leader to multiple replicas residing on different nodes. Database writes are replayed on all replicas in the same order as on the leader, guaranteeing consistent replication. This model forms the basis of CockroachDB survivability: if a node dies, a few others have an exact copy of the data that is lost.

CockroachDB 是一个构建在分布式一致键值存储之上的 SQL 数据库。整个键值空间被分成多个连续的键值范围，分布在许多节点上。 CockroachDB 使用 Raft 进行基于共识的复制写入，保证对任何键的写入都是持久的，或者换句话说，它能够存活。每次写入键值范围时，该范围都会从 Raft 领导者同步复制到驻留在不同节点上的多个副本。数据库写入会按照与领导者相同的顺序在所有副本上重播，从而保证复制的一致性。该模型构成了 CockroachDB 生存能力的基础：如果一个节点死亡，其他一些节点将拥有丢失数据的精确副本。

While we’d like to trust our survivability model, we went one step further: we added verification. We built a subsystem that periodically verifies that all replicas of a range have identical data, so that when one of them has to step up to replace a lost node, the data served is what is expected.

虽然我们希望相信我们的生存模型，但我们更进一步：我们添加了验证。我们构建了一个子系统，定期验证某个范围的所有副本是否具有相同的数据，以便当其中一个副本必须取代丢失的节点时，所提供的数据就是预期的。

How CockroachDB Checks Replication

Every node in the cluster runs an independent consistency checker. The checker scans through all the local replicas in a continuous loop over a 24 hour cycle, and runs a consistency check on each range for which it is the Raft leader:

集群中的每个节点都运行一个独立的一致性检查器。检查器以 24 小时为周期连续循环扫描所有本地副本，并对它作为 Raft 领导者的每个范围运行一致性检查：

The leader and replicas agree on the snapshot of the data to verify and assign a unique ID to it.

领导者和副本就数据快照达成一致，以验证数据并为其分配唯一的 ID。
A SHA-512 based checksum on the selected snapshot is computed in parallel on the leader and all replicas.

在领导者和所有副本上并行计算所选快照的基于 SHA-512 的校验和。
The leader shares its checksum and the unique snapshot ID with all replicas. A replica compares the supplied checksum with the one it has computed on its own. On seeing a different checksum, it declares a failure in replication.

领导者与所有副本共享其校验和和唯一快照 ID。副本将提供的校验和与它自己计算的校验和进行比较。当看到不同的校验和时，它声明复制失败。

In the rare circumstance that it finds a replication problem, the checker logs an error or panics. Unfortunately, it is not possible for the system to identify whether the Raft leader or its replica is responsible for a faulty replication, and a replication problem cannot be repaired.

在极少数情况下，检查程序发现复制问题，会记录错误或出现紧急情况。不幸的是，系统无法识别Raft领导者或其副本是否对错误的复制负责，并且复制问题无法修复。

Logging or panicking on detecting an inconsistency problem is great, but how does one debug such problems? To aid the user in debugging, on seeing a checksum mismatch, a second consistency check with an advanced diff setting is triggered on the range. The second consistency check publishes the entire range snapshot from the leader to all replicas, so that they can diff the entire snapshot against their own version and log a diff of the keys that are inconsistent. We were able to use this second consistency check mechanism to debug and fix many problems in replication.

在检测到不一致问题时进行记录或恐慌固然很好，但如何调试此类问题呢？为了帮助用户进行调试，在看到校验和不匹配时，会在范围上触发使用高级差异设置的第二次一致性检查。第二次一致性检查将整个范围快照从领导者发布到所有副本，以便它们可以将整个快照与自己的版本进行比较，并记录不一致的键的差异。我们能够使用第二种一致性检查机制来调试和修复复制中的许多问题。

While we had hoped we had built the perfect system, the verifier uncovered a few bugs!

虽然我们希望我们已经构建了完美的系统，但验证者发现了一些错误！

Bugs and Fixes

We fixed a number of hairy bugs:

The same data can look different: The order of entries in protocol buffer maps is undefined in the standard. For some cockroach internal data, we were using protocol buffer maps and replicating it via Raft. All the replicas would separately convert the protocol buffers into a wire format while writing them to disk. Since the wire format is non-deterministic in the standard and implementation, we saw the same data with a different wire encoding on different replicas (fixed via gogo/protobuf #156).

相同的数据可能看起来不同：协议缓冲区映射中的条目顺序在标准中未定义。对于一些 cockroach 内部数据，我们使用协议缓冲区映射并通过 Raft 复制它。所有副本都会将协议缓冲区分别转换为有线格式，同时将其写入磁盘。由于有线格式在标准和实现中是不确定的，我们在不同的副本上看到相同的数据具有不同的有线编码（通过 gogo/protobuf #156 修复）。
Backdoor writes on the replicas (#5090): We collect statistics on every range in the system and replicate them. Writes to the statistics are allowed only on the leader, with replicas simply replaying the operations. We were occasionally updating the statistics on the replicas outside of Raft consensus (fixed via #5133).

后门写入副本（#5090）：我们收集系统中每个范围的统计信息并复制它们。仅允许在领导者上写入统计信息，而副本仅重播操作。我们偶尔会更新 Raft 共识之外的副本的统计信息（通过 #5133 修复）。
Internal time series data was being merged incorrectly: The CockroachDB UI keeps track of monitoring time series data which is replicated. The replicas merge the time series data when read, but an occasional legitimate read would creep in on the leader, causing a bug in the merge functionality (fixed via #5515).

内部时间序列数据合并不正确：CockroachDB UI 跟踪监视复制的时间序列数据。副本在读取时合并时间序列数据，但偶尔的合法读取会悄悄进入领导者，导致合并功能出现错误（通过＃5515修复）。
Floating point addition might be non-deterministic: While fixing the above time series issue, we got super paranoid and as a defensive measure decided to not depend on the replica replay of floating point addition. We were aware of floating point addition being non-associative, and although we knew our floating point additions were being replayed in a definite order and didn’t depend on the associativity property, we adhered to the mantra, “only the paranoid survive,” and got rid of them (fixed via #5905).

浮点加法可能是不确定的：在解决上述时间序列问题时，我们变得超级偏执，作为一种防御措施，决定不依赖于浮点加法的副本重放。我们知道浮点加法是非关联性的，尽管我们知道我们的浮点加法是以明确的顺序重播并且不依赖于关联性属性，但我们坚持这样的口头禅：“只有偏执狂才能生存” 并摆脱了它们（通过#5905修复）。

Conclusion

We’ve built a new database that you can trust to survive. With trust comes verification, and we built that into CockroachDB. CockroachDB offers other features like indexes, unique columns, and foreign keys, that you can trust to work properly. We plan on building automatic online verification mechanisms for them, too. We look forward to discussing them in the future.

我们建立了一个值得您信赖的新数据库。信任带来验证，我们将其内置到 CockroachDB 中。 CockroachDB 提供其他功能，如索引、唯一列和外键，您可以相信它们可以正常工作。我们也计划为他们建立自动在线验证机制。我们期待将来讨论它们。