Isolation in CockroachDB

发表于 2023-10-27 分类于数据库阅读次数：

原文：https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/

Editor’s Note*: This post was originally authored when CockroachDB was pre-1.0. CockroachDB’s architecture has undergone many changes since then. One of the most significant, as it relates to this post which focuses on our previous “lockless” design, is that we now use more locking and lock-like structures to provide SERIALIZABLE isolation. For more current details about CockroachDB’s transaction model, read our transaction layer architecture documentation.*

编者注：这篇文章最初是在 CockroachDB 1.0 版本之前撰写的。从那时起，CockroachDB 的架构经历了许多变化。最重要的一个，因为它与这篇文章相关，重点关注我们之前的“无锁”设计，我们现在使用更多的锁定和类似锁的结构来提供可串行化的隔离。有关 CockroachDB 事务模型的更多最新详细信息，请阅读我们的事务层架构文档。

Several months ago, I discussed how CockroachDB’s distributed transactions are executed atomically. However, that discussion was incomplete; it ignored the concept of concurrency, where multiple transactions are active on the same data set at the same time. CockroachDB, like all database systems, tries to allow as much concurrency as possible in order to maximize access to the data set.

几个月前，我讨论了 CockroachDB 的分布式事务是如何原子执行的。然而，这一讨论并不完整；它忽略了并发的概念，即多个事务同时在同一数据集上处于活动状态。与所有数据库系统一样，CockroachDB 尝试允许尽可能多的并发性，以便最大限度地访问数据集。

Unfortunately, our atomicity guarantee is not sufficient to keep the database consistent in a world of concurrent transactions. Recall that guarantee:

不幸的是，我们的原子性保证不足以在并发事务的世界中保持数据库的一致性。回想一下这个保证：

For a group of database operations, either all of the operations are applied or none of them are applied.

对于一组数据库操作，要么应用所有操作，要么不应用任何操作。

What this does not address is the way that concurrent transactions may interleave. The individual operations (reads and writes) in a transaction do not happen simultaneously; there is time in between the individual operations. In a concurrent system, one transaction may commit during the execution window of a second transaction; even if the first transaction (T1) commits atomically, this can still allow operations later in the second transaction (T2) to see the results of T1, even though earlier operations on T2 did not see the results of T1. This interleaving can create a number of undesired anomalies, ultimately breaking the consistency of the database.

这没有解决并发事务可能交错的方式。事务中的各个操作（读取和写入）不会同时发生；各个操作之间有时间间隔。在并发系统中，一个事务可能会在第二个事务的执行窗口期间提交；即使第一个事务 (T1) 以原子方式提交，这仍然可以允许第二个事务 (T2) 中稍后的操作看到 T1 的结果，即使 T2 上的早期操作没有看到 T1 的结果。这种交错可能会产生许多不需要的异常，最终破坏数据库的一致性。

To protect against these anomalies, we require an Isolation guarantee:

为了防止这些异常情况，我们需要隔离保证：

For a group of atomic, concurrent transactions, the commit of one transaction may not interleave with the operations of another transaction.

对于一组原子并发事务，一个事务的提交可能不会与另一个事务的操作交错。

Perfect isolation can be trivially achieved through serial execution: executing all transactions on the system one at a time, with no concurrency. This has terrible performance implications; fortunately, it is also unnecessary to achieve perfect isolation. Many concurrent databases, including CockroachDB, instead offer serializable execution, which is equivalent to serial execution while allowing a considerable level of concurrent transactions.

通过串行执行可以轻松实现完美的隔离：一次执行系统上的所有事务，无需并发。这会对性能产生严重影响；幸运的是，也没有必要实现完美的隔离。许多并发数据库（包括 CockroachDB）都提供可序列化执行，这相当于串行执行，同时允许相当水平的并发事务。

CockroachDB’s default isolation level is called Serializable Snapshot. It is an optimistic, multi-version, timestamp-ordered concurrency control system with the following properties:

CockroachDB 的默认隔离级别称为可序列化快照。它是一个乐观的、多版本的、时间戳有序的并发控制系统，具有以下特性：

Serializable: The resulting database state is equivalent to a serial execution of component transactions.

生成的数据库状态相当于组件事务的串行执行。
Recoverable: A set of database transactions is considered recoverable if aborted or abandoned transactions will have no effect on the database state. Our atomic commit system already guarantees that individual transactions are recoverable; our Isolation system uses strict scheduling to ensure that any combination of transactions is also recoverable.

如果中止或放弃的事务对数据库状态没有影响，则认为一组数据库事务是可恢复的。我们的原子提交系统已经保证单个事务是可恢复的；我们的隔离系统使用严格的调度来确保任何事务组合也是可恢复的。
Lockless: Operations execute without taking locks on resources. Correctness is enforced by aborting transactions which would violate either serializability or strict scheduling.

操作的执行无需锁定资源。通过中止会违反可串行性或严格调度的事务来强制执行正确性。
Distributed: There is no central oracle, coordinator or service involved in this system.

该系统不涉及中央预言机、协调器或服务。

Providing Serializable Execution

CockroachDB uses a multi-version timestamp ordering to guarantee that its complete transaction commit history is serializable. The basic technique has been textbook material for three decades, but we will briefly go over how it works:

CockroachDB使用多版本时间戳排序来保证其完整的事务提交历史是可序列化的。三十年来，这项基本技术一直是教科书材料，但我们将简要介绍一下它的工作原理：

Serializability Graphs

To demonstrate the correctness of timestamp ordering, we look to serializability theory, and specifically one of its core concepts, the serializability graph. This graph is used to analyze a history of database transactions in terms of operation conflicts.

为了证明时间戳排序的正确性，我们研究了可序列化理论，特别是其核心概念之一，可序列化图。该图用于根据操作冲突来分析数据库事务的历史记录。

In the theory, a conflict occurs when two different transactions perform an operation on the same piece of data (one after the other), where at least one of the operations is a write. The second operation is said to be in conflict with the first operation. There are three types of conflicts:

理论上，当两个不同的事务对同一条数据（一个接一个）执行操作时，就会发生冲突，其中至少有一个操作是写入。据说第二个操作与第一个操作冲突。冲突分为三种类型：

Read-Write (RW) – Second operation overwrites a value that was read by the first operation.
Write-Read (WR) – Second operation reads a value that was written by the first operation.
Write-Write (WW) – Second operation overwrites a value that was written by first operation.

For any given transaction history, these conflicts can be used to create a serializability graph, which is a directed graph linking all transactions.

对于任何给定的事务历史记录，这些冲突可用于创建可序列化图，这是链接所有事务的有向图。

Transactions are nodes in the graph.

交易是图中的节点。
Whenever an operation conflicts with an operation from a different transaction, draw a directed edge from the conflicting operation to the conflicted operation.

每当一个操作与来自不同事务的操作发生冲突时，就从冲突操作到冲突操作绘制一条有向边。

*Figure 1: Example of a serializability graph for a simple transaction history.*简单交易历史记录的可序列化图示例

And we now arrive at a key statement of this theory: a history is guaranteed to be serializable if (and only if) its serializability graph is acyclic. (Proof, for those interested).

现在我们得出了该理论的一个关键陈述：当（且仅当）其可序列化图是非循环的时，历史才保证可序列化。（证明，对于那些有兴趣的人）。

Figure 2: Example of a transaction history with a cyclic serializability graph. This history is not serializable.

CockroachDB’s timestamp ordering guarantees an acyclic serializability graph, and this is straightforward to demonstrate:

CockroachDB 的时间戳排序保证了非循环可序列化图，这很容易演示：

Every transaction is assigned a timestamp (from the node on which it starts) when it begins. All operations in that transaction take place at this same timestamp, for the duration of the transaction.

每个事务在开始时都会被分配一个时间戳（从它开始的节点开始）。在该事务的持续时间内，该事务中的所有操作都在同一时间戳发生。
Individual operations can locally determine when they conflict with another operation, and what the transaction timestamp of the conflicted operation is.

各个操作可以在本地确定它们何时与另一个操作发生冲突，以及冲突操作的事务时间戳是什么。
Operations are only allowed to conflict with earlier timestamps; a transaction is not allowed to commit if doing so would create a conflict with a later timestamp.

操作只允许与较早的时间戳发生冲突；如果提交事务会与较晚的时间戳产生冲突，则不允许提交该事务。

By disallowing any conflicts that flow against the timestamp-ordered direction, cyclic serializability graphs are impossible. However, let’s explore in detail how CockroachDB actually goes about detecting and disallowing these conflicts.

通过禁止任何与时间戳排序方向相反的冲突，循环可串行化图是不可能的。然而，让我们详细探讨一下 CockroachDB 实际上是如何检测和禁止这些冲突的。

Write-Read Conflicts – MVCC Database

This is where the “multi-version” aspect of our control mechanism comes into play. CockroachDB keys do not store a single value, but rather store multiple timestamped versions of that value. New writes do not overwrite old values, but rather create a new version with a later timestamp.

这就是我们控制机制的“多版本”发挥作用的地方。 CockroachDB 键不存储单个值，而是存储该值的多个带时间戳的版本。新写入不会覆盖旧值，而是创建具有较晚时间戳的新版本。

Figure 3: Comparison of multi-versioned value store with a single-value store. Note that the multi-version store is sorted by timestamp.

Read operations on a key return the most recent version with a lower timestamp than the operation:

对键的读取操作返回时间戳低于操作的最新版本：

Thus, it is not possible in CockroachDB to form WR conflicts with later transactions; read operations will never read a value with a later timestamp.

因此，CockroachDB中不可能与后面的事务形成WR冲突；读取操作永远不会读取具有较晚时间戳的值。

Note: This is the “Snapshot” in serializable snapshot; in-progress transactions essentially see a temporal snapshot of the database, ignoring anything committed later.

注意：这是可序列化快照中的“快照”；正在进行的事务本质上看到数据库的临时快照，忽略稍后提交的任何内容。

Read-Write Conflicts – Read Timestamp Cache

On any read operation, the timestamp of that read operation is recorded in a node-local timestamp cache. This cache will return the most recent timestamp at which the key was read.

在任何读取操作中，该读取操作的时间戳都会记录在节点本地时间戳缓存中。该缓存将返回读取密钥的最新时间戳。

All write operations consult the timestamp cache for the key they are writing; if the returned timestamp is greater than the operation timestamp, this indicates a RW conflict with a later timestamp. To disallow this, the operation (and its transaction) must be aborted and restarted with a later timestamp.

所有写入操作都会查询时间戳缓存以获取它们正在写入的密钥；如果返回的时间戳大于操作时间戳，则表明与较晚的时间戳发生 RW 冲突。为了禁止这种情况，必须中止操作（及其事务）并使用稍后的时间戳重新启动。

The timestamp cache is an interval cache, meaning that its keys are actually key ranges. If a read operation is actually a predicate operating over a range of keys (such as a scan), then the entire scanned key range is written to the timestamp cache. This prevents RW conflicts where the key being written was not present during the scan operation.

时间戳缓存是一个区间缓存，这意味着它的键实际上是键范围。如果读取操作实际上是对一系列键进行操作的谓词（例如扫描），则整个扫描的键范围将写入时间戳缓存。这可以防止在扫描操作期间不存在正在写入的密钥时发生 RW 冲突。

The timestamp cache is a size-limited, in-memory LRU (least recently used) data structure, with the oldest timestamps being evicted when the size limit is reached. To deal with keys not in the cache, we also maintain a “low water mark”, which is equivalent to the earliest read timestamp of any key that is present in the cache. If a write operation writes to a key not present in the cache, the “low water mark” is returned instead.

时间戳缓存是一种大小有限的内存中 LRU（最近最少使用）数据结构，当达到大小限制时，最旧的时间戳将被逐出。为了处理不在缓存中的键，我们还维护一个“低水位线”，这相当于缓存中存在的任何键的最早读取时间戳。如果写入操作写入缓存中不存在的键，则会返回“低水位线”。

Write-Write Conflicts – Can only write the most recent version of a key

If a write operation attempts to write to a key, but that key already has a version with a later timestamp than the operation itself, allowing the operation would create a WW conflict with the later transaction. To ensure serializability, the operation (and its transaction) must be aborted and restarted with a later timestamp.

如果写入操作尝试写入某个键，但该键已经具有比操作本身晚的时间戳的版本，则允许该操作将与后面的事务产生 WW 冲突。为了确保可串行性，必须中止操作（及其事务）并使用稍后的时间戳重新启动。

By choosing a timestamp-based ordering, and rejecting all conflicts which disagree with that ordering, CockroachDB’s Serializable Snapshot guarantees a serializable schedule.

通过选择基于时间戳的排序，并拒绝所有与该排序不一致的冲突，CockroachDB 的可序列化快照保证了可序列化的计划。

Recoverable with Strict Scheduling

While the previous conflict rules are sufficient to guarantee a serializable history, a different concern arises when two uncommitted transactions have a conflict: even if that conflict is allowed by our timestamp ordering rules, additional rules are required to ensure that the transaction schedule remains recoverable.

虽然前面的冲突规则足以保证可序列化的历史记录，但当两个未提交的事务发生冲突时，就会出现不同的问题：即使我们的时间戳排序规则允许该冲突，也需要额外的规则来确保事务计划保持可恢复。

The issue of can be explained with an example:

这个问题可以用一个例子来解释：

consider two transactions [T1, T2], where timestamp(T1) < timestamp(T2). T1 writes to a key ‘A’. Later, T2 reads from key ‘A’, before T1 has committed.

This conflict is allowed according to our timestamp ordering rules. However, what value should T2’s read retrieve from ‘A’?

根据我们的时间戳排序规则，这种冲突是允许的。然而，T2 的读取应该从“A”检索什么值？

Assume it ignores the uncommitted value written by T1, and retrieves the previous value instead. If T1 and T2 both commit, this will create a WR conflict with T2 (T1 will have overwritten a value read by T2). This violates our timestamp ordering guarantee, and thus serializability.

假设它忽略 T1 写入的未提交值，而是检索以前的值。如果 T1 和 T2 都提交，这将与 T2 产生 WR 冲突（T1 将覆盖 T2 读取的值）。这违反了我们的时间戳排序保证，从而违反了可序列化性。
Assume it retrieves the value written by T1. If T2 commits, but T1 later aborts, this will have violated the atomicity of T1: T1 still will have had an effect on the database state, even though it aborted.

假设它检索 T1 写入的值。如果 T2 提交，但 T1 后来中止，这将违反 T1 的原子性：T1 仍然会对数据库状态产生影响，即使它中止了。

Thus, neither possibility can allowed: in this situation, there is no way that T2 can be safely committed before T1 while maintaining a recoverable schedule.

因此，这两种可能性都不允许：在这种情况下，无法在保持可恢复调度的同时在 T1 之前安全地提交 T2。

CockroachDB uses strict scheduling to handle this situation: operations are only allowed to read or overwrite committed values; operations are never allowed to act on an uncommitted value.

CockroachDB使用严格的调度来处理这种情况：操作只允许读取或覆盖提交的值；永远不允许操作对未提交的值进行操作。

Enforcing Strict Scheduling 执行严格的调度

As established in our atomicity post, uncommitted data is staged in intents on each key, for the purpose of atomic commits. In an MVCC data store, the intent on a key (if present) is stored in a special value which sorts immediately before the most recent committed value:

正如我们在原子性帖子中所建立的，为了原子提交的目的，未提交的数据在每个键上的意图中暂存。在 MVCC 数据存储中，键（如果存在）的意图存储在一个特殊值中，该值紧邻最近提交的值之前排序：

In our previous post on atomicity, we assumed that any intent encountered by a transaction was the result of an abandoned transaction; however, in a concurrent environment, the intent might instead be from a concurrent transaction which is still running.

在我们之前关于原子性的文章中，我们假设事务遇到的任何意图都是被放弃的事务的结果；但是，在并发环境中，意图可能来自仍在运行的并发事务。

Strict scheduling actions are required in two situations: if a read operation encounters an intent with a lower timestamp, or if a write encounters any intent from another transaction (regardless of timestamp ordering). In these situations, there are two options available to CockroachDB:

在两种情况下需要严格的调度操作：如果读取操作遇到具有较低时间戳的意图，或者如果写入遇到来自另一个事务的任何意图（无论时间戳顺序如何）。在这些情况下，CockroachDB 有两种选择：

If the second transaction has a higher timestamp, it can wait for the first transaction to commit or abort before completing the operation.

如果第二个事务具有更高的时间戳，则它可以等待第一个事务提交或中止，然后再完成操作。
One of the two transactions can be aborted.

两个事务之一可以被中止。

As an optimistic system (no waiting), CockroachDB always chooses to abort one of the transactions. The process of determining which transaction is as follows:

作为一个乐观的系统（无需等待），CockroachDB 总是选择中止其中一个事务。确定哪笔交易的过程如下：

The second transaction (which is encountering an intent) looks up the first transaction’s transaction record, the location of which is present in the intent.

第二个事务（遇到意图）查找第一个事务的事务记录，该记录的位置存在于意图中。
The transaction performs a “push” on the discovered transaction record. The push operation is as follows:

交易对发现的交易记录执行“推送”。推送操作如下：
- If the first transaction is already committed (the intent was not yet cleaned up), then the second transaction can clean up the intent and proceed as if the intent were a normal value.
  
  如果第一个事务已经提交（意图尚未清除），则第二个事务可以清除意图并继续执行，就好像意图是正常值一样。
- Likewise, if the other transaction already aborted, the intent can be removed and the second transaction can proceed as if the intent were not present.
  
  同样，如果另一个事务已经中止，则可以删除意图，并且第二个事务可以继续进行，就好像意图不存在一样。
- Otherwise, the surviving transaction is deterministic according to priority.
  
  否则，幸存的事务根据优先级是确定的。
  - It is not optimal to always abort either the pusher or pushee; there are cases where both transactions will attempt to push the other, so “victory” must be deterministic between any transaction pair.
    
    始终中止推动者或被推动者并不是最佳选择；在某些情况下，两个交易都会尝试推动另一个交易，因此任何交易对之间的“胜利”必须是确定性的。
  - Each transaction record is thus assigned a priority; priority is an integer number. In a push operation, the transaction with the lowest priority is always aborted (if priority is equal, the transaction with the higher timestamp is aborted. In the extremely rare case where both are equal, the pushing transaction is aborted).
    
    每条交易记录因此被分配一个优先级；优先级是一个整数。在推送操作中，优先级最低的事务总是被中止（如果优先级相同，则时间戳较高的事务将被中止。在极少数情况下，两者相等，推送事务将被中止）。
  - New transactions have a random priority. If a transaction is aborted by a push operation and is restarted, its new priority is max(randomInt(), [priority of transaction that caused the restart] - 1]); this has the effect of probabilistically ratcheting up a transaction’s priority if it is restarted multiple times.
    
    新交易具有随机优先级。如果事务被推送操作中止并重新启动，则其新优先级为 max(randomInt(), [导致重新启动的事务的优先级] - 1])；如果多次重新启动事务，这会在概率上提高事务的优先级。

In this way, all conflicts between uncommitted transactions are immediately resolved by aborting one of the transactions, thus enforcing strict scheduling and guaranteeing that all transaction histories are recoverable.

这样，未提交事务之间的所有冲突都可以通过中止其中一个事务来立即解决，从而执行严格的调度并保证所有事务历史都是可恢复的。

Note on Abandoned Transactions

As mentioned earlier, in a concurrent environment we can no longer assume that unresolved write intents belong to abandoned transactions; we must deal with abandoned transactions in a different way. The priority system already aborts abandoned transactions probabilistically – transactions blocked by the abandoned transaction will eventually have a high enough priority to usurp it.

如前所述，在并发环境中，我们不能再假设未解决的写意图属于已放弃的事务；我们必须以不同的方式处理废弃的交易。优先级系统已经有概率地中止被放弃的交易——被被放弃的交易阻止的交易最终将拥有足够高的优先级来篡夺它。

However, we additionally add a heartbeat timestamp to every transaction. While in progress, an active transaction is responsible for periodically updating the heartbeat timestamp on its central transaction record; if a push operation encounters a transaction with an expired heartbeat timestamp, then it is considered abandoned and can be aborted regardless of priority.

但是，我们还为每笔交易添加了心跳时间戳。在进行过程中，活动事务负责定期更新其中央事务记录上的心跳时间戳；如果推送操作遇到心跳时间戳过期的事务，则该操作将被视为放弃，并且无论优先级如何都可以中止。

Wrap Up 包起来

We have now demonstrated how CockroachDB’s Isolation system is able to provide a serializable and recoverable transaction history in a completely distributed fashion. Combined with our atomic commit post, we have already described a fairly robust system for executing concurrent, distributed ACID transactions. That said, there are still many aspects to CockroachDB’s transaction system that we have not yet covered.

我们现在已经演示了 CockroachDB 的隔离系统如何能够以完全分布式的方式提供可序列化和可恢复的事务历史记录。结合我们的原子提交帖子，我们已经描述了一个用于执行并发、分布式 ACID 事务的相当健壮的系统。也就是说，CockroachDB 的事务系统还有很多方面我们还没有涉及到。

For example, CockroachDB offers another, more relaxed isolation level known as Snapshot (without the “serializable”). Like relaxed isolation levels in other database systems, this mode increases concurrency performance by allowing transactions to interleave in certain cases; for some applications, this is an acceptable tradeoff.

例如，CockroachDB 提供了另一种更宽松的隔离级别，称为快照（没有“可序列化”）。与其他数据库系统中宽松的隔离级别一样，此模式通过允许事务在某些情况下交错来提高并发性能；对于某些应用程序来说，这是一个可以接受的折衷方案。

Another aspect is how CockroachDB provides linearizable access to its data. Linearizability is a property that can be difficult to provide in a distributed system. Spencer Kimball has already written this blog post demonstrating how CockroachDB deals with this in some detail (contrasting it with the way a similar system, Google’s Spanner, does the same); however, we may eventually write an additional linearizability blog post focused more directly on our transaction system.

另一个方面是 CockroachDB 如何提供对其数据的线性化访问。线性化是分布式系统中很难提供的属性。 Spencer Kimball 已经撰写了这篇博客文章，详细演示了 CockroachDB 如何处理此问题（将其与类似系统 Google 的 Spanner 的处理方式进行对比）；然而，我们最终可能会写一篇额外的线性化博客文章，更直接地关注我们的交易系统。