How Pipelining consensus writes speeds up distributed SQL transactions

CockroachDB supports ACID transactions across arbitrary data in a distributed database. A discussion on how this works was first published on our blog three years ago. Since then, a lot has changed. Perhaps most notably, CockroachDB has transitioned from a key-value store to a full SQL database that can be plugged in as a scalable, highly-available replacement for PostgreSQL. It did so by introducing a SQL execution engine which maps SQL tables onto its distributed key-value architecture. However, over this period of time, the fundamentals of the distributed, atomic transaction protocol at the core of CockroachDB have remained untouched 1.

CockroachDB 支持分布式数据库中任意数据的 ACID 事务。 三年前,我们的博客首次发表了关于其工作原理的讨论。 从那时起,很多事情都发生了变化。 也许最值得注意的是,CockroachDB 已经从键值存储转变为完整的 SQL 数据库,可以作为 PostgreSQL 的可扩展、高可用性替代品插入。 它通过引入 SQL 执行引擎来实现这一点,该引擎将 SQL 表映射到其分布式键值架构上。 然而,在这段时间里,CockroachDB 核心的分布式原子事务协议的基本原理并未受到影响1。

For the most part, this hasn’t been an issue. The transaction protocol in CockroachDB was built to scale out to tremendously large clusters with arbitrary data access patterns. It does so efficiently while permitting serializable multi-key reads and writes. These properties have been paramount in allowing CockroachDB to evolve from a key-value store to a SQL database. However, CockroachDB has had to pay a price for this consistency in terms of transaction latency. When compared to other consensus systems offering weaker transaction semantics, CockroachDB often needed to perform more synchronous consensus rounds to navigate a transaction. However, we realized that we could improve transaction latency by introducing concurrency between these rounds of consensus.

在大多数情况下,这不是问题。 CockroachDB 中的事务协议旨在扩展到具有任意数据访问模式的超大型集群。 它可以高效地实现这一点,同时允许可序列化的多键读取和写入。 这些属性对于 CockroachDB 从键值存储发展为 SQL 数据库至关重要。 然而,CockroachDB 必须为这种一致性付出交易延迟方面的代价。 与提供较弱事务语义的其他共识系统相比,CockroachDB 通常需要执行更多同步共识轮次来导航事务。 然而,我们意识到,我们可以通过在这些轮次共识之间引入并发性来改善交易延迟。

This post will focus on an extension to the CockroachDB transaction protocol called Transactional Pipelining, which was introduced in CockroachDB’s recent 2.1 release. The optimization promises to dramatically speed up distributed transactions, reducing their time complexity from O(n) to O(1), where n is the number of DML SQL statements executed in the transaction and the analysis is expressed with respect to the latency cost of distributed consensus.

这篇文章将重点介绍 CockroachDB 事务协议的扩展,称为事务管道,它是在 CockroachDB 最近的 2.1 版本中引入的。 该优化有望显着加速分布式事务,将其时间复杂度从 O(n) 降低到 O(1),其中 n 是事务中执行的 DML SQL 语句的数量,分析是针对以下延迟成本来表示的: 分布式共识。

The post will give a recap of core CockroachDB concepts before using them to derive a performance model for approximating transaction latency in CockroachDB. It will then dive into the extension itself, demonstrating its impact on the performance model and providing experimental results showing its effects on real workloads. The post will wrap up with a preview of how we intend to extend this optimization further in upcoming releases to continue speeding up transactions.

这篇文章将回顾 CockroachDB 的核心概念,然后使用它们来导出性能模型来近似 CockroachDB 中的事务延迟。 然后,它将深入研究扩展本身,展示其对性能模型的影响,并提供实验结果来显示其对实际工作负载的影响。 这篇文章最后将预览我们打算如何在即将发布的版本中进一步扩展此优化,以继续加快交易速度。

Distributed Transactions: A Recap 回顾

CockroachDB allows transactions to span an entire cluster, providing ACID guarantees across arbitrary numbers of machines, data centers, and geographical regions. This is all exposed through SQL — meaning that you can BEGIN a transaction, issue any number of read and write statements, and COMMIT the transaction, all without worrying about inconsistencies or loss of durability. In fact, CockroachDB provides the strongest level of isolation, SERIALIZABLE, so that the integrity of your data is always preserved.

CockroachDB 允许事务跨越整个集群,为任意数量的机器、数据中心和地理区域提供 ACID 保证。 这一切都是通过 SQL 公开的 — 这意味着您可以 BEGIN 事务、发出任意数量的读取和写入语句以及 COMMIT 事务,而无需担心不一致或持久性损失。 事实上,CockroachDB 提供了最强的隔离级别(SERIALIZABLE),以便始终保持数据的完整性。

There are a few competing ideas which combine to make this all possible, each of which is important to understand. Below is a brief introduction to each. For those interested in exploring further, more detail can be found in our architecture documentation.

有一些相互竞争的想法结合起来使这一切成为可能,理解每一个都很重要。 下面对每一项进行简要介绍。 对于那些有兴趣进一步探索的人,可以在我们的架构文档中找到更多详细信息。

Storage

At its most fundamental level, the goal of a durable database is to persist committed data such that it will survive permanently. This is traditionally performed by a storage engine, which writes bytes to a non-volatile storage medium. CockroachDB uses RocksDB, an embedded key-value database maintained by Facebook, as its storage engine. RocksDB builds upon its pedigree (LevelDB and more generally the Log-structured merge-tree (LSM tree) data structure) to strike a balance between high write throughput, low space amplification, and acceptable read performance. This makes it a good choice for CockroachDB, which runs a separate instance of RocksDB on each individual node in a cluster.

在最基本的层面上,持久数据库的目标是持久保存已提交的数据,使其永久保存。 传统上,这是由存储引擎执行的,该引擎将字节写入非易失性存储介质。 CockroachDB 使用 RocksDB(Facebook 维护的嵌入式键值数据库)作为其存储引擎。 RocksDB 建立在其谱系(LevelDB 以及更普遍的日志结构合并树(LSM 树)数据结构)之上,以在高写入吞吐量、低空间放大和可接受的读取性能之间取得平衡。 这使得它成为 CockroachDB 的一个不错的选择,它在集群中的每个节点上运行一个单独的 RocksDB 实例。

Even with software improvements like improved indexing structures and hardware improvements like the emergence of SSDs, persistence is still expensive both in terms of the latency it imposes on each individual write and in terms of the bounds it places on write throughput. For the remainder of this post, we’ll refer to the first cost here as “storage latency”.

即使有了索引结构改进等软件改进和 SSD 出现等硬件改进,持久性仍然是昂贵的,无论是在每次写入所带来的延迟方面,还是在写入吞吐量的限制方面。 在本文的其余部分中,我们将这里的第一个成本称为“存储延迟”。

Replication

Replicating data across nodes allows CockroachDB to provide high-availability in the face of the chaotic nature of distributed systems. By default, every piece of data in CockroachDB is replicated across three nodes in a cluster (though this is configurable)—we refer to these as “replicas”, and each node contains many replicas. Each individual node takes responsibility for persisting its own replica data. This ensures that even if nodes lose power or lose connectivity with one another, as long as a majority of the replicas are available, the data will stay available to read and write. Like other modern distributed systems, CockroachDB uses the Raft consensus protocol to manage coordination between replicas and to achieve fault-tolerant consensus, upon which this state replication is built. We’ve published about this topic before.

跨节点复制数据使 CockroachDB 能够在面对分布式系统的混乱性质时提供高可用性。 默认情况下,CockroachDB 中的每条数据都会在集群中的三个节点上进行复制(尽管这是可配置的)——我们将这些称为“副本”,每个节点都包含许多副本。 每个单独的节点负责保存自己的副本数据。 这确保了即使节点断电或彼此失去连接,只要大多数副本可用,数据将保持可读写状态。 与其他现代分布式系统一样,CockroachDB 使用 Raft 共识协议来管理副本之间的协调并实现容错共识,在此基础上构建状态复制。 我们之前曾发表过有关此主题的文章。

Of course, the benefits of replication come at the cost of coordination latency. Whenever a replica wants to make a change to a particular piece of its replicated data, it “proposes” that change to the other replicas and multiple nodes must come to an agreement about what to change and when to change it. To maintain strong consistency during this coordination, Raft (and other consensus protocols like it) require at least a majority of replicas (e.g. a quorum of 2 nodes for a replication group of 3 nodes) to agree on the details of the change.

当然,复制的好处是以调度延迟为代价的。 每当副本想要对其复制数据的特定部分进行更改时,它都会“建议”对其他副本进行更改,并且多个节点必须就更改内容以及何时更改达成一致。 为了在协调过程中保持强一致性,Raft(以及其他类似的共识协议)需要至少大多数副本(例如,3 个节点的复制组的 2 个节点的法定数量)就更改的细节达成一致。

In its steady-state, Raft allows the proposing replica to achieve this agreement with just a single network call to each other replica in its replication group. The proposing replica must then wait for a majority of replicas to respond positively to its proposal. This can be done in parallel for every member in the group, meaning that at a minimum, consensus incurs the cost of a single round-trip network call to the median slowest member of the replication group. For the remained of this post, we’ll refer to this as “replication latency”.

在稳定状态下,Raft 允许提议副本只需对其复制组中的每个其他副本进行一次网络调用即可达成此协议。 然后,提议的副本必须等待大多数副本对其提议做出积极响应。 这可以为组中的每个成员并行完成,这意味着共识至少会产生对复制组中最慢成员的单次往返网络调用的成本。 在本文的其余部分,我们将其称为“复制延迟”。

Distribution

Replicating data across nodes improves resilience, but it doesn’t allow data to scale indefinitely. For that, CockroachDB needs to distribute different data across the nodes in a cluster, storing only a subset of the total data on each individual node. To do this, CockroachDB breaks data into 64MB chunks, called Ranges. These Ranges operate independently and each manage its own N-way replication. The Ranges automatically split, merge, and move around a cluster to hold a suitable amount of data and to stay healthy (i.e. fully-replicated) if nodes crash or become unreachable.

跨节点复制数据可以提高弹性,但不允许数据无限扩展。 为此,CockroachDB 需要在集群中的节点之间分布不同的数据,在每个单独的节点上仅存储总数据的子集。 为此,CockroachDB 将数据分成 64MB 的块,称为范围。 这些范围独立运行,并且每个范围都管理自己的 N 路复制。 范围会自动拆分、合并并在集群中移动,以保存适量的数据,并在节点崩溃或无法访问时保持健康(即完全复制)。

A Range is made up of Replicas, which are members of the Range who hold a copy of its state and live on different nodes. Each Range has a single “leaseholder” Replica who both coordinates writes for the Range, as well as serves reads from its local RocksDB store. The leaseholder Replica is defined as the Replica at any given time who holds a time-based “range lease”. This lease can be moved between the Replicas as they see fit. For the purpose of this post, we’ll always assume that the leaseholder Replica is collocated with (in the same data center as) the node serving SQL traffic. This is not always the case, but automated processes like Follow-the-Workload do their best to enforce this collocation, and lease preferences make it possible to manually control leaseholder placement.

Range 由副本组成,副本是 Range 的成员,拥有其状态的副本并位于不同的节点上。 每个 Range 都有一个“租赁持有者”副本,它既协调 Range 的写入操作,又提供本地 RocksDB 存储的读取服务。 租赁持有者副本被定义为在任何给定时间持有基于时间的“范围租约”的副本。 该租约可以在副本之间移动,只要他们认为合适。 出于本文的目的,我们始终假设租赁者副本与提供 SQL 流量的节点并置(在同一数据中心)。 情况并非总是如此,但像“跟踪工作负载”这样的自动化流程会尽最大努力强制实施这种搭配,并且租赁偏好使得手动控制承租人安置成为可能。

With a distribution policy built on top of consistent replication, a CockroachDB cluster is able to scale to an arbitrary number of Ranges and to move Replicas in these Ranges around to ensure resilience and localized access. However, as is becoming the trend in this post, this also comes at a cost. Because distribution forces data to be split across multiple replication groups (i.e. multiple Ranges), we lose the ability to trivially order operations if they happen in different replication groups. This loss of linearizable ordering across Ranges is what necessitates the distributed transaction protocol that the rest of this post will focus on.

通过建立在一致复制之上的分发策略,CockroachDB 集群能够扩展到任意数量的范围,并在这些范围内移动副本,以确保弹性和本地化访问。 然而,正如本文中的趋势一样,这也是有代价的。 因为分布强制数据跨多个复制组(即多个范围)进行分割,所以如果操作发生在不同的复制组中,我们就失去了对操作进行简单排序的能力。 这种跨范围的线性化排序的损失使得分布式事务协议成为必要,本文的其余部分将重点讨论这一点。

Transactions

CockroachDB’s transactional protocol implements ACID transactions on top of the scalable, fault-tolerant foundation that its storage, replication, and distribution layers combine to provide. It does so while allowing transactions to span an arbitrary number of Ranges and as many participating nodes as necessary. The protocol was inspired in part by Google Percolator, and it follows a similar pattern of breaking distributed transactions into three distinct phases:

CockroachDB 的事务协议在其存储、复制和分发层共同提供的可扩展、容错基础之上实现了 ACID 事务。 它这样做的同时允许交易跨越任意数量的范围和所需的任意数量的参与节点。 该协议部分受到 Google Percolator 的启发,它遵循类似的模式,将分布式事务分为三个不同的阶段:

1. Preparation

A transaction begins when a SQL BEGIN statement is issued. At that time, the transaction determines the timestamp at which it will operate and prepares to execute SQL statements. From this point on, the transaction will perform all reads and writes at its pre-determined timestamp. Those with prior knowledge of CockroachDB may remember that its storage layer implements multi-version concurrency control, meaning that transactional reads are straightforward even if other transactions modify the same data at later timestamps.

当发出 SQL BEGIN 语句时事务开始。 此时,事务确定将要操作的时间戳并准备执行 SQL 语句。 从此时起,事务将按照其预先确定的时间戳执行所有读取和写入。 了解过 CockroachDB 的人可能还记得,它的存储层实现了多版本并发控制,这意味着即使其他事务在稍后的时间戳修改了相同的数据,事务读取也很简单。

When the transaction executes statements that mutate data (DML statements), it doesn’t write committed values immediately. Instead, it creates two things that help it manage its progress:

当事务执行改变数据的语句(DML 语句)时,它不会立即写入提交的值。 相反,它创建了两个东西来帮助它管理进度:

  • The first write of the transaction creates a transaction record which includes the transaction’s current status. This transaction record acts as the transaction’s “switch”. It begins in the “pending” state and is eventually switched to “committed” to signify that the transaction has committed.

    交易的第一次写入会创建一个交易记录,其中包括交易的当前状态。 这条交易记录充当了交易的“开关”。 它从“待处理”状态开始,最终切换到“已提交”状态,表示事务已提交。

  • The transaction creates write intents for each of the key-value data mutations it intends to make. The intents represent provisional, uncommitted state which lives on the same Ranges as their corresponding data records. As such, a transaction can end up spreading intents across Ranges and across an entire cluster as it performs writes during the preparation phase. Write intents point at their transaction’s record and indicate to readers that they must check the status of the transaction record before treating the intent’s value as the source of truth or before ignoring it entirely.

    该事务为其打算进行的每个键值数据突变创建写入意图。 意图表示临时的、未提交的状态,其与其相应的数据记录位于相同的范围内。 因此,当事务在准备阶段执行写入时,它最终可能会跨范围和整个集群传播意图。 写入意图指向其交易记录,并向读者表明他们必须在将意图的值视为事实来源或完全忽略它之前检查交易记录的状态。

2. Commit

When a SQL transaction has finished issuing read and write statements, it executes a COMMIT statement. What happens next is simple - the transaction visits its transaction record, checks if it has been aborted, and if not, it flips its switch from “pending” to “committed”. The transaction is now committed and the client can be informed of the success.

当 SQL 事务完成发出读写语句时,它会执行 COMMIT 语句。 接下来发生的事情很简单 - 交易访问其交易记录,检查是否已中止,如果没有,则将其开关从“待处理”切换到“已提交”。 交易现已提交,并且可以通知客户交易成功。

3. Cleanup

After the transaction has been resolved and the client has been acknowledged, an asynchronous process is launched to replace all provisional write intents with committed values. This reduces the chance that future readers will observe the intents and need to check in with the intents’ transaction record to determine its disposition. This can be important for performance because checking the status of another transaction by visiting its transaction record can be expensive. However, this cleanup process is strictly an optimization and not a matter of correctness.

事务解决并且客户端已被确认后,将启动异步进程以用提交的值替换所有临时写入意图。 这减少了未来读者观察意图并需要检查意图的交易记录以确定其处置的机会。 这对于性能来说很重要,因为通过访问另一个事务的事务记录来检查另一个事务的状态可能会很昂贵。 然而,这个清理过程严格来说是一种优化,而不是正确性问题。

That high-level overview of the transaction protocol in CockroachDB should be sufficient for the rest of this post, but those who are interested can learn more in our docs.

CockroachDB 中事务协议的高级概述对于本文的其余部分来说应该足够了,但是感兴趣的人可以在我们的文档中了解更多信息。

The Cost of Distributed Transactions in CockroachDB

With an understanding of the three phases of a distributed transaction in CockroachDB and an understanding of the abstractions upon which they are built, we can begin to construct a performance model that captures the cost of distributed transactions. Specifically, our model will approximate the latency that a given transaction will incur when run through CockroachDB 2.0 and earlier.

了解了 CockroachDB 中分布式事务的三个阶段以及构建这些阶段的抽象之后,我们就可以开始构建一个捕获分布式事务成本的性能模型。 具体来说,我们的模型将近似给定事务在通过 CockroachDB 2.0 及更早版本运行时将产生的延迟。

Model Assumptions 模型假设

To begin, we’ll establish a few simplifying assumptions that will make our latency model easier to work with and visualize.

首先,我们将建立一些简化的假设,使我们的延迟模型更易于使用和可视化。

  1. The first assumption we’ll make is that the two dominant latency costs in distributed transactions are storage latency and replication latency. That is, the cost to replicate data between replicas in a Range and the cost to persist it to disk on each replica will dominate all other latencies in the transaction such that everything else can safely be ignored in our model. To safely make this approximation, we must assume that Range leaseholders are collocated with the CockroachDB nodes serving SQL traffic. This allows us to ignore any network latency between SQL gateways and Range leaseholders when performing KV reads and writes. As we discussed earlier, this is a safe and realistic assumption to make. Likewise, we must also assume that the network latency between the client application issuing SQL statements and the SQL gateway node executing them is sufficiently negligible. If all client applications talk to CockroachDB nodes within their local data centers/zones, this is also a safe assumption.

    我们要做的第一个假设是分布式事务中两个主要的延迟成本是存储延迟和复制延迟。 也就是说,在范围内的副本之间复制数据的成本以及将数据持久保存到每个副本上的磁盘的成本将主导事务中的所有其他延迟,以便在我们的模型中可以安全地忽略其他所有内容。 为了安全地进行这种近似,我们必须假设 Range 租用者与提供 SQL 流量的 CockroachDB 节点并置。 这使我们能够在执行 KV 读写时忽略 SQL 网关和 Range 租用者之间的任何网络延迟。 正如我们之前讨论的,这是一个安全且现实的假设。 同样,我们还必须假设发出 SQL 语句的客户端应用程序和执行它们的 SQL 网关节点之间的网络延迟可以忽略不计。 如果所有客户端应用程序都与本地数据中心/区域内的 CockroachDB 节点通信,这也是一个安全的假设。

  2. The second assumption we’ll make is that the transactional workload being run is sufficiently uncontended such that any additional latency due to queuing for lock and latch acquisition is negligible. This holds true for most workloads, but will not always be the case in workloads that create large write hotspots, like YCSB in its zipfian distribution mode. It’s our belief that a crucial property of successful schema design is the avoidance of write hotspots, so we think this is a safe assumption to make.

    我们要做的第二个假设是,正在运行的事务工作负载完全没有竞争,因此由于排队等待锁和闩锁获取而产生的任何额外延迟都可以忽略不计。 对于大多数工作负载来说都是如此,但在创建大型写入热点的工作负载中情况并非总是如此,例如 zipfian 分发模式下的 YCSB。 我们相信,成功的模式设计的一个关键特性是避免写入热点,因此我们认为这是一个安全的假设。

  3. Finally, the third assumption we’ll make is that the CockroachDB cluster is operating under a steady-state that does not include chaos events. CockroachDB was built to survive catastrophic failures across a cluster, but failure events can still induce latencies on the order of a few seconds to live traffic as Ranges recover, Range leases change hands, and data is migrated in response to the unreachable nodes. These events are a statistical given in a large-scale distributed system, but they shouldn’t represent the cluster’s typical behavior –– so, for the sake of this performance model, it’s safe to assume they are absent.

    最后,我们要做的第三个假设是 CockroachDB 集群在不包含混乱事件的稳态下运行。 CockroachDB 的构建是为了在整个集群中承受灾难性故障,但随着范围恢复、范围租约易手以及为响应无法访问的节点而迁移数据,故障事件仍然可能导致实时流量大约几秒的延迟。 这些事件是大规模分布式系统中给出的统计数据,但它们不应该代表集群的典型行为——因此,为了这个性能模型,可以安全地假设它们不存在。

The model will not be broken if any of these assumptions are incorrect, but it will need to be adapted to account for changes in latency characteristics.

如果这些假设中的任何一个不正确,该模型都不会被破坏,但需要对其进行调整以考虑延迟特性的变化。

Latency Model

First, let’s define exactly what we mean by “latency”. Because we’re most interested in the latency observed by applications, we define transactional latency as “the delay between when a client application first issues its BEGIN statement and when it gets an acknowledgement that its COMMIT statement succeeded.” Remember that SQL is “conversational”, meaning that clients typically issue a statement and wait for its response before issuing the next one.

首先,让我们准确定义“延迟”的含义。 因为我们对应用程序观察到的延迟最感兴趣,所以我们将事务延迟定义为“客户端应用程序首次发出 BEGIN 语句与收到 COMMIT 语句成功确认之间的延迟”。 请记住,SQL 是“会话式”的,这意味着客户端通常会发出一条语句并等待其响应,然后再发出下一条语句。

We then take this definition and apply it to CockroachDB’s transaction protocol. The first thing we see is that because we defined latency from the client’s perspective, the asynchronous third phase of cleaning up write intents can be ignored. This reduces the protocol down to just two client-visible phases: everything before the COMMIT statement is issued by the client and everything after. Let’s call these two component latencies L_prep and L_commit, respectively. Together, they combine to a total transitional latency L_txn = L_prep + L_commit.

然后我们采用这个定义并将其应用到 CockroachDB 的事务协议中。 我们首先看到的是,因为我们从客户端的角度定义了延迟,所以可以忽略清理写入意图的异步第三阶段。 这将协议减少到只有两个客户端可见的阶段:COMMIT 语句之前的所有内容均由客户端发出,以及之后的所有内容。 我们将这两个组件延迟分别称为 L_prep 和 L_commit。 它们共同构成总转换延迟 L_txn = L_prep + L_commit。

The goal of our model is then to characterize L_prep and L_commit in terms of the two dominant latency costs of the transaction so that we can define L_txn as a function of this cost. It just so happens that these two dominant latency costs are always paid as a pair, so we can define this unit latency as L_c, which can be read as “the latency of distributed consensus”. This latency is a function of both the replication layer and the storage layer. It can be expressed to a first-order approximation as the latency of a single round-trip network call to, plus a synchronous disk write on, the median slowest member of a replication group (i.e. Range). This value is highly dependent on a cluster’s network topology and on its storage hardware, but is typically on the order of single or double digit milliseconds.

我们模型的目标是根据交易的两个主要延迟成本来表征 L_prep 和 L_commit,以便我们可以将 L_txn 定义为该成本的函数。 恰巧这两个主要的延迟成本总是成对付出的,所以我们可以将这个单位延迟定义为L_c,可以理解为“分布式共识的延迟”。 该延迟是复制层和存储层的函数。 它可以用一阶近似表示为单个往返网络调用的延迟,加上同步磁盘写入,复制组中最慢的成员(即范围)。 该值高度依赖于集群的网络拓扑及其存储硬件,但通常约为一位数或两位数毫秒。

To define L_prep in terms of the unit latency L_c, we first need to enumerate everything a transaction can do before a COMMIT is issued. For the sake of this model, we’ll say that a transaction can issue R read statements (e.g. SELECT * FROM t LIMIT 1) and W write statements (e.g. INSERT INTO t VALUES (1)). If we define the latency of a read statement as L_r and the latency of a write statement as L_w, then the total latency of L_prep = R * L_r + W * L_w. So far, so good. It turns out that because leaseholders in CockroachDB can serve KV reads locally without coordination (the committed value already achieved consensus), and because we assumed that the leaseholders were all collocated with the SQL gateways, L_r approaches 0 and the model simplifies to L_prep = W * L_w. Of course, this isn’t actually true; reads aren’t free. In some sense, this shows a limitation of our model, but given the constraints we’ve placed on it and the assumptions we’ve made, it’s reasonable to assume that sufficiently small, OLTP-like reads have a negligible cost on the latency of a transaction.

为了根据单位延迟 L_c 定义 L_prep,我们首先需要枚举事务在发出 COMMIT 之前可以执行的所有操作。 为了这个模型,我们会说事务可以发出 R 读语句(例如 SELECT * FROM t LIMIT 1)和 W 写语句(例如 INSERT INTO t VALUES (1))。 如果我们将读语句的延迟定义为 L_r,将写语句的延迟定义为 L_w,则 L_prep 的总延迟 = R * L_r + W * L_w。 到目前为止,一切都很好。 事实证明,由于 CockroachDB 中的租约持有者可以在本地无需协调地提供 KV 读取服务(承诺值已达成共识),并且由于我们假设租约持有者都与 SQL 网关并置,因此 L_r 接近 0,模型简化为 L_prep = W * L_w。 当然,事实并非如此。 阅读不是免费的。 从某种意义上说,这显示了我们模型的局限性,但考虑到我们对其施加的约束以及我们所做的假设,可以合理地假设足够小的、类似 OLTP 的读取对延迟的成本可以忽略不计。 一笔交易。

With L_prep reduced to L_prep = W * L_w, we now just need to characterize the cost of L_w in terms of L_c. This is where details about CockroachDB’s transaction protocol implementation come into play.

随着 L_prep 减少到 L_prep = W * L_w,我们现在只需要用 L_c 来表征 L_w 的成本。 这就是有关 CockroachDB 事务协议实现的详细信息发挥作用的地方。

To begin, we know that the transaction protocol creates a transaction record during the first phase. We also know that the transaction protocol creates a write intent for every modified key-pair during this phase. Both the transaction record and the write intents are replicated and persisted in order to maintain consistency. This means that naively L_prep would incur a single L_c cost when creating the transaction record and an L_c cost for every key-value pair modified across all writing statements. However, this isn’t actually the cost of L_prep for two reasons:

首先,我们知道交易协议在第一阶段创建交易记录。 我们还知道,交易协议在此阶段为每个修改的密钥对创建一个写入意图。 事务记录和写入意图都会被复制和持久化,以保持一致性。 这意味着,天真的 L_prep 在创建事务记录时会产生单个 L_c 成本,并且会为所有写入语句中修改的每个键值对产生 L_c 成本。 然而,这实际上并不是 L_prep 的成本,原因有二:

  1. The transaction record is not created immediately after the transaction begins. Instead, it is collocated with and written in the same batch as the first write intent, meaning that the latency cost to create the transaction record is completely hidden and therefore can be ignored.

    交易记录并不是在交易开始后立即创建的。 相反,它与第一个写入意图并置并在同一批次中写入,这意味着创建事务记录的延迟成本完全隐藏,因此可以忽略不计。

  2. Every provisional write intent for a SQL statement is created in parallel, meaning that regardless of how many key-value pairs a SQL statement modifies, it only incurs a single L_c cost. A SQL statement may touch multiple key-value pairs if it touches a single row with a secondary index or if it touches multiple distinct rows. This explains why using multi-row DML statements can lead to such dramatic performance improvements.

    SQL 语句的每个临时写入意图都是并行创建的,这意味着无论 SQL 语句修改多少个键值对,它都只会产生单个 L_c 成本。 如果 SQL 语句涉及具有辅助索引的单个行或涉及多个不同的行,则它可能会涉及多个键值对。 这就解释了为什么使用多行 DML 语句可以带来如此显着的性能改进。

Together this means that L_w, the latency cost of a single DML SQL statement, is equivalent to L_c. This is true even for the first writing statement which has the important role of creating the transaction record. With this substitution, we can then define L_prep = W * L_c

这意味着单个 DML SQL 语句的延迟成本 L_w 等于 L_c。 即使对于第一个写入语句也是如此,它具有创建交易记录的重要作用。 通过这种替换,我们可以定义 L_prep = W * L_c

Defining L_commit in terms of the unit latency L_c is a lot more straightforward. When the COMMIT statement is issued, the switch on the transaction’s record is flipped with a single round of distributed consensus. This means that L_commit = L_c.

根据单位延迟 L_c 定义 L_commit 要简单得多。 当发出 COMMIT 语句时,交易记录的开关将通过单轮分布式共识进行翻转。 这意味着L_commit = L_c。

然后我们可以结合这两个组件来完成 2.1 之前的延迟模型:

1
L_txn = (W + 1) * L_c

We can read this as saying that a transaction pays the cost of distributed consensus once for every DML statement it executes plus once to commit. For instance, if our cluster can perform consensus in 7ms and our transaction performs 3 UPDATE statements, a back-of-the-envelope calculation for how long it should take gives us 28ms.

我们可以将其理解为,事务为它执行的每个 DML 语句以及提交一次支付分布式共识的成本。 例如,如果我们的集群可以在 7 毫秒内达成共识,并且我们的事务执行 3 个 UPDATE 语句,则粗略计算所需时间为 28 毫秒。

The “1-Phase Transaction” Fast-Path

CockroachDB contains an important optimization in its transaction protocol that has existed since its inception called the “one-phase commit” fast-path. This optimization prevents a transaction that performs all writes on the same Range and commits immediately from needing a transaction record at all. This allows the transaction to complete with just a single round of consensus (L_txn = 1*L_c).

CockroachDB 在其事务协议中包含一项重要的优化,该协议自诞生以来就存在,称为“单阶段提交”快速路径。 这种优化可以防止在同一 Range 上执行所有写入并立即提交的事务根本不需要事务记录。 这使得交易只需一轮共识即可完成(L_txn = 1*L_c)。

An important property of this optimization is that the transaction needs to commit immediately. This means that typically the fast-path is only accessible by implicit SQL transactions (i.e. single statements outside of a BEGIN; ... COMMIT; block). Because of this limitation, we’ll ignore this optimization for the remainder of this post and focus on explicit transactions.

这种优化的一个重要特性是事务需要立即提交。 这意味着快速路径通常只能通过隐式 SQL 事务(即 BEGIN; … COMMIT; 块之外的单个语句)访问。 由于此限制,我们将在本文的其余部分忽略此优化,并重点关注显式事务。

Transactional Pipelining

The latency model we built reveals an interesting property of transactions in CockroachDB — their latency scales linearly with respect to the number of DML statements they contain. This behavior isn’t unreasonable, but its effects are clearly noticeable when measuring the performance of large transactions. Further, its effects are especially noticeable in geo-distributed clusters with very high replication latencies. This isn’t great for a database specializing in distributed operation.

我们构建的延迟模型揭示了 CockroachDB 中事务的一个有趣属性——它们的延迟与它们包含的 DML 语句的数量成线性关系。 这种行为并非不合理,但在衡量大型事务的性能时,其影响是显而易见的。 此外,它的影响在复制延迟非常高的地理分布式集群中尤其明显。 这对于专门从事分布式操作的数据库来说并不是很好。

What is Transactional Pipelining?

Transactional Pipelining is an extension to the CockroachDB transaction protocol which was introduced in v2.1 and aims to improve performance for distributed transactions. Its stated goal is to avoid the linear scaling of transaction latency with respect to DML statement count. At a high level, it achieves this by performing distributed consensus for intent writes across SQL statements concurrently. In doing so, it achieves its goal of reducing transaction latency to a constant multiple of consensus latency.

事务管道是 CockroachDB 事务协议的扩展,该协议在 v2.1 中引入,旨在提高分布式事务的性能。 其既定目标是避免事务延迟相对于 DML 语句计数的线性扩展。 在较高层面上,它通过同时跨 SQL 语句执行意图写入的分布式共识来实现这一点。 通过这样做,它实现了将交易延迟减少到共识延迟的恒定倍数的目标。

Prior Art (a.k.a. the curse of SQL)

Before taking a look at how transactional pipelining works, let’s quickly take a step back and explore how CockroachDB has attempted to address this problem in the past. CockroachDB first attempted to solve this issue in its v1.0 release through parallel statement execution.

在了解事务管道如何工作之前,让我们快速退后一步,探索 CockroachDB 过去如何尝试解决这个问题。 CockroachDB 在 v1.0 版本中首先尝试通过并行语句执行来解决这个问题。

Parallel statement execution worked as advertised - it allowed clients to specify that they wanted statements in their SQL transaction to run in parallel. A client would do so by suffixing DML statements with the RETURNING NOTHING specifier. Upon the receipt of a statement with this specifier, CockroachDB would begin executing the statement in the background and would immediately return a fake return value to the client. Returning to the client immediately allowed parallel statement execution to get around the constraints of SQL’s conversational API within session transactions and enabled multiple statements to run in parallel.

并行语句执行的工作方式正如宣传的那样——它允许客户端指定他们希望 SQL 事务中的语句并行运行。 客户端可以通过在 DML 语句后添加 RETURNING NOTHING 说明符来实现此目的。 收到带有此说明符的语句后,CockroachDB 将开始在后台执行该语句,并立即向客户端返回一个假的返回值。 立即返回客户端允许并行语句执行,以绕过会话事务中 SQL 会话 API 的限制,并允许多个语句并行运行。

There were two major problems with this. First, clients had to change their SQL statements in order to take advantage of parallel statement execution. This seems minor, but it was a big issue for ORMs or other tools which abstract the SQL away from developers. Second, the fake return value was a lie. In the happy case where a parallel statement succeeded, the correct number of rows affected would be lost. In the unhappy case where a parallel statement failed, the error would be returned, but only later in the transaction. This was true whether the error was in the SQL domain, like a foreign key violation, or in the operational domain, like a failure to write to disk. Ultimately, parallel statement execution broke SQL semantics to allow statements to run in parallel.

这有两个主要问题。 首先,客户端必须更改其 SQL 语句才能利用并行语句执行。 这看起来很小,但对于 ORM 或其他将 SQL 从开发人员手中抽象出来的工具来说却是一个大问题。 其次,假的返回值是一个谎言。 在并行语句成功的情况下,正确的受影响行数将会丢失。 在并行语句失败的不幸情况下,将返回错误,但仅在事务的后期返回。 无论错误是在 SQL 域(如外键冲突)还是在操作域(如写入磁盘失败),情况都是如此。 最终,并行语句执行打破了 SQL 语义,允许语句并行运行。

We thought we could do better, which is why we started looking at the problem again from a new angle. We wanted to retain the benefits of parallel statement execution without breaking SQL semantics. This in turn would allow us to speed up all transactions, not just those that were written with parallel statement execution in mind.

我们认为我们可以做得更好,这就是为什么我们开始从新的角度重新审视这个问题。 我们希望保留并行语句执行的优势而不破坏 SQL 语义。 这反过来又将使我们能够加速所有事务,而不仅仅是那些考虑到并行语句执行而编写的事务。

Buffering Writes Until Commit

We understood from working with a number of other transaction systems that a valid alternative would be to buffer all write operations at the transaction coordinator node until the transaction was ready to commit. This would allow us to flush all write intents at once and pay the “preparation” cost of all writes, even across SQL statements, in parallel. This would also bring our distributed transaction protocol more closely in line with a traditional presumed abort 2-phase commit protocol.

通过与许多其他事务系统的合作,我们了解到,有效的替代方案是在事务协调器节点缓冲所有写入操作,直到事务准备好提交为止。 这将使我们能够一次性刷新所有写入意图,并并行支付所有写入的“准备”成本,甚至跨 SQL 语句。 这也将使我们的分布式事务协议更加符合传统的假定中止两阶段提交协议。

The idea was sound and we ended up creating a prototype that did just this. However, in the end we decided against the approach. In addition to the complication of buffering large amounts of data on transaction coordinator nodes and having to impose conservative transaction size limits to accommodate doing so, we realized that the approach would have a negative effect on transaction contention in CockroachDB.

这个想法很合理,我们最终创建了一个原型来实现这一点。 然而,最终我们决定不采用这种方法。 除了在事务协调器节点上缓冲大量数据以及必须施加保守的事务大小限制以适应这样做的复杂性之外,我们意识到该方法会对 CockroachDB 中的事务争用产生负面影响。

If you squint, write intents serve a similar role to row locks in a traditional SQL database. By “acquiring” these locks later into the lifecycle of a transaction and allowing reads from other transactions to create read-write conflicts in the interim period, we observed a large uptick in transaction aborts when running workloads like TPC-C. It turns out that performing all writes (i.e. acquiring all locks) at the end of a transaction works out with weaker isolation levels like snapshot isolation because such isolation levels allow a transaction’s read timestamp and its write timestamp to drift apart.

如果您仔细观察,就会发现写意图的作用与传统 SQL 数据库中的行锁类似。 通过稍后在事务的生命周期中“获取”这些锁,并允许从其他事务读取,从而在过渡期间产生读写冲突,我们观察到,在运行 TPC-C 等工作负载时,事务中止的情况大幅增加。 事实证明,在事务结束时执行所有写入(即获取所有锁)会使用较弱的隔离级别(例如快照隔离),因为此类隔离级别允许事务的读取时间戳和写入时间戳偏离。

However, at a serializable isolation level, a transaction must read and write at the same timestamp to prevent anomalies like write skew from corrupting data. With this restriction, writing intents as early as possible serves an important role in CockroachDB of sequencing conflicting operations across transactions and avoiding the kinds of conflicts that result in transaction aborts. As such, doing so ends up being a large performance win even for workloads with just a small amount of contention.

然而,在可序列化隔离级别,事务必须在同一时间戳进行读取和写入,以防止写入倾斜等异常情况损坏数据。 有了这个限制,尽早写入意图在 CockroachDB 中发挥着重要作用,它可以对事务之间的冲突操作进行排序,并避免导致事务中止的冲突。 因此,即使对于只有少量争用的工作负载,这样做最终也会带来巨大的性能提升。

Creating significantly more transaction aborts would have been a serious issue, so we began looking for other ways that we could speed up transactions without acquiring all locks at commit time. We’ll soon see that transactional pipelining allows us to achieve these same latency properties while still eagerly acquiring locks and discovering contention points within a transaction long before they would cause a transaction to abort.

创建更多的事务中止将是一个严重的问题,因此我们开始寻找其他方法来加速事务,而无需在提交时获取所有锁。 我们很快就会看到,事务管道使我们能够实现这些相同的延迟属性,同时仍然急切地获取锁并在事务中早在导致事务中止之前就发现事务中的争用点。

A Key Insight

The breakthrough came when we realized that we could separate SQL errors from operational errors. We recognized that in order to satisfy the contract for SQL writes, we only need to synchronously perform SQL-domain constraint validation to determine whether a write should return an error, and if not, determine what the effect of the write should be (i.e. rows affected). Notably, we realized that we could begin writing intents immediately but don’t actually need to wait for them to finish before returning a result to the client. Instead, we just need to make sure that the write succeeds sometime before the transaction is allowed to commit.

当我们意识到我们可以将 SQL 错误与操作错误分开时,突破就出现了。 我们认识到,为了满足 SQL 写入的约定,我们只需要同步执行 SQL 域约束验证来确定写入是否应该返回错误,如果不是,则确定写入的效果应该是什么(即行) 做作的)。 值得注意的是,我们意识到我们可以立即开始编写意图,但实际上不需要等待它们完成后再将结果返回给客户端。 相反,我们只需要确保在允许提交事务之前的某个时间写入成功。

The interesting part about this is that a Range’s leaseholder has all the information necessary to perform constraint validation and determine the effect of a SQL write, and it can do this all without any coordination with other Replicas. The only time that it needs to coordinate with its peers is when replicating changes, and this doesn’t need to happen before we return to a client who issued a DML statement. Effectively, this means that we can push the entire consensus step out of the synchronous stage of statement execution. We can turn a write into a read and do all the hard work later. In doing so, we can perform the time-consuming operation of distributed consensus concurrently across all statements in a transaction!

有趣的是,Range 的租约持有者拥有执行约束验证和确定 SQL 写入效果所需的所有信息,并且它可以在不与其他副本进行任何协调的情况下完成这一切。 它需要与同级协调的唯一时间是复制更改时,并且在我们返回到发出 DML 语句的客户端之前不需要发生这种情况。 实际上,这意味着我们可以将整个共识步骤推出语句执行的同步阶段。 我们可以将写入转换为读取,然后再完成所有艰苦的工作。 这样做,我们可以在一个事务中的所有语句上并发执行分布式共识的耗时操作!

Asynchronous Consensus 异步共识

In order to make this all fit together, we had to make a few changes to CockroachDB’s key-value API and client. The KV API was extended with the concept of “asynchronous consensus”. Traditionally, a KV operation like a Put would acquire latches on the corresponding Range’s leaseholder, determine the Puts effect by evaluating against the local state of the leaseholder (i.e. creating a new write intent), replicate this effect by proposing it through consensus, and wait until consensus succeeds before finally returning to the client.

为了使这一切结合在一起,我们必须对 CockroachDB 的键值 API 和客户端进行一些更改。 KV API 以“异步共识”的概念进行了扩展。 传统上,像“Put”这样的 KV 操作将获取相应 Range 租用者的锁存器,通过评估租用者的本地状态(即创建新的写入意图)来确定“Put”的效果,通过提议来复制此效果 通过共识,等待共识成功才最终返回给客户端。

Asynchronous consensus instructs KV operations to skip this last step and return immediately after proposing the change to Raft. Using this option, CockroachDB’s SQL layer can avoid waiting for consensus during each DML statement within a transaction––this means we no longer need to wait W *L_c during a transaction’s preparation phase.

异步共识指示 KV 操作跳过最后一步,并在向 Raft 提出更改后立即返回。 使用这个选项,CockroachDB的SQL层可以避免在事务中的每个DML语句期间等待共识——这意味着我们不再需要在事务的准备阶段等待W *L_c。

Proving Intent Writes 证明意图写入

The other half of the puzzle is that transactions now need to wait for all in-flight consensus writes to complete before committing a transaction. We call this job of waiting for an in-flight consensus write “proving” the intent. To prove an intent, the transaction client, which lives on the the SQL gateway node performing a SQL transaction, talks to the leaseholder of the Range which the intent lives on and checks whether it has been successfully replicated and persisted. If the in-flight consensus operation succeeded, the intent is successfully proven. If it failed, the intent is not proven and the transaction returns an error. If consensus operation is still in-flight, the client waits until it finishes.

难题的另一半是,事务现在需要等待所有正在进行的共识写入完成才能提交事务。 我们将等待飞行中达成共识的这项工作称为“证明”意图。 为了证明意图,位于执行 SQL 事务的 SQL 网关节点上的事务客户端与意图所在的 Range 的租用持有者进行通信,并检查它是否已成功复制和持久化。 如果正在进行的共识操作成功,则意图已成功得到证明。 如果失败,则意图无法得到证实,交易将返回错误。 如果共识操作仍在进行中,客户端将等待直到其完成。

To use this new mechanism, the transaction client was modified to track all unproven intents. It was then given the critical job of proving all intent writes before allowing a transaction to commit. The effect of this is that provisional writes in a transaction never wait for distributed consensus anymore. Instead, a transaction waits for all of its intents to be replicated through consensus in parallel, immediately before it commits. Once all intent writes succeed, the transaction can flip the switch on its transaction record from PENDING to COMMITTED.

为了使用这种新机制,交易客户端被修改为跟踪所有未经证实的意图。 然后,它的关键工作是在允许事务提交之前证明所有意图写入。 这样做的效果是,事务中的临时写入不再等待分布式共识。 相反,事务在提交之前立即等待其所有意图通过共识并行复制。 一旦所有意图写入成功,事务就可以将其事务记录上的开关从 PENDING 切换到 COMMITTED。

Read-Your-Writes

There is an interesting edge case here. When a transaction writes a value, it should be able to read that same value later on as if it had already been committed. This property is sometimes called “read-your-writes”. CockroachDB’s transaction protocol has traditionally made this property trivial to enforce. Before asynchronous consensus, each DML statement in a transaction would synchronously result in intents that would necessarily be visible to all later statements in the transaction. Later statements would notice these intents when they went to perform operations on the same rows and would simply treat them as the source of truth since they were part of the same transaction.

这里有一个有趣的边缘情况。 当事务写入一个值时,它应该能够稍后读取相同的值,就像它已经被提交一样。 此属性有时称为“read-your-writes”。 传统上,CockroachDB 的事务协议使得该属性的执行变得微不足道。 在异步共识之前,事务中的每个 DML 语句都会同步产生意图,这些意图对于事务中的所有后续语句来说必然是可见的。 稍后的语句在对同一行执行操作时会注意到这些意图,并且会简单地将它们视为事实来源,因为它们是同一事务的一部分。

With asynchronous consensus, this guarantee isn’t quite as strong. Now that we’re responding to SQL statements before they have been replicated or persisted, it is possible for a later statement in a transaction to try to access the same data that an earlier statement modified, before the earlier statement’s consensus has resulted in an intent.

对于异步共识,这种保证并不那么有力。 现在,我们在 SQL 语句被复制或持久化之前对其进行响应,因此事务中的后续语句可能会在较早语句达成共识之前尝试访问较早语句修改的相同数据。 。

To prevent this from causing the client to miss its writes, we create a pipeline dependency between statements in a transaction that touch the same rows. Effectively, this means that the second statement will wait for the first to complete before running itself. In doing so, the second intent write first proves the success of the first intent write before starting asynchronous consensus itself. This results in what is known as a “pipeline stall”, because the pipeline within the transaction must slow down to prevent reordering and ensure that dependent statements see their predecessor’s results.

为了防止这导致客户端错过其写入,我们在接触相同行的事务中的语句之间创建管道依赖关系。 实际上,这意味着第二条语句将等待第一条语句完成后再运行。 这样做时,第二个意图写入首先证明第一个意图写入的成功,然后再启动异步共识本身。 这会导致所谓的“管道停顿”,因为事务中的管道必须减慢速度以防止重新排序并确保依赖语句看到其前一个语句的结果。

It is worth noting that the degenerate case where all statements depend on one-other and each results in a pipeline stall is exactly the case we had before - all statements are serialized with no intermediate concurrency.

值得注意的是,所有语句都相互依赖并且每个语句都会导致管道停顿的退化情况正是我们之前遇到的情况 - 所有语句都被序列化,没有中间并发性。

This mix of asynchronous consensus, proving intent writes, and the strong ordering enforced between dependent statements that touch the same rows combine to create transactional pipelining.

这种异步共识、证明意图写入以及接触相同行的依赖语句之间强制执行的强排序的组合结合起来创建了事务管道。

Latency Model: Revisited 延迟模型:重新审视

Transactional pipelining dramatically changes our latency model. It affects both the preparation phase and the commit phase of a transaction and forces us to rederive L_prep and L_commit. To do so, we need to remember two things. First, with transactional pipelining, intent writes no longer synchronously pay the cost of distributed consensus. Second, before committing, a transaction must prove all of its intents before changing the status on its transaction record.

事务流水线极大地改变了我们的延迟模型。 它会影响事务的准备阶段和提交阶段,并迫使我们重新导出 L_prep 和 L_commit。 为此,我们需要记住两件事。 首先,通过事务管道,意图写入不再同步支付分布式共识的成本。 其次,在提交之前,事务必须在更改其事务记录上的状态之前证明其所有意图。

We hinted at the effect of this change on L_prep earlier - writing statements are now just as cheap as reading statements. This means that L_prep approaches 0 and the model simplifies to L_txn = L_commit.

我们之前暗示过这一变化对 L_prep 的影响 - 写入语句现在与读取语句一样便宜。 这意味着 L_prep 接近 0,模型简化为 L_txn = L_commit。

However, L_commit is now more expensive because it has to do two things: prove all intents and write to its transaction record, and it must do these operations in order. The cost of the first step is of particular interest. The transaction client is able to prove all intents in parallel. The effect of this is that the latency cost of proving intent writes at the end of a transaction is simply the latency cost of the slowest intent write, or L_c. The latency cost of the second step, writing to the transaction’s record to flip its switch does not change.

然而,L_commit 现在更加昂贵,因为它必须做两件事:证明所有意图并写入其事务记录,并且它必须按顺序执行这些操作。 第一步的成本特别令人感兴趣。 交易客户端能够并行证明所有意图。 这样做的效果是,在事务结束时证明意图写入的延迟成本只是最慢意图写入的延迟成本,即 L_c。 第二步(写入交易记录以翻转其开关)的延迟成本不会改变。

By adding these together we arrive at our new transaction latency model:

通过将这些加在一起,我们得到了新的事务延迟模型:

1
L_txn = L_commit = 2 * L_c

We can read this as saying that a transaction whose writes cross multiple Ranges pays the cost of distributed consensus twice, regardless of the reads or the writes it performs. For instance, if our cluster can perform consensus in 7ms and our transaction performs 3 UPDATE statements, a back-of-the-envelope calculation for how long it should take gives us 14ms. If we add a fourth UPDATE statement to the transaction, we don’t expect it to pay an additional consensus round trip - the estimated cost is constant regardless of what else the transaction does.

我们可以这样理解,一个跨多个 Range 进行写入的交易,无论执行的是读取还是写入操作,都会付出两次分布式共识的成本。 例如,如果我们的集群可以在 7 毫秒内达成共识,并且我们的事务执行 3 个 UPDATE 语句,则粗略计算所需时间为 14 毫秒。 如果我们向交易添加第四条 UPDATE 语句,我们不希望它支付额外的共识往返费用 - 无论交易执行其他操作,估计成本都是恒定的。

Benchmark Results