CockroachDB's consistency model
原文链接 https://www.cockroachlabs.com/blog/consistency-model/
A few days ago, prompted by a Hacker News post, my friend Ivo texted me saying “Does your head ever explode when you’re thinking about databases and consistency semantics and whatever models? It just sounds like pointless taxonomy stuff. We are <N, K>-serializable whereas QuinoaDB is only ü-serializable”. The answer is yes — my head does explode. I don’t think it’s pointless, though, although I agree that the discussions are generally unproductive.
几天前,在黑客新闻帖子的推动下,我的朋友 Ivo 给我发短信说:“当你思考数据库、一致性语义和任何模型时,你的头是否曾经爆炸过? 这听起来像是毫无意义的分类学东西。 我们是<N, K>-可序列化的,而QuinoaDB仅是ü-可序列化的”。 答案是肯定的——我的头确实爆炸了。 不过,我不认为这是毫无意义的,尽管我同意讨论通常没有成果。
Separately, the other day a colleague told a user that “CockroachDB implements serializability, not linearizability”. While we say this statement often, and it is the best kind of correct, I don’t like it much because I think it doesn’t do us justice and it’s also not particularly helpful for the users — it doesn’t teach them very much about CockroachDB.
另外,有一天,一位同事告诉用户“CockroachDB 实现了可序列化,而不是线性化”。 虽然我们经常说这句话,而且它是最好的正确说法,但我不太喜欢它,因为我认为它对我们不公平,而且对用户也不是特别有帮助——它并没有教会他们太多 关于 CockroachDB 的更多信息。
In this post, I’m attempting to present the guarantees that CockroachDB gives and the ones it doesn’t, and offer my preferred marketing slogan summarizing it all.
The first section provides background and some terminology for consistency models to support the following, CockroachDB-specific section. It’s not formal, rigorous or exhaustive (I link to better sources, though) so readers who are familiar with these things might want to skip it and head straight to the section on CockroachDB’s consistency model.
在这篇文章中,我试图介绍 CockroachDB 提供的保证和它没有提供的保证,并提供我最喜欢的营销口号来总结这一切。
第一部分提供一致性模型的背景和一些术语,以支持以下特定于 CockroachDB 的部分。 它不是正式的、严格的或详尽的(不过,我链接到了更好的资源),因此熟悉这些内容的读者可能想跳过它,直接阅读 CockroachDB 一致性模型的部分。
A summary of database consistency models
数据库一致性模型总结
First of all, a brief introduction to what we’re talking about. Databases let many “clients” access data concurrently, and so they need to define the semantics of these concurrent accesses: for example, what happens if two clients read and write the same data “at the same time”. Moreover, distributed and replicated databases generally store multiple copies of the data, usually over a network of machines, and so they need to define what complications can arise from the fact that different machines are involved in serving reads and writes to the same data: e.g. if I tell machine A to write a key, and then immediately after I ask machine B to read it, will machine B return the data that had been just written? Informally speaking, what we’d ideally want from our database is to hide the data distribution and replication from us and to behave as if all transactions were being run one at a time by a single machine. A database that provides this kind of execution is said to implement the “strict serializability” consistency model - that’s the holy grail.
首先,简单介绍一下我们正在谈论的内容。 数据库允许许多“客户端”并发访问数据,因此需要定义这些并发访问的语义:例如,如果两个客户端“同时”读取和写入相同的数据会发生什么。 此外,分布式和复制数据库通常通过机器网络存储数据的多个副本,因此它们需要定义不同机器参与对相同数据的读取和写入服务这一事实可能会产生哪些复杂性:例如 如果我告诉机器A写入一个密钥,然后我让机器B读取它后,机器B会立即返回刚刚写入的数据吗? 通俗地说,我们理想的情况是对数据库隐藏数据分布和复制,并且表现得好像所有事务都由一台机器一次运行一个。 提供这种执行的数据库据说可以实现“严格可串行化”一致性模型 - 这是圣杯。
But, of course, we also want our database to be resilient to machine failure, and we want the transactions to execute fast, and we want many transactions to execute at the same time, and we want data for European customers to be served from European servers and not cross an ocean network link. All these requirements generally come in conflict with strict serializability. So then databases start relaxing the strict serializability guarantees, basically compromising on that front to get execution speed and other benefits. These compromises need precise language for explaining them. For example, consider a replicated database and a write operation executed by one of the replicas followed quickly by a read operation served by another one. What are admissible results for this read? Under strict serializability, the answer is clear — only the value of the preceding write is acceptable. Under more relaxed models, more values are allowed in addition to this one. But which values exactly? Is a random value coming out of thin air acceptable? Generally, no. Is the value of some other relatively recent write acceptable? Perhaps. To define things precisely, we need specialized vocabulary that’s used by well studied sets of rules (called “consistency models”).
但是,当然,我们也希望我们的数据库能够适应机器故障,我们希望事务能够快速执行,我们希望许多事务同时执行,我们希望欧洲客户的数据能够从欧洲提供 服务器而不是跨越海洋的网络链接。 所有这些要求通常与严格的可串行性相冲突。 因此,数据库开始放松严格的可串行性保证,基本上在这方面做出妥协以获得执行速度和其他好处。 这些妥协需要精确的语言来解释。 例如,考虑一个复制数据库,其中一个副本执行写入操作,然后很快由另一个副本执行读取操作。 这次阅读的可接受结果是什么? 在严格的可串行性下,答案很明确——只有前面写入的值是可接受的。 在更宽松的模型下,除此之外还允许更多值。 但究竟是哪个值呢? 凭空产生的随机值可以接受吗? 一般来说,不会。 其他一些相对较新的写入的值是否可以接受? 也许。 为了精确地定义事物,我们需要经过深入研究的规则集(称为“一致性模型”)使用的专门词汇。
Historically, both the distributed systems community and the databases community have evolved their own terminology and models for consistency. In more recent years, the communities have joined, driven by the advent of “distributed databases”, and the vocabularies have combined. Things are tricky though, plus different databases try to market themselves the best way they can, and so I think it’s fair to say that there’s a lot of confusion on the topic. I’ve been thinking about these things for a couple of years now in the context of CockroachDB, and I still always struggle to make unequivocal and clear statements on the subject. Additionally, I’ll argue that none of the standard lexicon describes CockroachDB very well. For a more systematic treaty on the different meanings of consistency, see The many faces of consistency and Jepsen’s treatment of the topic.
从历史上看,分布式系统社区和数据库社区都发展了自己的术语和模型以实现一致性。 近年来,在“分布式数据库”出现的推动下,社区加入了,词汇也合并了。 但事情很棘手,加上不同的数据库试图以最好的方式推销自己,所以我认为可以公平地说,这个话题存在很多混乱。 几年来,我一直在 CockroachDB 的背景下思考这些事情,但我仍然很难就这个主题做出明确而清晰的陈述。 此外,我认为没有一个标准词典能够很好地描述 CockroachDB。 有关一致性的不同含义的更系统的条约,请参阅[The many faces of consistency]和[Jepsen’s treatment of the topic]。
Transaction isolation levels and serializability
事务隔离级别和可串行性
The databases community has been describing behavior in terms of transactions, which are composite operations (made up of SQL queries). Transactions are subject to the ACID properties (Atomicity, Consistency, Isolation, Durability). This community was primarily interested in the behavior of concurrent transactions on a single server, not so much in the interactions with data replication — it was thus initially not concerned by the historical issues around distributed consistency. For our discussion, the Isolation property is the relevant one: we have multiple transactions accessing the same data concurrently and we need them to be isolated from each other. Each one needs to behave, to the greatest extent possible, as if no other transaction was interfering with it. Ironically, the Consistency in ACID refers to a concept that’s only tangentially related to what we’re talking about here — the fact that the database will keep indexes up to date automatically and will enforce foreign key constraints and such.
数据库社区一直用_事务_来描述行为,事务是复合操作(由 SQL 查询组成)。 事务受 ACID 属性(原子性、一致性、隔离性、持久性)的约束。 该社区主要对单个服务器上并发事务的行为感兴趣,而不是与数据复制的交互 - 因此它最初并不关心分布式一致性的历史问题。 对于我们的讨论,隔离属性是相关的:我们有多个事务同时访问相同的数据,并且我们需要它们彼此隔离。 每个交易都需要尽最大可能表现得好像没有其他交易干扰它一样。 讽刺的是,ACID 中的一致性指的是一个与我们这里讨论的内容无关的概念——数据库将自动保持索引最新并强制执行外键约束等事实。
To describe the possible degrees of transaction isolation, the literature and the ANSI standard enumerates a list of possible “anomalies” (examples of imperfect isolation), and, based on those, defines a couple of standard “isolation levels”: Read Uncommitted, Read Committed, Repeatable Read, Serializable. To give a flavor of what these are about, for example the Repeatable Read isolation level says that once a transaction has read some data, reading it again within the same transaction yields the same results. So, concurrent transactions modifying that data have to somehow not affect the execution of our reading transaction. However, this isolation level allows the Phantom Read anomaly. Basically, if a transaction performs a query asking for rows matching a condition twice, the second execution might return more rows than the first. For example, something like select * from orders where value > 1000
might return orders (a, b, c) the first time and (a, b, c, d) the second time (which is ironic given Repeatable Read’s name since one might call what just happened a non-repeatable read).
为了描述事务隔离的可能程度,文献和 ANSI 标准列举了一系列可能的“异常”(不完美隔离的示例),并在此基础上定义了几个标准“隔离级别”:读未提交、读 提交、可重复读取、可序列化。 为了说明这些内容的含义,例如可重复读取隔离级别表示,一旦事务读取了某些数据,在同一事务中再次读取它会产生相同的结果。 因此,修改该数据的并发事务必须以某种方式不影响我们读取事务的执行。 然而,此隔离级别允许幻读异常。 基本上,如果事务执行两次查询,要求匹配条件的行,则第二次执行可能会返回比第一次更多的行。 例如,像 select * from orders where value > 1000 这样的东西可能会第一次返回订单 (a, b, c) ,第二次返回 (a, b, c, d) (鉴于可重复读取的名称,这很讽刺,因为可能会 将刚刚发生的事情称为不可重复读取)。
Frankly, the definitions of the ANSI isolation levels are terrible (also see A Critique of ANSI SQL Isolation Levels), arguably with the exception of the Serializable one. They have been defined narrow-mindedly with a couple of database implementations in mind and have not stood the test of time.
坦率地说,ANSI 隔离级别的定义非常糟糕(另请参阅《ANSI SQL 隔离级别批判》),可以说除了 Serialized 隔离级别之外。 它们的定义是狭隘的,只考虑了几种数据库实现,并且没有经受住时间的考验。
The Serializable isolation level, which, as far as the SQL standard is concerned, is the gold standard, doesn’t allow any of the defined anomalies. In plain terms, it states that the database needs to ensure that transactions need to behave as if the transactions executed sequentially, one by one. The definition allows that database to choose the order of transactions in an equivalent sequential execution. This is less than ideal because it allows for the following scenario:
就 SQL 标准而言,可序列化隔离级别是黄金标准,不允许任何已定义的异常。 简单来说,它指出数据库需要确保事务的行为就像事务逐一按顺序执行一样。 该定义允许数据库选择等效顺序执行中的事务顺序。 这不太理想,因为它允许出现以下情况:
HN1:
We consider three transactions. The first one is insert into hacker_news_comments (id, parent_id, text) values (1, NULL, 'a root comment')
. The second one is insert into hacker_news_comments (id, parent_id, text) values (2, 1, 'OP is wrong')
. The third one is select id, text from comments
.
我们考虑三个事务。
第一个是 insert into hacker_news_comments (id, parent_id, text) values (1, NULL, 'a root comment')
.
第二个是 insert into hacker_news_comments (id, parent_id, text) values (2, 1, 'OP is wrong')
.
第三个是 select id, text from comments
.
- I run transaction one. 我运行第一个
- I yell across the room to my friend Tobi who’s just waiting to reply to my threads. 我在房间的lingg
- Tobi runs transaction 2. Tobi 运行第二个
- We then tell our friend Nathan to stop what he’s doing and read our thread.
- He runs transaction 3 and gets a single result:
(2, 'OP is wrong')
.
So, Nathan is seeing the response, but not the original post. That’s not good. And yet, it is allowed by the Serializable isolation level and, in fact, likely to occur in many distributed databases (spoiler alert: not in CRDB), assuming the actors were quick to yell at each other and run their transactions. The serial order in which the transactions appear to have executed is 2, 3, 1.
因此,Nathan 看到的是回复,而不是原始帖子。 这不好。 然而,它是可序列化隔离级别所允许的,并且事实上,它很可能发生在许多分布式数据库中(剧透警报:不在 CRDB 中),假设参与者很快就会互相大喊大叫并运行他们的事务。 事务执行的顺序是 2、3、1。
What has happened here is that the actors synchronized with each other outside of the database and expected the database’s ordering of transactions to respect “real time”, but the isolation levels don’t talk about “real time” at all. This seems to not have been a concern for the SQL standardization committee at the time, probably since this kind of thing simply wouldn’t happen if the database software runs entirely on one machine (however many database researchers were thinking about the issues of distributed databases as early as the 70s–for example, see Papadimitriou paper on serializability).
这里发生的情况是,参与者在数据库外部相互同步,并期望数据库的事务排序尊重“实时”,但隔离级别根本不谈论“实时”。 这似乎并不是当时 SQL 标准化委员会所关心的问题,可能是因为如果数据库软件完全运行在一台机器上,这种事情根本就不会发生(然而许多数据库研究人员正在考虑分布式数据库的问题) 早在 70 年代,例如,请参阅 Papadimitriou 关于可序列化性的论文)。
Distributed systems and linearizability
分布式系统和线性化
While database people were concerned with transaction isolation, researchers in distributed and parallel systems were concerned with the effects of having multiple copies of data on the system’s operations. In particular, they were concerned with the semantics of “read” and “write” operations on this replicated data. So, the literature evolved a set of operation “consistency levels”, with names like “read your own writes”, “monotonic reads”, “bounded staleness”, “causal consistency”, and “linearizable” which all give guidance about what values a read operation can return under different circumstances. The original two problems in need of solutions were how to resolve concurrent writes to the same logical address from two writers at separate physical locations using local replicas (CPUs on their local cache, NFS clients on their local copy), and when/how a stale copy should be updated (cache invalidation). The spectrum of possible solutions has been explored in different ways by the original communities: designers of memory caches were constrained by much tighter demands of programmers on consistency, whereas networked filesystems were constrained by unreliable networks to err on the side of more availability.
数据库人员关心事务隔离,而分布式并行系统的研究人员则关心拥有多个数据副本对系统操作的影响。 特别是,他们关心对此复制数据的“读”和“写”操作的语义。 因此,文献发展了一套操作“一致性级别”,其名称包括“读取自己的写入”、“单调读取”、“有界陈旧性”、“因果一致性”和“线性化”,这些名称都为什么值提供了指导 读操作可以在不同情况下返回。 最初需要解决的两个问题是如何使用本地副本(本地缓存上的 CPU、本地副本上的 NFS 客户端)解决来自不同物理位置的两个写入器对同一逻辑地址的并发写入,以及何时/如何处理过时的问题。 应更新副本(缓存失效)。 最初的社区已经以不同的方式探索了一系列可能的解决方案:内存缓存的设计者受到程序员对一致性的更严格要求的限制,而网络文件系统则受到不可靠网络的限制,从而在更高的可用性方面犯了错误。
Generally speaking, this evolutionary branch of consistency models doesn’t talk about transactions. Instead, systems are modeled as collections of objects, with each object defining a set of operations it supports. For example, assuming we have a key-value store that provides the operations read(k) and write(k,v), the system obeys the “monotonic reads” model if, once a process reads the value of a key k, any successive read operation on k by that process will always return that same value or a more recent value. In other words, reads by any one process don’t “go backwards”.
一般来说,一致性模型的这一演化分支不涉及事务。 相反,系统被建模为对象的集合,每个对象定义它支持的一组操作。 例如,假设我们有一个提供 read(k) 和 write(k,v) 操作的键值存储,则系统遵循“单调读取”模型,如果进程读取键 k 的值,则任何 该进程对 k 的连续读取操作将始终返回相同的值或更新的值。 换句话说,任何一个进程的读取都不会“倒退”。
There’s two things to note about this model’s definition: first of all, it talks about a “process”, so the system has a notion of different threads of control. Understanding this is a burden; the serializable isolation level we discussed in the databases context did not need such a concept1 — the user of a system did not need to think about what process is performing what operations. Second, this model is quite relaxed in comparison to others. If one process performs a write(“a”, 1) and later another process performs read(“a”) (and there’s no intervening writes to “a”), then the read might not return 1. The monotonic reads model describes various distributed systems where data is replicated asynchronously and multiple replicas can all serve reads.
这个模型的定义有两点需要注意:首先,它讨论的是“进程”,因此系统有不同控制线程的概念。 理解这一点是一种负担; 我们在数据库上下文中讨论的可序列化隔离级别不需要这样的概念1——系统的用户不需要考虑哪个进程正在执行什么操作。 其次,与其他模型相比,该模型相当宽松。 如果一个进程执行 write(“a”, 1),然后另一个进程执行 read(“a”)(并且没有中间写入“a”),则读取可能不会返回 1。单调读取模型描述了各种 分布式系统,其中数据异步复制并且多个副本都可以提供读取服务。
The gold standard among these models is linearizability. It was formalized by Herlihy and Wing in a delightful paper.
这些模型的黄金标准是线性化。 Herlihy 和 Wing 在一篇令人愉快的论文中将其正式化。
This model aims to describe systems with properties pretty similar to the ones guaranteed for database transactions by the Serializable isolation level. Informally, it says that operations will behave as if they were executed one at a time, and an operation that finished before another one began (according to “real time”) has to execute before the second one. This model, assuming systems can actually implement it efficiently, sounds really good. Let’s definite it more formally.
该模型旨在描述具有与可序列化隔离级别保证数据库事务的属性非常相似的系统。 非正式地,它表示操作的行为就像一次执行一个操作,并且在另一个操作开始之前完成的操作(根据“实时”)必须在第二个操作之前执行。 这个模型,假设系统实际上可以有效地实现它,听起来确实不错。 让我们更正式地确定它。
Usually, linearizability is defined at the level of a single, relatively simple “object” and then expanded to the level of a system comprised of many such objects. So, we have an object that affords a couple of operations, and we want to devise a set of rules for how these operations behave. An operation is modeled as an “invocation” (from a client to the object) followed by a “response” (from the object to the client). We’re talking in a concurrent setting, where many clients are interacting with a single object concurrently. We define a “history” to be a set of invocations and responses.
通常,线性化是在单个相对简单的“对象”级别上定义的,然后扩展到由许多此类对象组成的系统级别。 因此,我们有一个提供几个操作的对象,并且我们希望为这些操作的行为方式设计一组规则。 操作被建模为“调用”(从客户端到对象),然后是“响应”(从对象到客户端)。 我们正在讨论并发环境,其中许多客户端同时与单个对象交互。 我们将“历史”定义为一组调用和响应。
For example, say our object is a FIFO queue (providing the enqueue/dequeue operations). Then a history might be something like:
例如,假设我们的对象是一个 FIFO 队列(提供入队/出队操作)。 那么历史可能是这样的:
H1:
client 1: enqueue “foo”
client 1: ok
client 1: dequeue
client 1: ok (“foo”)
client 1: enqueue “bar”
client 2: enqueue “baz”
client 1: ok
client 2: ok
client 1: dequeue
client 1: ok (“baz”)
The first event in this history is an invocation by client 1, the second one is the corresponding response from the queue object. Responses for dequeue operations are annotated with the element they return.
该历史记录中的第一个事件是客户端 1 的调用,第二个事件是来自队列对象的相应响应。 出队操作的响应用它们返回的元素进行注释。
We say that a given history is “sequential” if every invocation is immediately followed by a response. H1 is not sequential since it contains, for example, this interleaving of operations:
如果每次调用后都立即有响应,我们就说给定的历史是“顺序的”。 H1 不是连续的,因为它包含例如以下操作的交错:
client 1: enqueue “bar”
client 2: enqueue “baz”
Sequential histories are easy to reason about and check for validity (e.g. whether or not our FIFO queue is indeed FIFO). Since H1 is not sequential, it’s a bit hard to say whether the last response client 1 got is copacetic. Here’s where we use linearizability: we say that a history H is linearizable if it is equivalent to some valid sequential history H’, where H’ contains the same events, possibly reorderdered under the constraint that, if a response op1 appears before an invocation op2 in H, then this order is preserved in H’. In other words, a history is linearizable if all the responses are valid according to a sequential reordering that preserves the order of non-overlapping responses.
顺序历史很容易推理和检查有效性(例如,我们的 FIFO 队列是否确实是 FIFO)。 由于 H1 不是连续的,所以很难说客户端 1 最后得到的响应是否一致。 这里是我们使用线性化的地方:我们说历史 H 是线性化的,如果它“等价”于某个有效的顺序历史 H’,其中 H’ 包含相同的事件,可能在以下约束下重新排序:如果响应 op1 出现在 H 中的调用 op2 之前,则 该顺序保留在 H’ 中。 换句话说,如果所有响应根据保留非重叠响应顺序的顺序重新排序都是有效的,则历史是可线性化的。
For example, H1 is in fact linearizable because it’s equivalent to the following sequential history:
例如,H1 实际上是可线性化的,因为它相当于以下顺序历史:
client 1: enqueue “foo”
client 1: ok
client 1: dequeue
client 1: ok (“foo”)
client 2: enqueue “baz”
client 2: ok
client 1: enqueue “bar”
client 1: ok
client 1: dequeue
client 1: ok (“baz”)
Now, an object is said to be linearizable if all the histories it produces are linearizable. In other words, no matter how the clients bombard our queue with requests concurrently, the results need to look as if the requests came one by one. If the queue is to claim linearizability, the implementation should use internal locking, or whatever it needs to do, to make this guarantee. Note that this model does not explicitly talk about replication, but the cases where it is of value are primarily systems with replicated state. If our queue is replicated across many machines, and clients talk to all of them for performing operations, “using internal locking” is not trivial but has to somehow be done if we want linearizability.
现在,如果一个对象产生的所有历史都是可线性化的,则称该对象是可线性化的。 换句话说,无论客户端如何并发地用请求轰炸我们的队列,结果都需要看起来像是请求是一个接一个地到来的。 如果队列要求线性化,则实现应使用内部锁定或任何需要执行的操作来保证这一点。 请注意,该模型没有明确讨论复制,但它有价值的情况主要是具有复制状态的系统。 如果我们的队列在许多机器上复制,并且客户端与所有机器通信以执行操作,那么“使用内部锁定”并不是微不足道的,但如果我们想要线性化,就必须以某种方式完成。
To raise the level of abstraction, a whole system is said to be linearizable if it can be modeled as a set of linearizable objects. Linearizability has this nice “local” property: it can be composed like that. So, for example, a key-value store that offers point reads and point writes can be modeled as a collection of registers, with each register offering a read and write operation. If the registers individually provide linearizability, then the store as a whole also does.
为了提高抽象级别,如果整个系统可以建模为一组可线性化对象,则称其为可线性化的。 线性化具有这个很好的“局部”属性:它可以这样组合。 因此,例如,提供点读取和点写入的键值存储可以建模为寄存器的集合,每个寄存器提供读取和写入操作。 如果寄存器单独提供线性化能力,那么存储作为一个整体也能提供线性化能力。
Two things are of note about the linearizable consistency model:
关于线性化一致性模型有两点值得注意:
First, there is a notion of “real time” used implicitly. Everybody is able to look at one clock on the wall so that it can be judged which operation finishes before another operation begins. The order of operations in our linearizable histories has a relation with the time indicated by this mythical clock.
首先,隐含地使用了“实时”的概念。 每个人都可以看着墙上的一个时钟,从而可以判断哪一个操作在另一操作开始之前完成。 我们线性化历史中的运算顺序与这个神话时钟所指示的时间有关。
Second, concurrent operations are allowed to execute in any order. For example, in our history H1, the last event might have been
其次,允许并发操作以任何顺序执行。 例如,在我们的历史 H1 中,最后一个事件可能是
client 1: ok (“bar”) because a serial history where enqueuing baz finishes before enqueuing baz begins would also have been acceptable.
client 1: ok (“bar”),因为在 enqueuing baz 开始之前 enqueuing baz 完成的串行历史也是可以接受的。
It’s worth reminding ourselves that linearizability does not talk about transactions, so this model by itself is not well suited to be used by SQL databases. I guess one could shoehorn it by saying that the whole database is one object which provides one transaction operation, but then a definition needs to be provided for the functional specifications of this operation. We’re getting back to the ACID properties and the transaction isolation levels, and I’m not sure how the formalism would work exactly.
值得提醒我们的是,线性化不涉及事务,因此该模型本身不太适合 SQL 数据库使用。 我想人们可以通过说整个数据库是一个提供一个事务操作的对象来硬塞它,但随后需要为该操作的功能规范提供一个定义。 我们回到 ACID 属性和事务隔离级别,我不确定形式主义将如何准确地发挥作用。
What the literature does for advancing a database model to incorporate this relationship that linearizability has with time is to incorporate its ideas into the serializable transaction isolation level.
为了推进数据库模型以纳入线性化与时间的这种关系,文献所做的就是将其思想纳入可序列化事务隔离级别。
A note on clocks
关于时钟的注意事项
The mentioning of “real time” and the use of a global clock governing a distributed system are fighting words for some of my colleagues. It’s understandable since, on the one hand, Einstein realized that time itself is relative (different observers can perceive events to take place in different orders relative to each other) and, on the other hand, even if we are to ignore relativistic effects for practical purposes, this one true, shared clock doesn’t quite exist in the context of a distributed system. I’m not qualified to discuss relativistic effects beyond acknowledging that there is such a thing as relativistic linearizability. I believe the casual database user can ignore them, but I’ll start blabbering if you ask me exactly why.
对于我的一些同事来说,提到“实时”和使用全局时钟来管理分布式系统都是争论不休的。 这是可以理解的,因为一方面,爱因斯坦意识到时间本身是相对的(不同的观察者可以感知事件以相对于彼此不同的顺序发生),另一方面,即使我们在实际中忽略相对论效应 出于目的,这个真正的共享时钟在分布式系统的上下文中并不完全存在。 除了承认存在“相对论线性化”这样的东西之外,我没有资格讨论相对论效应。 我相信普通的数据库用户可以忽略它们,但如果你问我到底为什么,我会开始喋喋不休。
The fact that there is no shared clock according to which we can decide ordering is a problem all too real for implementers of distributed systems like CockroachDB. The closest we’ve come is a system called TrueTime built by Google, which provides tightly synchronized clocks and bounded errors brought front and center.
事实上,对于像 CockroachDB 这样的分布式系统的实现者来说,没有共享时钟来决定排序是一个非常现实的问题。 我们最接近的是一个由 Google 构建的名为 TrueTime 的系统,它提供了紧密同步的时钟和有限的误差。
As far as the linearizability model is concerned (which assumes that a shared clock exists), the way I think about it is that the model tells us what to expect if such a clock were to exist. Given that it doesn’t quite exist, then clients of the system can’t actually use it to record their histories perfectly: one can’t simply ask all the clients, or all the CockroachDB replicas, to log their operation invocations and responses and timestamp them using the local clocks, and then centralize all the logs and construct a history out of that. This means that verifying a system that claims to be linearizable isn’t trivial. In other words, Herlihy talks about histories but doesn’t describe how one might actually produce these histories in practice. But that doesn’t mean the model is not useful.
就线性化模型而言(假设存在共享时钟),我的想法是该模型告诉我们如果存在这样的时钟会发生什么。 鉴于它并不完全存在,那么系统的客户端实际上无法使用它来完美地记录其历史记录:不能简单地要求所有客户端或所有 CockroachDB 副本记录其操作调用和响应, 使用本地时钟为它们添加时间戳,然后集中所有日志并从中构建历史记录。 这意味着验证一个声称可线性化的系统并非易事。 换句话说,赫利希谈论历史,但没有描述人们如何在实践中真正产生这些历史。 但这并不意味着该模型没有用。
What a verifier can do is record certain facts like “I know that this invocation happened after this other invocation, because there was a causal relationship between them”. For certain operations for which there was not a causal relationship, the client might not have accurate enough timestamps to put in the history and so such pairs of events can’t be used to verify whether a history is linearlizable or not. Alternatively, another thing a verifier might do is relay all its operations through a singular “timestamp oracle”, whose recording would then be used to produce and validate a history. Whether such a construct is practical is debatable, though, since the mere act of sequencing all operations would probably introduce enough latency in them as to hide imperfections of the system under test.
验证者可以做的是记录某些事实,例如“我知道这次调用发生在另一次调用之后,因为它们之间存在因果关系”。 对于某些不存在因果关系的操作,客户端可能没有足够准确的时间戳来放入历史记录中,因此此类事件对不能用于验证历史记录是否可线性化。 或者,验证者可能做的另一件事是通过单个“时间戳预言机”中继其所有操作,然后使用其记录来生成和验证历史记录。 然而,这样的构造是否实用是有争议的,因为仅仅对所有操作进行排序的行为就可能会在其中引入足够的延迟,从而隐藏被测系统的缺陷。
Bringing the worlds together: strict serializability
将世界结合在一起:严格的可串行性
As I was saying, the ANSI SQL standard defines the serializable transaction isolation as the highest level, but its definition doesn’t consider phenomena present in distributed databases. It admits transaction behavior that is surprising and undesirable because it doesn’t say anything about how some transactions need to be ordered with respect to the time at which the client executed them.
正如我所说,ANSI SQL 标准将可序列化事务隔离定义为最高级别,但其定义并未考虑分布式数据库中存在的现象。 它承认令人惊讶且不受欢迎的事务行为,因为它没有说明某些事务需要如何根据客户端执行它们的时间进行排序。
To cover these gaps, the term “strict serializability” has been introduced for describing (distributed) databases that don’t suffer from these undesirable behaviors.
为了弥补这些差距,引入了术语“严格可串行化”来描述不受这些不良行为影响的(分布式)数据库。
Strict serializability says that transaction behavior is equivalent to some serial execution, and the serial order of transactions corresponds to real time (i.e. a transaction started after another one finished will be ordered after it). Note that strict serializability (like linearizability) still doesn’t say anything about the relative ordering of concurrent transactions (but, of course, those transaction still need to appear to be “isolated” from each other). We’ll come back to this point in the next sections.
严格的可串行性表示事务行为相当于某种串行执行,并且事务的串行顺序对应于实时(即,在另一个事务完成之后开始的事务将在它之后排序)。 请注意,严格的可串行性(如线性化)仍然没有说明并发事务的相对顺序(但是,当然,这些事务仍然需要看起来彼此“隔离”)。 我们将在下一节中回到这一点。
Under strict serializability, the system behavior outlined in the Hacker News posts example from the Serializability section is not permitted. Databases described by the strict serializability model must ensure that the final read, Nathan’s, must return both the root comment and the response. Additionally, the system must ensure that a query like select * from hacker_news_comments
never returns the child comment without the parent, regardless of the the time when the query is executed (i.e. depending on the time when it’s executed, it can return an empty set, the root, or both the root and the child). We’ll come back to this point when discussing CRDB’s guarantees.
在严格的可序列化性下,不允许出现“可序列化性”部分中的黑客新闻帖子示例中概述的系统行为。 严格的可序列化模型描述的数据库必须确保 Nathan 的最终读取必须返回根注释和响应。 此外,系统必须确保像 select * from hacker_news_comments
这样的查询永远不会返回没有父评论的子评论,无论查询执行的时间如何(即,根据执行的时间,它可以返回一个空集, 根,或根和孩子)。 当我们讨论 CRDB 的保证时,我们会回到这一点。
Google’s Spanner uses the term “external consistency” instead of “strict serializability”. I like that term because it emphasizes the difference between a system that provides “consistency” for transactions known to the database to be causally related and systems that don’t try to infer causality and offer stronger guarantees (or, at least, that’s how me and my buddies interpret the term). For example, remembering the Hacker News example, there are systems that allow Tobi to explicitly tell the database that his transaction has been “caused” by my transaction, and then the system guarantees that the ordering of the two transaction will respect this. Usually this is done through some sort of “causality tokens” that the actors pass around between them. In contrast, Spanner doesn’t require such cooperation from the client in order to prevent the bad outcome previously described: even if the clients coordinated “externally” to the database (e.g, by yelling across the room), they’ll still get the consistency level they expect.
谷歌的 Spanner 使用术语“外部一致性”而不是“严格的可序列化性”。 我喜欢这个术语,因为它强调了为数据库已知的因果关系事务提供“一致性”的系统与不尝试推断因果关系并提供更强保证的系统之间的区别(或者,至少,这就是我的方式) 我的朋友解释了这个词)。 例如,记住 Hacker News 的例子,有些系统允许 Tobi 明确地告诉数据库他的交易是由我的交易“引起”的,然后系统保证两个交易的顺序将尊重这一点。 通常这是通过演员之间传递的某种“因果关系令牌”来完成的。 相比之下,Spanner 不需要客户端的这种合作来防止前面描述的不良结果:即使客户端在数据库“外部”进行协调(例如,通过在房间里大喊大叫),他们仍然会得到 他们期望的一致性水平。
Peter Bailis has more words on Linearizability, Serializability and Strict Serializability.
Peter Bailis 对线性化、可串行化和严格可串行化有更多的论述。
CockroachDB’s consistency model: more than serializable, less than strict serializability
Now that we’ve discussed some general concepts, let’s talk about how they apply to CockroachDB. CockroachDB is an open-source, transactional, SQL database and it’s also a distributed system. In my opinion, it comes pretty close to being the Holy Grail of databases: it offers a high degree of “consistency”, it’s very resilient to machine and network failures, it scales well and it performs well. This combination of features already makes it unique enough; the system goes beyond that and brings new concepts that are quite game-changing — good, principled control over data placement and read and write latency versus availability tradeoffs in geographically-distributed clusters. All without ever sacrificing things we informally refer to as “consistency” and “correctness” in common parlance. Also it’s improving every day at a remarkable pace. I’m telling you — you need to try this thing!
现在我们已经讨论了一些一般概念,接下来我们来谈谈它们如何应用于 CockroachDB。 CockroachDB 是一个开源的事务性 SQL 数据库,也是一个分布式系统。 在我看来,它非常接近数据库的圣杯:它提供了高度的“一致性”,它对机器和网络故障具有很强的弹性,它具有良好的扩展性并且性能良好。 这种功能组合已经使其足够独特; 该系统超越了这一点,并带来了完全改变游戏规则的新概念——对数据放置、读写延迟与地理分布式集群中的可用性权衡进行良好的、有原则的控制。 所有这一切都没有牺牲我们非正式地称为“一致性”和“正确性”的东西。 而且它每天都在以惊人的速度进步。 我告诉你——你需要尝试一下这个东西!
But back to the subject at hand — the consistency story. CockroachDB is a complex piece of software; understanding how it all works in detail is not tractable for most users, and indeed it will not even be a good proposition for all the engineers working on it. We therefore need to model it and present a simplified version of reality. The model needs to be as simple as possible and as useful as possible to users, without being misleading (e.g. suggesting that outcomes that one might think are undesirable are not possible when in fact they are). Luckily, because CockroachDB was always developed under a “correctness first” mantra, coming up with such a model is not too hard, as I’ll argue.
但回到我们手头的主题——一致性的故事。 CockroachDB 是一个复杂的软件; 对于大多数用户来说,理解它的详细工作原理并不容易,事实上,对于所有致力于它的工程师来说,这甚至不是一个好建议。 因此,我们需要对其进行建模并呈现现实的简化版本。 该模型需要尽可能简单并且对用户尽可能有用,而不会产生误导(例如,暗示人们可能认为不期望的结果是不可能的,而实际上它们是不可能的)。 幸运的是,因为 CockroachDB 始终是在“正确性第一”的口号下开发的,所以正如我所说,提出这样一个模型并不难。
There’s a standard disclosure that comes with our software: the system assumes that the clocks on the Cockroach nodes are somewhat synchronized with each other. The clocks are allowed to drift away from each other up to a configured “maximum clock offset” (by default 500ms). Operators need to run NTP or other clock synchronization mechanism on their machines. The system detects when the drift approaches the maximum allowed limit and shuts down some nodes, alerting an operator[^2]. Theoretically, I think more arbitrary failures modes are possible if clocks get unsynchronized quickly. More on the topic in Spencer’s post “Living Without Atomic Clocks.”
我们的软件附带了一个标准披露:系统假设 Cockroach 节点上的时钟在某种程度上彼此同步。 允许时钟彼此漂移,最多可达配置的“最大时钟偏移”(默认为 500ms)。 运营商需要在他们的机器上运行NTP或其他时钟同步机制。 系统会检测漂移何时接近最大允许限制并关闭一些节点,从而向操作员发出警报[^2]。 从理论上讲,我认为如果时钟快速不同步,则可能会出现更多任意故障模式。 有关该主题的更多信息,请参阅斯宾塞的文章“没有原子钟的生活”。
Back to the consistency. For one, CockroachDB implements the serializable isolation level for transactions, as specified by the SQL standard. In contrast to most other databases which don’t offer this level of isolation as the default (or at all, for crying out loud!), this is the only isolation level we offer; users can’t opt for a lesser one. We, the CockroachDB authors, collectively think that any lower level is just asking for pain. It’s fair to say that it’s generally extremely hard to reason about the other levels and the consequences of using them in an application (see the ACIDRain paper for what can go wrong when using lower isolation levels). I’m not trying to be condescending; up until the 2.1 version we used to offer another relatively high level of isolation as an option (Snapshot Isolation), but it turned out that it (or, at least, our implementation of it) had complex, subtle consequences that even we hadn’t fully realized for the longest time. Thus, we ripped it out and instead improved the performance of the our implementation ensuring serializability as much as possible. Below serializability be dragons.
回到一致性。 其一,CockroachDB 实现了 SQL 标准指定的事务的可序列化隔离级别。 与大多数其他不提供这种默认隔离级别(或者根本不提供这种隔离级别)的数据库相比,这是我们提供的唯一隔离级别; 用户不能选择较小的。 我们,CockroachDB 作者,集体认为任何较低的级别都是自找痛苦。 公平地说,通常很难推理其他级别以及在应用程序中使用它们的后果(请参阅 ACIDRain 论文,了解使用较低隔离级别时可能会出现的问题)。 我并不是想表现出居高临下的态度。 在 2.1 版本之前,我们曾经提供另一个相对较高级别的隔离作为选项(快照隔离),但事实证明它(或者至少是我们对它的实现)产生了复杂而微妙的后果,甚至我们也没有意识到” 最长的时间没有完全实现。 因此,我们将其删除,并提高了实现的性能,尽可能确保可序列化。 下面的可序列化是龙。
But simply saying that we’re serializable is selling our system short. We offer more than that. We do not allow the bad outcome in the Hacker News commenting scenario.
但仅仅说我们是可序列化的就低估了我们的系统。 我们提供的远不止这些。 我们不允许黑客新闻评论场景出现不良结果。
CockroachDB doesn’t quite offer strict serializability, but we’re fairly close to it. I’ll spend the rest of the section explaining how exactly we fail strict serializability, what our guarantees actually are, and some gotchas.
CockroachDB 并没有完全提供严格的可序列化性,但我们已经相当接近了。 我将用本节的其余部分来解释我们到底是如何失败的严格的可序列化,我们的保证实际上是什么,以及一些陷阱。
No stale reads
If there’s one canned response I wish we’d give users that pop into our chat channels asking about the consistency model, I think it should be “CockroachDB doesn’t allow stale reads”. This should be the start of all further conversations, and in fact I think it will probably preempt many conversations. Stating this addresses a large swath of anomalies that people wonder about (in relation to distributed systems). “No stale reads” means that, once a write transaction committed, every read transaction starting afterwards[^3] will see it.
如果我希望我们能给那些突然进入我们的聊天频道询问一致性模型的用户一个预设的回复,我认为它应该是“CockroachDB 不允许过时的读取”。 这应该是所有进一步对话的开始,事实上我认为它可能会抢占许多对话。 声明这一点解决了人们想知道的大量异常现象(与分布式系统有关)。 “无陈旧读取”意味着,一旦写事务被提交,随后开始的每个读取事务[^3]都会看到它。
Internalizing this is important and useful. It does not come by chance; the system works hard for it and so have we, the builders. In the Hacker News comments example, once I have committed my root comment, a new transaction by Nathan is guaranteed to see it. Yes, our system is distributed and data is replicated. Yes, Nathan might be talking to a different node than I was, maybe a node with a clock that’s trailing behind. In fact, the node I was talking to might have even crashed in the meantime. Doesn’t matter. If Nathan is able to read the respective table, he will be able to read my write.
内化这一点很重要而且有用。 它不是偶然出现的;它是偶然发生的。 系统为此努力工作,我们建设者也是如此。 在黑客新闻评论示例中,一旦我提交了我的根评论,Nathan 的新事务就一定会看到它。 是的,我们的系统是分布式的,数据是复制的。 是的,Nathan 可能正在与一个与我不同的节点通信,也许是一个时钟落后的节点。 事实上,我正在交谈的节点甚至可能在此期间崩溃了。 没关系。 如果Nathan能够读取相应的表,他将能够读取我的写入。
Beyond serializability, saying “no stale reads” smells like linearizability (and, thus, strict serializability) since “staleness” is related to the passing of time. In fact, when people come around asking for linearizability, I conjecture that most will be satisfied by this answer. I think this is what I’d be asking for if I hadn’t educated myself specifically on the topic. Relatedly, this is also what the C(onsistency) in the famous CAP theorem is asking for. And we have it.
除了可序列化性之外,说“没有过时的读取”闻起来像线性化(因此也是严格的可序列化),因为“过时性”与时间的流逝有关。 事实上,当人们询问线性化时,我猜大多数人都会对这个答案感到满意。 我想如果我没有专门针对这个主题进行自我教育的话,这就是我所要求的。 与此相关,这也是著名的 CAP 定理中的 C(一致性)所要求的。 我们有它。
So why exactly don’t we claim strict serializability?
那么我们为什么不要求严格的可序列化呢?
CockroachDB does not offer strict serializability
Even though CRDB guarantees (say it with me) “no stale reads”, it still can produce transaction histories that are not linearizable.
尽管 CRDB 保证(跟我说)“没有过时的读取”,但它仍然会产生不可线性化的事务历史记录。
Consider the history HN2 (assume every statement is its own transaction, for simplicity):
考虑历史 HN2(为简单起见,假设每个语句都是它自己的事务):
- Nathan runs
select * from hacker_news_comments
. Doesn’t get a response yet. - I run
insert into hacker_news_comments (id, parent_id, text) values (1, NULL, 'a root comment')
and commit. - Tobi runs
insert into hacker_news_comments (id, parent_id, text) values (2, 1, 'OP is wrong')
and commits. - Nathan’s query returns and he gets Tobi’s row but not mine.
This is the “anomaly” described in Section 2.5 of Jepsen’s analysis of CRDB from back in the day.
这就是 Jepsen 当年对 CRDB 的分析第 2.5 节中描述的“异常”。
So what happened? From Nathan’s perspective, Tobi’s transaction appears to have executed before mine. That contradicts strict serializability since, according to “real time”, Tobi ran his transaction after me. This is how CRDB fails strict serializability; we call this anomaly “causal reverse”.
所以发生了什么事? 从 Nathan 的角度来看,Tobi 的交易似乎是在我的之前执行的。 这与严格的序列化相矛盾,因为根据“实时”,托比在我之后运行他的交易。 这就是 CRDB 无法严格串行化的原因; 我们称这种异常为“因果逆转”。
Before freaking out, let’s analyze the circumstances of the anomaly a bit. Then I’ll explain more technically, for the curious, how such a thing can happen in CRDB.
在惊慌失措之前,我们先来分析一下异常情况。 然后,为了满足好奇心,我将从技术上更详细地解释一下 CRDB 中如何发生这样的事情。
First of all, let’s restate our motto: if Nathan had have started his transaction after Tobi committed (in particular, if Nathan would have started his transaction because Tobi committed his), he would have seen both rows and things would have been good. An element that’s at play, and in fact is key here, is that Nathan’s transaction was concurrent with both mine and Tobi’s. According to the definition of strict serializability, Nathan’s transaction can be ordered in a bunch of ways with respect to the other two: it can be ordered before both of them, after both of them, or after mine but before Tobi’s. The only thing that’s required is that my transaction is ordered before Tobi’s. The violation of strict serializability that we detected here is not that Nathan’s transaction was mis-ordered, but that mine and Tobi’s (which are not concurrent) appear to have been reordered. Non-strict serializability allows this just fine.
首先,让我们重申一下我们的座右铭:如果 Nathan 在 Tobi 提交后开始他的事务(特别是,如果 Nathan 因为 Tobi 提交了他的事务而开始他的事务),他就会看到这两行,事情就会很好。 一个起作用的因素,实际上是这里的关键,是Nathan的事务与我和Tobi的事务同时发生。 根据严格可序列化的定义,Nathan 的事务相对于其他两个事务可以通过多种方式排序:它可以排序在它们之前、之后,或者在我的事务之后但在 Tobi 之前。 唯一需要做的是我的事务在 Tobi 的事务之前订购。 我们在这里检测到的违反严格序列化的行为并不是 Nathan 的事务顺序错误,而是我的事务和 Tobi 的事务(不并发)似乎已重新排序。 非严格的可串行性允许这很好。
My opinion is that this anomaly is not particularly bad because Nathan was not particularly expecting to see either of the two writes. But if this was my only argument, I’d probably stay silent.
我的观点是,这种异常现象并不是特别糟糕,因为 Nathan 并没有特别期望看到这两个写入中的任何一个。 但如果这是我唯一的论点,我可能会保持沉默。
There’s another important thing to explain: both my and Tobi’s transactions are, apart from their timing, unrelated: the sets of data they read and write do not overlap. If they were overlapping (e.g. if Tobi read my comment from the DB before inserting his), then serializability would not allow them to be reordered at all (and so CockroachDB wouldn’t do it and the anomaly goes away). In this particular example, if the schema of the hackernewscomments table would contain a self-referencing foreign key constraint (asking the database to ensure that child comments reference an existing parent), then the “reading” part would have been ensured by the system.
还有一件重要的事情需要解释:我和 Tobi 的事务除了时间之外都是无关的:它们读取和写入的数据集不重叠。 如果它们重叠(例如,如果 Tobi 在插入他的评论之前从数据库中读取了我的评论),那么可序列化性根本不允许它们重新排序(因此 CockroachDB 不会这样做,异常就会消失)。 在这个特定的示例中,如果 hackernewscomments 表的模式包含自引用外键约束(要求数据库确保子评论引用现有的父评论),那么系统将确保“阅读”部分。
So, for this anomaly to occur, you need three transactions to play. Two of them need to appear to be independent of each other (but not really be, or otherwise we probably wouldn’t have noticed the anomaly) and the third needs to overlap both of them. I’ll let everybody judge for themselves how big of a deal this is. For what it’s worth, I don’t remember hearing a CRDB user complaining about it.
因此,要发生这种异常,您需要进行三笔交易。 其中两个需要看起来彼此独立(但实际上并非如此,否则我们可能不会注意到异常),第三个需要将它们重叠。 我会让每个人自己判断这件事有多大。 无论如何,我不记得听到过 CRDB 用户抱怨过它。
Beyond the theory, there are technical considerations that make producing this anomaly even more unlikely: given CockroachDB’s implementation, the anomaly is avoided not only if the read/write sets of my and Tobi’s transactions overlap, but also if the leadership of any of the ranges of data containing hackernewscomments rows 1 and 2 happens to be on the same node when these transactions occur, or if Nathan’s database client is talking to the same CockroachDB node as Tobi’s, and also in various other situations. Also, the more synchronized the clocks on the three nodes are, the less likely it is. Overall, this anomaly is pretty hard to produce even if you try explicitly.
除了理论之外,还有一些技术考虑因素使得产生这种异常的可能性更小:考虑到 CockroachDB 的实现,不仅当 my 和 Tobi 的事务的读/写集重叠时,而且当任何范围的领导层重叠时,也可以避免异常。 当这些事务发生时,或者 Nathan 的数据库客户端与 Tobi 的数据库客户端与同一个 CockroachDB 节点通信时,以及在各种其他情况下,包含 hackernewscomments 行 1 和 2 的数据恰好位于同一节点上。 此外,三个节点上的时钟越同步,这种情况的可能性就越小。 总的来说,即使你明确地尝试,这种异常也很难产生。
As you might have guessed, I personally am not particularly concerned about this anomaly. Besides everything I’ve said, I’ll add a whataboutist argument and take the discussion back to friendly territory: consider this anomaly in contrast to the “stale reads” family of anomalies present in many other competing products. All these things are commonly bucketed under strict serializability / linearizability violations, but don’t be fooled into thinking that they’re all just as bad. Our anomaly needs three transactions doing a specific dance resulting in an outcome that, frankly, is not even that bad. A stale read anomaly can be seen much easier in a product that allows it. Examples are many; a colleague gave a compelling one recently: if your bank was using a database that allows stale reads, someone might deposit a check for you, at which point your bank would text you about it, and you’d go online to see your balance. You might see the non-updated balance and freak out. Banks should be using CockroachDB.
正如您可能已经猜到的那样,我个人并不特别担心这种异常现象。 除了我所说的一切之外,我还将添加一个关于什么的论点,并将讨论带回到友好的领域:将此异常与许多其他竞争产品中存在的“陈旧读取”异常系列进行对比。 所有这些事情通常都受到严格的可序列化/线性化违规的影响,但不要误以为它们都一样糟糕。 我们的异常需要三笔交易进行特定的舞蹈,坦率地说,结果并没有那么糟糕。 在允许这种情况的产品中,可以更容易地看到陈旧的读取异常。 例子有很多; 一位同事最近给出了一个令人信服的说法:如果您的银行使用的数据库允许过时读取,有人可能会为您存入一张支票,此时您的银行会向您发送短信,然后您可以上网查看余额。 您可能会看到未更新的余额并感到害怕。 银行应该使用 CockroachDB。
Other CockroachDB gotchas
其他 CockroachDB 问题
I’ve discussed the CockroachDB guarantees and violations of strict serializability. Our discussion used, laxly, the SQL language to illustrate things but the discussion used language and concepts from more theoretical literature. We bridged the gap by implying that SQL statements are really reads and writes used by some models. This section discusses some uses of CockroachDB/SQL that fall a bit outside the models we’ve used, but are surprising nevertheless. I think these examples will not fall nicely into the models used for the strict serializability definition, at least not without some effort into expanding the model.
我已经讨论了 CockroachDB 的保证和对严格序列化的违反。 我们的讨论宽松地使用了 SQL 语言来说明问题,但讨论使用了更多理论文献中的语言和概念。 我们通过暗示 SQL 语句实际上是某些模型使用的读取和写入来弥补这一差距。 本节讨论 CockroachDB/SQL 的一些用法,这些用法有点超出我们使用的模型,但仍然令人惊讶。 我认为这些示例不会很好地落入用于严格可串行性定义的模型中,至少在不努力扩展模型的情况下是这样。
The SQL now() function
Consider the following two transactions:
1 | 1. insert into foo (id, time) values (1, now()) |
Assuming these two transactions execute in this order, it is possible (and surprising) to read the rows back and see that the time value for row 2 is lower that the one for row one.
假设这两个事务按此顺序执行,则有可能(并且令人惊讶)读回行并看到第 2 行的时间值低于第 1 行的时间值。
Perhaps it’s realistic to think that this happens in other systems too, even single-node systems, if the system clock jumps backwards (as it sometimes does), so perhaps there’s nothing new here.
也许现实的是,如果系统时钟向后跳(有时会发生),这种情况也会发生在其他系统中,甚至是单节点系统中,所以也许这里没有什么新东西。
as of system time
queries and backups
CockroachDB supports the (newer) standard SQL system-versioned tables; CockroachDB lets one “time travel” and query the old state of the database with a query like select * from foo as of system time now()-10s
. This is a fantastic, really powerful feature. But it also provides another way to observe a “causal reverse” anomaly. Say one ran these two distinct transactions, in this order:
CockroachDB 支持(较新的)标准 SQL 系统版本表; CockroachDB 允许“时间旅行”,并使用 select * from foo 之类的查询来查询截至系统时间 now()-10s 的数据库的旧状态。 这是一个非常棒的、非常强大的功能。 但它也提供了另一种观察“因果逆转”异常的方法。 假设有人按以下顺序运行这两个不同的事务:
1 | 1. insert into hacker_news_comments (id, parent_id, text) values (1, NULL, 'a root comment') |
It’s possible for an as of system time
query to be executed later and, if it’s unlucky in its choice of a “system time”, to see the second row and not the first.
稍后执行as of system time
查询是可能的,如果不幸选择了“系统时间”,则可能会看到第二行而不是第一行。
Again, if the second transaction were to read the data written by the first (e.g. implicitly through a foreign key check), the anomaly would not be possible.
同样,如果第二个事务要读取第一个事务写入的数据(例如,通过外键检查隐式地进行),则不会出现异常。
Relatedly, a backup, taken through the backup database
command, is using as of system time
queries under the hood, and so a particular backup might contain row 2 but not row 1.
相关地,通过备份数据库命令进行的备份在幕后使用系统时间查询,因此特定备份可能包含第 2 行,但不包含第 1 行。
CockroachDB implementation details
The architecture of CockroachDB is based on a separation between multiple layers (a SQL layer on top down to a storage layer at the bottom). For the subject at hand, the interesting layer is the Transaction Layer, which is in charge of making sure that a transaction doesn’t miss writes that it’s supposed to be seeing. Each transaction has a timestamp, assigned by the “gateway node” — the node that a client happens to be talking to — when the transaction starts (through a SQL BEGIN statement). As the transaction talks to different other nodes that might be responsible for ranges of data it wants to read, this timestamp is used to decide what values are visible (because they’ve been written by transactions “in the past”) and which values aren’t visible because they’ve been written “in the future”.
CockroachDB的架构基于多层之间的分离(顶部的SQL层到底部的存储层)。 对于当前的主题,有趣的层是事务层,它负责确保事务不会错过它应该看到的写入。 每个事务都有一个时间戳,由事务开始时(通过 SQL BEGIN 语句)由“网关节点”(客户端恰好与之通信的节点)分配。 当事务与可能负责其想要读取的数据范围的不同其他节点进行通信时,此时间戳用于决定哪些值是可见的(因为它们是由“过去”的事务写入的)以及哪些值是不可见,因为它们已被写为“未来”。
CockroachDB uses multi-version concurrency control (MVCC), which means that the history of each row is available for transactions to look through. The difficulties, with respect to consistency guarantees, stem from the fact that the timestamp recording into MVCC are taken from the clock of the gateway node that wrote it, which generally is not the same one as the gateways assigning transaction timestamp for a reader, and we assume that the clock can be desynchronized up to a limit (we call the phenomenon “clock skew”). So, given transaction timestamp t and value timestamp t’, how does one decide whether the value in question should be visible or not?
CockroachDB使用多版本并发控制(MVCC),这意味着每一行的历史记录都可供事务查看。 一致性保证方面的困难源于以下事实:MVCC 中的时间戳记录取自写入它的网关节点的时钟,该时钟通常与为读取器分配事务时间戳的网关不同,并且 我们假设时钟可以去同步到一定限度(我们将这种现象称为“时钟偏差”)。 那么,给定交易时间戳 t 和值时间戳 t’,如何决定所讨论的值是否应该可见?
The rules are that, if t’ <= t, then the transaction will see the respective value (and so we’ll essentially order our transaction after that writer). The reasoning is that either our transaction really started after the other one committed, or, if not, the two were concurrent and so we can order things either way.
规则是,如果 t’ <= t,那么交易将看到相应的值(因此我们基本上会在该编写者之后对交易进行排序)。 原因是,要么我们的事务在另一个事务提交后才真正开始,要么如果不是,则两个事务是并发的,因此我们可以以任何一种方式排序。
If t’ > t, then it gets tricky. Did the writer really start and commit before the reader began its transaction, or did it commit earlier than that but t’ was assigned by a clock that’s ahead of ours? What CRDB does is define an “uncertainty interval”: if the values are close enough so that t’ could be explained by a trailing clock, we say that we’re unsure about whether the value needs to be visible or not, and our transaction needs to change its timestamp (which, unless we can avoid it, means the transaction might have to restart. Which, unless we can further avoid it, means the client might get a retriable error). This is what allows CockroachDB to guarantee no stale reads. In the Hacker News example, if Nathan starts his transaction after me and Tobi committed ours, the worst that could happen is that he gets a timestamp that’s slightly in the past and has to consider some of our other writes uncertain, at which point he’ll restart at a higher timestamp.
如果 t’ > t,那么事情就会变得棘手。 写入器是否真的在读取器开始其事务之前开始并提交,或者它是否早于该时间提交,但 t’ 是由比我们早的时钟分配的? CRDB 所做的是定义一个“不确定性区间”:如果这些值足够接近以至于 t’ 可以用尾随时钟来解释,我们就说我们不确定该值是否需要可见,并且我们的交易 需要更改其时间戳(除非我们可以避免它,否则意味着事务可能必须重新启动。除非我们可以进一步避免它,否则意味着客户端可能会遇到可重试的错误)。 这就是 CockroachDB 能够保证没有过时读取的原因。 在 Hacker News 的例子中,如果 Nathan 在我和 Tobi 提交我们的交易之后开始他的交易,最糟糕的情况是他得到的时间戳稍微有点过去,并且必须考虑我们的其他一些写入不确定,此时他’ 将以更高的时间戳重新启动。
We work quite hard to minimize the effects of this uncertainty interval. For one, transactions keep track of what timestamps they’ve observed at each node and uncertainty is tracked between nodes pair-wise. This, coupled with the fact that a node’s clock is bumped up when someone tries to write on it with a higher timestamp, allows a transaction to not have to restart more than once because of an uncertain value seen on a particular node. Also, overall, once the maximum admissible clock skew elapses since a transaction started, a transaction no longer has any uncertainty.
我们非常努力地工作以尽量减少这种不确定性区间的影响。 首先,交易会跟踪它们在每个节点观察到的时间戳,并且成对地跟踪节点之间的不确定性。 再加上当有人试图用更高的时间戳在节点上写入时,节点的时钟会被调高,使得事务不必因为在特定节点上看到的不确定值而多次重新启动。 此外,总体而言,一旦自事务开始以来最大允许时钟偏差过去,事务就不再具有任何不确定性。
Separately, when a transaction’s timestamp does need to be bumped, we try to be smart about it. If either the transaction hasn’t read anything before encountering the uncertain value, or if we can verify that there’s been no writes on the data its already read before encountering the uncertainty, then the transaction can be bumped with no fuss. If we can’t verify that, then the transaction needs to restart so it can perform its writes again. If it does have to restart, we don’t necessarily tell the client about it. If we haven’t yet returned any results for the transaction to the client (which is common if the client can send parts of a transaction’s statements as a batch), then we can re-execute all the transaction’s statements on the server-side and the client is none the wiser.
另外,当交易的时间戳确实需要改变时,我们会尽量聪明地处理它。 如果事务在遇到不确定值之前没有读取任何内容,或者如果我们可以验证在遇到不确定值之前已经读取的数据没有被写入,那么事务就可以毫不费力地被碰撞。 如果我们无法验证这一点,则事务需要重新启动,以便它可以再次执行写入操作。 如果确实需要重新启动,我们不一定会告诉客户。 如果我们尚未将事务的任何结果返回给客户端(如果客户端可以批量发送事务的部分语句,则这是常见的),那么我们可以在服务器端重新执行所有事务的语句,并 客户却一无所知。
Conclusion
CockroachDB provides a high level of “consistency”, second only to Spanner among distributed databases as far as I know (but then CockroachDB is a more flexible and easy to migrate to database — think ORM support — so I’ll take it over Spanner any day). We offer a relatively easy to understand programming model, although the literature doesn’t give us a good name for it. It stronger than serializability, but somewhat weaker than strict serializability (and than linearizability, although using that term in the context of a transactional system is an abuse of the language). It’s probably easiest to qualify it by understanding the anomaly that it allows — “causal reverse” — and the limited set of circumstances under which it can occur. In the majority of cases where one might be wondering about semantics of reads and writes in CockroachDB, the slogan “no stale reads” should settle most discussions.
据我所知,CockroachDB 提供了高水平的“一致性”,在分布式数据库中仅次于 Spanner(但 CockroachDB 是一个更灵活、更容易迁移到的数据库——想想 ORM 支持——所以我会用它来取代 Spanner) 天)。 我们提供了一个相对容易理解的编程模型,尽管文献没有给我们一个好名字。 它比可序列化性更强,但比严格的可序列化性稍弱(并且比线性化更弱,尽管在事务系统的上下文中使用该术语是对该语言的滥用)。 通过理解它所允许的异常——“因果逆转”——以及它可能发生的有限情况,来限定它可能是最简单的。 在大多数情况下,人们可能想知道 CockroachDB 中读写的语义,“无陈旧读取”这一口号应该可以解决大多数讨论。
Although I think the definition of the Serializable isolation level would have benefitted from introducing some notion of different clients. As phrased by the SQL standard, I believe it technically allows empty results to be produced for any read-only transaction with the justification that those transactions are simply ordered before any other transaction. Implementing that would be egregious, though.[^2]: We’re thinking of ways to make CRDB resilient to more arbitrarily unsynchronized clocks.[^3]: As discussed in the “A note on clocks” section, figuring out what “afterwards” means is not always trivial when the clients involved are not on the same machine. But still, sometimes (in the cases that matter most), a transaction is known to happen after another one, usually through a causal relationship between the two.
尽管我认为可串行化隔离级别的定义会受益于引入不同客户端的一些概念。 正如 SQL 标准所述,我相信它在技术上允许为任何只读事务生成空结果,理由是这些事务只是在任何其他事务之前排序。 然而,实现这一点将是令人震惊的。[^2]:我们正在考虑如何使 CRDB 能够适应更多任意不同步的时钟。[^3]:正如“时钟注释”部分中所讨论的,弄清楚“ 当涉及的客户端不在同一台机器上时,“之后”的意思并不总是微不足道的。 但有时(在最重要的情况下),一笔交易会在另一笔交易之后发生,通常是通过两者之间的因果关系。