CockroachDB stability post-mortem From 1 node to 100 nodes

https://www.cockroachlabs.com/blog/cockroachdb-stability-from-1-node-to-100-nodes/

In August, we published a blog post entitled “Why Can’t I Run a 100-Node CockroachDB Cluster?”. The post outlined difficulties we encountered stabilizing CockroachDB. CockroachDB stability (or the lack of) had become significant enough that we designated it a “code yellow” issue, a concept borrowed from Google that means a problem is so pressing that it merits promotion to a primary concern of the company. For us, the code yellow was more than warranted; a database program isn’t worth the bytes to store its binary if it lacks stability.

8 月份,我们发表了一篇题为“为什么我不能运行 100 节点 CockroachDB 集群?”的博客文章。 这篇文章概述了我们在稳定 CockroachDB 时遇到的困难。 CockroachDB 稳定性(或缺乏稳定性)已经变得足够严重,以至于我们将其指定为“黄色代码”问题,这是一个借用自 Google 的概念,意味着问题非常紧迫,值得提升为公司的首要关注点。 对我们来说,黄色代码是值得的; 如果数据库程序缺乏稳定性,那么它就不值得用字节来存储其二进制文件。

In this post, I’ll set the stage with some background, then cover hypotheses for root causes of instability, our communication strategy, some interesting technical details, outcomes for stabilization efforts, and conclusions. It’s a long post, so bear with me!

在这篇文章中,我将介绍一些背景知识,然后介绍不稳定的根本原因的假设、我们的沟通策略、一些有趣的技术细节、稳定工作的结果和结论。 这是一篇很长的文章,所以请耐心等待!

TL;DR: We achieved most of our stability goal. While we’re still working on some of the chaos scenarios, the system is easily stable at many more than 10 node clusters – we’ve tested it successfully at 100 nodes.

TL;DR:我们实现了大部分稳定性目标。 虽然我们仍在处理一些混乱场景,但系统在超过 10 个节点的集群上很容易保持稳定——我们已经在 100 个节点上成功测试了它。

Background

To better set the stage: we announced the CockroachDB Beta release in April, after more than a year of development. Over the five months of progress on the beta, concerns over correctness, performance, and the general push for new features dominated our focus. We incorrectly assumed stability would be an emergent property of forward progress, just as long as everyone was paying some attention to, and fixing stability bugs whenever they were encountered. But by August, despite a team of 20 developers, we couldn’t stand up a 10-node cluster for two weeks without major performance and stability issues.

为了更好地做好准备:经过一年多的开发,我们于 4 月份宣布了 CockroachDB Beta 版本。 在测试版的五个月进展中,对正确性、性能和新功能的总体推动的担忧主导了我们的重点。 我们错误地认为只要每个人都关注并修复遇到的稳定性错误,稳定性就会成为前进的一个新兴属性。 但到了 8 月份,尽管我们的团队有 20 名开发人员,但我们仍无法在不出现重大性能和稳定性问题的情况下让 10 节点集群运行两周。

Nothing could more effectively convey the gulf that had opened up between our stability expectations and reality than the increasingly frequent mentions of instability as a punchline in the office. Despite feeling that we were just one or two pull requests away from stability, the inevitable chuckle nevertheless morphed into a insidious critique. This blog post chronicles our journey to establish a baseline of stability for CockroachDB.

没有什么比办公室里越来越频繁地提到不稳定问题更能有效地传达我们对稳定的期望与现实之间出现的鸿沟了。 尽管感觉我们离稳定只有一两个拉请求,但不可避免的笑声仍然演变成一种阴险的批评。 这篇博文记录了我们为 CockroachDB 建立稳定性基线的过程。

Hypotheses on Root Causes

关于根本原因的假设

What caused instability? Obviously there were technical oversights and unexpectedly complex interactions between system components. A better question is: what was preventing us from achieving stability? Perhaps surprisingly, our hypotheses came down to a mix of mostly process and management failures, not engineering. We identified three root causes:

是什么导致了不稳定? 显然,系统组件之间存在技术疏忽和意外复杂的交互。 更好的问题是:是什么阻碍了我们实现稳定? 也许令人惊讶的是,我们的假设主要归结为流程和管理失败,而不是工程失败。 我们确定了三个根本原因:

  1. The rapid pace of development was obscuring, or contributing to, instability faster than solutions could be developed. Imagine a delicate surgery with an excessive amount of blood welling up in the incision. You must stop the bleeding first in order to operate. This analogy suggested we’d need to work on stability fixes in isolation from normal development. We accomplished this by splitting our master branch into two branches: the master branch would be dedicated to stability, freezing with the exception of pull requests targeting stability. All other development would continue in a develop branch.

    快速的发展速度比解决方案的制定速度更快地掩盖或助长了不稳定因素。 想象一下一场精密的手术,切口中涌出过量的血液。 必须先止血才能进行手术。 这个类比表明我们需要独立于正常开发来进行稳定性修复。 我们通过将主分支分成两个分支来实现这一点:主分支将致力于稳定性,除了针对稳定性的拉取请求外,冻结分支。 所有其他开发将在开发分支中继续。

  2. While many engineers were paying attention to stability, there was no focused team and no clear leader. Imagine many cooks in the kitchen working independently on the same dish, without anyone responsible for the result tasting right. To complicate matters, imagine many of the chefs making experimental contributions… Time to designate a head chef. We chose Peter Mattis, one of our co-founders. He leads engineering and is particularly good at diagnosing and fixing complex systems, and so was an obvious choice. Instead of his previously diffuse set of goals to develop myriad functionality and review significant amounts of code, we agreed that he would largely limit his focus to stability and become less available for other duties. The key objective here was to enable focus and establish accountability.

    虽然许多工程师都注重稳定性,但没有专注的团队,也没有明确的领导者。 想象一下厨房里的许多厨师独立地制作同一道菜,没有人对结果的味道负责。 让事情变得复杂的是,想象一下许多厨师都在做出实验性的贡献……是时候指定一位主厨了。 我们选择了我们的联合创始人之一彼得·马蒂斯。 他领导工程部门,特别擅长诊断和修复复杂系统,因此是一个显而易见的选择。 我们一致认为,他将在很大程度上将注意力集中在稳定性上,并减少承担其他职责的时间,而不是他之前制定的分散的目标来开发无数的功能和审查大量的代码。 这里的主要目标是集中注意力并建立问责制。

  3. Instability was localized in a core set of components, which were undergoing too many disparate changes for anyone to fully understand and review. Perhaps a smaller team could apply more scrutiny to fewer, careful changes and achieve what had eluded a larger team. We downsized the team working on core components (the transactional, distributed key-value store), composed of five engineers with the most familiarity with that part of the codebase. We even changed seating arrangements, which felt dangerous and counter-cultural, as normally we randomly distribute engineers so that project teams naturally resist balkanization.

    不稳定性集中在一组核心组件中,这些组件正在经历太多不同的变化,任何人都无法完全理解和审查。 也许较小的团队可以对更少、更仔细的变更进行更多的审查,并实现较大团队无法实现的目标。 我们缩小了负责核心组件(事务性分布式键值存储)的团队规模,该团队由五名最熟悉代码库该部分的工程师组成。 我们甚至改变了座位安排,这感觉很危险且反文化,因为通常我们随机分配工程师,这样项目团队自然会抵制巴尔干化。

Communication

The decision to do something about stability happened quickly. I’d come home from an August vacation blithely assuming instability a solved problem. Unfortunately, new and seemingly worse problems had cropped up. This finally provided enough perspective to galvanize us into action. Our engineering management team discussed the problem in earnest, considered likely causes, and laid out a course of action over the course of a weekend. To proceed, we had to communicate the decisions internally to the team at Cockroach Labs, and after some soul searching, externally to the community at large.

我们很快就决定采取一些措施来保持稳定。 八月份的假期结束后,我满心欢喜地认为不稳定问题已得到解决。 不幸的是,新的、看似更严重的问题又出现了。 这最终提供了足够的视角来激励我们采取行动。 我们的工程管理团队认真讨论了该问题,考虑了可能的原因,并在周末制定了行动方案。 为了继续下去,我们必须在内部向 Cockroach Labs 的团队传达这些决定,并在一番自我反省之后,向外部整个社区传达这些决定。

One of our values at Cockroach Labs is transparency. Internally, we are open about our stability goals and our successes or failures to meet them. But just being transparent about a problem isn’t enough; where we fell down was in being honest with ourselves about the magnitude of the problem and what it meant for the company.

Cockroach Labs 的价值观之一是透明度。 在内部,我们对我们的稳定目标以及实现这些目标的成功或失败持开放态度。 但仅仅对问题保持透明是不够的。 我们失败的地方在于对自己诚实地了解问题的严重性以及它对公司的意义。

Once decided, we drafted a detailed email announcing the code yellow to the team. Where we succeeded was in clearly defining the problem and risks, actions to be taken, and most importantly: code yellow exit criteria. Exit criteria must be measurable and achievable! We decided on “a 10-node cluster running for two weeks under chaos conditions without data loss or unexpected downtime.”

一旦决定,我们起草了一封详细的电子邮件,向团队宣布黄色代码。 我们成功的地方在于清楚地定义了问题和风险、要采取的行动,最重要的是:黄色代码退出标准。 退出标准必须是可衡量和可实现的! 我们决定“一个 10 节点集群在混乱条件下运行两周,不会丢失数据或意外停机”。

Where we didn’t succeed was in how precipitous the decision and communication seemed to some members of the team. We received feedback that the decision lacked sufficient deliberation and the implementation felt “railroaded”.

我们没有成功的地方在于,在团队的一些成员看来,决策和沟通显得过于草率。 我们收到的反馈称,该决定缺乏充分的深思熟虑,而且实施起来感觉“不顺利”。

We didn’t decide immediately to communicate the code yellow externally, although consensus quickly formed around the necessity. For one thing, we’re building an open source project and we make an effort to use Gitter instead of Slack for engineering discussions, so the community at large can participate. It would be a step backwards to withhold this important change in focus. For another thing, the community surely was aware of our stability problems and this was an opportunity to clarify and set expectations.

尽管很快就必要性达成了共识,但我们并没有立即决定向外部传达黄色代码。 一方面,我们正在构建一个开源项目,并努力使用 Gitter 而不是 Slack 进行工程讨论,以便整个社区都可以参与。 如果不关注这一重要的焦点变化,那将是一种倒退。 另一方面,社区肯定意识到了我们的稳定性问题,这是一个澄清和设定期望的机会。

Nevertheless, the task of actually writing a blog post to announce the stability code yellow wasn’t easy and wasn’t free of misgivings. Raise your hand if you like airing your problems in public… Unsurprisingly, there was criticism from Hacker News commentators, but there were also supportive voices. In the end, maintaining community transparency was the right decision and we hope established trust.

然而,实际撰写博客文章来宣布稳定性代码黄色的任务并不容易,而且也存在疑虑。 如果你喜欢在公共场合表达你的问题,请举手……不出所料,黑客新闻评论员提出了批评,但也有支持的声音。 最后,保持社区透明度是正确的决定,我们希望建立信任。

Technical Details

With changes to process and team structure decided, and the necessary communication undertaken, we embarked on an intense drive to address the factors contributing to instability in priority order. We of course had no idea how long this would take. Anywhere from one to three months was the general consensus. In the end, we achieved our code yellow exit criteria in five weeks.

在决定改变流程和团队结构并进行必要的沟通后,我们开始大力解决导致优先顺序不稳定的因素。 我们当然不知道这需要多长时间。 普遍共识是一到三个月。 最终,我们在五周内达到了黄色代码退出标准。

What did we fix? Well, instability appeared in various guises, including clusters slowing precipitously or deadlocking, out-of-memory panics (OOMs), and data corruption (detected via periodic replica checksum comparisons).

我们修复了什么? 不稳定以各种形式出现,包括集群急剧减速或死锁、内存不足恐慌 (OOM) 以及数据损坏(通过定期副本校验和比较检测到)。

Rebalancing via Snapshots

The generation and communication of replica snapshots, used to rebalance and repair data in a CockroachDB cluster, was our most persistent adversary in the battle for stability. Snapshots use significant disk and network IO, and mechanisms that limit their memory consumption and processing time while holding important locks were originally considered unnecessary for beta stability. Much of the work to tame snapshots occurred during the months leading up to the stability code yellow, which hints at their significance. Over the course of addressing snapshots, we reduced their memory usage with streaming RPCs, and made structural changes to avoid holding important locks during generation. However, the true cause of snapshot instability proved to be a trivial oversight, but it was simply not visible through the fog of cluster stabilization – at least, not until after we’d mostly eliminated the obvious symptoms of snapshot badness.

副本快照的生成和通信(用于重新平衡和修复 CockroachDB 集群中的数据)是我们在稳定性之战中最顽固的对手。 快照使用大量的磁盘和网络 IO,并且在持有重要锁的同时限制其内存消耗和处理时间的机制最初被认为对于 beta 稳定性来说是不必要的。 控制快照的大部分工作都是在稳定代码黄色之前的几个月内进行的,这暗示了它们的重要性。 在处理快照的过程中,我们通过流式 RPC 减少了它们的内存使用,并进行了结构更改以避免在生成过程中持有重要的锁。 然而,事实证明,快照不稳定的真正原因是一个微不足道的疏忽,但透过集群稳定的迷雾根本看不到它——至少在我们基本上消除了快照不良的明显症状之后。

Snapshots are used by nodes to replicate information to other nodes for repair (if a node is lost), or rebalancing (to spread load evenly between nodes in a cluster). Rebalancing is accomplished with a straightforward algorithm:

节点使用快照将信息复制到其他节点以进行修复(如果节点丢失)或重新平衡(在集群中的节点之间均匀分布负载)。 重新平衡是通过一个简单的算法完成的:

1. Nodes periodically advertise the number of range replicas they maintain.

节点定期通告其维护的范围副本数量。

2. Each node computes the mean replica count across all nodes, and decides:

每个节点计算所有节点的平均副本数,并决定:

  • If a node is underfull compared to the mean, it does nothing 如果一个节点与均值相比未满,则它不会执行任何操作
  • If overfull, it rebalances via snapshot to an underfull node 如果超满,它会通过快照重新平衡到未满的节点

Our error was in making this judgement too literally, without applying enough of a threshold around the mean in order to avoid “thrashing”. See the animated diagram below which shows two scenarios.

我们的错误在于过于字面地做出了这个判断,而没有在均值周围应用足够的阈值以避免“抖动”。 请参阅下面的动画图,其中显示了两种场景。

看原文中的动画

Simulation 1. In the left “Exact Mean” simulation, we rebalance to within a replica of the mean; this will never stop rebalancing. Notice that far more RPCs are sent and the simulation never reaches equilibrium.

模拟 1. 在左侧的“精确均值”模拟中,我们重新平衡到均值的副本内; 这永远不会停止再平衡。 请注意,发送了更多的 RPC,并且模拟永远不会达到平衡。

In the right “Threshold of Mean” simulation, we rebalance to within a threshold of the mean, which quickly reaches equilibrium. In practice, continuously rebalancing crowded out other, more salient, work being done in the cluster.

在右侧的“均值阈值”模拟中,我们重新平衡到均值阈值内,从而很快达到平衡。 在实践中,不断的重新平衡会挤出集群中正在完成的其他更重要的工作。

Lock Refactoring

Tracing tools were invaluable in diagnosing lock-contention as a cause of excessively slow or deadlocked clusters. Most of these symptoms were caused by holding common locks during processing steps which could sometimes take an order of magnitude longer than originally supposed. Pileups over common locks resulted in RPC traffic jams and excessive client latencies. The solution was lock refactoring.

跟踪工具对于诊断由于集群速度过慢或死锁而引起的锁争用非常重要。 大多数这些症状是由于在处理步骤期间持有公共锁而引起的,这些步骤有时可能比最初预期的时间长一个数量级。 常见锁的堆积导致 RPC 流量堵塞和客户端延迟过大。 解决方案是锁重构。

Locks held during Raft processing, in particular, proved problematic as commands for ranges were executed serially, holding a single lock per range. This limited parallelization and caused egregious contention for long-running commands, notably replica snapshot generation. Garbage collection of replica data after rebalancing was previously protected by a common lock in order to avoid tricky consistency issues. Replica GC work is time consuming and impractical to do while holding a per-node lock covering actions on all stores. In both cases, the expedient solution of coarse-grained locking proved inadequate and required refactoring.

事实证明,Raft 处理期间持有的锁尤其存在问题,因为范围命令是串行执行的,每个范围持有一个锁。 这种限制并行化并导致长时间运行的命令的严重争用,特别是副本快照生成。 重新平衡后副本数据的垃圾收集之前受到公共锁的保护,以避免棘手的一致性问题。 在持有覆盖所有存储上的操作的每节点锁的情况下,副本 GC 工作非常耗时且不切实际。 在这两种情况下,粗粒度锁定的权宜解决方案都被证明是不够的,需要重构。

Tracing Tools

Ironically, the same tracing tools used to diagnose degenerate locking behavior were themselves stability culprits. Our internal tracing tools were pedantically storing complete dumps of KV and Raft commands while those spans were held in a trace’s ring buffer. This was fine for small commands, but quickly caused Out-of-Memory (OOM) errors for larger commands, especially pre-streaming snapshots. A silver lining to our various OOM-related difficulties was development of fine-grained memory consumption metrics, tight integration with Go and C++ heap profiling tools, and integration with Lightstep, a distributed tracing system inspired by Google’s Dapper.

讽刺的是,用于诊断退化锁定行为的相同跟踪工具本身就是稳定性的罪魁祸首。 我们的内部跟踪工具迂腐地存储 KV 和 Raft 命令的完整转储,而这些跨度则保存在跟踪的环形缓冲区中。 这对于小型命令来说没什么问题,但对于较大的命令,尤其是预流快照,很快就会导致内存不足 (OOM) 错误。 解决各种与 OOM 相关的困难的一线希望是开发细粒度的内存消耗指标、与 Go 和 C++ 堆分析工具的紧密集成,以及与 Lightstep(受 Google Dapper 启发的分布式跟踪系统)的集成。

Corruption! 腐败!

OOMs and deadlocks are often diagnosed and fixed through honest labor that pays an honest wage. What keeps us up at night are seemingly impossible corruption errors. Some of these occur between replicas (i.e. replicas don’t agree on a common checksum of their contents). Others are visible when system invariants are broken. These kinds of problems have been rare, though we found one during our stability code yellow.

OOM 和僵局通常可以通过支付诚实工资的诚实劳动来诊断和解决。 让我们彻夜难眠的是看似不可能的腐败错误。 其中一些发生在副本之间(即副本不同意其内容的通用校验和)。 当系统不变量被破坏时,其他的问题就会显现出来。 此类问题很少见,但我们在稳定性代码黄色期间发现了一个问题。

CockroachDB uses a bi-level index to access data in the system. The first level lives on a special bootstrap range, advertised via gossip to all nodes. It contains addressing information for the second level, which lives on an arbitrary number of subsequent ranges. The second level, finally, contains addressing information for the actual system data, which lives on the remaining ranges.

CockroachDB使用双层索引来访问系统中的数据。 第一级存在于一个特殊的引导范围内,通过八卦向所有节点通告。 它包含第二层的寻址信息,该信息存在于任意数量的后续范围中。 最后,第二层包含实际系统数据的寻址信息,该数据位于其余范围内。

Addressing records are updated when ranges split and are rebalanced or repaired. They are updated like any other data in the system, using distributed transactions, and should always be consistent. However, a second level index addressing record went unexpectedly missing. Luckily, Ben Darnell, our resident coding Sherlock Holmes, was able to theorize a gap in our model which could account for the problem, despite requiring an obscure and unlikely sequence of events, and perfect timing. It’s amazing what a brilliant engineer can intuit from code inspection alone. Also, there ought to be a maxim that in a sufficiently large distributed system, anything that can happen, will happen.

当范围分裂并重新平衡或修复时,寻址记录会被更新。 它们像系统中的任何其他数据一样使用分布式事务进行更新,并且应该始终保持一致。 然而,二级索引寻址记录意外丢失。 幸运的是,我们的常驻编码夏洛克·福尔摩斯的本·达内尔(Ben Darnell)能够对我们模型中的一个缺口进行理论分析,从而可以解释这个问题,尽管需要一个模糊且不可能的事件序列和完美的时机。 令人惊奇的是,一位出色的工程师仅通过代码检查就能直觉到什么。 另外,应该有一条格言:在足够大的分布式系统中,任何可能发生的事情都会发生。

Raft

Last, and certainly not least, we waged an epic struggle to tame Raft, our distributed consensus algorithm. In a resonant theme of these technical explanations, we had originally concluded that improvements to Raft that were on the drawing board could wait until after our general availability release. They were seen as necessary for much larger clusters, while the Raft algorithm’s impedance mismatch with CockroachDB’s architecture could simply be ignored for the time being. This proved a faulty assumption.

最后,也是最重要的一点,我们为驯服我们的分布式共识算法 Raft 进行了一场史诗般的斗争。 在这些技术解释的一个共鸣主题中,我们最初得出的结论是,对绘图板上的 Raft 的改进可以等到我们的正式版本发布之后。 它们被认为对于更大的集群是必要的,而 Raft 算法与 CockroachDB 架构的阻抗不匹配可以暂时忽略。 事实证明这是一个错误的假设。

Impedance mismatch? Yes, it turns out that Raft is a very busy protocol and typically suited to applications where only a small number of distinct instances, or “Raft groups”, are required. However, CockroachDB maintains a Raft group per range, and a large cluster will have hundreds of thousands or millions of ranges. Each Raft group elects a leader to coordinate updates, and the leader engages in periodic heartbeats to followers. If a heartbeat is missed, followers elect a new leader. For a large CockroachDB cluster, this meant a huge amount of heartbeat traffic, proportional to the total number of ranges in the system, not just ranges being actively read or written, and it was causing massive amounts of network traffic. This, in conjunction with lock contention and snapshots, would cause chain reactions. For example, too many heartbeats would fill network queues causing heartbeats to be missed, leading to reelection storms, thus bringing overall progress to a halt or causing node panics due to unconstrained memory usage. We had to fix this dynamic.

阻抗不匹配? 是的,事实证明 Raft 是一个非常繁忙的协议,通常适合只需要少量不同实例或“Raft 组”的应用程序。 然而,CockroachDB 为每个范围维护一个 Raft 组,一个大型集群将有数十万或数百万个范围。 每个 Raft 组都会选出一个领导者来协调更新,领导者会定期向追随者发出心跳。 如果心跳丢失,追随者就会选出新的领导者。 对于大型 CockroachDB 集群来说,这意味着大量的心跳流量,与系统中范围的总数成正比,而不仅仅是主动读取或写入的范围,并且它导致了大量的网络流量。 这与锁争用和快照相结合,会导致连锁反应。 例如,太多的心跳会填满网络队列,导致心跳丢失,从而导致重选风暴,从而导致整体进度停止或由于内存使用不受限制而导致节点恐慌。 我们必须解决这个问题。

We undertook two significant changes. The first was lazy initialization of Raft groups. Previously, we’d cycle through every replica contained on a node at startup time, causing each to participate in their respective Raft groups as followers. Being lazy dramatically eased communication load on node startup. However, being lazy isn’t free: Raft groups require more time to respond to the first read or write request if they’re still “cold”, leading to higher latency variance. Still, the benefits outweighed that cost.

我们进行了两项重大变革。 第一个是 Raft 组的延迟初始化。 以前,我们会在启动时循环遍历节点上包含的每个副本,使每个副本作为追随者参与各自的 Raft 组。 懒惰极大地减轻了节点启动时的通信负载。 然而,偷懒并不是免费的:如果 Raft 组仍然“冷”,则需要更多时间来响应第一个读取或写入请求,从而导致更高的延迟方差。 尽管如此,好处还是超过了成本。

The success of lazy initialization led to a further insight: if Raft groups didn’t need to be active immediately after startup, why couldn’t they simply be decommissioned after use? We called this process “quiescence”, and applied it to Raft groups where all participants were fully replicated with no pending traffic remaining. The final heartbeat to the Raft group contains a special flag, telling participants to quiesce instead of being ready to campaign for a new leader if the leader fails further heartbeats.

延迟初始化的成功让我们有了进一步的认识:如果 Raft 组不需要在启动后立即处于活动状态,为什么不能在使用后简单地停用它们呢? 我们将这个过程称为“静止”,并将其应用于 Raft 组,其中所有参与者都被完全复制,没有剩余的待处理流量。 Raft 组的最后一次心跳包含一个特殊的标志,如果领导者进一步的心跳失败,则告诉参与者保持静止,而不是准备竞选新的领导者。

看原文中的动画

Simulation 2. In the left “Naive Raft” simulation, notice near constant sequence of heartbeats, denoted by the red RPCs between Raft groups. These are constant despite the slow trickle of writes from applications. In the right “Quiescing Raft” simulation, the Raft heartbeats occur only in order to quiesce after write traffic.

模拟 2. 在左侧的“Naive Raft”模拟中,请注意接近恒定的心跳序列,由 Raft 组之间的红色 RPC 表示。 尽管应用程序的写入速度缓慢,但这些都是恒定的。 在右侧的“Quiescing Raft”模拟中,Raft 心跳仅发生,以便在写入流量后停顿。

In addition to other changes, such as Raft batching, we managed to meaningfully reduce background traffic. By doing so, we also directly contributed to another key product goal, to constrain network, disk, and CPU usage to be directly proportional to the amount of data being read or written, and never proportional to the total size of data stored in the cluster.

除了 Raft 批处理等其他变化之外,我们还设法显着减少了后台流量。 通过这样做,我们还直接为另一个关键产品目标做出了贡献,即限制网络、磁盘和 CPU 使用率与读取或写入的数据量成正比,而不是与集群中存储的数据总大小成正比 。

Outcomes 结果

How did our process and management initiatives fare in addressing the three hypothesized root causes?

我们的流程和管理举措在解决三个假设的根本原因方面表现如何?

Working With Two Branches

Splitting the master branch was not without costs. It added significant overhead in near-daily merges from the master branch to develop in order to avoid conflicts and maintain compatibility with stability fixes. We effectively excluded changes to “core” packages from the develop branch in order to avoid a massive merge down the road. This held up some developer efforts, refactorings in particular, making it unpopular. In particular, Tamir Duberstein was a martyr for the stability cause, suffering the daily merge from master to develop at first quietly, and then with mounting frustration.

拆分主分支并不是没有成本的。 它增加了从主分支到开发的几乎日常合并的显着开销,以避免冲突并保持与稳定性修复的兼容性。 我们有效地从开发分支中排除了对“核心”包的更改,以避免将来发生大规模合并。 这阻碍了一些开发人员的努力,尤其是重构,使其不受欢迎。 尤其是塔米尔·杜伯斯坦(Tamir Duberstein),他是稳定事业的烈士,每天都在忍受着大师的融合,一开始默默地发展,后来却越来越沮丧。

Was the split branch necessary? A look at the data suggests not. There was significant churn in the develop branch, which counted 300 more commits during the split branch epoch. Despite that, there was no regression in stability when the branches were merged. We suspect that the successful merge is more the result of limits on changes to core components than to the split branches. While there is probably a psychological benefit to working in isolation on a stability branch, nobody is now arguing that was a crucial factor.

分裂分支有必要吗? 数据显示并非如此。 开发分支中存在显着的流失,在分裂分支时期又增加了 300 次提交。 尽管如此,分支合并时稳定性并没有下降。 我们怀疑成功的合并更多地是对核心组件的更改限制的结果,而不是对拆分分支的限制。 虽然在稳定分支上独立工作可能有心理上的好处,但现在没有人认为这是一个关键因素。

The CockroachDB Stability Team

Designating a team with stability as the specific focus, and putting a single person in charge, proved invaluable. In our case, we drafted very experienced engineers, which may have led to a productivity hit in other areas. Since this was temporary, it was easy to justify given the severity of the problem.

事实证明,指定一个以稳定性为重点的团队并由一个人负责是非常有价值的。 就我们而言,我们招募了经验丰富的工程师,这可能会导致其他领域的生产力受到打击。 由于这是暂时的,考虑到问题的严重性,很容易证明其合理性。

Relocating team members for closer proximity felt like it meaningfully increased focus and productivity when we started. However, we ended up conducting a natural experiment on the efficacy of proximity. First two, and then three, out of the five stability team members ended up working remotely. Despite the increasing ratio of remote engineers, we did not notice an adverse impact on execution.

当我们开始时,重新安置团队成员以使其更加接近,感觉这有意义地提高了注意力和生产力。 然而,我们最终对邻近效应进行了一项自然实验。 五名稳定团队成员中,前两名、然后三名最终选择了远程工作。 尽管远程工程师的比例不断增加,但我们没有注意到对执行力的不利影响。

What ended up being more important than proximity were daily “stability sync” stand ups. These served as the backbone for coordination, and required only 30 minutes each morning. The agenda is (and remains): 1) status of each test cluster; 2) who’s working on what; 3) group discussion on clearing any blocking issues.

最终比接近更重要的是每日“稳定性同步”站会。 这些是协调的支柱,每天早上只需要 30 分钟。 议程是(并且仍然是):1)每个测试集群的状态; 2)谁在做什么; 3)小组讨论清除任何阻塞问题。

We also held a twice-weekly “stability war room” and pressed any and all interested engineers into the role of “production monkey” each week. A production monkey is an engineer dedicated to overseeing production deployments and monitoring. Many contributions came from beyond the stability team, and the war rooms were a central point of coordination for the larger engineering org. Everyone pitching in with production duties raised awareness and familiarized engineers with deployment and debugging tools.

我们还举办了每周两次的“稳定作战室”,每周迫使所有感兴趣的工程师扮演“生产猴子”的角色。 生产猴子是专门负责监督生产部署和监控的工程师。 许多贡献来自稳定团队之外,而作战室是更大的工程组织的协调中心点。 每个参与生产职责的人都提高了意识,并让工程师熟悉了部署和调试工具。

Fewer People, More Scrutiny 更少的人,更多的审查

A smaller team with a mandate for greater scrutiny was a crucial success factor. In a testament to that, the structure has become more or less permanent. An analogy for achieving stability and then maintaining it is to imagine swimming in the ocean at night with little sense of what’s below or in which direction the shoreline is. We were pretty sure we weren’t far from a spot we could put our feet down and stop swimming, but every time we tried, we couldn’t touch bottom. Now that we’ve finally found a stable place, we can proceed with confidence; if we step off into nothingness, we can swim back a pace to reassess from a position of safety.

规模较小、接受更严格审查的团队是成功的关键因素。 证明这一点的是,该结构或多或少已经变得永久。 实现稳定并维持稳定的一个类比是想象一下晚上在海洋中游泳,几乎不知道下面是什么或海岸线在哪个方向。 我们非常确定我们离可以放下脚并停止游泳的地方不远了,但每次我们尝试时,我们都无法触底。 现在我们终于找到了一个稳定的地方,我们可以放心地继续前进了; 如果我们走进虚无,我们可以向后游一段距离,从安全的位置重新评估。

We now merge non-trivial changes to core components one-at-a-time by deploying the immediately-prior SHA, verifying it over the course of several hours of load, and then deploying the non-trivial change to verify expected behavior without regressions. This process works and has proven dramatically effective.

现在,我们通过部署前一 SHA,在几个小时的负载过程中验证它,然后部署重要更改以验证预期行为而不回归,将重要更改一次合并到核心组件 。 这个过程确实有效,并且已被证明非常有效。

The smaller stability team instituted obsessive review and gatekeeping for changes to core components. In effect, we went from a state of significant concurrency and decentralized review to a smaller number of clearly delineated efforts and centralized review.

规模较小的稳定性团队对核心组件的变更进行了严格的审查和把关。 实际上,我们从高度并发和分散审查的状态转变为少量明确划分的工作和集中审查。

Somewhat counter-intuitively, the smaller team saw an increase per engineer in pull request activity (see stream chart below).

有点违反直觉的是,较小的团队发现每个工程师的拉取请求活动有所增加(请参见下面的流程图)。

Conclusions on CockroachDB Stability

In hindsight (and in Hacker News commentary), it seems negligent to have allowed stability to become such a pressing concern. Shouldn’t we have realized earlier that the problem wasn’t going away without changing our approach? One explanation is the analogy of the frog in the slowly heating pot of water. Working so closely with the system, day in and day out, we failed to notice how stark the contrast had become between our stability expectations pre-beta and the reality in the months that followed. There were many distractions: rapid churn in the code base, new engineers starting to contribute, and no team with stability as its primary focus. In the end, we jumped out of the pot, but not before the water had gotten pretty damn hot.

事后看来(以及《黑客新闻》的评论),让稳定性成为如此紧迫的问题似乎是疏忽大意。 难道我们不应该早点意识到,如果不改变我们的方法,问题就不会消失吗? 一种解释是用缓慢加热的锅中的青蛙来比喻。 日复一日地与系统密切合作,我们没有注意到测试前的稳定性预期与接下来几个月的现实之间存在多么鲜明的对比。 有很多干扰因素:代码库的快速变动、新工程师开始贡献、没有团队以稳定性为主要关注点。 最后,我们跳出了锅,但水已经变得非常热了。

Many of us at Cockroach Labs had worked previously on complex systems which took their own sweet time to stabilize. Enough, that we hold a deep-seated belief that such problems are tractable. We posited that if we stopped all other work on the system, a small group of dedicated engineers could fix stability in a matter of weeks. I can’t stress enough how powerful belief in an achievable solution can be.

Cockroach Labs 的许多人之前都曾研究过复杂的系统,但这些系统花了很长时间才稳定下来。 我们坚信这些问题是可以解决的,这就足够了。 我们假设,如果我们停止系统上的所有其他工作,一小群专门的工程师可以在几周内修复稳定性。 我无法充分强调对可实现的解决方案的信念是多么强大。

Could we have avoided instability?

Ah, the big question, and here I’m going to use “I” instead of “we”.

啊,这是一个大问题,在这里我将使用“我”而不是“我们”。

Hacker News commentary on my previous blog post reveals differing viewpoints. What I’m going to say next is simply conjecture as I can’t assert the counterfactual is possible, and nobody can assert that it’s impossible. However, since I’m unaware of any complex, distributed system having avoided a period of instability, I’ll weakly assert that it’s quite unlikely. So, I’ll present an argument from experience, with the clear knowledge that it’s a fallacy. Enough of a disclaimer?

《黑客新闻》对我之前博客文章的评论揭示了不同的观点。 接下来我要说的只是猜想,因为我不能断言反事实是可能的,也没有人能断言这是不可能的。 然而,由于我不知道有任何复杂的分布式系统避免了一段不稳定时期,所以我会弱弱地断言这是不太可能的。 因此,我将根据经验提出一个论点,并清楚地知道这是一个谬论。 足够的免责声明吗?

I’ve worked on several systems in the same mold as CockroachDB and none required less than months to stabilize. Chalk one up for personal anecdote. While I didn’t work on Spanner at Google, my understanding is that it took a long time to stabilize. I’ve heard estimates as long as 18 months. Many popular non-distributed databases, both SQL and NoSQL, open source and commercial, took years to stabilize. Chalk several up for anecdotal hearsay.

我曾在多个与 CockroachDB 相同的系统上工作过,但没有一个系统需要不到几个月的时间才能稳定下来。 记下个人轶事。 虽然我没有在 Google 从事过 Spanner 的工作,但我的理解是它需要很长时间才能稳定下来。 我听说过长达 18 个月的估计。 许多流行的非分布式数据库,无论是 SQL 还是 NoSQL,无论是开源的还是商业的,都需要数年时间才能稳定下来。 记下一些轶事传闻。

While proving distributed systems correct is possible, it likely wouldn’t apply to the kinds of stability problems which have plagued CockroachDB. After all, the system worked as designed in most cases; there was emergent behavior as a result of complex interactions.

虽然证明分布式系统的正确性是可能的,但它可能不适用于困扰 CockroachDB 的稳定性问题。 毕竟,在大多数情况下,系统都按设计运行。 由于复杂的相互作用,出现了紧急行为。

I’d like to conclude with several practical suggestions for mitigating instability in future efforts.

最后,我想提出几项切实可行的建议,以减轻未来工作中的不稳定因素。

  • Define a less ambitious minimally viable product (MVP) and hope to suffer less emergent complexity and a smaller period of instability. Proceed from there in an incremental fashion, preventing further instability with a careful process to catch regressions.

    定义一个不太雄心勃勃的最低限度可行产品(MVP),并希望遭遇更少的紧急复杂性和更短的不稳定期。 从这里开始以渐进的方式进行,通过仔细的过程来捕捉回归,防止进一步的不稳定。

  • When a system is functionally complete, proceed immediately to a laser focus on stability. Form a team with an experienced technical lead, and make stability its sole focus. Resist having everyone working on stability. Clearly define accountability and ownership.

    当系统功能完成后,立即开始重点关注稳定性。 组建一支经验丰富的技术领导团队,并以稳定性为唯一关注点。 抵制让每个人都致力于稳定。 明确界定责任和所有权。

  • Systems like CockroachDB must be tested in a real world setting. However, there is significant overhead to debugging a cluster on AWS. The cycle to develop, deploy, and debug using the cloud is very slow. An incredibly helpful intermediate step is to deploy clusters locally as part of every engineer’s normal development cycle (use multiple processes on different ports). See the allocsim and zerosum tools.

    像 CockroachDB 这样的系统必须在现实环境中进行测试。 然而,在 AWS 上调试集群会产生大量开销。 使用云进行开发、部署和调试的周期非常慢。 一个非常有用的中间步骤是在本地部署集群,作为每个工程师正常开发周期的一部分(在不同端口上使用多个进程)。 请参阅 allocsim 和 Zerosum 工具。

Does building a distributed SQL system and untangling all of its parts sound like your ideal Tuesday morning? If so, we’re hiring! Check out our open positions here.

构建一个分布式 SQL 系统并理清其所有部分听起来是不是您理想的周二早上? 如果是这样,我们正在招聘! 在这里查看我们的空缺职位。