Automatic table statistics in CockroachDB
Automatic table statistics in CockroachDB
Last year, we rebuilt our cost-based optimizer from scratch for CockroachDB’s 2.1 release. We’ve been continuing to improve the optimizer since then, and we’ve added a number of new features for the CockroachDB 19.1 release. One of the new features is automatic collection of table statistics. Automatic statistics enables the optimizer to make better decisions when choosing query plans.
去年,我们为 CockroachDB 2.1 版本从头开始重建了基于成本的优化器。 从那时起,我们一直在不断改进优化器,并为 CockroachDB 19.1 版本添加了许多新功能。 新功能之一是自动收集表统计信息。 自动统计信息使优化器能够在选择查询计划时做出更好的决策。
This post explains why statistics are important for the optimizer and describes some of the challenges we overcame when implementing automatic collection.
这篇文章解释了为什么统计信息对于优化器很重要,并描述了我们在实现自动收集时克服的一些挑战。
A Tale of Two Queries
Consider the following two SQL queries, used by an imaginary kitchen supplies company to count the number of toaster ovens purchased by customers in New York City, grouped by the date of purchase:
考虑以下两个 SQL 查询,一家虚构的厨房用品公司使用它来统计纽约市客户购买的烤面包机数量,并按购买日期分组:
1 | -- query A |
1 | -- query B |
With minimal optimization, these queries correspond to the following logical query plans:
通过最少的优化,这些查询对应于以下逻辑查询计划:
query A:
query B:
These queries compute the same result, but one query is vastly more expensive than the other. Can you tell which one? (scroll down to see the answer…)
这些查询计算相同的结果,但一个查询比另一个查询要昂贵得多。 你能说出是哪一个吗? (向下滚动查看答案…)
Spoiler alert: it’s not possible to determine which query is faster without more information. So suppose I give you the following statistics:
剧透警告:如果没有更多信息,就不可能确定哪个查询更快。 假设我给你以下统计数据:
Now can you tell?
现在你能说出来吗?
Before I give the answer, let’s discuss what makes a query expensive in the first place, and why you should care. The “cost” of a query roughly translates to the total amount of computing resources required to execute the query plan, including CPU, I/O and network. Lower-cost plans tend to execute faster (i.e., they have lower latency), and also enable the system to maintain a higher throughput, since more resources are available for executing other queries simultaneously. That means less money spent on extra hardware, and less time waiting for results!
在给出答案之前,让我们首先讨论一下是什么让查询变得昂贵,以及为什么您应该关心。 查询的“成本”大致翻译为执行查询计划所需的计算资源总量,包括 CPU、I/O 和网络。 成本较低的计划往往执行速度更快(即,延迟较低),并且还使系统能够保持较高的吞吐量,因为有更多的资源可用于同时执行其他查询。 这意味着花在额外硬件上的钱更少,等待结果的时间也更少!
How do you estimate the cost of a query?
We can estimate the cost of a particular query plan by estimating the cost of each operation in the plan and adding them together. For example, consider Query Plan A. You can think of data as flowing through the plan from bottom to top. First we scan table customers
and apply a filter, then we scan table orders
and join the two together, etc. Each stage in the plan costs a certain amount depending on the type of operation and the amount of data that must be processed. Operations involving I/O are generally more expensive than operations involving only CPU, so reading 10,000 rows from disk (e.g., as part of a scan) is much more expensive than applying a filter to those rows. The actual cost formula varies for each operator and is beyond the scope of this blog entry, but the important thing to know is that the cost of each operation is directly proportional to the number of rows processed.
我们可以通过估计计划中每个操作的成本并将它们加在一起来估计特定查询计划的成本。 例如,考虑查询计划 A。您可以将数据视为从下到上流过该计划。 首先,我们扫描客户表并应用过滤器,然后扫描订单表并将两者连接在一起,等等。计划中的每个阶段都会花费一定的费用,具体取决于操作类型和必须处理的数据量。 涉及 I/O 的操作通常比仅涉及 CPU 的操作更昂贵,因此从磁盘读取 10,000 行(例如,作为扫描的一部分)比对这些行应用过滤器要昂贵得多。 每个操作符的实际成本公式各不相同,超出了本博客条目的范围,但需要了解的重要一点是每个操作的成本与处理的行数成正比。
So how do we estimate the number of rows processed by each operator? With statistics, of course!
那么我们如何估计每个算子处理的行数呢? 当然是统计数据!
Similar to how data propagates from the bottom to the top of a query plan, we can also propagate statistics from the bottom to the top to estimate the number of rows at each step.
与数据从查询计划的底部传播到顶部的方式类似,我们也可以从底部到顶部传播统计信息来估计每个步骤的行数。
Let’s look at the example of query plan A. Given the stats from Table 1, we know that customers
has 100,000 rows. Since in this simple example we don’t have any indexes on customers
, the scan must process all of the rows. Similarly, the filter operator must process all rows produced by the scan to test the filter predicate city='New York'
on each row. But this is where it gets interesting. To estimate the number of rows that match the filter predicate and pass on to the join operator, we need to calculate the selectivity of the filter, or the percentage of rows that are likely to match. For this calculation, we make the simplifying assumption that all values are uniformly distributed in each column. For example, we assume that a column with 30 rows and 3 distinct values will have 10 of each value.
让我们看一下查询计划 A 的示例。根据表 1 中的统计信息,我们知道客户有 100,000 行。 由于在这个简单的示例中我们没有任何客户索引,因此扫描必须处理所有行。 同样,过滤运算符必须处理扫描生成的所有行,以测试每行上的过滤谓词 city=’New York’。 但这就是有趣的地方。 为了估计与过滤谓词匹配并传递给连接运算符的行数,我们需要计算过滤器的选择性,或者可能匹配的行的百分比。 对于此计算,我们做出简化假设,即所有值均匀分布在每列中。 例如,我们假设一列包含 30 行和 3 个不同的值,每个值有 10 个。
In our running example, we can see from Table 1 that column city
has only two distinct values (it turns out our imaginary kitchen supply company only has two locations). After we apply the predicate city='New York'
, there is at most one distinct value, the value 'New York'
. Utilizing the assumption of uniformity, we can estimate the selectivity of the predicate as 1⁄2. With 100,000 input rows, we estimate that after the predicate is applied there will be 100,000 * 1⁄2 = 50,000 rows.
在我们的运行示例中,我们可以从表 1 中看到,列 city 只有两个不同的值(事实证明,我们假想的厨房供应公司只有两个地点)。 在我们应用谓词 city=’New York’ 后,最多有一个不同的值,即值“New York”。 利用均匀性假设,我们可以估计谓词的选择性为 1⁄2。 对于 100,000 个输入行,我们估计应用谓词后将有 100,000 * 1⁄2 = 50,000 行。
Note that this result of 50,000 rows is just an estimate, since the assumption of uniformity is not always correct. It’s possible that the data is skewed and nearly all of all rows match city='New York'
. It’s also possible that almost none of the rows match city='New York'
. In the next release we will incorporate histograms into our estimates to handle such cases. But in practice, the uniformity assumption works reasonably well for query optimization.
请注意,50,000 行的结果只是一个估计值,因为均匀性的假设并不总是正确的。 数据可能存在偏差,几乎所有行都匹配 city=’New York’。 也有可能几乎没有行与 city=’New York’ 匹配。 在下一个版本中,我们将把直方图纳入我们的估计中以处理此类情况。 但实际上,一致性假设对于查询优化来说相当有效。
The cost-based optimizer does not need to know the exact cost of a query plan; it just needs a good enough estimate that it can compare two plans and select the relatively cheaper plan.
基于成本的优化器不需要知道查询计划的确切成本; 它只需要一个足够好的估计,就可以比较两个计划并选择相对便宜的计划。
Given that we estimate 50,000 rows from the filter, how do we propagate the statistics up the tree? We can’t directly apply the 0.5 filter selectivity to the distinct count, since that would be statistically incorrect. Suppose that there is another column in customers
that tracks the customer’s gender
. Assuming that gender
is independent of city
(which is true, last time I checked…), it is likely that gender
will still have (at least) two distinct values after half of the rows are removed that don’t match city='New York'
. If we were to blindly apply the selectivity to distinct counts (similar to how we applied it to the row count), we would estimate that gender
had only one distinct value, which is clearly incorrect given the assumption of independence. Instead, we use the following formula:
鉴于我们估计过滤器中有 50,000 行,我们如何将统计信息传播到树上? 我们不能直接将 0.5 过滤器选择性应用于非重复计数,因为这在统计上是不正确的。 假设客户中有另一列跟踪客户的性别。 假设性别与城市无关(这是真的,我上次检查过……),在删除与 city=’New 不匹配的一半行后,性别可能仍然有(至少)两个不同的值 约克’。 如果我们盲目地将选择性应用于不同的计数(类似于我们将其应用于行计数的方式),我们会估计性别只有一个不同的值,考虑到独立性的假设,这显然是不正确的。 相反,我们使用以下公式:
d′=d−d∗(1−selectivity)n/d
Where n is the number of input rows, �d is the distinct count before the selectivity is applied, and �’d’ is the distinct count after. It can be derived as follows: If each distinct value appears �/�n/d times, and the probability of a row being filtered out is (1−�����������)(1−selectivity), the probability that all �/�n/d rows are filtered out is (1−�����������)�/�(1−selectivity)n/d. So the expected number of values that are filtered out is �∗(1−�����������)�/�d∗(1−selectivity)n/d. This formula has the nice property that it returns �∗�����������d∗selectivity when �=�d=n but it’s closer to �d when �<<�d<<n.
Similar to the uniformity assumption, the independence assumption is not always correct, but it makes calculations significantly simpler. We’ll likely relax the independence assumption in future releases of the optimizer.
与均匀性假设类似,独立性假设并不总是正确的,但它使计算变得更加简单。 我们可能会在优化器的未来版本中放宽独立性假设。
Moving further up the query plan, we must next calculate the number of rows produced by the join. This is a notoriously difficult problem, and remains an open area of research1. For equi-joins such as the ones in queries A and B, we make the simplifying assumption that we have a primary key-foreign key join, so the output cardinality will be equal to the larger of the two inputs multiplied by the selectivity of any filters that have been applied. Distinct counts are not affected significantly by joins.
继续执行查询计划,接下来我们必须计算连接产生的行数。 这是一个众所周知的难题,并且仍然是一个开放的研究领域。 对于等连接,例如查询 A 和 B 中的连接,我们做出简化的假设,即我们有一个主键-外键连接,因此输出基数将等于两个输入中较大的一个乘以任意输入的选择性 已应用的过滤器。 不同计数不会受到连接的显着影响。
As the last step in both query plans, we apply a GROUP BY
operation which calculates the number of matching orders grouped by purchased
date. Luckily, calculating the number of output rows of a GROUP BY
operator is easy: it’s equal to the number of distinct values of the grouping column(s).
作为两个查询计划的最后一步,我们应用 GROUP BY 操作来计算按购买日期分组的匹配订单数。 幸运的是,计算 GROUP BY 运算符的输出行数很容易:它等于分组列的不同值的数量。
Although this example only covers a subset of all SQL operators, it gives a good flavor for how we use statistics to estimate the cost of different query plans. Hopefully now you can see that query A is significantly more expensive than query B. The CockroachDB optimizer will always transform query A so that it uses the same query plan as B. And for good reason: when we execute the two plans, we see that query A takes 1.9 seconds while query B takes 16 ms: over a 100X difference!
尽管此示例仅涵盖所有 SQL 运算符的子集,但它很好地说明了我们如何使用统计信息来估计不同查询计划的成本。 希望现在您可以看到查询 A 比查询 B 昂贵得多。CockroachDB 优化器将始终转换查询 A,以便它使用与 B 相同的查询计划。并且有充分的理由:当我们执行这两个计划时,我们会看到 查询 A 需要 1.9 秒,而查询 B 需要 16 毫秒:相差超过 100 倍!
How CockroachDB Collects Table Statistics
So how do we get statistics such as the ones shown in Table 1 that are used by the optimizer to calculate the cost of candidate query plans? In CockroachDB, we collect statistics using the command CREATE STATISTICS
. CREATE STATISTICS
performs a distributed, full table scan of the specified table, and calculates statistics in parallel on each node before a final merge (see Figure 1). After collection, statistics are stored in the system table system.table_statistics
, and are cached on each node for fast access by the optimizer.
那么我们如何获得如表 1 所示的统计信息,优化器使用这些统计信息来计算候选查询计划的成本呢? 在 CockroachDB 中,我们使用命令 CREATE STATISTICS 收集统计信息。 CREATE STATISTICS 对指定表执行分布式全表扫描,并在最终合并之前在每个节点上并行计算统计信息(参见图 1)。 收集后,统计信息存储在系统表system.table_statistics中,并缓存在每个节点上,以便优化器快速访问。
Figure 1: The distributed SQL plan of a CREATE STATISTICS statement on 5 nodes
The statistics we currently support are:
我们目前支持的统计数据有:
Row Count
This is an exact row count as of the time of the table scan. Although this is the simplest statistic, it’s also the most important. As you learned in the previous section, row count is the primary input to our cost model for determining the cost of each candidate query plan.
这是截至表扫描时的精确行计数。 虽然这是最简单的统计数据,但也是最重要的。 正如您在上一节中了解到的,行计数是成本模型的主要输入,用于确定每个候选查询计划的成本。
Distinct Count
For each index column and a subset of other columns in each table, we estimate the number of distinct values. As described above, distinct counts are useful for estimating the selectivity of filter predicates and estimating the number of rows output by GROUP BY
and other relational SQL operators. We don’t calculate exact distinct counts, because for columns with many distinct values, the calculation could require nearly as much memory and disk space as the entire table itself! Instead, we use a variant of the well-known HyperLogLog algorithm called HLL-TailCut+2 and the excellent golang implementation by Axiom, Inc3. HyperLogLog estimates distinct counts with high accuracy and very low memory overhead. It can also be calculated in parallel across several nodes and efficiently merged to produce a distinct count estimate for the full table.
对于每个表中的每个索引列和其他列的子集,我们估计不同值的数量。 如上所述,非重复计数对于估计过滤谓词的选择性以及估计 GROUP BY 和其他关系 SQL 运算符输出的行数非常有用。 我们不计算精确的不同计数,因为对于具有许多不同值的列,计算可能需要几乎与整个表本身一样多的内存和磁盘空间! 相反,我们使用著名的 HyperLogLog 算法的变体(称为 HLL-TailCut+2)以及 Axiom, Inc3 出色的 golang 实现。 HyperLogLog 以高精度和极低的内存开销来估计不同计数。 它还可以跨多个节点并行计算并有效合并以生成整个表的不同计数估计。
Null Count
We calculate null counts explicitly, as NULL
is a value that can appear quite often in some workloads. We can get better cardinality estimates for certain queries by tracking this count separately.
我们显式计算 null 计数,因为 NULL 是在某些工作负载中经常出现的值。 通过单独跟踪此计数,我们可以为某些查询获得更好的基数估计。
In the future, we plan to support more statistics, including multi-column distinct counts and histograms.
未来,我们计划支持更多统计数据,包括多列非重复计数和直方图。
Changes to Table Statistics in CockroachDB 19.1
The CREATE STATISTICS
command existed in the 2.1 release, but we found that most customers were not taking advantage of it, and the few that knew about it were not using it effectively. There were a few reasons for this:
CREATE STATISTICS 命令存在于 2.1 版本中,但我们发现大多数客户都没有利用它,而且少数了解它的客户也没有有效地使用它。 造成这种情况的原因有以下几个:
We did not advertise the feature, so many customers were not aware of it. We did this purposefully, since as described below,
CREATE STATISTICS
is difficult to use correctly.我们没有宣传该功能,因此很多客户不知道。 我们故意这样做,因为如下所述,CREATE STATISTICS 很难正确使用。
Even if they discovered the command on their own, they might have run it once and forgotten to run it again. This is a problem because statistics can quickly become stale for tables that are modified frequently. Running
CREATE STATISTICS
once can be worse than never running it at all, since we have some reasonable defaults in the optimizer if there are no stats. But we’ll always use whatever stats are available, so suppose you create a table, add one row, and then runCREATE STATISTICS
. And then you add 1 million rows but forget to refresh the stats. The optimizer will choose a plan optimized for the case of one row, even though that may be an awful plan for the case of one million rows.即使他们自己发现了该命令,他们也可能运行过一次并忘记再次运行它。 这是一个问题,因为对于频繁修改的表,统计信息很快就会变得过时。 运行一次 CREATE STATISTICS 可能比根本不运行它更糟糕,因为如果没有统计信息,我们在优化器中有一些合理的默认值。 但我们将始终使用任何可用的统计信息,因此假设您创建一个表,添加一行,然后运行 CREATE STATISTICS。 然后您添加 100 万行,但忘记刷新统计数据。 优化器将选择针对一行的情况进行优化的计划,即使对于一百万行的情况这可能是一个糟糕的计划。
A savvy user might have realized that they needed to refresh stats periodically, but then it was not clear how often to perform a refresh. The optimal frequency may be different for every table, since some tables are updated frequently and others are updated less frequently. Furthermore, some tables are large, and even a lot of updates don’t change the stats much, while some tables are small and a few updates can drastically change the stats.
精明的用户可能已经意识到他们需要定期刷新统计数据,但不清楚多久执行一次刷新。 每个表的最佳频率可能不同,因为有些表更新频繁,而另一些表更新频率较低。 此外,有些表很大,即使很多更新也不会改变统计数据太多,而有些表很小,一些更新就可以极大地改变统计数据。
Even if a user managed to overcome all of these hurdles, we were not satisfied with the status quo. Our mission at Cockroach Labs is to “make data easy”, and forcing users to perform their own periodic stats refreshes did not fit with this mission.
即使用户设法克服所有这些障碍,我们也不满足于现状。 我们 Cockroach Labs 的使命是“让数据变得简单”,强迫用户定期刷新统计数据并不符合这一使命。
The Solution: Automatic Statistics Collection
The solution, of course, is for CockroachDB to perform statistics collection automatically. Our idea at the beginning of the release was to detect when some significant portion of the data in a table had changed, and then automatically trigger statistics collection by having the system call CREATE STATISTICS
. This introduced two key challenges:
解决方案当然是让CockroachDB自动进行统计收集。 我们在发布之初的想法是检测表中数据的某些重要部分何时发生更改,然后通过系统调用 CREATE STATISTICS 自动触发统计信息收集。 这带来了两个关键挑战:
We want to detect when a significant amount of data has changed, but there should be negligible impact on the performance of
INSERT
,UPDATE
andDELETE
transactions.我们希望检测大量数据何时发生变化,但对 INSERT、UPDATE 和 DELETE 事务的性能影响应该可以忽略不计。
Once we’ve decided to trigger a statistics refresh, the collection of statistics should have negligible impact on the performance of concurrent transactions.
一旦我们决定触发统计数据刷新,统计数据的收集对并发事务的性能的影响应该可以忽略不计。
Let’s examine each of these challenges:
让我们逐一分析一下这些挑战:
Deciding to Trigger a Refresh
In order to decide when to trigger a refresh, we want to know when a “significant” portion of a table has changed. To make this concrete, let’s say that we want to update stats after approximately 20% of the rows in a table have been inserted, updated or deleted. The advantage of using a percentage instead of an absolute row count is that it allows the frequency of refreshes to scale inversely with the size of the table.
为了决定何时触发刷新,我们想知道表的“重要”部分何时发生更改。 为了具体说明这一点,假设我们希望在表中大约 20% 的行被插入、更新或删除后更新统计信息。 使用百分比而不是绝对行计数的优点是,它允许刷新频率与表的大小成反比。
So the challenge is to determine when 20% of rows have changed. Perhaps you’re thinking, “this doesn’t seem very hard…”. At first glance, it does seem straightforward; we know how many rows are affected by each mutation operation on each node. But the problem is that we can have many mutation operations happening simultaneously to the same table on different nodes, and we’d like to avoid having a central counter somewhere that could be a source of contention. The only way a geo-distributed system like CockroachDB scales is if nodes are largely independent.
因此,挑战在于确定 20% 的行何时发生更改。 也许你在想,“这看起来并不难……”。 乍一看,这似乎很简单; 我们知道每个节点上的每个突变操作会影响多少行。 但问题是,我们可以在不同节点上的同一张表上同时发生许多突变操作,并且我们希望避免在可能成为争用源的地方设置中央计数器。 像 CockroachDB 这样的地理分布式系统扩展的唯一方法是节点在很大程度上是独立的。
Luckily, the decision to refresh a table after some percentage of rows have changed does not need to be exact. There is a tradeoff between more frequent refreshes (which add overhead to the cluster), and less frequent refreshes (which could result in suboptimal query plans). We have found that refreshing with 20% of rows “stale” provides a good balance, but 10% or 30% would also be perfectly fine in most cases.
幸运的是,在一定百分比的行发生更改后刷新表的决定不需要精确。 更频繁的刷新(这会增加集群的开销)和不太频繁的刷新(这可能会导致查询计划不理想)之间需要进行权衡。 我们发现刷新 20% 的“陈旧”行可以提供良好的平衡,但在大多数情况下 10% 或 30% 也完全没问题。
Given this flexibility, we decided to solve the problem by using the following statistical approach: after every mutation operation (e.g., INSERT
, UPDATE
or DELETE
), there is a small chance of statistics getting refreshed. In particular, the probability of a refresh is:
鉴于这种灵活性,我们决定使用以下统计方法来解决问题:在每次突变操作(例如插入、更新或删除)之后,统计数据刷新的机会很小。 特别地,刷新的概率为:
P(refresh)=numbe**r o**f row**s i**n table∗0.20numbe**r o**f row**s updated, inserte**d, deleted
To implement this statistical approach, we essentially use the following simple algorithm (although there are some complexities discussed in the next section): after each mutation, generate a random number between 0 and 1, and if it falls below this probability value, kick off a CREATE STATISTICS
run. What this means is that over time, stats for each table are refreshed after approximately 20% of rows have changed. We also have some guard rails in place in case there are statistical outliers. In particular, we always refresh stats for a table if it has no stats yet or if it has been a long time since the last refresh.
为了实现这种统计方法,我们本质上使用以下简单的算法(尽管下一节中讨论了一些复杂性):在每次突变之后,生成一个 0 到 1 之间的随机数,如果它低于这个概率值,则开始 运行创建统计信息。 这意味着随着时间的推移,每个表的统计信息会在大约 20% 的行发生更改后刷新。 我们还设置了一些防护栏,以防出现统计异常值。 特别是,如果表还没有统计信息或者距离上次刷新已经很长时间了,我们总是刷新表的统计信息。
Running the Refresh
The second challenge was ensuring that running a statistics refresh did not significantly impact running workloads. It’s one thing if there is high overhead when a user knowingly runs CREATE STATISTICS
on the command line, but it’s another thing if we trigger it without their knowledge and all of a sudden the command is using half the resources in the cluster. Since a statistics refresh can happen at any time, it must have minimal overhead.
第二个挑战是确保运行统计数据刷新不会显着影响正在运行的工作负载。 如果用户故意在命令行上运行 CREATE STATISTICS 时会产生很高的开销,这是一回事,但如果我们在他们不知情的情况下触发它并且命令突然使用了集群中一半的资源,那就是另一回事了。 由于统计数据刷新可以随时发生,因此它的开销必须最小。
This requirement forced us to rethink the solution described in the last section for triggering a refresh after 20% of rows changed. In particular, the simple algorithm of possibly triggering a refresh after each mutation was problematic. There can be many updates per second, but each CREATE STATISTICS
run can take minutes. As a result, we could have multiple threads scanning the same table to collect statistics at the same time. This is a big problem, since a single full table scan can impact performance. Many table scans at once can actually bring down the cluster.
这一要求迫使我们重新考虑上一节中描述的在 20% 行更改后触发刷新的解决方案。 特别是,每次突变后可能触发刷新的简单算法是有问题的。 每秒可能有多次更新,但每次 CREATE STATISTICS 运行可能需要几分钟的时间。 因此,我们可以让多个线程扫描同一个表来同时收集统计信息。 这是一个大问题,因为单个全表扫描会影响性能。 同时进行许多表扫描实际上可能会导致集群崩溃。
The solution we came up with was to have one background “refresher” thread running on each node. Mutation operations such as INSERT
, UPDATE
and DELETE
send messages to that thread with the table and number of rows affected, and the refresher aggregates the counts of rows updated for each table on that node. Periodically, the refresher thread starts up a separate thread that uses the latest counts to possibly kick off a statistics refresh for each table.
我们提出的解决方案是在每个节点上运行一个后台“刷新”线程。 INSERT、UPDATE 和 DELETE 等突变操作会向该线程发送包含受影响的表和行数的消息,并且刷新器会聚合该节点上每个表更新的行数。 刷新线程会定期启动一个单独的线程,该线程使用最新的计数来可能启动每个表的统计刷新。
The background refresher thread ensures that at most one statistics refresh is triggered at once per node, but it does not provide any coordination between nodes. To ensure that at most one automatic statistics refresh is running globally on the cluster, we took advantage of the existing “jobs” infrastructure used to run commands like IMPORT
, BACKUP
and RESTORE
. In the last release, CREATE STATISTICS
was a normal SQL statement, but we changed it during this release cycle to run as a job. Now, every time CREATE STATISTICS
is called, it creates an entry in the system.jobs
table. If a node wants to perform a refresh, it first checks the system.jobs
table to be sure that there are no other statistics jobs currently running. In this way, we ensure that only one statistics refresh is active at once. The jobs infrastructure ensures that we always make progress, since if a node dies, another node will adopt the job.
后台刷新线程确保每个节点最多一次触发一次统计刷新,但它不提供节点之间的任何协调。 为了确保集群上全局运行最多一次自动统计刷新,我们利用了现有的“作业”基础设施,用于运行 IMPORT、BACKUP 和 RESTORE 等命令。 在上一个版本中,CREATE STATISTICS 是一个普通的 SQL 语句,但我们在此版本周期中将其更改为作为作业运行。 现在,每次调用 CREATE STATISTICS 时,都会在 system.jobs 表中创建一个条目。 如果节点想要执行刷新,它首先检查system.jobs表以确保当前没有其他统计作业正在运行。 通过这种方式,我们可以确保一次只有一个统计刷新处于活动状态。 作业基础设施确保我们始终取得进展,因为如果一个节点死亡,另一个节点将接管该作业。
Even with all of this infrastructure in place, we found that the overhead of a single CREATE STATISTICS
job was too high for some workloads. Workloads with heavy utilization of resources saw throughput drops and latency spikes each time a statistics refresh started. This observation led us to the final requirement: we needed to limit the overhead of each individual CREATE STATISTICS
job. The solution we used was to throttle the execution of statistics collection. Specifically, we insert some idle time (i.e., sleep) after every 10,000 rows are processed. The amount of idle time is adaptive, and depends on the current CPU utilization of the cluster. If utilization is high, we sleep more, and if utilization is low, we turn off throttling altogether.
即使所有这些基础设施都已到位,我们发现单个 CREATE STATISTICS 作业的开销对于某些工作负载来说仍然太高。 每次统计数据刷新开始时,资源利用率很高的工作负载都会出现吞吐量下降和延迟激增的情况。 这一观察使我们得出了最终的要求:我们需要限制每个单独的 CREATE STATISTICS 作业的开销。 我们使用的解决方案是限制统计信息收集的执行。 具体来说,我们在每处理 10,000 行后插入一些空闲时间(即睡眠)。 空闲时间量是自适应的,并且取决于集群当前的 CPU 利用率。 如果利用率高,我们会睡得更多,如果利用率低,我们会完全关闭节流。
The Practical Stuff
By this point you may be wondering, “How do I actually use this feature?” Consistent with our mission to “make data easy”, you don’t need to do anything other than upgrade to CockroachDB 19.1. Automatic statistics are enabled by default in 19.1, so unless you explicitly disable the feature, the system will trigger statistics refreshes as needed.
此时您可能想知道“我如何实际使用此功能?” 与我们“让数据变得简单”的使命相一致,除了升级到 CockroachDB 19.1 之外,您无需执行任何操作。 自动统计在 19.1 中默认启用,因此除非您明确禁用该功能,否则系统将根据需要触发统计刷新。
Although we’ve made every effort to minimize the impact of automatic statistics collection on performance of the system, there will always be overhead due to the cost of performing a full table scan for each refresh. For most workloads, the amortized impact on performance is less than 1%, but we’ve seen some cases with larger performance degradation. If your workload is negatively affected, our documentation on the automatic statistics in CockroachDB describes how you can adjust the frequency of refreshes or disable automatic collection altogether. It’s still possible to run CREATE STATISTICS
manually if you want full control over when stats are refreshed.
尽管我们已尽一切努力尽量减少自动统计信息收集对系统性能的影响,但由于每次刷新执行全表扫描的成本,总会产生开销。 对于大多数工作负载,对性能的摊销影响小于 1%,但我们已经看到了一些性能下降较大的情况。 如果您的工作负载受到负面影响,我们有关 CockroachDB 中自动统计信息的文档描述了如何调整刷新频率或完全禁用自动收集。 如果您想完全控制统计数据的刷新时间,仍然可以手动运行 CREATE STATISTICS。
Conclusion
We hope this post has convinced you that statistics are essential for the optimizer to produce good query plans for a wide variety of queries, and we hope it has allowed you to peek inside the optimizer to learn a bit about how it works. CockroachDB 19.1 provides tools to collect statistics both manually and automatically, with the goal of keeping statistics fresh for the optimizer to use with minimal performance impact.
我们希望这篇文章能让您相信,统计信息对于优化器为各种查询生成良好的查询计划至关重要,并且我们希望它能让您深入了解优化器,了解它的工作原理。 CockroachDB 19.1 提供了手动和自动收集统计信息的工具,目的是使统计信息保持最新状态,供优化器使用,同时对性能影响最小。