在学习mysql统计信息和执行计划的过程中,发现了一个挺有意思的问题。就是mysql优化器在没有创建直方图的情况下,在评估rows上能做到非常精准。
oracle CBO在评估rows时,如果没有非常准的频率直方图,是无法做的精准的rows评估的。因为oracle在对象存在统计信息的情况下,所有cost、rows评估都是完全依赖统计信息的。其中rows的评估依赖dba_tables.num_rows、dba_tab_columns.num_distinct、num_nulls、density、low_value、high_value等等。
看一个oracle简单的例子
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
SQL> explain plan for select * from test.t where c='C'; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Plan hash value: 1386885587 --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 350 | 2450 | 2 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| T | 350 | 2450 | 2 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IDX_T_C | 350 | | 1 (0)| 00:00:01 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C"='C') SQL> explain plan for select * from test.t where c='D'; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Plan hash value: 1386885587 --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 175 | 1225 | 2 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| T | 175 | 1225 | 2 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IDX_T_C | 175 | | 1 (0)| 00:00:01 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C"='D') SQL> select c,count(*) from test.t group by c; C COUNT(*) ---------------------------------------- ---------- A 1000 B 10 C 40 |
可以看到c=’C’或’D’都与实际值相差较多。实际上对于rows评估,oracle是完全依赖于统计信息的。
1 2 3 4 5 6 7 8 9 10 11 |
SQL> select NUM_DISTINCT,NUM_NULLS,DENSITY,LOW_VALUE,HIGH_VALUE from dba_tab_columns where owner='TEST' and table_name='T' and column_name='C'; NUM_DISTINCT NUM_NULLS DENSITY LOW_VALUE HIGH_VALUE ------------ ---------- ---------- ---------- ---------- 3 0 .333333333 41 43 SQL> select NUM_ROWS from dba_tables where owner='TEST' and table_name='T'; NUM_ROWS ---------- 1050 |
- c=’C’落在LOW_VALUE和HIGH_VALUE之间,rows评估公式如下:
1 |
rows=(NUM_ROWS-NUM_NULLS)/NUM_DISTINCT |
所以rows=1050/3=350
- c=’D’在LOW_VALUE和HIGH_VALUE之外,rows评估公式如下:
1 |
rows=NUM_ROWS/(2*DENSITY) |
所以rows=1050/2/3=175
可以看出oracle在字段数据倾斜很严重的时候,rows评估会造成比较大的偏差,所以才会有直方图的出现。即使有直方图,oracle也无法非常精准的评估出rows,因为CBO是完全依赖统计信息的。
下面来看看mysql的测试:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
mysql> select c,count(*) from t group by c; +------+----------+ | c | count(*) | +------+----------+ | A | 2000 | | B | 1000 | | C | 1000 | | D | 500 | | E | 800 | | F | 700 | +------+----------+ 6 rows in set (0.01 sec) mysql> explain select * from t where c='D'; +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | 1 | SIMPLE | t | NULL | ref | idx_t_c | idx_t_c | 63 | const | 500 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec) mysql> explain select * from t where c='F'; +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | 1 | SIMPLE | t | NULL | ref | idx_t_c | idx_t_c | 63 | const | 700 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec) mysql> explain select * from t where c='E'; +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | 1 | SIMPLE | t | NULL | ref | idx_t_c | idx_t_c | 63 | const | 800 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec) mysql> explain select * from t where c='B'; +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ | 1 | SIMPLE | t | NULL | ref | idx_t_c | idx_t_c | 63 | const | 1000 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec) |
在没有直方图的情况下,可以看到rows的评估非常精准。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
mysql> select * from mysql.innodb_index_stats where table_name='t'; +---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+ | database_name | table_name | index_name | last_update | stat_name | stat_value | sample_size | stat_description | +---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+ | lixinyan | t | GEN_CLUST_INDEX | 2023-02-24 21:28:18 | n_diff_pfx01 | 6000 | 14 | DB_ROW_ID | | lixinyan | t | GEN_CLUST_INDEX | 2023-02-24 21:28:18 | n_leaf_pages | 14 | NULL | Number of leaf pages in the index | | lixinyan | t | GEN_CLUST_INDEX | 2023-02-24 21:28:18 | size | 15 | NULL | Number of pages in the index | | lixinyan | t | idx_t_a | 2023-02-24 21:28:18 | n_diff_pfx01 | 1 | 7 | a | | lixinyan | t | idx_t_a | 2023-02-24 21:28:18 | n_diff_pfx02 | 6000 | 7 | a,DB_ROW_ID | | lixinyan | t | idx_t_a | 2023-02-24 21:28:18 | n_leaf_pages | 7 | NULL | Number of leaf pages in the index | | lixinyan | t | idx_t_a | 2023-02-24 21:28:18 | size | 8 | NULL | Number of pages in the index | | lixinyan | t | idx_t_c | 2023-02-24 21:28:18 | n_diff_pfx01 | 6 | 7 | c | | lixinyan | t | idx_t_c | 2023-02-24 21:28:18 | n_diff_pfx02 | 6000 | 7 | c,DB_ROW_ID | | lixinyan | t | idx_t_c | 2023-02-24 21:28:18 | n_leaf_pages | 7 | NULL | Number of leaf pages in the index | | lixinyan | t | idx_t_c | 2023-02-24 21:28:18 | size | 8 | NULL | Number of pages in the index | +---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+ |
从统计信息中也无法做出如此精准的判断,这是什么原因呢?
原因就在于mysql在没有直方图之前,如果数据倾斜严重的情况下,容易走错执行计划。所以引入了一个index dive的特性。官方文档介绍如下:
Equality Range Optimization of Many-Valued Comparisons
Consider these expressions, where
col_name
is an indexed column:
- col_name IN(val1, …, valN)
- col_name = val1 OR … OR col_name = valN
Each expression is true if
col_name
is equal to any of several values. These comparisons are equality range comparisons (where the “range” is a single value). The optimizer estimates the cost of reading qualifying rows for equality range comparisons as follows:
- If there is a unique index on
col_name
, the row estimate for each range is 1 because at most one row can have the given value.- Otherwise, any index on
col_name
is nonunique and the optimizer can estimate the row count for each range using dives into the index or index statistics.With index dives, the optimizer makes a dive at each end of a range and uses the number of rows in the range as the estimate. For example, the expression
has three equality ranges and the optimizer makes two dives per range to generate a row estimate. Each pair of dives yields an estimate of the number of rows that have the given value.
col_name
IN (10, 20, 30)Index dives provide accurate row estimates, but as the number of comparison values in the expression increases, the optimizer takes longer to generate a row estimate. Use of index statistics is less accurate than index dives but permits faster row estimation for large value lists.
The
eq_range_index_dive_limit
system variable enables you to configure the number of values at which the optimizer switches from one row estimation strategy to the other. To permit use of index dives for comparisons of up toN
equality ranges, seteq_range_index_dive_limit
toN
+ 1. To disable use of statistics and always use index dives regardless ofN
, seteq_range_index_dive_limit
to 0.To update table index statistics for best estimates, use
ANALYZE TABLE
.Prior to MySQL 8.0, there is no way of skipping the use of index dives to estimate index usefulness, except by using the
eq_range_index_dive_limit
system variable. In MySQL 8.0, index dive skipping is possible for queries that satisfy all these conditions:
- The query is for a single table, not a join on multiple tables.
- A single-index
FORCE INDEX
index hint is present. The idea is that if index use is forced, there is nothing to be gained from the additional overhead of performing dives into the index.- The index is nonunique and not a
FULLTEXT
index.- No subquery is present.
- No
DISTINCT
,GROUP BY
, orORDER BY
clause is present.For
EXPLAIN FOR CONNECTION
, the output changes as follows if index dives are skipped:
- For traditional output, the
rows
andfiltered
values areNULL
.- For JSON output,
rows_examined_per_scan
androws_produced_per_join
do not appear,skip_index_dive_due_to_force
istrue
, and cost calculations are not accurate.Without
FOR CONNECTION
,EXPLAIN
output does not change when index dives are skipped.After execution of a query for which index dives are skipped, the corresponding row in the Information Schema
OPTIMIZER_TRACE
table contains anindex_dives_for_range_access
value ofskipped_due_to_force_index
.
该特性的意思是mysql在评估满足index dive生效条件的执行计划时,会实际去访问一些page,从而获取更加精准的rows。以便得到正确的执行计划。并不是完全依靠统计信息。
index dive的参数由eq_range_index_dive_limit控制。
1 2 3 4 5 6 |
mysql> show variables like '%eq_range%'; +---------------------------+-------+ | Variable_name | Value | +---------------------------+-------+ | eq_range_index_dive_limit | 200 | +---------------------------+-------+ |
- 0 :始终使用index dive的方式。
- 1 :不使用index dive的方式。
- N :1+N 条件个数。
index dive的适用条件:
- 非唯一索引字段的in或者or,col_name IN (val1, ..., valN)/col_name = val1 OR … OR col_name = valN
- 非唯一索引字段的(< > <= >=)的范围扫描
- 非唯一索引字段的等值查询col_name=val
index dive的限制:
在 MySQL 8.0 之前,只能使用eq_range_index_dive_limit跳过index dive。在 MySQL 8.0 中,满足这些条件的查询可能会跳过index dive:
- 多表关联
- 使用了FORCE INDEX HINT
- 存在子查询
- 存在DISTINCT, GROUP BY, or ORDER BY
- 索引为全文索引
通过optimizer_trace验证是否使用了index dive:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
mysql> set optimizer_trace='enabled=on'; Query OK, 0 rows affected (0.00 sec) mysql> select * from t where c='D'; ... mysql> select * from information_schema.optimizer_trace\G *************************** 1. row *************************** QUERY: select * from t where c='D' ... "analyzing_range_alternatives": { "range_scan_alternatives": [ { "index": "idx_t_c", "ranges": [ "c = 'D'" ], "index_dives_for_eq_ranges": true, "rowid_ordered": true, "using_mrr": false, "index_only": false, "in_memory": 1, "rows": 500, "cost": 175.26, "chosen": true } ] ... |
后续小菜鸟还初探了一下源码,可以看到每次index dive采样的page最多10个page,以及rows评估的公式。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
/* Do not read more than this number of pages in order not to hurt performance with this code which is just an estimation. If we read this many pages before reaching slot2->page_no then we estimate the average from the pages scanned so far. */ constexpr uint32_t N_PAGES_READ_LIMIT = 10; page_id_t page_id(dict_index_get_space(index), slot1->page_no); const fil_space_t *space = fil_space_get(index->space); ut_ad(space); const page_size_t page_size(space->flags); level = slot1->page_level; do { mtr_t mtr; page_t *page; buf_block_t *block; mtr_start(&mtr); /* Fetch the page. Because we are not holding the index->lock, the tree may have changed and we may be attempting to read a page that is no longer part of the B-tree. We pass Page_fetch::POSSIBLY_FREED in order to silence a debug assertion about this. */ block = buf_page_get_gen(page_id, page_size, RW_S_LATCH, nullptr, Page_fetch::POSSIBLY_FREED, UT_LOCATION_HERE, &mtr); page = buf_block_get_frame(block); /* It is possible that the tree has been reorganized in the meantime and this is a different page. If this happens the calculated estimate will be bogus, which is not fatal as this is only an estimate. We are sure that a page with page_no exists because InnoDB never frees pages, only reuses them. */ if (!fil_page_index_page_check(page) || btr_page_get_index_id(page) != index->id || btr_page_get_level(page) != level) { /* The page got reused for something else */ mtr_commit(&mtr); goto inexact; } /* It is possible but highly unlikely that the page was originally written by an old version of InnoDB that did not initialize FIL_PAGE_TYPE on other than B-tree pages. For example, this could be an almost-empty BLOB page that happens to contain the magic values in the fields that we checked above. */ n_pages_read++; if (page_id.page_no() != slot1->page_no) { /* Do not count the records on slot1->page_no, we already counted them before this loop. */ n_rows += page_get_n_recs(page); } page_id.set_page_no(btr_page_get_next(page, &mtr)); mtr_commit(&mtr); if (n_pages_read == N_PAGES_READ_LIMIT || page_id.page_no() == FIL_NULL) { /* Either we read too many pages or we reached the end of the level without passing through slot2->page_no, the tree must have changed in the meantime */ goto inexact; } } while (page_id.page_no() != slot2->page_no); return (n_rows); inexact: *is_n_rows_exact = false; /* We did interrupt before reaching slot2->page */ if (n_pages_read > 0) { /* The number of pages on this level is n_rows_on_prev_level, multiply it by the average number of recs per page so far */ n_rows = n_rows_on_prev_level * n_rows / n_pages_read; } else { /* The tree changed before we could even start with slot1->page_no */ n_rows = 10; } return (n_rows); } |
- N_PAGES_READ_LIMIT 10
- n_rows = n_rows_on_prev_level * n_rows / n_pages_read