mirror of
				https://github.com/MariaDB/server.git
				synced 2025-10-31 10:56:12 +01:00 
			
		
		
		
	
		
			
				
	
	
		
			1309 lines
		
	
	
	
		
			45 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			1309 lines
		
	
	
	
		
			45 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| This file is intended to explain some of the optimizer cost variables
 | |
| in MariaDB 11.0
 | |
| 
 | |
| Background
 | |
| ==========
 | |
| 
 | |
| Most timings has come from running:
 | |
| 
 | |
| ./check_costs.pl --rows=1000000 --socket=/tmp/mysql-dbug.sock --comment="--aria-pagecache-buffer-size=10G --innodb-buffer_pool_size=10G --key_buffer-size=1G --max-heap-table-size=10G"
 | |
| 
 | |
| The MariaDB server is started with the options:
 | |
| --aria-pagecache-buffer-size=10G --innodb-buffer_pool_size=10G --key_buffer-size=1G --max-heap-table-size=10G"
 | |
| 
 | |
| - All costs are changed to be milliseconds for engine operations and
 | |
|   other calculations, like the WHERE clause. This is a big change from
 | |
|   before the patch that added this file where the basic cost was a
 | |
|   disk seek and one index read and we assumed they had the same cost.
 | |
| - I am using Aria as the 'base' cost. This is because it caches all data,
 | |
|   which most other engines also would do.
 | |
| - MyISAM cannot be used as 'base' as it does not cache row data (which gives
 | |
|   a high overhead when doing row lookups).
 | |
| - Heap is in memory and a bit too special (no caching).
 | |
| - InnoDB is a clustered engine where secondary indexes has to use
 | |
|   the clustered index to find a row (not a common case among storage engines).
 | |
| 
 | |
| The old assumption in the optimizer has 'always' been that
 | |
| 1 cost = 1 seek = 1 index = 1 row lookup = 0.10ms.
 | |
| However 1 seek != 1 index or row look and this has not been reflected in
 | |
| most other cost.
 | |
| This document is the base of changing things so that 1 cost = 1ms.
 | |
| 
 | |
| 
 | |
| Setup
 | |
| =====
 | |
| 
 | |
| All timings are calculated based on result from this computer:
 | |
| CPU:    Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz
 | |
| Memory: 256G
 | |
| Disk:   Samsum SSD 860 (not really relevant in this case)
 | |
| Rows in tests: 1M Each test is run 3 times
 | |
| (one test to cache the data and 2 runs of which we take the average).
 | |
| 
 | |
| The assumption is that other computers will have somewhat proportional
 | |
| timings. The timings are done with all data in memory (except MyISAM rows).
 | |
| This is reflected in the costs for the test by setting
 | |
| optimizer_disk_read_ratio=0.
 | |
| 
 | |
| Note that even on a single Linux computer without any notable tasks
 | |
| the run time vary a bit from run to run (up to 4%), so the numbers in
 | |
| this document cannot be repeated exactly but should be good enough for
 | |
| the optimizer.
 | |
| 
 | |
| Timings for disk accesses on other system can be changed by setting
 | |
| optimizer_disk_read_cost (usec / 4092 bytes) to match the read speed.
 | |
| 
 | |
| Default values for check_costs.pl:
 | |
| optimizer_disk_read_ratio= 0       Everything is cached
 | |
| SCAN_LOOKUP_COST=1                 Cost modifier for scan (for end user)
 | |
| set @@optimizer_switch='index_condition_pushdown=off'";
 | |
| 
 | |
| 
 | |
| ROW_COPY_COST and KEY_COPY_COST
 | |
| ===============================
 | |
| 
 | |
| Regarding ROW_COPY_COST:
 | |
| When calulating cost of fetching a row, we have two alternativ cost
 | |
| parts (in addition to other costs):
 | |
| scanning:  rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)
 | |
| rnd_pos:   rows * (ROW_LOOKUP_COST +    ROW_COPY_COST)
 | |
| 
 | |
| In theory we could remove ROW_COPY_COST and just move the cost
 | |
| to the two other variables. However, in the future there may reason
 | |
| to be able to modif row_copy_cost per table depending on number and type
 | |
| of fields (A table of 1000 fields should have a higher row copy cost than
 | |
| a table with 1 field). Because of this, I prefer to keep ROW_COPY_COST
 | |
| around for now.
 | |
| 
 | |
| Regarding KEY_COPY_COST:
 | |
| When calulating cost of fetching a key we have as part of the cost:
 | |
| keyread_time: rows * KEY_COPY_COST + ranges * KEY_LOOKUP_COST +
 | |
|              (rows-ranges) * KEY_NEXT_FIND_COST
 | |
| key_scan_time: rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST)
 | |
| 
 | |
| We could remove KEY_COPY_COST by adding it to KEY_LOOKUP_COST and
 | |
| KEY_NEXT_FIND_COST but I prefer to keep it with the same argument as
 | |
| for ROW_COPY_COST.
 | |
| 
 | |
| The reation between KEY_COPY_COST / (KEY_NEXT_FIND_COST + KEY_COPY_COST)
 | |
| is assumed to be 0.1577 (See analyze in the appendix)
 | |
| 
 | |
| There is a relationship between the above costs in that for a clustered
 | |
| index the cost is calculated as ha_keyread_time() + ROW_COPY_COST.
 | |
| 
 | |
| 
 | |
| Preramble
 | |
| =========
 | |
| 
 | |
| I tried first to use performance schema to get costs, but I was not
 | |
| successful as all timings I got for tables showed the total time
 | |
| executing the statement, not the timing for doing the actual reads.
 | |
| Also the overhead of performance schema affected the results
 | |
| 
 | |
| With --performance-schema=on
 | |
| 
 | |
| MariaDB [test]> select sum(1) from seq_1_to_100000000;
 | |
| +-----------+
 | |
| | sum(1)    |
 | |
| +-----------+
 | |
| | 100000000 |
 | |
| +-----------+
 | |
| 1 row in set (4.950 sec)
 | |
| 
 | |
| Performance schema overhead: 30.1%
 | |
| 
 | |
| With:
 | |
| UPDATE performance_schema.setup_consumers SET ENABLED = 'YES';
 | |
| UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES';
 | |
| 
 | |
| Flush with:
 | |
| CALL sys.ps_truncate_all_tables(FALSE);
 | |
| 
 | |
| Performance schema overhead now: 32.9%
 | |
| 
 | |
| Timings from:
 | |
| select * from events_statements_current where thread_id=80;
 | |
| 
 | |
| MariaDB [test]> select 885402302809000-884884140290000;
 | |
| +---------------------------------+
 | |
| | 885402302809000-884884140290000 |
 | |
| +---------------------------------+
 | |
| |                    518162519000 |
 | |
| +---------------------------------+
 | |
| -> Need to divide by 1000000000000.0 to get seconds
 | |
| 
 | |
| As seen above, the above gives the total statement time not the time
 | |
| spent to access the tables.
 | |
| 
 | |
| In the end, I decided to use analyze to find out the cost of the table
 | |
| actions:
 | |
| 
 | |
| For example: Finding out table scan timing (and thus costs):
 | |
| 
 | |
| analyze format=json select sum(1) from seq_1_to_100000000;
 | |
| r_table_time_ms": 1189.239022
 | |
| 
 | |
| 
 | |
| Calculating 'optimizer_where_cost'
 | |
| ==================================
 | |
| 
 | |
| To make the WHERE cost reasonable (not too low) we are assuming there is
 | |
| 2 simple conditions in the default 'WHERE clause'
 | |
| 
 | |
| MariaDB [test]> select benchmark(100000000,l_commitDate >= '2000-01-01' and l_tax >= 0.0) from test.check_costs limit 1;
 | |
| +--------------------------------------------------------------------+
 | |
| | benchmark(100000000,l_commitDate >= '2000-01-01' and l_tax >= 0.0) |
 | |
| +--------------------------------------------------------------------+
 | |
| |                                                                  0 |
 | |
| +--------------------------------------------------------------------+
 | |
| 1 row in set (3.198 sec)
 | |
| 
 | |
| Time of where in seconds: 3.198 / 100000000  (100,000,000)
 | |
| 
 | |
| Verification:
 | |
| 
 | |
| select sum(1) from seq_1_to_100000000 where seq>=0.0 and seq>=-1.0;
 | |
| +-----------+
 | |
| | sum(1)    |
 | |
| +-----------+
 | |
| | 100000000 |
 | |
| +-----------+
 | |
| 1 row in set (8.564 sec)
 | |
| 
 | |
| MariaDB [test]> select sum(1) from seq_1_to_100000000;
 | |
| +-----------+
 | |
| | sum(1)    |
 | |
| +-----------+
 | |
| | 100000000 |
 | |
| +-----------+
 | |
| 1 row in set (5.162 sec)
 | |
| 
 | |
| Time of where= (8.564-5.162)/100000000 = 3.402/100000000 (100,000,000)
 | |
| (Result good enough, as sligthly different computations)
 | |
| 
 | |
| check_costs.pl comes provides the numbers when using heap tables and 1M rows:
 | |
| 
 | |
| simple where:  118.689 ms
 | |
| complex where: 138.474 ms
 | |
| no where:       83.699 ms
 | |
| 
 | |
| Which gives for simple where:
 | |
| (118.689-83.699)/1000 = 0.034990000000000007 ms
 | |
| Which is in the same ballpark.
 | |
| 
 | |
| We use the result from the select benchmark run as this has least overhead
 | |
| and is easiest to repeat and verify in a test.
 | |
| Which gives:
 | |
| optimizer_where_cost= 0.032 ms / WHERE.
 | |
| 
 | |
| 
 | |
| HEAP TABLE SCAN & ROW_COPY_COST
 | |
| ===============================
 | |
| 
 | |
| We start with heap as all rows are in memory and we don't have to take
 | |
| disk reads into account.
 | |
| 
 | |
| select sum(l_partkey) from test.check_costs
 | |
| table_scan  ms: 10.02078736
 | |
| rows: 1000000
 | |
| 
 | |
| Cost should be 10.02078736 (scan cost) + 32 (where cost)
 | |
| 
 | |
| cost= scan_time() * optimizer_cache_cost * SCAN_LOOKUP_COST +
 | |
|      TABLE_SCAN_SETUP_COST +
 | |
|      records * (ROW_COPY_COST + ROW_LOOKUP_COST + WHERE_COMPARE_COST);
 | |
| 
 | |
| =>
 | |
| We are ignoring TABLE_SCAN_SETUP (which is just to prefer index lookup on small
 | |
| tables).
 | |
| We can also ignore records * WHERE_COMPARE_COST as we don't have that
 | |
| in the above calcuated 'ms'.
 | |
| row_costs= (ROW_COPY_COST + ROW_LOOKUP_COST)
 | |
| 
 | |
| cost= scan_time() * 1 * 1 +
 | |
|      1000000.0 * (row_costs)
 | |
| =>
 | |
| cost= time_per_row*1000000 + row_costs * 1000000;
 | |
| =>
 | |
| time_per_row+row_cost= cost/1000000
 | |
| 
 | |
| Let's assume that for heap, finding the next row is 80 % of the time and
 | |
| copying the row (a memcmp) to upper level is then 20 %.
 | |
| (This is not really important, we could put everthing in heap_scan_time,
 | |
| but it's good to have split the data as it gives us more options to
 | |
| experiment later).
 | |
| 
 | |
| row_lookup_cost=  10.02078736/1000000*0.8 = 8.0166298880000005e-06
 | |
| row_copy_cost= 10.02078736/1000000*0.2 = 2.0041574720000001e-06
 | |
| 
 | |
| Conclusion:
 | |
| heap_scan_time= 8.0166e-06
 | |
| row_copy_cost=  2.0042e-06
 | |
| 
 | |
| Heap doesn't support key only read, so key_copy_cost is not relevant for it.
 | |
| 
 | |
| 
 | |
| HEAP INDEX SCAN
 | |
| ===============
 | |
| 
 | |
| select count(*) from test.check_costs_heap force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0
 | |
| index_scan  time: 79.7286117 ms
 | |
| 
 | |
| Index scan on heap tables can only happen with binary trees.
 | |
| l_supp_key is using a binary tree.
 | |
| 
 | |
| cost= (ranges + rows + 1) * BTREE_KEY_NEXT_FIND_COST + rows * row_copy_cost=
 | |
| (for large number of rows):
 | |
| rows * (BTREE_KEY_NEXT_FIND_COST + row_copy_cost)
 | |
| 
 | |
| BTREE_KEY_NEXT_FIND_COST=  cost/rows - row_copy_cost =
 | |
| 79.7286117/1000000- 2.334e-06= 0.0000773946117
 | |
| 
 | |
| 
 | |
| HEAP EQ_REF
 | |
| ===========
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,test.check_costs_heap where seq=l_linenumber
 | |
| eq_ref_index_join  time: 175.874165 of which 12.57 is from seq_1_to_1000000
 | |
| 
 | |
| Note: This is 34% of the cost of an Aria table with index lookup and
 | |
|       20% of an Aria table with full key+row lookup.
 | |
| 
 | |
| cost= rows * (key_lookup_cost + row_copy_cost)
 | |
| key_lookup_cost= cost/rows - key_copy_cost =
 | |
| (175.874165-12.57)/1000000 - 2.334e-06 = 0.00016097016500000002
 | |
| 
 | |
| 
 | |
| HEAP EQ_REF on binary tree index
 | |
| ================================
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,test.check_costs_heap where seq=l_extra and l_partkey >= 0
 | |
| eq_ref_join  time: 241.350539 ms of which 12.57 is from seq_1_to_1000000
 | |
| 
 | |
| rows * (tree_find_cost() + row_copy_cost) =
 | |
| 
 | |
| tree_find_cost()= cost/rows - row_copy_cost =
 | |
| 
 | |
| (241.350539-12.57)/1000000 - 2.334e-06= 0.000226446539
 | |
| 
 | |
| tree_find_cost() is defined as key_compare_cost * log2(table_rows)
 | |
| ->
 | |
| key_compare_cost= 0.000226446539/log2(1000000) = 0.000011361200108882259;
 | |
| 
 | |
| 
 | |
| SEQUENCE SCAN
 | |
| =============
 | |
| 
 | |
| analyze format=json select sum(seq+1) from seq_1_to_1000000;
 | |
| r_table_time_ms: 12.47830611
 | |
| 
 | |
| Note that for sequence index and table scan is the same thing.
 | |
| We need to have a row_copy/key_copy cost as this is used when doing
 | |
| an key lookup for sequence. Setting these to 50% of the full cost
 | |
| should be sufficient for now.
 | |
| 
 | |
| Calculation sequence_scan_cost:
 | |
| 
 | |
| When ignoring reading from this, the cost of table scan is:
 | |
| rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST)
 | |
| 
 | |
| The cost of key scan is:
 | |
| ranges * KEY_LOOKUP_COST + (rows - ranges) * KEY_NEXT_FIND_COST +
 | |
| rows * KEY_COPY_COST;
 | |
| 
 | |
| As there is no search after first key for sequence, we can set
 | |
| KEY_LOOKUP_COST = KEY_NEXT_FIND_COST.
 | |
| 
 | |
| This gives us:
 | |
| 
 | |
| r_table_time_ms = (ROW_NEXT_FIND_COST + ROW_COPY_COST) =
 | |
|                   (KEY_NEXT_FIND_COST + KEY_COPY_COST) * 1000000;
 | |
| 
 | |
| ->
 | |
| ROW_NEXT_FIND_COST= ROW_COPY_COST = KEY_LOOKUP_COST + KEY_COPY_COST=
 | |
| 12.47830611/1000000/2 =  0.0000062391530550
 | |
| 
 | |
| 
 | |
| HEAP KEY LOOKUP
 | |
| ===============
 | |
| 
 | |
| We can use this code to find the timings of a index read in a table:
 | |
| 
 | |
| analyze format=json select straight_join count(*) from seq_1_to_1000000,check_costs where seq=l_orderkey
 | |
| 
 | |
| "query_block": {
 | |
|     "select_id": 1,
 | |
|     "r_loops": 1,
 | |
|     "r_total_time_ms": 420.5083447,
 | |
|     "table": {
 | |
|       "table_name": "seq_1_to_1000000",
 | |
|       "access_type": "index",
 | |
|       "possible_keys": ["PRIMARY"],
 | |
|       "key": "PRIMARY",
 | |
|       "key_length": "8",
 | |
|       "used_key_parts": ["seq"],
 | |
|       "r_loops": 1,
 | |
|       "rows": 1000000,
 | |
|       "r_rows": 1000000,
 | |
|       "r_table_time_ms": 12.47830611,
 | |
|       "r_other_time_ms": 44.0671283,
 | |
|       "filtered": 100,
 | |
|       "r_filtered": 100,
 | |
|       "using_index": true
 | |
|     },
 | |
|     "table": {
 | |
|       "table_name": "check_costs",
 | |
|       "access_type": "eq_ref",
 | |
|       "possible_keys": ["PRIMARY"],
 | |
|       "key": "PRIMARY",
 | |
|       "key_length": "4",
 | |
|       "used_key_parts": ["l_orderkey"],
 | |
|       "ref": ["test.seq_1_to_1000000.seq"],
 | |
|       "r_loops": 1000000,
 | |
|       "rows": 1,
 | |
|       "r_rows": 1,
 | |
|       "r_table_time_ms": 160
 | |
|       "filtered": 100,
 | |
|       "r_filtered": 100,
 | |
|       "attached_condition": "seq_1_to_1000000.seq = check_costs.l_orderkey"
 | |
|     }
 | |
|   }
 | |
| 
 | |
| This gives the time for a key lookup on hash key as:
 | |
| 160/10000000 - row_copy_cost =
 | |
| 160/1000000.0 - 2.0042e-06 = 0.00015799580000000002
 | |
| 
 | |
| 
 | |
| ARIA TABLE SCAN
 | |
| ===============
 | |
| (page format, all rows are cached)
 | |
| 
 | |
| table_scan ms: 107.315698
 | |
| 
 | |
| Cost is calculated as:
 | |
| 
 | |
| blocks= stats.data_file_length / stats.block_size) = 122888192/4096= 30002
 | |
| engine_blocks (8192 is block size in Aria) = 15001
 | |
| 
 | |
| cost= blocks * avg_io_cost() *
 | |
|       optimizer_cache_cost * SCAN_LOOKUP_COST +
 | |
|       engine_blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       records * (ROW_NEXT_FIND_COST + ROW_COPY_COST));
 | |
| 
 | |
| When all is in memory (optimizer_cache_cost= 0) we get:
 | |
| 
 | |
| cost= blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       records * (ROW_NEXT_FIND_COST + ROW_COPY_COST));
 | |
| 
 | |
| To calculate INDEX_BLOCK_COPY_COST I added a temporary tracker in
 | |
| ma_pagecache.cc::pagecache_read() and did run the same query.
 | |
| I got the following data:
 | |
| {counter = 17755, sum = 1890559}
 | |
| Which give me the time for copying a block to:
 | |
| 1000.0*1890559/sys_timer_info.cycles.frequency/17755 = 3.558138826971332e-05 ms
 | |
| And thus INDEX_BLOCK_COPY_COST= 0.035600
 | |
| 
 | |
| Replacing known constants (and ignore TABLE_SCAN_SETUP_COST):
 | |
| cost= 107.315698 = 15001 * 3.56e-5 + 1000000 * aria_row_copy_costs;
 | |
| 
 | |
| aria_row_copy_costs= (107.315698 - (15001 * 3.56e-5))/1000000 =
 | |
| 0.0001067816624
 | |
| 
 | |
| As ROW_COPY_COST/ROW_NEXT_FIND_COST= 0.57 (See appendex)
 | |
| 
 | |
| ROW_COPY_COST=      0.0001067816624 * 0.57 = 0.000060865547560
 | |
| ROW_NEXT_FIND_COST= 0.0001067816624 * 0.43 = 0.000045916114832
 | |
| 
 | |
| 
 | |
| Aria, INDEX SCAN
 | |
| ================
 | |
| 
 | |
| Finding out cost of reading X keys from an index (no row lookup) in Aria.
 | |
| 
 | |
| Query: select count(*) from test.check_costs_aria force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0
 | |
| Table access time: ms: 98.1427158
 | |
| 
 | |
| blocks= index_size/IO_SIZE =
 | |
| (rows * tot_key_length / INDEX_BLOCK_FILL_FACTOR) / IO_SIZE
 | |
| ->
 | |
| 1000000 * 19 / 0.75/ 4096 = 6184
 | |
| engine_blocks (block_size 8192) = 6184/2 = 3092
 | |
| (Range optimzer had calculated 3085)
 | |
| 
 | |
| keyread_time= blocks * avg_io_cost() * cache + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST);
 | |
| = engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST=
 | |
|   3092 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST)
 | |
| ->
 | |
| KEY_NEXT_FIND_COST + KEY_COPY_COST= (98.1427158 - 3092 * 3.56e-05)/1000000 =
 | |
| 0.0000980326406;
 | |
| 
 | |
| KEY_COPY_COST=       0.0000980326406 * 0.16 = 0.000015685222496
 | |
| KEY_NEXT_FIND_COST=  0.0000980326406 * 0.84 = 0.000082347418104
 | |
| 
 | |
| 
 | |
| Aria, RANGE SCAN (scan index, fetch a row for each index entry)
 | |
| ===============================================================
 | |
| 
 | |
| Query:
 | |
| select sum(l_orderkey) from test.check_costs_aria force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0
 | |
| range_scan  ms: 309.7620909
 | |
| 
 | |
| cost= keyread_time + rnd_pos_time.
 | |
| keyread_time is as above in index scan, but whithout KEY_COPY_COST:
 | |
| keyread_time= 98.1427158 - KEY_COPY_COST * 1000000=
 | |
| 98.1427158 - 0.000015685222496 * 1000000= 82.457493304000000;
 | |
| rnd_pos_time= 309.7620909 - 82.457493304000000 = 227.304597596000000
 | |
| 
 | |
| rnd_pos_time() = io_cost + engine_mem_cost +
 | |
|                 rows * (ROW_LOOKUP_COST + ROW_COPY_COST) =
 | |
| rows * avg_io_cost() * engine_block_size/IO_SIZE +
 | |
| rows * INDEX_BLOCK_COPY_COST +
 | |
| rows * (ROW_COPY_COST + ROW_LOOKUP_COST)
 | |
| = (When rows are in memory)
 | |
| rows * INDEX_BLOCK_COPY_COST +
 | |
| rows * (ROW_COPY_COST + ROW_LOOKUP_COST)
 | |
| 
 | |
| This gives us:
 | |
| 227.304597596000000 = 1000000 * 3.56e-05 + 1000000*(0.000060865547560 + ROW_LOOKUP_COST)
 | |
| ->
 | |
| ROW_LOOKUP_COST= (227.304597596000000 - 1000000 * 3.56e-05 - 1000000*0.000060865547560) / 1000000 = 0.0001308390500
 | |
| 
 | |
| 
 | |
| Aria, EQ_REF with index_read
 | |
| ============================
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,test.check_costs_aria where seq=l_linenumber
 | |
| eq_ref_index_join 499.631749 ms
 | |
| 
 | |
| According to analyze statement:
 | |
| 
 | |
| - Cost for SELECT * from seq_1_to_1000000: 12.57
 | |
|   (From Last_query_cost after the above costs has been applied)
 | |
| - Time from check_costs: eq_ref's: 499.631749- 12.57s = 487.061749
 | |
| 
 | |
| cost= rows * (keyread_time(1,1) + KEY_COPY_COST)
 | |
| 
 | |
| keyread_time(1,1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST;
 | |
| 
 | |
| cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST)
 | |
| ->
 | |
| KEY_LOOKUP_COST= cost/rows - 0.000015685222496 - 0.000035600
 | |
| KEY_LOOKUP_COST= 487.061749 / 1000000 - 0.000035600 - 0.000015685222496
 | |
| KEY_LOOKUP_COST= 0.000435776526504
 | |
| 
 | |
| 
 | |
| MyISAM, TABLE SCAN
 | |
| ==================
 | |
| 
 | |
| select sum(l_partkey) from test.check_costs_myisam
 | |
| table_scan  ms: 126.353364
 | |
| 
 | |
| check_costs.MYD: 109199788 = 26660 IO_SIZE blocks
 | |
| The row format for MyISAM is similar to Aria, so we use the same
 | |
| ROW_COPY_COST for Aria.
 | |
| 
 | |
| cost= blocks * avg_io_cost() *
 | |
|       optimizer_cache_cost * SCAN_LOOKUP_COST +
 | |
|       engine_blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST));
 | |
| 
 | |
| MyISAM is using the file system as a row cache.
 | |
| Let's put the cost of accessing the row in ROW_NEXT_FIND_COST.
 | |
| Everything is cached (by the file system) and optimizer_cache_cost= 0;
 | |
| 
 | |
| cost= engine_blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST))
 | |
| 
 | |
| ROW_NEXT_FIND_COST=
 | |
| (costs - engine_blocks * INDEX_BLOCK_COPY_COST - TABLE_SCAN_SETUP_COST)/rows -
 | |
| ROW_COPY_COST
 | |
| =
 | |
| (126.353364 - 26660 * 3.56e-05 - 1)/1000000 - 0.000060865547560
 | |
| ROW_NEXT_FIND_COST= 0.00006353872044
 | |
| 
 | |
| 
 | |
| MyISAM INDEX SCAN
 | |
| =================
 | |
| 
 | |
| select count(*) from test.check_costs_myisam force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0;
 | |
| index_scan  ms:  106.490584
 | |
| 
 | |
| blocks= index_size/IO_SIZE =
 | |
| (rows * tot_key_length / INDEX_BLOCK_FILL_FACTOR) / IO_SIZE
 | |
| ->
 | |
| 1000000 * 19 / 0.75/ 4096 = 6184
 | |
| As MyISAM has a block size of 4096 for this table, engine_blocks= 6184
 | |
| 
 | |
| cost= keyread_time= blocks * avg_io_cost() * cache + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST);
 | |
| ->
 | |
| cost= engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST
 | |
| 
 | |
| Assuming INDEX_BLOCK_COPY_COST is same as in Aria and the code for
 | |
| key_copy is identical to Aria:
 | |
| cost=  6184 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST)
 | |
| ->
 | |
| KEY_NEXT_FIND_COST= (106.490584 - 6184 * 3.56e-05)/1000000 - 0.000015685222496=
 | |
| 0.000090585211104
 | |
| 
 | |
| 
 | |
| MyISAM, RANGE SCAN (scan index, fetch a row for each index entry)
 | |
| =================================================================
 | |
| 
 | |
| select sum(l_orderkey) from test.check_costs_myisam force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0
 | |
| time: 1202.0894 ms
 | |
| 
 | |
| cost= keyread_time + rnd_pos_time.
 | |
| keyread_time is as above in MyISAM INDEX SCAN, but without KEY_COPY_COST:
 | |
| keyread_time= 106.490584 - KEY_COPY_COST * 1000000=
 | |
| 106.490584 - 0.000015685222496 * 1000000= 90.805361504000000;
 | |
| rnd_pos_time=  1202.0894 - 90.805361504000000 = 1111.284038496000000
 | |
| 
 | |
| rnd_pos_time() = io_cost + engine_mem_cost +
 | |
|                 rows * (ROW_LOOKUP_COST + ROW_COPY_COST) =
 | |
| rows * avg_io_cost() * engine_block_size/IO_SIZE +
 | |
| rows * INDEX_BLOCK_COPY_COST +
 | |
| rows * (ROW_COPY_COST + ROW_LOOKUP_COST)
 | |
| = (When rows are in memory)
 | |
| rows * INDEX_BLOCK_COPY_COST +
 | |
| rows * (ROW_COPY_COST + ROW_LOOKUP_COST)
 | |
| 
 | |
| This gives us:
 | |
|  1111.284038496000000 = 1000000 * 3.56e-05 + 1000000*(0.000060865547560 + ROW_LOOKUP_COST)
 | |
| ->
 | |
| ROW_LOOKUP_COST= ( 1111.284038496000000 - 1000000 * (3.56e-05 + 0.000060865547560)) / 1000000s
 | |
| ->
 | |
| ROW_LOOKUP_COST= 0.001014818490936
 | |
| 
 | |
| As the row is never cached, we have to ensure that rnd_pos_time()
 | |
| doesn't include an io cost (which would be affected by
 | |
| optimizer_cache_hit_ratio).  This is done by having a special
 | |
| ha_myisam::rnd_pos_time() that doesn't include io cost but instead an
 | |
| extra cpu cost.
 | |
| 
 | |
| 
 | |
| MyISAM, EQ_REF with index_read
 | |
| ==============================
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,test.check_costs_myisam where seq=l_linenumber;
 | |
| eq_ref_join  ms: 613.906777 of which 12.48 ms is for seq_1_to_1000000;
 | |
| 
 | |
| According to analyze statement:
 | |
| 
 | |
| - Cost for SELECT * from seq_1_to_1000000: 12.48 (See sequence_scan_cost)
 | |
| - Time from check_costs: eq_ref's: 613.906777- 12.48 = 601.426777;
 | |
| 
 | |
| cost= rows * (keyread_time(1) + KEY_COPY_COST)
 | |
| 
 | |
| keyread_time(1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST;
 | |
| 
 | |
| cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST)
 | |
| ->
 | |
| KEY_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - KEY_COPY_COST;
 | |
| 601.426777 / 1000000 - 3.56e-05 - 0.000015685222496 = 0.00055014155451
 | |
| KEY_LOOKUP_COST= 0.00055014155451
 | |
| 
 | |
| 
 | |
| 
 | |
| InnoDB, TABLE SCAN
 | |
| ==================
 | |
| 
 | |
| select sum(l_quantity) from check_costs_innodb;
 | |
| table_scan 131.302492
 | |
| Note that InnoDB reported only 956356 rows instead of 100000 in stats.records
 | |
| This will will cause the optimizer to calculate the costs based on wrong
 | |
| assumptions.
 | |
| 
 | |
| As InnoDB have a clustered index (which cost is a combination of
 | |
| KEY_LOOKUP_COST + ROW_COPY_COST), we have to ensure that the
 | |
| relationship between KEY_COPY_COST and ROW_COPY_COST is close to the
 | |
| real time of copying a key and a row.
 | |
| 
 | |
| I assume, for now, that the row format for InnoDB is not that
 | |
| different than for Aria (in other words, computation to unpack is
 | |
| about the same), so lets use the same ROW_COPY_COST (0.000060865547560)
 | |
| 
 | |
| I am ignoring the fact that InnoDB can optimize row copying by only
 | |
| copying the used fields as the optimizer currently have to take that
 | |
| into account. (This would require a way to update ROW_COPY_COST /
 | |
| table instance in the query).
 | |
| 
 | |
| For now, lets also use the same value as Aria for
 | |
| INDEX_BLOCK_COPY_COST (3.56e-05).
 | |
| 
 | |
| The number of IO_SIZE blocks in the InnoDB data file is 34728 (from gdb))
 | |
| (For reference, MyISAM was using 26660 and Aria 30002 blocks)
 | |
| As InnoDB is using 16K blocks, the number of engine blocks= 34728/4= 8682
 | |
| 
 | |
| cost= blocks * avg_io_cost() *
 | |
|       optimizer_cache_cost * SCAN_LOOKUP_COST +
 | |
|       engine_blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST));
 | |
| 
 | |
| as optimizer_cache_cost = 0
 | |
| 
 | |
| cost= engine_blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST))
 | |
| 
 | |
| ROW_NEXT_FIND_COST=
 | |
| (costs - engine_blocks * INDEX_BLOCK_COPY_COST - TABLE_SCAN_SETUP_COST)/rows -
 | |
| ROW_COPY_COST
 | |
| = (Ignoring TABLE_SCAN_SETUP_COST, which is just 10 usec)
 | |
| (131.302492 - 8682 * 3.56e-05)/1000000 - 0.000060865547560 =
 | |
| 0.00007012786523999997
 | |
| 
 | |
| 
 | |
| InnoDB INDEX SCAN
 | |
| =================
 | |
| 
 | |
| select count(*) from check_costs_innodb force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0;
 | |
| index_scan  114.733037 ms
 | |
| Note that  InnoDB is reporting 988768 rows instead of 1000000
 | |
| (The number varies a bit between runs. At another run I got 956356 rows)
 | |
| With default costs (as of above), we get a query cost of 112.142. This can
 | |
| still be improved a bit...
 | |
| 
 | |
| blocks= index_size/IO_SIZE =
 | |
| (rows * tot_key_length / INDEX_BLOCK_FILL_FACTOR) / IO_SIZE
 | |
| -> (total_key_length is 17 in InnoDB, 19 in Aria)
 | |
| 1000000 * 17 / 0.75/ 4096 = 5533
 | |
| engine_blocks= 5533/4 = 1383
 | |
| 
 | |
| (In reality we get 5293 blocks and 1323 engine blocks, because of the
 | |
| difference in InnoDB row count)
 | |
| 
 | |
| cost= keyread_time= blocks * avg_io_cost() * cache + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST);
 | |
| ->
 | |
| cost= engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST
 | |
| 
 | |
| Assuming INDEX_BLOCK_COPY_COST is same as in Aria:
 | |
| (Should probably be a bit higher as block_size in InnoDB is 16384
 | |
| compared to 8192 in Aria)
 | |
| 
 | |
| cost=  1383 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST)
 | |
| =
 | |
| KEY_NEXT_FIND_COST + KEY_COPY_COST= (114.733037 - 1383 * 3.56e-05)/1000000
 | |
| =
 | |
| KEY_NEXT_FIND_COST= (114.733037 - 1383 * 3.56e-05)/1000000 - 0.000015685222496
 | |
| ->
 | |
| KEY_NEXT_FIND_COST=0.000098998579704;
 | |
| 
 | |
| Setting this makes InnoDB calculate the cost to 113.077711 (With estimate of
 | |
| 988768 rows)
 | |
| If we would have the right number of rows in ha_key_scan_time, we would
 | |
| have got a cost of:
 | |
| 
 | |
| Last_query_cost: 145.077711  (Including WHERE cost for 988768 row)
 | |
| (145.077711)/988768*1000000.0-32 = 114.72573444933
 | |
| 
 | |
| 
 | |
| InnoDB RANGE SCAN
 | |
| =================
 | |
| 
 | |
| select sum(l_orderkey) from check_costs_innodb force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0
 | |
| range_scan 961.4857045 ms
 | |
| Note that  InnoDB was reporting 495340 rows instead of 1000000 !
 | |
| I added a patch to fix this and now InnoDB reports 990144 rows
 | |
| 
 | |
| cost= keyread_time + rnd_pos_time.
 | |
| keyread_time is as above in index scan,  but we want it without KEY_COPY_COST:
 | |
| keyread_time= cost - KEY_COPY_COST * 1000000=
 | |
| 114.733037 - 0.000015685222496 * 1000000= 99.047814504000000
 | |
| rnd_pos_time= 961.4857045 - 99.047814504000000 = 862.437889996000000
 | |
| 
 | |
| rnd_pos_time() = io_cost + engine_mem_cost +
 | |
|                 rows * (ROW_LOOKUP_COST + ROW_COPY_COST) =
 | |
| rows * avg_io_cost() * engine_block_size/IO_SIZE +
 | |
| rows * INDEX_BLOCK_COPY_COST +
 | |
| rows * (ROW_COPY_COST + ROW_LOOKUP_COST)
 | |
| = (When rows are in memory)
 | |
| 
 | |
| rows * (INDEX_BLOCK_COPY_COST + ROW_COPY_COST + ROW_LOOKUP_COST)
 | |
| 
 | |
| This gives us:
 | |
| 862.437889996000000 = 1000000 * 3.56e-05 + 1000000*(0.000060865547560 + ROW_LOOKUP_COST)
 | |
| ->
 | |
| ROW_LOOKUP_COST= (862.437889996000000 - 1000000*(3.56e-05+0.000060865547560)) / 1000000
 | |
| ->
 | |
| ROW_LOOKUP_COST= 0.000765972342436
 | |
| 
 | |
| Setting this makes InnoDB calculate the cost to 961.081050 (good enough)
 | |
| 
 | |
| 
 | |
| InnodDB EQ_REF with index_read
 | |
| ==============================
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,test.check_costs_innodb where seq=l_linenumber
 | |
| time: 854.980610 ms
 | |
| 
 | |
| Here the engine first has to do a key lookup and copy the key to the upper
 | |
| level (Index only read).
 | |
| 
 | |
| According to analyze statement:
 | |
| 
 | |
| - Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost)
 | |
| - Time from check_costs: eq_ref_join: 854.980610
 | |
|   This is time for accessing both seq_1_to_1000000 and check_costs
 | |
|   time for check_cost_innodb: 854.980610-12.57 = 842.410610 ms
 | |
| 
 | |
| cost= rows * (keyread_time(1,1) + KEY_COPY_COST)
 | |
| 
 | |
| keyread_time(1,1)= INDEX_BLOCK_COPY_COST + ranges * KEY_LOOKUP_COST +
 | |
|                    (rows-ranges) * KEY_NEXT_FIND_COST
 | |
| 
 | |
| As rows=1 and ranges=1:
 | |
| 
 | |
| keyread_time(1,1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST
 | |
| 
 | |
| cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST)
 | |
| ->
 | |
| KEY_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - KEY_COPY_COST;
 | |
| 842.410610 / 1000000 - 3.56e-05 - 0.000015685222496
 | |
| ->
 | |
| KEY_LOOKUP_COST= 0.000791125387504;
 | |
| 
 | |
| After the above we have
 | |
| last_query_cost=918.986438;
 | |
| 
 | |
| The cost for check_costs_innodb =
 | |
| last_query_cost - sequence_scan_cost - where_cost*2 =
 | |
| 918.986438 - 12.57 - 32*2 = 842.416438 (ok)
 | |
| 
 | |
| 
 | |
| InnodDB EQ_REF with clustered index read
 | |
| ========================================
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,check_costs_innodb where seq=l_orderkey
 | |
| eq_ref_cluster_join  time: 972.290773 ms
 | |
| 
 | |
| According to analyze statement:
 | |
| - Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost)
 | |
| - Time from check_costs: eq_ref_cluster_join: 972.290773 ms
 | |
|   This is time for accessing both seq_1_to_1000000 and check_costs_innodb.
 | |
|   Time for check_cost_innodb: 972.290773 - 12.57 = 959.790773
 | |
| 
 | |
| The estimated cost is 875.0160
 | |
| 
 | |
| cost= rows * (keyread_time(1,1) +
 | |
|               ranges * ROW_LOOKUP_COST +
 | |
|               (rows - ranges) * ROW_NEXT_FIND_COST +
 | |
|               rows * ROW_COPY_COST)
 | |
| 
 | |
| As rows=1 and ranges=1:
 | |
| 
 | |
| cost= rows * (INDEX_BLOCK_COPY_COST + ROW_LOOKUP_COST +  ROW_COPY_COST);
 | |
| ->
 | |
| ROW_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - ROW_COPY_COST;
 | |
| 959.790773 / 1000000 - 3.56e-05 - 0.000060865547560
 | |
| ->
 | |
| ROW_LOOKUP_COST= 0.0008633252254400001
 | |
| 
 | |
| From InnoDB RANGE SCAN we have ROW_LOOKUP_COST=0.000765972342436
 | |
| From EQ_REF with index read we have KEY_LOOKUP_COST= 0.000791125387504,
 | |
| which should in theory be identical to ROW_LOOKUP_COST,
 | |
| 
 | |
| For now we have to live with the difference (as I want to have the project done
 | |
| for the next release).
 | |
| 
 | |
| The difference could be come from the following things:
 | |
| 
 | |
| - InnoDB estimation of rows in the range scan test is a bit off.
 | |
| - Maybe the work to find a row from an internal key entry compared to
 | |
|   a external key is a bit difference (less checking/conversions)
 | |
| - There is different keys used for range scan and this test that could have
 | |
|   different costs
 | |
| - Maybe we should increase ROW_COPY_COST or ROW_LOOKUP_COST for InnoDB
 | |
|   and adjust other costs.
 | |
| 
 | |
| 
 | |
| Some background. In range scan, the cost is:
 | |
| - Scanning over all keys
 | |
|   - For each key, fetch row using rowid
 | |
| 
 | |
| For the EQ_REF cache
 | |
| - Scan seq_1_to_1000000
 | |
|     for each value in seq
 | |
|        do a index_read() call
 | |
| 
 | |
| 
 | |
| Archive scan cost
 | |
| =================
 | |
| 
 | |
| table_scan  time: 757.390280 ms
 | |
| rows: 1000000
 | |
| file size: 32260650 = 7878 IO_SIZE blocks
 | |
| 
 | |
| cost= scan_time() + TABLE_SCAN_SETUP_COST +
 | |
|      records * (ROW_COPY_COST + ROW_LOOKUP_COST + WHERE_COMPARE_COST);
 | |
| 
 | |
| 757.390280 = scan_time() + 10 + 1000000 * (0.060866+0.032000)
 | |
| ->
 | |
| scan_time()= 757.390280 - (10 + 1000000 * (0.060866+0.032000)/1000) = 654.52428
 | |
| 
 | |
| scan_time() is defined as:
 | |
| 
 | |
| cost.cpu= (blocks * DISK_READ_COST * DISK_READ_RATIO +
 | |
|            blocks *  ARCHIVE_DECOMPRESS_TIME);
 | |
| 
 | |
| Default values for above:
 | |
| blocks= 7878
 | |
| DISK_READ_COST:    10.240000 usec
 | |
| DIUSK_READ_RATIO=  0.20
 | |
| ->
 | |
| ARCHIVE_COMPRESS_TIME= (654.52428 - (7878 * 10.240000/1000*0.2)) / 7878 =
 | |
| 0.081034543792841
 | |
| 
 | |
| 
 | |
| MyRocksDB, TABLE SCAN
 | |
| =====================
 | |
| 
 | |
| select sum(l_quantity) from check_costs_rocksdb;
 | |
| table_scan 213.038648 ms
 | |
| 
 | |
| cost= blocks * avg_io_cost() *
 | |
|       optimizer_cache_cost * SCAN_LOOKUP_COST +
 | |
|       engine_blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST));
 | |
| 
 | |
| Some defaults:
 | |
| optimizer_cache_cost = 0
 | |
| index_block_copy_cost= 0.000035600  (Assume same as innoDB)
 | |
| table_scan_setup_cost= 0            (Lets ignore it for now)
 | |
| row_copy_cost=0.000060865547560     (Assume same as InnoDB for now)
 | |
| 
 | |
| show table status tells us that datalength=64699000 = 15795 4K-blocks.
 | |
| 
 | |
| cost= engine_blocks * INDEX_BLOCK_COPY_COST +
 | |
|       TABLE_SCAN_SETUP_COST +
 | |
|       rows * (ROW_NEXT_FIND_COST + ROW_COPY_COST))
 | |
| 
 | |
| ROW_NEXT_FIND_COST=
 | |
| (costs - engine_blocks * INDEX_BLOCK_COPY_COST)/rows -
 | |
| ROW_COPY_COST
 | |
| = (213.03868 - 15796 * 0.000035600 - 0)/1000000 - 0.000060865547560 =
 | |
| 0.00015161079484
 | |
| 
 | |
| 
 | |
| MyRocks INDEX SCAN
 | |
| ==================
 | |
| 
 | |
| select count(*) from test.check_costs_rocksdb force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0
 | |
| index_scan 266.80435 ms
 | |
| 
 | |
| Note that myrocks returns 2M rows for the table when it has only 1M rows!
 | |
| 
 | |
| block_size= 8192
 | |
| key_length= 18
 | |
| compression=0.25 (75 %)
 | |
| blocks= (key_length * rows) / 4 * block_size/4096 = 18 * 1000000/4 * 2=
 | |
| 2198 IO_BLOCKS (=1094 engine_blocks)
 | |
| 
 | |
| cost= keyread_time= blocks * avg_io_cost * DISK_READ_RATIO + engine_blocks * INDEX_BLOCK_COPY_COST + rows * (KEY_NEXT_FIND_COST + KEY_COPY_COST);
 | |
| 
 | |
| As we assume that everything is in memory (DISK_READ_RATIO=0)
 | |
| ->
 | |
| cost= engine_blocks * INDEX_BLOCK_COPY_COST + rows * KEY_NEXT_FIND_COST;
 | |
| 
 | |
| Assuming INDEX_BLOCK_COPY_COST and KEY_COPY_COST are same as in Aria and InnoDB)
 | |
| 
 | |
| cost=  1094 * 3.56e-05 + 1000000 * (KEY_NEXT_FIND_COST + KEY_COPY_COST)
 | |
| =
 | |
| KEY_NEXT_FIND_COST + KEY_COPY_COST= (266.80435 - 1094 * 3.56e-05)/1000000
 | |
| =
 | |
| KEY_NEXT_FIND_COST= (266.80435 - 1094 * 3.56e-05)/1000000 - 0.000015685222496
 | |
| ->
 | |
| KEY_NEXT_FIND_COST= 0.000251080181104
 | |
| 
 | |
| 
 | |
| MyRocks EQ_REF with index_read
 | |
| ==============================
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,test.check_costs_rocksdb where seq=l_linenumber
 | |
| time: 857.548991
 | |
| 
 | |
| Here the engine first has to do a key lookup and copy the key to the upper
 | |
| level (Index only read).
 | |
| 
 | |
| According to analyze statement:
 | |
| 
 | |
| - Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost)
 | |
| - Time from check_costs: eq_ref_join: 857.548991
 | |
|   This is time for accessing both seq_1_to_1000000 and check_costs
 | |
|   time for check_cost_innodb: 857.548991-12.57 = 844.978991 ms
 | |
| 
 | |
| cost= rows * (keyread_time(1,1) + KEY_COPY_COST)
 | |
| 
 | |
| keyread_time(1,1)= INDEX_BLOCK_COPY_COST + ranges * KEY_LOOKUP_COST +
 | |
|                    (rows-ranges) * KEY_NEXT_FIND_COST
 | |
| 
 | |
| As rows=1 and ranges=1:
 | |
| 
 | |
| keyread_time(1,1)= INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST
 | |
| 
 | |
| cost= rows * (KEY_COPY_COST + INDEX_BLOCK_COPY_COST + KEY_LOOKUP_COST)
 | |
| ->
 | |
| KEY_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - KEY_COPY_COST;
 | |
| 844.978991 / 1000000 - 3.56e-05 - 0.000015685222496 = 0.000793693768504
 | |
| 
 | |
| 
 | |
| MyRocks EQ_REF with clustered index read
 | |
| ========================================
 | |
| 
 | |
| select straight_join count(*) from seq_1_to_1000000,check_costs_rocksdb where seq=l_orderkey
 | |
| eq_ref_cluster_join 1613.5670 ms
 | |
| 
 | |
| According to analyze statement:
 | |
| - Cost for SELECT * from seq_1_to_1000000: 12.57 (See sequence_scan_cost)
 | |
| - Time from check_costs: eq_ref_cluster_join: 1613.5670 ms
 | |
|   This is time for accessing both seq_1_to_1000000 and check_costs_innodb.
 | |
|   Time for check_cost_rocksdb: 1613.5670 - 12.57 = 1600.9970
 | |
| 
 | |
| cost= rows * (keyread_time(1,1) +
 | |
|               ranges * ROW_LOOKUP_COST +
 | |
|               (rows - ranges) * ROW_NEXT_FIND_COST +
 | |
|               rows * ROW_COPY_COST)
 | |
| 
 | |
| As rows=1 and ranges=1:
 | |
| 
 | |
| cost= rows * (INDEX_BLOCK_COPY_COST + ROW_LOOKUP_COST +  ROW_COPY_COST);
 | |
| ->
 | |
| ROW_LOOKUP_COST= cost/rows - INDEX_BLOCK_COPY_COST - ROW_COPY_COST;
 | |
| 1600.9970 / 1000000 - 3.56e-05 - 0.000060865547560 = 0.00150453145244
 | |
| 
 | |
| 
 | |
| MyRocks Range scan
 | |
| ==================
 | |
| select sum(l_orderkey) from test.check_costs_rocksdb force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0
 | |
| 
 | |
| The MyRocks engine estimates the number of rows both for the table and
 | |
| for the to be about 2M, double the real amount.
 | |
| 
 | |
| The timing and costs from check_costs.pl are:
 | |
| 
 | |
| range_scan  time: 1845.06126 ms  cost-where: 3698.8919   cost: 3730.8919
 | |
| 
 | |
| As the costs are about the double of the time, this is as good as we can do things until
 | |
| MyRocks reported record count is corrected
 | |
| 
 | |
| The issue with wrongly estimated number of rows does not affect the other results from check_costs.pl
 | |
| as table scans estimates uses the number of rows from the analyze, not from the engine.
 | |
| 
 | |
| 
 | |
| Appendix
 | |
| ========
 | |
| 
 | |
| Future improvements
 | |
| ===================
 | |
| 
 | |
| The current costs are quite good for tables of 1M rows (usually about
 | |
| 10% from the true cost for the test table).
 | |
| 
 | |
| For smaller tables the costs will be a bit on the high side and for
 | |
| bigger tables a bit on the low size for eq_ref joins (both with index
 | |
| and with row lookup).
 | |
| 
 | |
| The only engine that takes into account the number of rows for key lookups
 | |
| is heap with binary-tree indexes.
 | |
| 
 | |
| Ideas of how to fix this:
 | |
| 
 | |
| - Change KEY_LOOKUP_COST, INDEX_BLOCK_COPY_COST and ROW_LOOKUP_COST
 | |
|  (for clustered index) to take into account the height of the B tree.
 | |
| 
 | |
| 
 | |
| Observations
 | |
| ============
 | |
| 
 | |
| Ratio between table scan and range scan
 | |
| 
 | |
| Queries used:
 | |
| select sum(l_quantity) from check_costs_aria;
 | |
| select sum(l_orderkey) from test.check_costs_aria force index(l_suppkey) where l_suppkey >= 0 and l_partkey >=0 and l_discount>=0.0;
 | |
| 
 | |
| The test for Aria shows that cost ratio of range_scan/table_scan are:
 | |
| disk_read_ratio=0       341.745207/139.348286=  2.4524536097
 | |
| disk_read_ratio=0.02    752.408528/145.748695=  5.1623688843
 | |
| disk_read_ratio=0.20    4448.378423/203.352382= 21.8752216190
 | |
| 
 | |
| As we are using disk_read_ratio=0.02 by default, this means that in
 | |
| mtr to not use table scan instead of range, we have to ensure that the
 | |
| range does not cover more than 1/5 of the total rows.
 | |
| 
 | |
| 
 | |
| Trying to understand KEY_COPY_COST
 | |
| ==================================
 | |
| 
 | |
| An index scan with 2 and 4 key parts on an Aria table.
 | |
| The index has null key parts, so packed keys are used.
 | |
| 
 | |
| Query1 "index_scan" (2 integer key parts, both key parts may have NULLS):
 | |
| select count(*) from $table force index (l_suppkey) where l_suppkey >= 0 and l_partkey >=0");
 | |
| 
 | |
| - Optimized build: Average 164 ms/query
 | |
| - gprof build: Average 465 ms/query
 | |
| 
 | |
| [16]    51.2    0.00    0.21 3999987         handler::ha_index_next()
 | |
| [15]    51.2    0.01    0.20 3999993         maria_rnext [15]
 | |
| [22]    19.5    0.08    0.00 9658527         _ma_get_pack_key [22]
 | |
| 
 | |
| This means that for 3999987 read next calls, the time of _ma_get_pack_key
 | |
| to retrieve the returned key is:
 | |
| 0.08 * (3999987/9658527)
 | |
| 
 | |
| The relation of KEY_COPY_COST to KEY_NEXT_FIND_COST is thus for Aria:
 | |
| 
 | |
| 0.08 * (3999987/9658527)/0.21 = 0.15777 parts of KEY_NEXT_FIND_COST
 | |
| 
 | |
| ------
 | |
| 
 | |
| Query 2 "index_scan_4_parts" (4 integer key parts, 2 parts may have NULL's):
 | |
| select count(*) from $table force index (long_suppkey) where l_linenumber >= 0 and l_extra >0");
 | |
| 
 | |
| - Optimized build: 218 ms
 | |
| - gprof build: Average 497 ms/query
 | |
| 
 | |
| Most costly functions
 | |
|    %   cumulative   self              self     total
 | |
|  time   seconds   seconds    calls  ms/call  ms/call  name
 | |
|  13.44      0.61     0.61 48292742     0.00     0.00  _ma_get_pack_key
 | |
|   8.59      1.00     0.39 28298101     0.00     0.00  ha_key_cmp
 | |
|   7.27      1.33     0.33 19999951     0.00     0.00  _ma_put_key_in_record
 | |
|   4.41      1.96     0.20 19999952     0.00     0.00  handler::ha_index_next(unsigned char*)
 | |
| 
 | |
| Call graph
 | |
| [13]     9.0    0.20    0.21 19999952         handler::ha_index_next(unsigned char*) [13]
 | |
| 
 | |
| [3]     21.6    0.16    0.82 19999960         _ma_search_next [3]
 | |
| [18]     7.7    0.02    0.33 19999951         _ma_read_key_record [18]
 | |
|         0.00    0.00 19887291/19999952        _ma_get_static_key [6565][19]
 | |
|         18.4    0.10    0.64 19999936         Item_cond_and::val_int() [19]
 | |
| 
 | |
| -> KEY_COPY_COST = 1.33/1.96 = 0.6785 parts of the index_read_next
 | |
| 
 | |
| Total cost increases from 2 -> 4 key parts = 1.96 / 1.40 = 40%
 | |
| This includes the additional work in having more key pages, more work in
 | |
| finding next key (if key parts are packed or possible null) ,and copying
 | |
| the key parts to the record
 | |
| 
 | |
| I also did a quick analyze between using NOT NULL keys, in which case
 | |
| Aria can use fixed key lengths. This gives a 39.4% speed up on index
 | |
| scan, a small speedup to table scan (as 2 fields are cannot have null)
 | |
| but not a notable speed up for anything else.
 | |
| 
 | |
| 
 | |
| Trying to understand ROW_COPY_COST
 | |
| ==================================
 | |
| 
 | |
| An simple table scan on an Aria table
 | |
| 
 | |
| query: select sum(l_quantity) from check_costs_aria
 | |
| 
 | |
| From gprof running the above query 10 times with 1M rows in the table:
 | |
| 
 | |
| [14]    83.7    0.03    0.76 9999989         handler::ha_rnd_next()
 | |
| [17]    51.6    0.49    0.00 10000010         _ma_read_block_record2 [17]
 | |
| [18]    21.1    0.01    0.19  156359         pagecache_read [18]
 | |
| 
 | |
| The function that unpacks the row is _ma_read_block_record2()
 | |
| 
 | |
| Taking into account that all pages are cached:
 | |
| (Note that the main cost in pagecache_read in this test is calculating the page
 | |
| checksum)
 | |
| 
 | |
| ROW_COPY_COST/ROW_NEXT_FIND_COST= 0.49/(0.76+0.3-0.20) = 0.56977 = 0.57
 | |
| 
 | |
| 
 | |
| Reason for SCAN_SETUP_COSTS
 | |
| ===========================
 | |
| 
 | |
| One problem with the new more exact cost model is that the optimizer
 | |
| starts to use table scans much more for small tables (which is correct when
 | |
| one looks at cost). However, small tables are usually cached fully so
 | |
| it is still better to use index scan in many cases.
 | |
| 
 | |
| This problem is especially notable in mtr where most test cases uses
 | |
| tables with very few rows.
 | |
| 
 | |
| TABLE_SCAN_SETUP_COST is used to add a constant startup cost for
 | |
| table and index scans. It is by default set to 10 usec, about 10 MyISAM
 | |
| row reads.
 | |
| 
 | |
| The following cost calculation shows why this is needed:
 | |
| 
 | |
| explain select count(*) from t1, t2 where t1.p = t2.i
 | |
| +------+-------------+-------+-------+---------------+---------+---------+-----------+------+-------------+
 | |
| | id   | select_type | table | type  | possible_keys | key     | key_len | ref       | rows | Extra       |
 | |
| +------+-------------+-------+-------+---------------+---------+---------+-----------+------+-------------+
 | |
| |    1 | SIMPLE      | t1    | index | PRIMARY       | PRIMARY | 4       | NULL      | 2    | Using index |
 | |
| |    1 | SIMPLE      | t2    | ref   | k1            | k1      | 5       | test.t1.p | 2    | Using index |
 | |
| +------+-------------+-------+-------+---------------+---------+---------+-----------+------+-------------+
 | |
| 
 | |
| t1 has 2 rows
 | |
| t2 has 4 rows
 | |
| 
 | |
| Optimizer trace shows when using TABLE_SCAN_SETUP_COST=0:
 | |
| 
 | |
| index scan costs
 | |
| "read_cost": 0.00308962,
 | |
| read_and_compare_cost": 0.00321762
 | |
| 
 | |
| key read costs:
 | |
| "rows": 2,
 | |
| "cost": 0.00567934
 | |
| 
 | |
| CHOSEN:
 | |
| Scan with join cache: cost": 0.0038774
 | |
| rows_after_scan": 2
 | |
| 
 | |
| Note that in the following, we are using cost in microseconds while
 | |
| the above costs are in milliseconds.
 | |
| 
 | |
| select * from information_schema.optimizer_costs where engine="myisam"\G
 | |
|                           ENGINE: MyISAM
 | |
|         OPTIMIZER_DISK_READ_COST: 10.240000
 | |
|  OPTIMIZER_INDEX_BLOCK_COPY_COST: 0.035600
 | |
|       OPTIMIZER_KEY_COMPARE_COST: 0.008000
 | |
|          OPTIMIZER_KEY_COPY_COST: 0.066660
 | |
|        OPTIMIZER_KEY_LOOKUP_COST: 0.498540
 | |
|     OPTIMIZER_KEY_NEXT_FIND_COST: 0.060210
 | |
|        OPTIMIZER_DISK_READ_RATIO: 0.200000
 | |
| OPTIMIZER_RND_POS_INTERFACE_COST: 0.000000
 | |
|          OPTIMIZER_ROW_COPY_COST: 0.088630
 | |
|        OPTIMIZER_ROW_LOOKUP_COST: 0.641150
 | |
|     OPTIMIZER_ROW_NEXT_FIND_COST: 0.049510
 | |
|     OPTIMIZER_ROWID_COMPARE_COST: 0.004000
 | |
| @@OPTIMIZER_SCAN_SETUP_COST  10.000000
 | |
| @@OPTIMIZER_WHERE_COST       0.032000
 | |
| 
 | |
| Checking the calculated costs:
 | |
| 
 | |
| index_scan_cost=  10.240000 * 0.2 + 0.035600 + 0.498540 + 4 * (0.060210+0.066660) = 3.08962
 | |
| where_cost 0.032000*4= 0.128000
 | |
| total: 3.21762
 | |
| 
 | |
| key_read_cost=  10.240000 * 0.2 + 0.035600 + 0.498540 +  0.060210  = 2.64235
 | |
| key_copy_cost= 0.066660 * 2 = 0.13332
 | |
| where_cost 0.032000*2=  0.06400
 | |
| total: 2.64235 + 0.13332 + 0.06400 =     2.8396699999999999
 | |
| Needs to be done 2 times (2 rows in t1): 5.67934
 | |
| 
 | |
| Join cache only needs 1 refill. The calculation is done in
 | |
| sql_select.cc:best_access_path()
 | |
| 
 | |
| scan_with_join_cache=
 | |
| scan_time + cached_combinations * ROW_COPY_COST * JOIN_CACHE_COST +
 | |
| row_combinations * (ROW_COPY_COST * JOIN_CACHE_COST + WHERE_COST) =
 | |
| 3.2176 + 2 * 0.088630 + 2*2 * (0.088630 * 1 + 0.032000) =
 | |
| 3.87738
 | |
| 
 | |
| Other observations:
 | |
| OPTIMIZER_KEY_NEXT_FIND_COST + OPTIMIZER_KEY_COPY_COST + OPTIMIZER_WHERE_COST=
 | |
| 0.060210 + 0.066660 + 0.032000 = 0.158870
 | |
| OPTIMIZER_KEY_LOOKUP_COST /  0.158870 = 3.138
 | |
| 
 | |
| This means that when using index only reads (and DISK_READ_RATIO=0)
 | |
| the optimizer will prefer to use 3 times more keys in range or ref
 | |
| than doing a key lookups!
 | |
| If DISK_READ_RATIO is higher, the above ratio increases. This is one of
 | |
| the reasons why we set the default value for DISK_READ_RATIO quite low
 | |
| (0.02 now)
 | |
| 
 | |
| (OPTIMIZER_ROW_COPY_COST + OPTIMIZER_ROW_NEXT_FIND_COST) /
 | |
| (OPTIMIZER_KEY_COPY_COST + OPTIMIZER_KEY_NEXT_FIND_COST) =
 | |
| (0.088630 + 0.049510) / (0.066660 + 0.060210) = 1.08831
 | |
| Which means that table scans and index scans have almost the same cost.
 | |
| select 0.066660
 | |
| 
 | |
| 
 | |
| HEAP_TEMPTABLE_CREATE_COST
 | |
| ==========================
 | |
| 
 | |
| I added trackers in create_tmp_table() and open_tmp_table() and run a
 | |
| simple query that create two materialized temporary table with an unique
 | |
| index 31 times. I got the following tracking information:
 | |
| 
 | |
| (gdb) p open_tracker
 | |
| $1 = {counter = 31, cycles = 302422}
 | |
| (gdb) p create_tracker
 | |
| $2 = {counter = 31, cycles = 1479836}
 | |
| 
 | |
| Cycles per create = (302422 + 1479836)/31= 57492
 | |
| 
 | |
| 1000.0*57492/sys_timer_info.cycles.frequency = 0.0249 ms
 | |
| HEAP_TMPTABLE_CREATE_COST= 0.025 ms
 | |
| 
 | |
| 
 | |
| What to do with wrong row estimates
 | |
| ===================================
 | |
| 
 | |
| MyRocks can have a very bad estimate of rows, both for the number of rows in the table and also
 | |
| for big ranges. Analyze table can fix this, but we have to consider how to keep the row estimate
 | |
| correct when tables are growing over time.
 | |
| 
 | |
| Suggested fixed:
 | |
| - If we can assume that the datafile size reported by the engine is somewhat correct, we could
 | |
|   estimate the number of rows as:
 | |
|   analyze_number_of_rows * current_datafile_size / analyze_datafile_size
 | |
| 
 | |
| 
 | |
| MySQL cost structures
 | |
| =====================
 | |
| 
 | |
| MySQL 8.0 server cost are stored in the class Server_cost_constants defined
 | |
| int opt_costconstants.h
 | |
| 
 | |
| It containts the following slots and has the following default values:
 | |
| 
 | |
| m_row_evaluate_cost             0.1  Cost for evaluating the query condition on
 | |
|                                      a row
 | |
| m_key_compare_cost              0.05 Cost for comparing two keys
 | |
| m_memory_temptable_create_cost  1.0  Cost for creating an internal temporary
 | |
|                                      table in memory
 | |
| m_memory_temptable_row_cost     0.1  Cost for retrieving or storing a row in an
 | |
|                                      internal temporary table stored in memory.
 | |
| m_disk_temptable_create_cost    20.0 Cost for creating an internal temporary
 | |
|                                      table in a disk resident storage engine.
 | |
| m_disk_temptable_row_cost       0.5  Cost for retrieving or storing a row in an
 | |
|                                      internal disk resident temporary table.
 | |
| 
 | |
| Engine cost variables:
 | |
| m_memory_block_read_cost        0.25 The cost of reading a block from a main
 | |
|                                      memory buffer pool
 | |
| m_io_block_read_cost            1.0  The cost of reading a block from an
 | |
|                                      IO device (disk)
 | |
| 
 | |
| -------
 | |
| 
 | |
| Some cost functions:
 | |
| 
 | |
| scan_time() = data_file_length / IO_SIZE + 2;
 | |
| read_time(index, ranges, rows)= rows2double(ranges + rows);
 | |
| index_only_read_time()= records / keys_per_block
 | |
| 
 | |
| table_scan_cost()= scan_time() * page_read_cost(1.0);
 | |
| 
 | |
| index_scan_cost()= index_only_read_time(index, rows) *
 | |
|                    page_read_cost_index(index, 1.0);
 | |
| read_cost()=       read_time() * page_read_cost(1.0);
 | |
| 
 | |
| 
 | |
| page_read_cost()= buffer_block_read_cost(pages_in_mem) +
 | |
|                   io_block_read_cost(pages_on_disk);
 | |
| 
 | |
| io_block_read_cost()=     blocks * m_io_block_read_cost
 | |
| buffer_block_read_cost()= blocks * m_memory_block_read_cost;
 | |
| 
 | |
| 
 | |
| There are also:
 | |
| table_in_memory_estimate()
 | |
| index_in_memory_estimate()
 | |
| 
 | |
| If the storage engine is not providing estimates for the above, then
 | |
| the estimates are done based on table size (not depending on how many
 | |
| rows are going to be accessed in the table).
 | 
