mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-27 09:14:17 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	15700f54c2	Merge 11.4 into 11.7	2025-01-09 09:41:38 +02:00
Marko Mäkelä	17f01186f5	Merge 10.11 into 11.4	2025-01-09 07:58:08 +02:00
Marko Mäkelä	ddd7d5d8e3	MDEV-24035 Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0 causing disruption and replication failure Under unknown circumstances, the SQL layer may wrongly disregard an invocation of thd_mark_transaction_to_rollback() when an InnoDB transaction had been aborted (rolled back) due to one of the following errors: * HA_ERR_LOCK_DEADLOCK * HA_ERR_RECORD_CHANGED (if innodb_snapshot_isolation=ON) * HA_ERR_LOCK_WAIT_TIMEOUT (if innodb_rollback_on_timeout=ON) Such an error used to cause a crash of InnoDB during transaction commit. These changes aim to catch and report the error earlier, so that not only this crash can be avoided but also the original root cause be found and fixed more easily later. The idea of this fix is from Michael 'Monty' Widenius. HA_ERR_ROLLBACK: A new error code that will be translated into ER_ROLLBACK_ONLY, signalling that the current transaction has been aborted and the only allowed action is ROLLBACK. trx_t::state: Add TRX_STATE_ABORTED that is like TRX_STATE_NOT_STARTED, but noting that the transaction had been rolled back and aborted. trx_t::is_started(): Replaces trx_is_started(). ha_innobase: Check the transaction state in various places. Simplify the logic around SAVEPOINT. ha_innobase::is_valid_trx(): Replaces ha_innobase::is_read_only(). The InnoDB logic around transaction savepoints, commit, and rollback was unnecessarily complex and might have contributed to this inconsistency. So, we are simplifying that logic as well. trx_savept_t: Replace with const undo_no_t*. When we rollback to a savepoint, all we need to know is the number of undo log records that must survive. trx_named_savept_t, DB_NO_SAVEPOINT: Remove. We can store undo_no_t directly in the space allocated at innobase_hton->savepoint_offset. fts_trx_create(): Do not copy previous savepoints. fts_savepoint_rollback(): If a savepoint was not found, roll back everything after the default savepoint of fts_trx_create(). The test innodb_fts.savepoint is extended to cover this code. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2024-12-12 18:02:00 +02:00
Marko Mäkelä	33907f9ec6	Merge 11.4 into 11.7	2024-12-02 17:51:17 +02:00
Marko Mäkelä	2719cc4925	Merge 10.11 into 11.4	2024-12-02 11:35:34 +02:00
Thirunarayanan Balathandayuthapani	074831ec61	Merge branch 10.5 into 10.6	2024-11-08 18:17:15 +05:30
Thirunarayanan Balathandayuthapani	7afee25b08	MDEV-35115 Inconsistent Replace behaviour when multiple unique index exist - Replace statement fails with duplicate key error when multiple unique key conflict happens. Reason is that Server expects the InnoDB engine to store the confliciting keys in ascending order. But the InnoDB doesn't store the conflicting keys in ascending order. Fix: === - Enable HA_DUPLICATE_KEY_NOT_IN_ORDER for InnoDB storage engine only when unique index order is different in .frm and innodb dictionary.	2024-11-08 16:46:41 +05:30
Sergei Golubchik	aed5928207	cleanup: extract transaction-related part of handlerton into a separate transaction_participant structure handlerton inherits it, so handlerton itself doesn't change. but entities that only need to participate in a transaction, like binlog or online alter log, use a transaction_participant and no longer need to pretend to be a full-blown but invisible storage engine which doesn't support create table.	2024-11-05 14:00:50 -08:00
Sergei Golubchik	062f8eb37d	cleanup: key algorithm vs key flags the information about index algorithm was stored in two places inconsistently split between both. BTREE index could have key->algorithm == HA_KEY_ALG_BTREE, if the user explicitly specified USING BTREE or HA_KEY_ALG_UNDEF, if not. RTREE index had key->algorithm == HA_KEY_ALG_RTREE and always had key->flags & HA_SPATIAL FULLTEXT index had key->algorithm == HA_KEY_ALG_FULLTEXT and always had key->flags & HA_FULLTEXT HASH index had key->algorithm == HA_KEY_ALG_HASH or HA_KEY_ALG_UNDEF long unique index always had key->algorithm == HA_KEY_ALG_LONG_HASH In this commit: All indexes except BTREE and HASH always have key->algorithm set, HA_SPATIAL and HA_FULLTEXT flags are not used anymore (except for storage to keep frms backward compatible). As a side effect ALTER TABLE now detects FULLTEXT index renames correctly	2024-11-05 14:00:47 -08:00
Marko Mäkelä	f493e46494	Merge 11.6 into 11.7	2024-10-03 18:15:13 +03:00
Marko Mäkelä	43465352b9	Merge 11.4 into 11.6	2024-10-03 16:09:56 +03:00
Marko Mäkelä	12a91b57e2	Merge 10.11 into 11.2	2024-10-03 13:24:43 +03:00
Marko Mäkelä	7e0afb1c73	Merge 10.5 into 10.6	2024-10-03 09:31:39 +03:00
Max Kellermann	45298b730b	sql/handler: referenced_by_foreign_key() returns bool The method was declared to return an unsigned integer, but it is really a boolean (and used as such by all callers). A secondary change is the addition of "const" and "noexcept" to this method. In ha_mroonga.cpp, I also added "inline" to the two helper methods of referenced_by_foreign_key(). This allows the compiler to flatten the method.	2024-09-30 16:33:25 +03:00
Libing Song	5bbda97111	MDEV-33853 Async rollback prepared transactions during binlog crash recovery Summary ======= When doing server recovery, the active transactions will be rolled back by InnoDB background rollback thread automatically. The prepared transactions will be committed or rolled back accordingly by binlog recovery. Binlog recovery is done in main thread before the server can provide service to users. If there is a big transaction to rollback, the server will not available for a long time. This patch provides a way to rollback the prepared transactions asynchronously. Thus the rollback will not block server startup. Design ====== - Handler::recover_rollback_by_xid() This patch provides a new handler interface to rollback transactions in recover phase. InnoDB just set the transaction's state to active. Then the transaction will be rolled back by the background rollback thread. - Handler::signal_tc_log_recover_done() This function is called after tc log is opened(typically binlog opened) has done. When this function is called, all transactions will be rolled back have been reverted to ACTIVE state. Thus it starts rollback thread to rollback the transactions. - Background rollback thread With this patch, background rollback thread is defered to run until binlog recovery is finished. It is started by innobase_tc_log_recovery_done().	2024-09-05 21:19:25 +03:00
Alexander Barkov	4e805aed85	Merge remote-tracking branch 'origin/11.4' into 11.5	2024-07-10 12:17:09 +04:00
Alexander Barkov	8aad19ddfc	Merge remote-tracking branch 'origin/11.1' into 11.2	2024-07-09 14:04:11 +04:00
Oleksandr Byelkin	2447dda2c0	Merge branch '10.11' into 11.1	2024-07-08 22:40:16 +02:00
Marko Mäkelä	d1ecf5cc5f	MDEV-32176 Contention in ha_innobase::info_low() During a Sysbench oltp_point_select workload with 1 table and 400 concurrent connections, a bottleneck on dict_table_t::lock_mutex was observed in ha_innobase::info_low(). dict_table_t::lock_latch: Replaces lock_mutex. In ha_innobase::info_low() and several other places, we will acquire a shared dict_table_t::lock_latch or we may elide the latch if hardware memory transactions are available. innobase_build_v_templ(): Remove the parameter "bool locked", and require the caller to hold exclusive dict_table_t::lock_latch (instead of holding an exclusive dict_sys.latch). Tested by: Vladislav Vaintroub Reviewed by: Vladislav Vaintroub	2024-06-28 15:57:07 +03:00
Monty	c4cad8d50c	MDEV-33449 improving repair of tables This task is to ensure we have a clear definition and rules of how to repair or optimize a table. The rules are: - REPAIR should be used with tables that are crashed and are unreadable (hardware issues with not readable blocks, blocks with 'unexpected data' etc) - OPTIMIZE table should be used to optimize the storage layout for the table (recover space for delete rows and optimize the index structure. - ALTER TABLE table_name FORCE should be used to rebuild the .frm file (the table definition) and the table (with the original table row format). If the table is from and older MariaDB/MySQL release with a different storage format, it will convert the data to the new format. ALTER TABLE ... FORCE is used as part of mariadb-upgrade Here follows some more background: The 3 ways to repair a table are: 1) ALTER TABLE table_name FORCE" (not other options). As an alias we allow: "ALTER TABLE table_name ENGINE=original_engine" 2) "REPAIR TABLE" (without FORCE) 3) "OPTIMIZE TABLE" All of the above commands will optimize row space usage (which means that space will be needed to hold a temporary copy of the table) and re-generate all indexes. They will also try to replicate the original table definition as exact as possible. For ALTER TABLE and "REPAIR TABLE without FORCE", the following will hold: If the table is from an older MariaDB version and data conversion is needed (for example for old type HASH columns, MySQL JSON type or new TIMESTAMP format) "ALTER TABLE table_name FORCE, algorithm=COPY" will be used. The differences between the algorithms are 1) Will use the fastest algorithm the engine supports to do a full repair of the table (except if data conversions are is needed). 2) Will use the storage engine internal REPAIR facility (MyISAM, Aria). If the engine does not support REPAIR then "ALTER TABLE FORCE, ALGORITHM=COPY" will be used. If there was data incompatibilities (which means that FORCE was used) then there will be a warning after REPAIR that ALTER TABLE FORCE is still needed. The reason for this is that REPAIR may be able to go around data errors (wrong incompatible data, crashed or unreadable sectors) that ALTER TABLE cannot do. 3) Will use the storage engine internal OPTIMIZE. If engine does not support optimize, then "ALTER TABLE FORCE" is used. The above will ensure that ALTER TABLE FORCE is able to correct almost any errors in the row or index data. In case of corrupted blocks then REPAIR possible followed by ALTER TABLE is needed. This is important as mariadb-upgrade executes ALTER TABLE table_name FORCE for any table that must be re-created. Bugs fixed with InnoDB tables when using ALTER TABLE FORCE: - No error for INNODB_DEFAULT_ROW_FORMAT=COMPACT even if row length would be too wide. (Independent of innodb_strict_mode). - Tables using symlinks will be symlinked after any of the above commands (Independent of the setting of --symbolic-links) If one specifies an algorithm together with ALTER TABLE FORCE, things will work as before (except if data conversion is required as then the COPY algorithm is enforced). ALTER TABLE .. OPTIMIZE ALL PARTITIONS will work as before. Other things: - FORCE argument added to REPAIR to allow one to first run internal repair to fix damaged blocks and then follow it with ALTER TABLE. - REPAIR will not update frm_version if ha_check_for_upgrade() finds that table is still incompatible with current version. In this case the REPAIR will end with an error. - REPAIR for storage engines that does not have native repair, like InnoDB, is now using ALTER TABLE FORCE. - REPAIR csv-table USE_FRM now works. - It did not work before as CSV tables had extension list in wrong order. - Default error messages length for %M increased from 128 to 256 to not cut information from REPAIR. - Documented HA_ADMIN_XX variables related to repair. - Added HA_ADMIN_NEEDS_DATA_CONVERSION to signal that we have to do data conversions when converting the table (and thus ALTER TABLE copy algorithm is needed). - Fixed typo in error message (caused test changes).	2024-05-27 12:39:03 +02:00
Oleksandr Byelkin	dd7d9d7fb1	Merge branch '11.4' into 11.5	2024-05-23 17:01:43 +02:00
Sergei Golubchik	bf5da43e50	Merge branch '11.1' into 11.2	2024-05-13 10:00:26 +02:00
Sergei Petrunia	fe41171c96	MDEV-33533: Crash at execution of DELETE when trying to use rowid filter (Based on original patch by Oleksandr Byelkin) Multi-table DELETE can execute via "buffered" mode: at phase #1 it collects rowids of rows to be deleted, then at phase #2 in multi_delete::do_deletes() it calls handler->rnd_pos() to read rows to be deleted and deletes them. The problem occurred when phase #1 used Rowid Filter on the table that phase #2 would be deleting from. In InnoDB, h->rnd_init(scan=false) and h->rnd_pos() is an index scan over PK under the hood. So, at phase #2 ha_innobase::rnd_init() would try to use the Rowid Filter and hit an assertion inside ha_innobase::rnd_init(). Note that multi-table UPDATE works similarly but was not affected, because patch for MDEV-7487 added code to disable rowid filter for phase #2 in multi_update::do_updates(). This patch changes the approach: - It makes InnoDB not use Rowid Filter in rnd_pos() scans: it is disabled in ha_innobase::rnd_init() and enabled back in ha_innobase::rnd_end(). - multi_update::do_updates() no longer disables Rowid Filter for phase#2 as it is no longer necessary.	2024-05-13 09:52:39 +02:00
Alexander Barkov	fd247cc21f	MDEV-31340 Remove MY_COLLATION_HANDLER::strcasecmp() This patch also fixes: MDEV-33050 Build-in schemas like oracle_schema are accent insensitive MDEV-33084 LASTVAL(t1) and LASTVAL(T1) do not work well with lower-case-table-names=0 MDEV-33085 Tables T1 and t1 do not work well with ENGINE=CSV and lower-case-table-names=0 MDEV-33086 SHOW OPEN TABLES IN DB1 -- is case insensitive with lower-case-table-names=0 MDEV-33088 Cannot create triggers in the database `MYSQL` MDEV-33103 LOCK TABLE t1 AS t2 -- alias is not case sensitive with lower-case-table-names=0 MDEV-33109 DROP DATABASE MYSQL -- does not drop SP with lower-case-table-names=0 MDEV-33110 HANDLER commands are case insensitive with lower-case-table-names=0 MDEV-33119 User is case insensitive in INFORMATION_SCHEMA.VIEWS MDEV-33120 System log table names are case insensitive with lower-cast-table-names=0 - Removing the virtual function strnncoll() from MY_COLLATION_HANDLER - Adding a wrapper function CHARSET_INFO::streq(), to compare two strings for equality. For now it calls strnncoll() internally. In the future it will turn into a virtual function. - Adding new accent sensitive case insensitive collations: - utf8mb4_general1400_as_ci - utf8mb3_general1400_as_ci They implement accent sensitive case insensitive comparison. The weight of a character is equal to the code point of its upper case variant. These collations use Unicode-14.0.0 casefolding data. The result of my_charset_utf8mb3_general1400_as_ci.strcoll() is very close to the former my_charset_utf8mb3_general_ci.strcasecmp() There is only a difference in a couple dozen rare characters, because: - the switch from "tolower" to "toupper" comparison, to make utf8mb3_general1400_as_ci closer to utf8mb3_general_ci - the switch from Unicode-3.0.0 to Unicode-14.0.0 This difference should be tolarable. See the list of affected characters in the MDEV description. Note, utf8mb4_general1400_as_ci correctly handles non-BMP characters! Unlike utf8mb4_general_ci, it does not treat all BMP characters as equal. - Adding classes representing names of the file based database objects: Lex_ident_db Lex_ident_table Lex_ident_trigger Their comparison collation depends on the underlying file system case sensitivity and on --lower-case-table-names and can be either my_charset_bin or my_charset_utf8mb3_general1400_as_ci. - Adding classes representing names of other database objects, whose names have case insensitive comparison style, using my_charset_utf8mb3_general1400_as_ci: Lex_ident_column Lex_ident_sys_var Lex_ident_user_var Lex_ident_sp_var Lex_ident_ps Lex_ident_i_s_table Lex_ident_window Lex_ident_func Lex_ident_partition Lex_ident_with_element Lex_ident_rpl_filter Lex_ident_master_info Lex_ident_host Lex_ident_locale Lex_ident_plugin Lex_ident_engine Lex_ident_server Lex_ident_savepoint Lex_ident_charset engine_option_value::Name - All the mentioned Lex_ident_xxx classes implement a method streq(): if (ident1.streq(ident2)) do_equal(); This method works as a wrapper for CHARSET_INFO::streq(). - Changing a lot of "LEX_CSTRING name" to "Lex_ident_xxx name" in class members and in function/method parameters. - Replacing all calls like system_charset_info->coll->strcasecmp(ident1, ident2) to ident1.streq(ident2) - Taking advantage of the c++11 user defined literal operator for LEX_CSTRING (see m_strings.h) and Lex_ident_xxx (see lex_ident.h) data types. Use example: const Lex_ident_column primary_key_name= "PRIMARY"_Lex_ident_column; is now a shorter version of: const Lex_ident_column primary_key_name= Lex_ident_column({STRING_WITH_LEN("PRIMARY")});	2024-04-18 15:22:10 +04:00
Oleksandr Byelkin	cd28b2479c	Merge branch '11.1' into 11.2	2024-04-09 12:12:33 +02:00
Marko Mäkelä	683fbced6b	Merge 11.0 into 11.1	2024-03-28 12:15:36 +02:00
Alexander Barkov	929c2e06aa	MDEV-31531 Remove my_casedn_str() and my_caseup_str() Under terms of MDEV 27490 we'll add support for non-BMP identifiers and upgrade casefolding information to Unicode version 14.0.0. In Unicode-14.0.0 conversion to lower and upper cases can increase octet length of the string, so conversion won't be possible in-place any more. This patch removes virtual functions performing in-place casefolding: - my_charset_handler_st::casedn_str() - my_charset_handler_st::caseup_str() and fixes the code to use the non-inplace functions instead: - my_charset_handler_st::casedn() - my_charset_handler_st::caseup()	2024-02-28 22:20:29 +04:00
Marko Mäkelä	d73baa402a	Merge 10.11 into 11.0	2024-02-20 12:02:01 +02:00
Marko Mäkelä	466069b184	Merge 10.5 into 10.6	2024-02-08 10:38:53 +02:00
Marko Mäkelä	0381921e26	MDEV-33277 In-place upgrade causes invalid AUTO_INCREMENT values MDEV-33308 CHECK TABLE is modifying .frm file even if --read-only As noted in commit `d0ef1aaf61`, MySQL as well as older versions of MariaDB server would during ALTER TABLE ... IMPORT TABLESPACE write bogus values to the PAGE_MAX_TRX_ID field to pages of the clustered index, instead of letting that field remain 0. In commit `8777458a6e` this field was repurposed for PAGE_ROOT_AUTO_INC in the clustered index root page. To avoid trouble when upgrading from MySQL or older versions of MariaDB, we will try to detect and correct bogus values of PAGE_ROOT_AUTO_INC when opening a table for the first time from the SQL layer. btr_read_autoinc_with_fallback(): Add the parameters to mysql_version,max to indicate the TABLE_SHARE::mysql_version of the .frm file and the maximum value allowed for the type of the AUTO_INCREMENT column. In case the table was originally created in MySQL or an older version of MariaDB, read also the maximum value of the AUTO_INCREMENT column from the table and reset the PAGE_ROOT_AUTO_INC if it is above the limit. dict_table_t::get_index(const dict_col_t &) const: Find an index that starts with the specified column. ha_innobase::check_for_upgrade(): Return HA_ADMIN_FAILED if InnoDB needs upgrading but is in read-only mode. In this way, the call to update_frm_version() will be skipped. row_import_autoinc(): Adjust the AUTO_INCREMENT column at the end of ALTER TABLE...IMPORT TABLESPACE. This refinement was suggested by Debarun Banerjee. The changes outside InnoDB were developed by Michael 'Monty' Widenius: Added print_check_msg() service for easy reporting of check/repair messages in ENGINE=Aria and ENGINE=InnoDB. Fixed that CHECK TABLE do not update the .frm file under --read-only. Added 'handler_flags' to HA_CHECK_OPT as a way for storage engines to store state from handler::check_for_upgrade(). Reviewed by: Debarun Banerjee	2024-02-08 10:35:45 +02:00
Nikita Malyavin	b3f988d260	Add const to get_foreign_key_list/get_parent_foreign_key_list	2023-08-15 10:16:13 +02:00
Sergei Golubchik	d767ed5c89	remove handler::open_read_view() ht->start_consistent_snapshot() is also not a way, because some engines (e.g. rocksdb) only do it readonly. instead, downgrade the lock after reading the first row (which implicitly opens a read view).	2023-08-15 10:16:11 +02:00
Nikita Malyavin	ab4bfad206	MDEV-16329 [5/5] ALTER ONLINE TABLE * Log rows in online_alter_binlog. * Table online data is replicated within dedicated binlog file * Cached data is written on commit. * Versioning is fully supported. * Works both wit and without binlog enabled. * For now savepoints setup is forbidden while ONLINE ALTER goes on. Extra support is required. We can simply log the SAVEPOINT query events and replicate them together with row events. But it's not implemented for now. * Cache flipping: We want to care for the possible bottleneck in the online alter binlog reading/writing in advance. IO_CACHE does not provide anything better that sequential access, besides, only a single write is mutex-protected, which is not suitable, since we should write a transaction atomically. To solve this, a special layer on top Event_log is implemented. There are two IO_CACHE files underneath: one for reading, and one for writing. Once the read cache is empty, an exclusive lock is acquired (we can wait for a currently active transaction finish writing), and flip() is emitted, i.e. the write cache is reopened for read, and the read cache is emptied, and reopened for writing. This reminds a buffer flip that happens in accelerated graphics (DirectX/OpenGL/etc). Cache_flip_event_log is considered non-blocking for a single reader and a single writer in this sense, with the only lock held by reader during flip. An alternative approach by implementing a fair concurrent circular buffer is described in MDEV-24676. * Cache managers: We have two cache sinks: statement and transactional. It is important that the changes are first cached per-statement and per-transaction. If a statement fails, then only statement data is rolled back. The transaction moves along, however. Turns out, there's no guarantee that TABLE well persist in thd->open_tables to the transaction commit moment. If an error occurs, tables from statement are purged. Therefore, we can't store te caches in TABLE. Ideally, it should be handlerton, but we cut the corner and store it in THD in a list.	2023-08-15 10:16:11 +02:00
Yuchen Pei	9b431d714f	MDEV-26137 Improve import tablespace workflow. Allow ALTER TABLE ... IMPORT TABLESPACE without creating the table followed by discarding the tablespace. That is, assuming we want to import table t1 to t2, instead of CREATE TABLE t2 LIKE t1; ALTER TABLE t2 DISCARD TABLESPACE; FLUSH TABLES t1 FOR EXPORT; --copy_file $MYSQLD_DATADIR/test/t1.cfg $MYSQLD_DATADIR/test/t2.cfg --copy_file $MYSQLD_DATADIR/test/t1.ibd $MYSQLD_DATADIR/test/t2.ibd UNLOCK TABLES; ALTER TABLE t2 IMPORT TABLESPACE; We can simply do FLUSH TABLES t1 FOR EXPORT; --copy_file $MYSQLD_DATADIR/test/t1.cfg $MYSQLD_DATADIR/test/t2.cfg --copy_file $MYSQLD_DATADIR/test/t1.frm $MYSQLD_DATADIR/test/t2.frm --copy_file $MYSQLD_DATADIR/test/t1.ibd $MYSQLD_DATADIR/test/t2.ibd UNLOCK TABLES; ALTER TABLE t2 IMPORT TABLESPACE; We achieve this by creating a "stub" table in the second scenario while opening the table, where t2 does not exist but needs to import from t1. The "stub" table is similar to a table that is created but then instructed to discard its tablespace. We include tests with various row formats, encryption, with indexes and auto-increment.	2023-07-04 17:56:27 +10:00
Marko Mäkelä	7ca89af6f8	MDEV-30545 Remove innodb_defragment and related parameters The deprecated parameters will be removed: innodb_defragment innodb_defragment_n_pages innodb_defragment_stats_accuracy innodb_defragment_fill_factor_n_recs innodb_defragment_fill_factor innodb_defragment_frequency The mysql.innodb_index_stats.stat_name values 'n_page_split' and 'n_pages_freed' will lose their special meaning. The related changes to OPTIMIZE TABLE in InnoDB will be removed as well. The parameter innodb_optimize_fulltext_only will retain its special meaning in OPTIMIZE TABLE. Tested by: Matthias Leich	2023-03-11 10:45:35 +02:00
Monty	009db2288b	Fixed limit optimization in range optimizer The issue was that when limit is used, SQL_SELECT::test_quick_select would set the cost of table scan to be unreasonable high to force a range to be used. The problem with this approach was that range was used even when the cost of range, when it would only read 'limit rows' would be higher than the cost of a table scan. This patch fixes it by not accepting ranges when the range can never have a lower cost than a table scan, even if every row would match the WHERE clause.	2023-02-02 23:54:57 +03:00
Monty	b66cdbd1ea	Changing all cost calculation to be given in milliseconds This makes it easier to compare different costs and also allows the optimizer to optimizer different storage engines more reliably. - Added tests/check_costs.pl, a tool to verify optimizer cost calculations. - Most engine costs has been found with this program. All steps to calculate the new costs are documented in Docs/optimizer_costs.txt - User optimizer_cost variables are given in microseconds (as individual costs can be very small). Internally they are stored in ms. - Changed DISK_READ_COST (was DISK_SEEK_BASE_COST) from a hard disk cost (9 ms) to common SSD cost (400MB/sec). - Removed cost calculations for hard disks (rotation etc). - Changed the following handler functions to return IO_AND_CPU_COST. This makes it easy to apply different cost modifiers in ha_..time() functions for io and cpu costs. - scan_time() - rnd_pos_time() & rnd_pos_call_time() - keyread_time() - Enhanched keyread_time() to calculate the full cost of reading of a set of keys with a given number of ranges and optional number of blocks that need to be accessed. - Removed read_time() as keyread_time() + rnd_pos_time() can do the same thing and more. - Tuned cost for: heap, myisam, Aria, InnoDB, archive and MyRocks. Used heap table costs for json_table. The rest are using default engine costs. - Added the following new optimizer variables: - optimizer_disk_read_ratio - optimizer_disk_read_cost - optimizer_key_lookup_cost - optimizer_row_lookup_cost - optimizer_row_next_find_cost - optimizer_scan_cost - Moved all engine specific cost to OPTIMIZER_COSTS structure. - Changed costs to use 'records_out' instead of 'records_read' when recalculating costs. - Split optimizer_costs.h to optimizer_costs.h and optimizer_defaults.h. This allows one to change costs without having to compile a lot of files. - Updated costs for filter lookup. - Use a better cost estimate in best_extension_by_limited_search() for the sorting cost. - Fixed previous issues with 'filtered' explain column as we are now using 'records_out' (min rows seen for table) to calculate filtering. This greatly simplifies the filtering code in JOIN_TAB::save_explain_data(). This change caused a lot of queries to be optimized differently than before, which exposed different issues in the optimizer that needs to be fixed. These fixes are in the following commits. To not have to change the same test case over and over again, the changes in the test cases are done in a single commit after all the critical change sets are done. InnoDB changes: - Updated InnoDB to not divide big range cost with 2. - Added cost for InnoDB (innobase_update_optimizer_costs()). - Don't mark clustered primary key with HA_KEYREAD_ONLY. This will prevent that the optimizer is trying to use index-only scans on the clustered key. - Disabled ha_innobase::scan_time() and ha_innobase::read_time() and ha_innobase::rnd_pos_time() as the default engine cost functions now works good for InnoDB. Other things: - Added --show-query-costs (\Q) option to mysql.cc to show the query cost after each query (good when working with query costs). - Extended my_getopt with GET_ADJUSTED_VALUE which allows one to adjust the value that user is given. This is used to change cost from microseconds (user input) to milliseconds (what the server is internally using). - Added include/my_tracker.h ; Useful include file to quickly test costs of a function. - Use handler::set_table() in all places instead of 'table= arg'. - Added SHOW_OPTIMIZER_COSTS to sys variables. These are input and shown in microseconds for the user but stored as milliseconds. This is to make the numbers easier to read for the user (less pre-zeros). Implemented in 'Sys_var_optimizer_cost' class. - In test_quick_select() do not use index scans if 'no_keyread' is set for the table. This is what we do in other places of the server. - Added THD parameter to Unique::get_use_cost() and check_index_intersect_extension() and similar functions to be able to provide costs to called functions. - Changed 'records' to 'rows' in optimizer_trace. - Write more information to optimizer_trace. - Added INDEX_BLOCK_FILL_FACTOR_MUL (4) and INDEX_BLOCK_FILL_FACTOR_DIV (3) to calculate usage space of keys in b-trees. (Before we used numeric constants). - Removed code that assumed that b-trees has similar costs as binary trees. Replaced with engine calls that returns the cost. - Added Bitmap::find_first_bit() - Added timings to join_cache for ANALYZE table (patch by Sergei Petrunia). - Added records_init and records_after_filter to POSITION to remember more of what best_access_patch() calculates. - table_after_join_selectivity() changed to recalculate 'records_out' based on the new fields from best_access_patch() Bug fixes: - Some queries did not update last_query_cost (was 0). Fixed by moving setting thd->...last_query_cost in JOIN::optimize(). - Write '0' as number of rows for const tables with a matching row. Some internals: - Engine cost are stored in OPTIMIZER_COSTS structure. When a handlerton is created, we also created a new cost variable for the handlerton. We also create a new variable if the user changes a optimizer cost for a not yet loaded handlerton either with command line arguments or with SET @@global.engine.optimizer_cost_variable=xx. - There are 3 global OPTIMIZER_COSTS variables: default_optimizer_costs The default costs + changes from the command line without an engine specifier. heap_optimizer_costs Heap table costs, used for temporary tables tmp_table_optimizer_costs The cost for the default on disk internal temporary table (MyISAM or Aria) - The engine cost for a table is stored in table_share. To speed up accesses the handler has a pointer to this. The cost is copied to the table on first access. If one wants to change the cost one must first update the global engine cost and then do a FLUSH TABLES. This was done to be able to access the costs for an open table without any locks. - When a handlerton is created, the cost are updated the following way: See sql/keycaches.cc for details: - Use 'default_optimizer_costs' as a base - Call hton->update_optimizer_costs() to override with the engines default costs. - Override the costs that the user has specified for the engine. - One handler open, copy the engine cost from handlerton to TABLE_SHARE. - Call handler::update_optimizer_costs() to allow the engine to update cost for this particular table. - There are two costs stored in THD. These are copied to the handler when the table is used in a query: - optimizer_where_cost - optimizer_scan_setup_cost - Simply code in best_access_path() by storing all cost result in a structure. (Idea/Suggestion by Igor)	2023-02-02 23:54:45 +03:00
Monty	b6215b9b20	Update row and key fetch cost models to take into account data copy costs Before this patch, when calculating the cost of fetching and using a row/key from the engine, we took into account the cost of finding a row or key from the engine, but did not consistently take into account index only accessed, clustered key or covered keys for all access paths. The cost of the WHERE clause (TIME_FOR_COMPARE) was not consistently considered in best_access_path(). TIME_FOR_COMPARE was used in calculation in other places, like greedy_search(), but was in some cases (like scans) done an a different number of rows than was accessed. The cost calculation of row and index scans didn't take into account the number of rows that where accessed, only the number of accepted rows. When using a filter, the cost of index_only_reads and cost of accessing and disregarding 'filtered rows' where not taken into account, which made filters cost less than there actually where. To remedy the above, the following key & row fetch related costs has been added: - The cost of fetching and using a row is now split into different costs: - key + Row fetch cost (as before) but multiplied with the variable 'optimizer_cache_cost' (default to 0.5). This allows the user to tell the optimizer the likehood of finding the key and row in the engine cache. - ROW_COPY_COST, The cost copying a row from the engine to the sql layer or creating a row from the join_cache to the record buffer. Mostly affects table scan costs. - ROW_LOOKUP_COST, the cost of fetching a row by rowid. - KEY_COPY_COST the cost of finding the next key and copying it from the engine to the SQL layer. This is used when we calculate the cost index only reads. It makes index scans more expensive than before if they cover a lot of rows. (main.index_merge_myisam) - KEY_LOOKUP_COST, the cost of finding the first key in a range. This replaces the old define IDX_LOOKUP_COST, but with a higher cost. - KEY_NEXT_FIND_COST, the cost of finding the next key (and rowid). when doing a index scan and comparing the rowid to the filter. Before this cost was assumed to be 0. All of the above constants/variables are now tuned to be somewhat in proportion of executing complexity to each other. There is tuning need for these in the future, but that can wait until the above are made user variables as that will make tuning much easier. To make the usage of the above easy, there are new (not virtual) cost calclation functions in handler: - ha_read_time(), like read_time(), but take optimizer_cache_cost into account. - ha_read_and_copy_time(), like ha_read_time() but take into account ROW_COPY_TIME - ha_read_and_compare_time(), like ha_read_and_copy_time() but take TIME_FOR_COMPARE into account. - ha_rnd_pos_time(). Read row with row id, taking ROW_COPY_COST into account. This is used with filesort where we don't need to execute the WHERE clause again. - ha_keyread_time(), like keyread_time() but take optimizer_cache_cost into account. - ha_keyread_and_copy_time(), like ha_keyread_time(), but add KEY_COPY_COST. - ha_key_scan_time(), like key_scan_time() but take optimizer_cache_cost nto account. - ha_key_scan_and_compare_time(), like ha_key_scan_time(), but add KEY_COPY_COST & TIME_FOR_COMPARE. I also added some setup costs for doing different types of scans and creating temporary tables (on disk and in memory). This encourages the optimizer to not use these for simple 'a few row' lookups if there are adequate key lookup strategies. - TABLE_SCAN_SETUP_COST, cost of starting a table scan. - INDEX_SCAN_SETUP_COST, cost of starting an index scan. - HEAP_TEMPTABLE_CREATE_COST, cost of creating in memory temporary table. - DISK_TEMPTABLE_CREATE_COST, cost of creating an on disk temporary table. When calculating cost of fetching ranges, we had a cost of IDX_LOOKUP_COST (0.125) for doing a key div for a new range. This is now replaced with 'io_cost * KEY_LOOKUP_COST (1.0) * optimizer_cache_cost', which matches the cost we use for 'ref' and other key lookups. The effect is that the cost is now a bit higher when we have many ranges for a key. Allmost all calculation with TIME_FOR_COMPARE is now done in best_access_path(). 'JOIN::read_time' now includes the full cost for finding the rows in the table. In the result files, many of the changes are now again close to what they where before the "Update cost for hash and cached joins" commit, as that commit didn't fix the filter cost (too complex to do everything in one commit). The above changes showed a lot of a lot of inconsistencies in optimizer cost calculation. The main objective with the other changes was to do calculation as similar (and accurate) as possible and to make different plans more comparable. Detailed list of changes: - Calculate index_only_cost consistently and correctly for all scan and ref accesses. The row fetch_cost and index_only_cost now takes into account clustered keys, covered keys and index only accesses. - cost_for_index_read now returns both full cost and index_only_cost - Fixed cost calculation of get_sweep_read_cost() to match other similar costs. This is bases on the assumption that data is more often stored on SSD than a hard disk. - Replaced constant 2.0 with new define TABLE_SCAN_SETUP_COST. - Some scan cost estimates did not take into account TIME_FOR_COMPARE. Now all scan costs takes this into account. (main.show_explain) - Added session variable optimizer_cache_hit_ratio (default 50%). By adjusting this on can reduce or increase the cost of index or direct record lookups. The effect of the default is that key lookups is now a bit cheaper than before. See usage of 'optimizer_cache_cost' in handler.h. - JOIN_TAB::scan_time() did not take into account index only scans, which produced a wrong cost when index scan was used. Changed JOIN_TAB:::scan_time() to take into consideration clustered and covered keys. The values are now cached and we only have to call this function once. Other calls are changed to use the cached values. Function renamed to JOIN_TAB::estimate_scan_time(). - Fixed that most index cost calculations are done the same way and more close to 'range' calculations. The cost is now lower than before for small data sets and higher for large data sets as we take into account how many keys are read (main.opt_trace_selectivity, main.limit_rows_examined). - Ensured that index_scan_cost() == range(scan_of_all_rows_in_table_using_one_range) + MULTI_RANGE_READ_INFO_CONST. One effect of this is that if there is choice of doing a full index scan and a range-index scan over almost the whole table then index scan will be preferred (no range-read setup cost). (innodb.innodb, main.show_explain, main.range) - Fixed the EQ_REF and REF takes into account clustered and covered keys. This changes some plans to use covered or clustered indexes as these are much cheaper. (main.subselect_mat_cost, main.state_tables_innodb, main.limit_rows_examined) - Rowid filter setup cost and filter compare cost now takes into account fetching and checking the rowid (KEY_NEXT_FIND_COST). (main.partition_pruning heap.heap_btree main.log_state) - Added KEY_NEXT_FIND_COST to Range_rowid_filter_cost_info::lookup_cost to account of the time to find and check the next key value against the container - Introduced ha_keyread_time(rows) that takes into account finding the next row and copying the key value to 'record' (KEY_COPY_COST). - Introduced ha_key_scan_time() for calculating an index scan over all rows. - Added IDX_LOOKUP_COST to keyread_time() as a startup cost. - Added index_only_fetch_cost() as a convenience function to OPT_RANGE. - keyread_time() cost is slightly reduced to prefer shorter keys. (main.index_merge_myisam) - All of the above caused some index_merge combinations to be rejected because of cost (main.index_intersect). In some cases 'ref' where replaced with index_merge because of the low cost calculation of get_sweep_read_cost(). - Some index usage moved from PRIMARY to a covering index. (main.subselect_innodb) - Changed cost calculation of filter to take KEY_LOOKUP_COST and TIME_FOR_COMPARE into account. See sql_select.cc::apply_filter(). filter parameters and costs are now written to optimizer_trace. - Don't use matchings_records_in_range() to try to estimate the number of filtered rows for ranges. The reason is that we want to ensure that 'range' is calculated similar to 'ref'. There is also more work needed to calculate the selectivity when using ranges and ranges and filtering. This causes filtering column in EXPLAIN EXTENDED to be 100.00 for some cases where range cannot use filtering. (main.rowid_filter) - Introduced ha_scan_time() that takes into account the CPU cost of finding the next row and copying the row from the engine to 'record'. This causes costs of table scan to slightly increase and some test to changed their plan from ALL to RANGE or ALL to ref. (innodb.innodb_mysql, main.select_pkeycache) In a few cases where scan time of very small tables have lower cost than a ref or range, things changed from ref/range to ALL. (main.myisam, main.func_group, main.limit_rows_examined, main.subselect2) - Introduced ha_scan_and_compare_time() which is like ha_scan_time() but also adds the cost of the where clause (TIME_FOR_COMPARE). - Added small cost for creating temporary table for materialization. This causes some very small tables to use scan instead of materialization. - Added checking of the WHERE clause (TIME_FOR_COMPARE) of the accepted rows to ROR costs in get_best_ror_intersect() - Removed '- 0.001' from 'join->best_read' and optimize_straight_join() to ensure that the 'Last_query_cost' status variable contains the same value as the one that was calculated by the optimizer. - Take avg_io_cost() into account in handler::keyread_time() and handler::read_time(). This should have no effect as it's 1.0 by default, except for heap that overrides these functions. - Some 'ref_or_null' accesses changed to 'range' because of cost adjustments (main.order_by) - Added scan type "scan_with_join_cache" for optimizer_trace. This is just to show in the trace what kind of scan was used. - When using 'scan_with_join_cache' take into account number of preceding tables (as have to restore all fields for all previous table combination when checking the where clause) The new cost added is: (row_combinations * ROW_COPY_COST * number_of_cached_tables). This increases the cost of join buffering in proportion of the number of tables in the join buffer. One effect is that full scans are now done earlier as the cost is then smaller. (main.join_outer_innodb, main.greedy_optimizer) - Removed the usage of 'worst_seeks' in cost_for_index_read as it caused wrong plans to be created; It prefered JT_EQ_REF even if it would be much more expensive than a full table scan. A related issue was that worst_seeks only applied to full lookup, not to clustered or index only lookups, which is not consistent. This caused some plans to use index scan instead of eq_ref (main.union) - Changed federated block size from 4096 to 1500, which is the typical size of an IO packet. - Added costs for reading rows to Federated. Needed as there is no caching of rows in the federated engine. - Added ha_innobase::rnd_pos_time() cost function. - A lot of extra things added to optimizer trace - More costs, especially for materialization and index_merge. - Make lables more uniform - Fixed a lot of minor bugs - Added 'trace_started()' around a lot of trace blocks. - When calculating ORDER BY with LIMIT cost for using an index the cost did not take into account the number of row retrivals that has to be done or the cost of comparing the rows with the WHERE clause. The cost calculated would be just a fraction of the real cost. Now we calculate the cost as we do for ranges and 'ref'. - 'Using index for group-by' is used a bit more than before as now take into account the WHERE clause cost when comparing with 'ref' and prefer the method with fewer row combinations. (main.group_min_max). Bugs fixed: - Fixed that we don't calculate TIME_FOR_COMPARE twice for some plans, like in optimize_straight_join() and greedy_search() - Fixed bug in save_explain_data where we could test for the wrong index when displaying 'Using index'. This caused some old plans to show 'Using index'. (main.subselect_innodb, main.subselect2) - Fixed bug in get_best_ror_intersect() where 'min_cost' was not updated, and the cost we compared with was not the one that was used. - Fixed very wrong cost calculation for priority queues in check_if_pq_applicable(). (main.order_by now correctly uses priority queue) - When calculating cost of EQ_REF or REF, we added the cost of comparing the WHERE clause with the found rows, not all row combinations. This made ref and eq_ref to be regarded way to cheap compared to other access methods. - FORCE INDEX cost calculation didn't take into account clustered or covered indexes. - JT_EQ_REF cost was estimated as avg_io_cost(), which is half the cost of a JT_REF key. This may be true for InnoDB primary key, but not for other unique keys or other engines. Now we use handler function to calculate the cost, which allows us to handle consistently clustered, covered keys and not covered keys. - ha_start_keyread() didn't call extra_opt() if keyread was already enabled but still changed the 'keyread' variable (which is wrong). Fixed by not doing anything if keyread is already enabled. - multi_range_read_info_cost() didn't take into account io_cost when calculating the cost of ranges. - fix_semijoin_strategies_for_picked_join_order() used the wrong record_count when calling best_access_path() for SJ_OPT_FIRST_MATCH and SJ_OPT_LOOSE_SCAN. - Hash joins didn't provide correct best_cost to the upper level, which means that the cost for hash_joins more expensive than calculated in best_access_path (a difference of 10x * TIME_OF_COMPARE). This is fixed in the new code thanks to that we now include TIME_OF_COMPARE cost in 'read_time'. Other things: - Added some 'if (thd->trace_started())' to speed up code - Removed not used function Cost_estimate::is_zero() - Simplified testing of HA_POS_ERROR in get_best_ror_intersect(). (No cost changes) - Moved ha_start_keyread() from join_read_const_table() to join_read_const() to enable keyread for all types of JT_CONST tables. - Made a few very short functions inline in handler.h Notes: - In main.rowid_filter the join order of order and lineitem is swapped. This is because the cost of doing a range fetch of lineitem(98 rows) is almost as big as the whole join of order,lineitem. The filtering will also ensure that we only have to do very small key fetches of the rows in lineitem. - main.index_merge_myisam had a few changes where we are now using less keys for index_merge. This is because index scans are now more expensive than before. - handler->optimizer_cache_cost is updated in ha_external_lock(). This ensures that it is up to date per statements. Not an optimal solution (for locked tables), but should be ok for now. - 'DELETE FROM t1 WHERE t1.a > 0 ORDER BY t1.a' does not take cost of filesort into consideration when table scan is chosen. (main.myisam_explain_non_select_all) - perfschema.table_aggregate_global_* has changed because an update on a table with 1 row will now use table scan instead of key lookup. TODO in upcomming commits: - Fix selectivity calculation for ranges with and without filtering and when there is a ref access but scan is chosen. For this we have to store the lowest known value for 'accepted_records' in the OPT_RANGE structure. - Change that records_read does not include filtered rows. - test_if_cheaper_ordering() needs to be updated to properly calculate costs. This will fix tests like main.order_by_innodb, main.single_delete_update - Extend get_range_limit_read_cost() to take into considering cost_for_index_read() if there where no quick keys. This will reduce the computed cost for ORDER BY with LIMIT in some cases. (main.innodb_ext_key) - Fix that we take into account selectivity when counting the number of rows we have to read when considering using a index table scan to resolve ORDER BY. - Add new calculation for rnd_pos_time() where we take into account the benefit of reading multiple rows from the same page.	2023-02-02 21:43:30 +03:00
Marko Mäkelä	2ac1edb1c3	Merge 10.5 into 10.6	2022-11-08 17:37:22 +02:00
Marko Mäkelä	e572c745dc	MDEV-29504/MDEV-29849 TRUNCATE breaks FOREIGN KEY locking ha_innobase::referenced_by_foreign_key(): Protect the check with dict_sys.freeze(), to prevent races with TRUNCATE TABLE. The test innodb.instant_alter_crash has been adjusted for this additional locking. dict_table_is_referenced_by_foreign_key(): Removed (merged to the only caller). create_table_info_t::create_table(): Ignore missing indexes for FOREIGN KEY constraints if foreign_key_checks=0. create_table_info_t::create_table_update_dict(): Rewritten as a static function. Do not return any error. ha_innobase::create(): When trx!=nullptr and we are operating on a persistent table, do not rollback, commit, or release the data dictionary latch. ha_innobase::truncate(): Protect the entire critical section with an exclusive dict_sys.latch, so that ha_innobase::referenced_by_foreign_key() on referenced tables will return a consistent result. In case of a failure, invoke dict_load_foreigns() to restore also any FOREIGN KEY constraints. ha_innobase::free_foreign_key_create_info(): Define inline. lock_release(): Disregard innodb_evict_tables_on_commit_debug=ON when dict_sys.locked() holds. It would hold when fts_load_stopword() is invoked by create_table_info_t::create_table_update_dict(). dict_sys_t::locked(): Return whether the current thread is holding the exclusive dict_sys.latch. dict_sys_t::frozen_not_locked(): Return whether any thread is holding a shared dict_sys.latch. In the test main.mysql_upgrade, the InnoDB persistent statistics will no longer be recalculated in ha_innobase::open() as part of CHECK TABLE ... FOR UPGRADE. They were deleted earlier in the test. Tested by: Matthias Leich	2022-11-08 17:34:34 +02:00
Marko Mäkelä	a732d5e2ba	Merge 10.4 into 10.5	2022-11-08 17:01:28 +02:00
Alexander Barkov	ce443c8554	MDEV-29495 Generalize can_convert_xxx() hook engine API to cover any arbitrary data type	2022-10-27 11:48:46 +04:00
Jan Lindström	9fefd440b5	Merge 10.5 into 10.6	2022-09-05 14:05:30 +03:00
Jan Lindström	ba987a46c9	Merge 10.4 into 10.5	2022-09-05 13:28:56 +03:00
Daniele Sciascia	2917bd0d2c	Reduce compilation dependencies on wsrep_mysqld.h Making changes to wsrep_mysqld.h causes large parts of server code to be recompiled. The reason is that wsrep_mysqld.h is included by sql_class.h, even tough very little of wsrep_mysqld.h is needed in sql_class.h. This commit introduces a new header file, wsrep_on.h, which is meant to be included from sql_class.h, and contains only macros and variable declarations used to determine whether wsrep is enabled. Also, header wsrep.h should only contain definitions that are also used outside of sql/. Therefore, move WSREP_TO_ISOLATION* and WSREP_SYNC_WAIT macros to wsrep_mysqld.h. Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>	2022-08-31 11:05:23 +03:00
Marko Mäkelä	42cb400562	Merge 10.5 into 10.6	2022-03-11 13:35:35 +02:00
Marko Mäkelä	9047a908fe	Merge 10.4 into 10.5	2022-03-11 13:03:33 +02:00
Marko Mäkelä	fc8da65919	After-merge fix: clang -Winconsistent-missing-override The virtual member function that was added in commit `1766a18e06` needs to be declared "override".	2022-03-11 13:02:53 +02:00
Marko Mäkelä	be6f9593fe	Merge 10.5 into 10.6	2022-03-11 09:53:40 +02:00
Marko Mäkelä	81523baac6	Merge 10.4 into 10.5	2022-03-11 09:36:03 +02:00

1 2 3 4 5 ...

314 commits