mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-02-20 04:15:35 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	4179f93d28	MDEV-18976 Implement OPT_PAGE_CHECKSUM log record for improved validation We will introduce an optional log record OPT_PAGE_CHECKSUM for recording page checksums, so that more inconsistencies on crash recovery may be caught. mtr_t::page_checksum(const buf_page_t&): Write OPT_PAGE_CHECKSUM (currently not for ROW_FORMAT=COMPRESSED pages). mtr_t::do_write(): Write OPT_PAGE_CHECKSUM records for all pages (currently, in debug builds only). mtr_t::is_logged(): Return whether log should be written. mtr_t::set_log_mode_sub(const mtr_t&): Set the logging mode of a sub-minitransaction when another mini-transaction is holding latches on some modified pages. When creating or freeing BLOB pages, we may only write OPT_PAGE_CHECKSUM records in the main mini-transaction, after all changes have been written to the log. MTR_LOG_SUB: Log mode for a sub-mini-transaction. mtr_t::free(): Define non-inline, and invoke MarkFreed. MarkFreed: For any matching page in the mini-transaction log, change the first entry to say MTR_MEMO_PAGE_X_MODIFY and any subsequent entries to MTR_MEMO_PAGE_X_FIX. FindModified: Simplify a condition. MTR_MEMO_MODIFY can only be set if MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX are set. FindBlockX: Consider also MTR_MEMO_PAGE_X_MODIFY. recv_sys_t::parse(): Store OPT_PAGE_CHECKSUM records. log_phys_t::apply(): Validate OPT_PAGE_CHECKSUM records. log_phys_t::page_checksum(): Validate an OPT_PAGE_CHECKSUM record. Tested by: Matthias Leich	2022-06-06 14:05:01 +03:00
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Marko Mäkelä	75096c84b4	MDEV-28525 Some conditions around btr_latch_mode could be eliminated The types btr_latch_mode and mtr_memo_type_t are partly derived from rw_lock_type_t. Despite that, some code for converting between them is using conditions instead of bitwise arithmetics. Let us define btr_latch_mode in such a way that more conversions to rw_lock_type_t are possible by bitwise and. Some SPATIAL INDEX code that assumed !(BTR_MODIFY_TREE & BTR_MODIFY_LEAF) was adjusted.	2022-06-06 11:56:29 +03:00
Marko Mäkelä	1b03db11d2	MDEV-15528 fixup: Remove some dead code btr_page_split_and_insert(): Declare all parameters nonnull. btr_pessimistic_scrub() was removed in commit `a5584b13d1` (MDEV-15528).	2022-06-06 09:52:11 +03:00
Marko Mäkelä	2f8d0af883	Merge 10.5 into 10.6	2022-06-02 17:39:13 +03:00
Marko Mäkelä	5909e0ec31	Cleanup: btr_store_big_rec_extern_fields() does not really modify pcur	2022-06-02 17:22:16 +03:00
Sergei Golubchik	3bc98a4ec4	Merge branch '10.5' into 10.6	2022-05-10 14:01:23 +02:00
Sergei Golubchik	ef781162ff	Merge branch '10.4' into 10.5	2022-05-09 22:04:06 +02:00
Sergei Golubchik	a70a1cf3f4	Merge branch '10.3' into 10.4	2022-05-08 23:03:08 +02:00
Marko Mäkelä	57a9626fe4	Merge 10.5 into 10.6	2022-05-06 11:11:04 +03:00
Marko Mäkelä	26d46234e8	MDEV-28478: INSERT into SPATIAL INDEX in TEMPORARY table writes log This is based on commit `20ae4816bb` with some adjustments for MDEV-12353. row_ins_sec_index_entry_low(): If a separate mini-transaction is needed to adjust the minimum bounding rectangle (MBR) in the parent page, we must disable redo logging if the table is a temporary table. For temporary tables, no log is supposed to be written, because the temporary tablespace will be reinitialized on server restart. rtr_update_mbr_field(), rtr_merge_and_update_mbr(): Changed the return type to void and removed unreachable code. In older versions, these used to return a different value for temporary tables. page_id_t: Add constexpr to most member functions. mtr_t::log_write(): Catch log writes to invalid tablespaces so that the test case would crash without the fix to row_ins_sec_index_entry_low().	2022-05-06 10:12:31 +03:00
Marko Mäkelä	c844a5881a	MDEV-21452 fixup: Remove an unused variable	2022-05-03 13:31:59 +03:00
Marko Mäkelä	0806592ac8	MDEV-28422 Page split breaks a gap lock btr_insert_into_right_sibling(): Inherit any gap lock from the left sibling to the right sibling before inserting the record to the right sibling and updating the node pointer(s). lock_update_node_pointer(): Update locks in case a node pointer will move. Based on mysql/mysql-server@c7d93c274f	2022-04-27 13:38:08 +03:00
Marko Mäkelä	2ca1123464	MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table This follows up the previous fix in commit `c3c53926c4` (MDEV-26554). ha_innobase::delete_table(): Work around the insufficient metadata locking (MDL) during DML operations by acquiring exclusive InnoDB table locks on all child tables. Previously, this was only done on TRUNCATE and ALTER. ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke lock_update_delete() during change buffer operations. The revised trx_t::commit(std::vector<pfs_os_file_t>&) will hold exclusive lock_sys.latch while invoking fil_delete_tablespace(), which in turn may invoke ibuf_delete_rec(). dict_index_t::has_locking(): A new predicate, replacing the dummy !dict_table_is_locking_disabled(index->table). Used for skipping lock operations during ibuf_delete_rec(). trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks and remove the table from the cache while holding exclusive lock_sys.latch. trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds. trx_t::commit(): Reset dict_operation before invoking commit_in_memory() via commit_persist(). lock_release_on_drop(): Release locks while lock_sys.latch is exclusively locked. lock_table(): Add a parameter for a pointer to the table. We must not dereference the table before a lock_sys.latch has been acquired. If the pointer to the table does not match the table at that point, the table is invalid and DB_DEADLOCK will be returned. row_ins_foreign_check_on_constraint(): Improve the checks. Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed before commit `c5fd9aa562` (MDEV-25919). row_upd_check_references_constraints(), wsrep_row_upd_check_foreign_constraints(): Simplify checks.	2022-04-26 18:09:03 +03:00
Thirunarayanan Balathandayuthapani	4b80c11f52	MDEV-15250 UPSERT during ALTER TABLE results in 'Duplicate entry' error for alter - InnoDB DDL results in `Duplicate entry' if concurrent DML throws duplicate key error. The following scenario explains the problem connection con1: ALTER TABLE t1 FORCE; connection con2: INSERT INTO t1(pk, uk) VALUES (2, 2), (3, 2); In connection con2, InnoDB throws the 'DUPLICATE KEY' error because of unique index. Alter operation will throw the error when applying the concurrent DML log. - Inserting the duplicate key for unique index logs the insert operation for online ALTER TABLE. When insertion fails, transaction does rollback and it leads to logging of delete operation for online ALTER TABLE. While applying the insert log entries, alter operation encounters 'DUPLICATE KEY' error. - To avoid the above fake duplicate scenario, InnoDB should not write any log for online ALTER TABLE before DML transaction commit. - User thread which does DML can apply the online log if InnoDB ran out of online log and index is marked as completed. Set online log error if apply phase encountered any error. It can also clear all other indexes log, marks the newly added indexes as corrupted. - Removed the old online code which was a part of DML operations commit_inplace_alter_table() : Does apply the online log for the last batch of secondary index log and does frees the log for the completed index. trx_t::apply_online_log: Set to true while writing the undo log if the modified table has active DDL trx_t::apply_log(): Apply the DML changes to online DDL tables dict_table_t::is_active_ddl(): Returns true if the table has an active DDL dict_index_t::online_log_make_dummy(): Assign dummy value for clustered index online log to indicate the secondary indexes are being rebuild. dict_index_t::online_log_is_dummy(): Check whether the online log has dummy value ha_innobase_inplace_ctx::log_failure(): Handle the apply log failure for online DDL transaction row_log_mark_other_online_index_abort(): Clear out all other online index log after encountering the error during row_log_apply() row_log_get_error(): Get the error happened during row_log_apply() row_log_online_op(): Does apply the online log if index is completed and ran out of memory. Returns false if apply log fails UndorecApplier: Introduced a class to maintain the undo log record, latched undo buffer page, parse the undo log record, maintain the undo record type, info bits and update vector UndorecApplier::get_old_rec(): Get the correct version of the clustered index record that was modified by the current undo log record UndorecApplier::clear_undo_rec(): Clear the undo log related information after applying the undo log record UndorecApplier::log_update(): Handle the update, delete undo log and apply it on online indexes UndorecApplier::log_insert(): Handle the insert undo log and apply it on online indexes UndorecApplier::is_same(): Check whether the given roll pointer is generated by the current undo log record information trx_t::rollback_low(): Set apply_online_log for the transaction after partially rollbacked transaction has any active DDL prepare_inplace_alter_table_dict(): After allocating the online log, InnoDB does create fulltext common tables. Fulltext index doesn't allow the index to be online. So removed the dead code of online log removal Thanks to Marko Mäkelä for providing the initial prototype and Matthias Leich for testing the issue patiently.	2022-04-25 18:52:19 +05:30
Marko Mäkelä	8f8ba75855	MDEV-27234: Data dictionary recovery was not READ COMMITTED This also fixes MDEV-20198: Instant ALTER TABLE is not crash safe InnoDB dictionary recovery wrongly used the READ UNCOMMITTED isolation level, causing some mismatch. For example, if a table was renamed or replaced in a transaction, according to READ UNCOMMITTED the table might not exist at all. We implement READ COMMITTED isolation level for accessing the dictionary tables SYS_TABLES, SYS_COLUMNS, SYS_INDEXES, SYS_FIELDS, SYS_VIRTUAL, SYS_FOREIGN, SYS_FOREIGN_COLS. For most of these tables, no secondary index exists. For the secondary indexes (on SYS_TABLES.ID, SYS_FOREIGN.FOR_NAME, SYS_FOREIGN.REF_NAME), we will always look up the primary key in the clustered index and check if the record actually is a committed version. dict_check_sys_tables(): Recover tablespaces also from delete-marked committed records, so that if a matching .ibd file exists, it will be removed by fil_delete_tablespace() when the committed delete-marked SYS_INDEXES record of the clustered index is purged in row_purge_remove_clust_if_poss_low(). fil_ibd_open(): Change the Boolean parameter "validate" to a ternary one, to suppress error messages when the file might not exist. It is possible that a .ibd file was deleted and the server shut down before the SYS_INDEXES and SYS_TABLES records were purged. Hence, if dict_check_sys_tables() finds a committed delete-marked record, we must not complain if the tablespace file is not found. On Windows, we msut treat ERROR_PATH_NOT_FOUND (directory not found) in the same way as ERROR_FILE_NOT_FOUND. This fixes a few failures where a previous test successfully executed DROP DATABASE (and deleted all files and the directory), but a committed delete-marked SYS_TABLES record had not been purged before server restart. dict_getnext_system_low(): Do not filter out delete-marked records. dict_startscan_system(), dict_getnext_system(): Do filter out delete-marked records, for accessing the INFORMATION_SCHEMA tables. dict_sys_tables_rec_read(): Return the DB_TRX_ID of the committed version of the record. This is needed in dict_load_table_low(). dict_load_foreign_cols(), dict_load_foreign(): Add a parameter for the current transaction identifier. In some DDL operations, the FOREIGN KEY constraints are being loaded from the data dictionary before the DDL transaction has been committed. For SYS_FOREIGN and SYS_FOREIGN_COLS, we must implement the special case of READ COMMITTED that the changes of the uncommitted current transaction are visible. dict_load_foreign(): Validate the table name. We could find a SYS_FOREIGN.ID via a committed delete-marked secondary index record that does not match the REF_NAME or FOR_NAME of the secondary index record. dict_load_index_low(): Optionally take the table as a parameter, so that table->def_trx_id can be updated in case of a committed delete-marked SYS_INDEXES record corresponding to DROP INDEX, but not corresponding to an index stub of ADD INDEX. dict_load_indexes(): Do not update table->def_trx_id in case of delete-marked records. rec_is_metadata(), rec_offs_make_valid(), rec_get_offsets_func(), row_build_low(): Relax some assertions. We may now have !index->is_instant() even if a metadata record is present in the index. Previously, the recovery of instant ADD/DROP COLUMN assumed that READ UNCOMMITTED of the data dictionary will be performed. Now, we will have a READ COMMITTED copy of the data dictionary cache, and a READ UNCOMMITTED copy of the metadata record. btr_page_reorganize_low(): Correctly update the FIL_PAGE_TYPE when rolling back an instant ADD/DROP COLUMN operation. row_rec_to_index_entry_impl(): Relax some assertions, and disallow accessing "extra" fields. This fixes the recovery of a crash during an instant ADD COLUMN after a successful instant DROP COLUMN, in the test innodb.instant_alter_crash. Tested by: Matthias Leich	2022-03-28 08:37:51 +03:00
Vlad Lesin	202316a38f	Merge 10.5 into 10.6	2022-03-07 18:42:47 +03:00
Marko Mäkelä	2dce3bad9c	Merge 10.4 into 10.5	2022-03-07 09:26:50 +02:00
Marko Mäkelä	7b97020d40	Merge 10.3 into 10.4	2022-03-07 09:05:36 +02:00
Marko Mäkelä	02da00a98c	Merge 10.2 into 10.3	2022-03-04 14:29:36 +02:00
Marko Mäkelä	a92f07f4bd	MDEV-27993 Assertion failed in btr_page_reorganize_low() btr_cur_optimistic_insert(): Disregard DEBUG_DBUG injection to invoke btr_page_reorganize() if the page (and the table) is empty. Otherwise, an assertion would fail in btr_page_reorganize_low() because PAGE_MAX_TRX_ID is 0 in an empty secondary index leaf page.	2022-03-03 11:51:25 +02:00
Vlad Lesin	f6f055a191	Merge 10.3 into 10.4	2022-02-21 14:10:27 +03:00
Vlad Lesin	a6f258e47f	MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration Backported from 10.5 `20e9e804c1` and `5948d7602e`. sel_restore_position_for_mysql() moves forward persistent cursor position after btr_pcur_restore_position() call if cursor relative position is BTR_PCUR_ON and the cursor points to the record with NOT the same field values as in a stored record(and some other not important for this case conditions). It was done because btr_pcur_restore_position() sets page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON before opening cursor. So we are searching for the record less or equal to stored one. And if the found record is not equal to stored one, then it is less and we need to move cursor forward. But there can be a situation when the stored record was purged, but the new one with the same key but different value was inserted while row_search_mvcc() was suspended. In this case, when the thread is awaken, it will invoke sel_restore_position_for_mysql(), which, in turns, invoke btr_pcur_restore_position(), which will return false because found record don't match stored record, and sel_restore_position_for_mysql() will move forward cursor position. The above can lead to the case when awaken row_search_mvcc() do not see records inserted by other transactions while it slept. The mtr test case shows the example how it can be. The fix is to return special value from persistent cursor restoring function which would notify its caller that uniq fields of restored record and stored record are the same, and in this case sel_restore_position_for_mysql() don't move cursor forward. Delete-marked records are correctly processed in row_search_mvcc(). Non-unique secondary indexes are "uniquified" by adding the PK, the index->n_uniq should then be index->n_fields. So there is no need in additional checks in the fix. If transaction's readview can't see the changes made in secondary index record, it requests clustered index record in row_search_mvcc() to check its transaction id and get the correspondent record version. After this row_search_mvcc() commits mtr to preserve clustered index latching order, and starts mtr. Between those mtr commit and start secondary index pages are unlatched, and purge has the ability to remove stored in the cursor record, what causes rows duplication in result set for non-locking reads, as cursor position is restored to the previously visited record. To solve this the changes are just switched off for non-locking reads, it's quite simple solution, besides the changes don't make sense for non-locking reads. The more complex and effective from performance perspective solution is to create mtr savepoint before clustered record requesting and rolling back to that savepoint after that. See MDEV-27557. One more solution is to have per-record transaction id for secondary indexes. See MDEV-17598. If any of those is implemented, just remove select_lock_type argument in sel_restore_position_for_mysql().	2022-02-21 12:49:54 +03:00
Vlad Lesin	f2f22c382b	Merge 10.5 into 10.6	2022-02-14 18:30:51 +03:00
Vlad Lesin	20e9e804c1	MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration sel_restore_position_for_mysql() moves forward persistent cursor position after btr_pcur_restore_position() call if cursor relative position is BTR_PCUR_ON and the cursor points to the record with NOT the same field values as in a stored record(and some other not important for this case conditions). It was done because btr_pcur_restore_position() sets page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON before opening cursor. So we are searching for the record less or equal to stored one. And if the found record is not equal to stored one, then it is less and we need to move cursor forward. But there can be a situation when the stored record was purged, but the new one with the same key but different value was inserted while row_search_mvcc() was suspended. In this case, when the thread is awaken, it will invoke sel_restore_position_for_mysql(), which, in turns, invoke btr_pcur_restore_position(), which will return false because found record don't match stored record, and sel_restore_position_for_mysql() will move forward cursor position. The above can lead to the case when awaken row_search_mvcc() do not see records inserted by other transactions while it slept. The mtr test case shows the example how it can be. The fix is to return special value from persistent cursor restoring function which would notify its caller that uniq fields of restored record and stored record are the same, and in this case sel_restore_position_for_mysql() don't move cursor forward. Delete-marked records are correctly processed in row_search_mvcc(). Non-unique secondary indexes are "uniquified" by adding the PK, the index->n_uniq should then be index->n_fields. So there is no need in additional checks in the fix. If transaction's readview can't see the changes made in secondary index record, it requests clustered index record in row_search_mvcc() to check its transaction id and get the correspondent record version. After this row_search_mvcc() commits mtr to preserve clustered index latching order, and starts mtr. Between those mtr commit and start secondary index pages are unlatched, and purge has the ability to remove stored in the cursor record, what causes rows duplication in result set for non-locking reads, as cursor position is restored to the previously visited record. To solve this the changes are just switched off for non-locking reads, it's quite simple solution, besides the changes don't make sense for non-locking reads. The more complex and effective from performance perspective solution is to create mtr savepoint before clustered record requesting and rolling back to that savepoint after that. See MDEV-27557. One more solution is to have per-record transaction id for secondary indexes. See MDEV-17598. If any of those is implemented, just remove select_lock_type argument in sel_restore_position_for_mysql().	2022-02-14 17:35:04 +03:00
Marko Mäkelä	ba5ef63ae1	MDEV-27476 heap-use-after-free in buf_pool_t::is_block_field() This follows up commit `017d1b867b`. In commit `aaef2e1d8c` (MDEV-27058) some more problematic debug assertions were added. btr_search_update_block_hash_info(), trx_purge_truncate_history(): Use simpler assertions to check that an uncompressed page is present.	2022-01-12 12:34:07 +02:00
Marko Mäkelä	aaef2e1d8c	MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.	2021-11-18 17:47:19 +02:00
Marko Mäkelä	02e72f7b44	MDEV-26769 follow-up: Remove unnecessary page locking btr_cur_optimistic_latch_leaves(): Use transactional_shared_lock_guard. btr_cur_latch_leaves(): Avoid acquiring some page latches, because the changes are already blocked by index->lock. btr_cur_search_to_nth_level_func(): Remove a redundant variable retrying_for_search_prev=!!prev_tree_blocks, and avoid acquiring some page latches.	2021-11-18 17:44:33 +02:00
Marko Mäkelä	1f02280904	MDEV-26769 InnoDB does not support hardware lock elision This implements memory transaction support for: * Intel Restricted Transactional Memory (RTM), also known as TSX-NI (Transactional Synchronization Extensions New Instructions) * POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux transactional_lock_guard, transactional_shared_lock_guard: RAII lock guards that try to elide the lock acquisition when transactional memory is available. buf_pool.page_hash: Try to elide latches whenever feasible. Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED tables, this is not always possible. In buf_page_get_low(), memory transactions only work reasonably well for validating a guessed block address. TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards that try to elide lock_sys.latch and related latches.	2021-10-22 12:38:45 +03:00
Marko Mäkelä	c091a0bc8d	MDEV-26826 Duplicated computations of buf_pool.page_hash addresses Since commit `bd5a6403ca` (MDEV-26033) we can actually calculate the buf_pool.page_hash cell and latch addresses while not holding buf_pool.mutex. buf_page_alloc_descriptor(): Remove the MEM_UNDEFINED. We now expect buf_page_t::hash to be zero-initialized. buf_pool_t::hash_chain: Dedicated data type for buf_pool.page_hash.array. buf_LRU_free_one_page(): Merged to the only caller buf_pool_t::corrupted_evict().	2021-10-22 12:33:37 +03:00
Marko Mäkelä	d95361107c	Merge 10.5 into 10.6	2021-09-24 14:38:52 +03:00
Marko Mäkelä	7e2b42324c	Merge 10.4 into 10.5	2021-09-24 08:42:23 +03:00
Marko Mäkelä	9024498e88	Merge 10.3 into 10.4	2021-09-22 18:26:54 +03:00
Marko Mäkelä	b46cf33ab8	Merge 10.2 into 10.3	2021-09-22 18:01:41 +03:00
Marko Mäkelä	3209bc667f	MDEV-26636: InnoDB defragmentation statistics cause races on TEMPORARY TABLE btr_defragment_save_defrag_stats_if_needed(): Do not save defragmentation statistics for temporary tables. They are exempt of defragmentation anyway (ha_innobase::optimize() never invokes defragmentation for them), and the user-visible names are not available inside InnoDB. Furthermore, InnoDB assumes that temporary tables are never accessed by other threads than the one that handles the session with which the temporary table is associated with. Furthermore, we simplify the test innodb.innodb_defrag_stats and include a test case that demonstrates that defragmentation statistics are no longer being saved for temporary tables.	2021-09-18 15:47:52 +03:00
Marko Mäkelä	1e9c922fa7	MDEV-26623 Possible race condition between statistics and bulk insert When computing statistics, let us play it safe and check whether an insert into an empty table is in progress, once we have acquired the root page latch. If yes, let us pretend that the table is empty, just like MVCC reads would do. It is unlikely that this race condition could lead into any crashes before MDEV-24621, because the index tree structure should be protected by the page latches. But, at least we can avoid some busy work and return earlier. As part of this, some code that is only used for statistics calculation is being moved into static functions in that compilation unit.	2021-09-17 14:40:41 +03:00
Marko Mäkelä	1a4a7dddb6	Cleanup: Make btr_root_block_get() more robust btr_root_block_get(): Check for index->page == FIL_NULL. btr_root_get(): Declare static. Other callers can invoke btr_root_block_get() directly. btr_get_size(): Remove conditions that are checked in btr_root_block_get().	2021-09-17 14:40:41 +03:00
Marko Mäkelä	277ba134ad	MDEV-26467: Avoid futile spin loops Typically, index_lock and fil_space_t::latch will be held for a longer time than the spin loop in latch acquisition would be waiting for. Let us avoid spin loops for those as well as dict_sys.latch, which could be held in exclusive mode for a longer time (while loading metadata into the buffer pool and the dictionary cache). Performance testing on a dual Intel Xeon E5-2630 v4 (2 NUMA nodes) suggests that the buffer pool page latch (block_lock) benefits from a spin loop in both read-only and read-write workloads where the working set is slightly larger than the buffer pool. Presumably, most contention would occur on leaf page latches. Contention on upper level pages in the buffer pool should intuitively last longer. We introduce srw_spin_lock and srw_spin_mutex to allow users of srw_lock or srw_mutex to opt in for the spin loop. On Microsoft Windows, a spin loop variant was and will not be available; srw_mutex and srw_lock will simply wrap SRWLOCK. That is, on Microsoft Windows, the parameters innodb_sync_spin_loops and innodb_spin_wait_delay will only affect block_lock.	2021-09-06 12:32:24 +03:00
Marko Mäkelä	7730dd392b	Merge 10.5 into 10.6	2021-09-06 10:31:32 +03:00
Marko Mäkelä	84c578c795	MDEV-26547 Restoring InnoDB buffer pool dump is single-threaded for no reason buf_read_page_background(): Remove the parameter "bool sync" and always actually initiate a page read in the background. buf_load(): Always submit asynchronous reads. This allows page checksums to be verified in concurrent threads as soon as the reads are completed.	2021-09-06 10:14:24 +03:00
Marko Mäkelä	c5fd9aa562	MDEV-25919: Lock tables before acquiring dict_sys.latch In commit `1bd681c8b3` (MDEV-25506 part 3) we introduced a "fake instant timeout" when a transaction would wait for a table or record lock while holding dict_sys.latch. This prevented a deadlock of the server but could cause bogus errors for operations on the InnoDB persistent statistics tables. A better fix is to ensure that whenever a transaction is being executed in the InnoDB internal SQL parser (which will for now require dict_sys.latch to be held), it will already have acquired all locks that could be required for the execution. So, we will acquire the following locks upfront, before acquiring dict_sys.latch: (1) MDL on the affected user table (acquired by the SQL layer) (2) If applicable (not for RENAME TABLE): InnoDB table lock (3) If persistent statistics are going to be modified: (3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats (3.b) exclusive table locks on the statistics tables (4) Exclusive table locks on the InnoDB data dictionary tables (not needed in ANALYZE TABLE and the like) Note: Acquiring exclusive locks on the statistics tables may cause more locking conflicts between concurrent DDL operations. Notably, RENAME TABLE will lock the statistics tables even if no persistent statistics are enabled for the table. DROP DATABASE will only acquire locks on statistics tables if persistent statistics are enabled for the tables on which the SQL layer is invoking ha_innobase::delete_table(). For any "garbage collection" in innodb_drop_database(), a timeout while acquiring locks on the statistics tables will result in any statistics not being deleted for any tables that the SQL layer did not know about. If innodb_defragment=ON, information may be written to the statistics tables even for tables for which InnoDB persistent statistics are disabled. But, DROP TABLE will no longer attempt to delete that information if persistent statistics are not enabled for the table. This change should also fix the hangs related to InnoDB persistent statistics and STATS_AUTO_RECALC (MDEV-15020) as well as a bug that running ALTER TABLE on the statistics tables concurrently with running ALTER TABLE on InnoDB tables could cause trouble. lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(): Do not issue a fake instant timeout error when the transaction is holding dict_sys.latch. Instead, assert that the dict_sys.latch is never being held here. lock_sys_tables(): A new function to acquire exclusive locks on all dictionary tables, in case DROP TABLE or similar operation is being executed. Locking non-hard-coded tables is optional to avoid a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require all these dictionary tables to exist before executing any DDL, but the function row_merge_drop_temp_indexes() is an exception. When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier, the table SYS_VIRTUAL would not exist at this point. ha_innobase::commit_inplace_alter_table(): Invoke log_write_up_to() while not holding dict_sys.latch. dict_sys_t::remove(), dict_table_close(): No longer try to drop index stubs that were left behind by aborted online ADD INDEX. Such indexes should be dropped from the InnoDB data dictionary by row_merge_drop_indexes() as part of the failed DDL operation. Stubs for aborted indexes may only be left behind in the data dictionary cache. dict_stats_fetch_from_ps(): Use a normal read-only transaction. ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table(): While waiting for purge to stop using the table, do not hold dict_sys.latch. ha_innobase::delete_table(): Implement a work-around for the rollback of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL due to a conflicting LOCK TABLES, such as in the first ALTER TABLE in the test case of Bug#53676 in parts.partition_special_innodb. Therefore, we must explicitly stop purge, because it would not be stopped by MDL. dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that we can acquire MDL on the InnoDB persistent statistics tables. mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory() in order to avoid ASAN heap-use-after-free related to acquire_thd(). trx_t::dict_operation_lock_mode: Changed the type to bool. row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary(): Implemented as macros. rollback_inplace_alter_table(): Apply an infinite timeout to lock waits. innodb_thd_increment_pending_ops(): Wrapper for thd_increment_pending_ops(). Never attempt async operation for InnoDB background threads, such as the trx_t::commit() in dict_stats_process_entry_from_recalc_pool(). lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL. lock_wait(): Make dictionary transactions immune to KILL, and to lock wait timeout when waiting for locks on dictionary tables. parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly get ER_LOCK_WAIT_TIMEOUT. main.mdl: Filter out MDL on InnoDB persistent statistics tables Reviewed by: Thirunarayanan Balathandayuthapani	2021-08-31 13:54:44 +03:00
Marko Mäkelä	82b7c561b7	MDEV-24258 Merge dict_sys.mutex into dict_sys.latch In the parent commit, dict_sys.latch could theoretically have been replaced with a mutex. But, we can do better and merge dict_sys.mutex into dict_sys.latch. Generally, every occurrence of dict_sys.mutex_lock() will be replaced with dict_sys.lock(). The PERFORMANCE_SCHEMA instrumentation for dict_sys_mutex will be removed along with dict_sys.mutex. The dict_sys.latch will remain instrumented as dict_operation_lock. Some use of dict_sys.lock() will be replaced with dict_sys.freeze(), which we will reintroduce for the new shared mode. Most notably, concurrent table lookups are possible as long as the tables are present in the dict_sys cache. In particular, this will allow more concurrency among InnoDB purge workers. Because dict_sys.mutex will no longer 'throttle' the threads that purge InnoDB transaction history, a performance degradation may be observed unless innodb_purge_threads=1. The table cache eviction policy will become FIFO-like, similar to what happened to fil_system.LRU in commit `45ed9dd957`. The name of the list dict_sys.table_LRU will become somewhat misleading; that list contains tables that may be evicted, even though the eviction policy no longer is least-recently-used but first-in-first-out. (Note: Tables can never be evicted as long as locks exist on them or the tables are in use by some thread.) As demonstrated by the test perfschema.sxlock_func, there will be less contention on dict_sys.latch, because some previous use of exclusive latches will be replaced with shared latches. fts_parse_sql_no_dict_lock(): Replaced with pars_sql(). fts_get_table_name_prefix(): Merged to fts_optimize_create(). dict_stats_update_transient_for_index(): Deduplicated some code. ha_innobase::info_low(), dict_stats_stop_bg(): Use a combination of dict_sys.latch and table->stats_mutex_lock() to cover the changes of BG_STAT_SHOULD_QUIT, because the flag is being read in dict_stats_update_persistent() while not holding dict_sys.latch. row_discard_tablespace_for_mysql(): Protect stats_bg_flag by exclusive dict_sys.latch, like most other code does. row_quiesce_table_has_fts_index(): Remove unnecessary mutex acquisition. FLUSH TABLES...FOR EXPORT is protected by MDL. row_import::set_root_by_heuristic(): Remove unnecessary mutex acquisition. ALTER TABLE...IMPORT TABLESPACE is protected by MDL. row_ins_sec_index_entry_low(): Replace a call to dict_set_corrupted_index_cache_only(). Reads of index->type were not really protected by dict_sys.mutex, and writes (flagging an index corrupted) should be extremely rare. dict_stats_process_entry_from_defrag_pool(): Only freeze the dictionary, do not lock it exclusively. dict_stats_wait_bg_to_stop_using_table(), DICT_BG_YIELD: Remove trx. We can simply invoke dict_sys.unlock() and dict_sys.lock() directly. dict_acquire_mdl_shared()<trylock=false>: Assert that dict_sys.latch is only held in shared more, not exclusive mode. Only acquire it in exclusive mode if the table needs to be loaded to the cache. dict_sys_t::acquire(): Remove. Relocating elements in dict_sys.table_LRU would require holding an exclusive latch, which we want to avoid for performance reasons. dict_sys_t::allow_eviction(): Add the table first to dict_sys.table_LRU, to compensate for the removal of dict_sys_t::acquire(). This function is only invoked by INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS. dict_table_open_on_id(), dict_table_open_on_name(): If dict_locked=false, try to acquire dict_sys.latch in shared mode. Only acquire the latch in exclusive mode if the table is not found in the cache. Reviewed by: Thirunarayanan Balathandayuthapani	2021-08-31 13:51:35 +03:00
Marko Mäkelä	49f95c4065	Merge 10.5 into 10.6	2021-08-23 11:21:33 +03:00
Marko Mäkelä	2c9f2a4c8c	Merge 10.4 into 10.5	2021-08-23 11:10:59 +03:00
Marko Mäkelä	2b66cd2493	Merge 10.3 into 10.4	2021-08-23 10:44:06 +03:00
Marko Mäkelä	cfbdb5d210	Merge 10.2 into 10.3	2021-08-23 10:14:01 +03:00
Marko Mäkelä	ca89489716	MDEV-26383 fixup: Consistently protect freed_indexes with autoinc_mutex To avoid potential race conditions between concurrent access to dict_table_t::freed_indexes, let us consistently use dict_table_t::autoinc_mutex. dict_table_remove_from_cache_low(): To avoid extensive hold time of table->autoinc_mutex, unconditionally free the FTS data structures.	2021-08-23 10:06:21 +03:00
Thirunarayanan Balathandayuthapani	08e5a3d2e3	MDEV-26383 ASAN heap-use-after-free failure in btr_search_lazy_free Problem: ======= The last AHI page for two indexes of an dropped table is being freed at the same time by two threads. One thread frees the table heap and other thread tries to access table heap again. It leads to asan failure in btr_search_lazy_free(). Solution: ======== InnoDB uses autoinc_mutex to avoid the race condition in btr_search_lazy_free()	2021-08-21 12:38:10 +05:30
Marko Mäkelä	f3fcf5f45c	Merge 10.5 to 10.6	2021-08-19 12:25:00 +03:00
Marko Mäkelä	4a25957274	Merge 10.4 into 10.5	2021-08-18 18:22:35 +03:00

1 2 3 4 5 ...

1032 commits