mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-17 20:42:30 +01:00

Author	SHA1	Message	Date
Vlad Lesin	78a04a4c22	MDEV-29869 mtr failure: innodb.deadlock_wait_thr_race 1. The merge `aeccbbd926` has overwritten lock0lock.cc, and the changes of MDEV-29622 and MDEV-29635 were partially lost, this commit restores the changes. 2. innodb.deadlock_wait_thr_race test: The following hang was found during testing. There is deadlock_report_before_lock_releasing sync point in Deadlock::report(), which is waiting for sel_cont signal under lock_sys_t lock. The signal must be issued after "UPDATE t SET b = 100" rollback, and that rollback is executing undo record, which is blocked on dict_sys latch request. dict_sys is locked by the thread of statistics update(dict_stats_save()), and during that update lock_sys lock is requested, and can't be acquired as Deadlock::report() holds it. We have to disable statistics update to make the test stable. But even if statistics update is disabled, and transaction with consistent snapshot is started at the very beginning of the test to prevent purging, the purge can still be invoked for system tables, and it tries to open system table by id, what causes dict_sys.freeze() call and dict_sys latching. What, in combination with lock_sys::xx_lock() causes the same deadlock as described above. We need to disable purging globally for the test as well. All the above is applicable to innodb.deadlock_wait_lock_race test also.	2022-10-26 12:15:40 +03:00
Marko Mäkelä	aeccbbd926	Merge 10.5 into 10.6 To prevent ASAN heap-use-after-poison in the MDEV-16549 part of ./mtr --repeat=6 main.derived the initialization of Name_resolution_context was cleaned up.	2022-10-25 14:25:42 +03:00
Vlad Lesin	8128a46827	MDEV-28709 unexpected X lock on Supremum in READ COMMITTED The lock is created during page splitting after moving records and locks(lock_move_rec_list_(start\|end)()) to the new page, and inheriting the locks to the supremum of left page from the successor of the infimum on right page. There is no need in such inheritance for READ COMMITTED isolation level and not-gap locks, so the fix is to add the corresponding condition in gap lock inheritance function. One more fix is to forbid gap lock inheritance if XA was prepared. Use the most significant bit of trx_t::n_ref to indicate that gap lock inheritance is forbidden. This fix is based on mysql/mysql-server@b063e52a83	2022-10-25 00:52:10 +03:00
Vlad Lesin	9c04d66d11	MDEV-29622 Wrong assertions in lock_cancel_waiting_and_release() for deadlock resolving caller Suppose we have two transactions, trx 1 and trx 2. trx 2 does deadlock resolving from lock_wait(), it sets victim->lock.was_chosen_as_deadlock_victim=true for trx 1, but has not yet invoked lock_cancel_waiting_and_release(). trx 1 checks the flag in lock_trx_handle_wait(), and starts rollback from row_mysql_handle_errors(). It can change trx->lock.wait_thr and trx->state as it holds trx_t::mutex, but trx 2 has not yet requested it, as lock_cancel_waiting_and_release() has not yet been called. After that trx 1 tries to release locks in trx_t::rollback_low(), invoking trx_t::rollback_finish(). lock_release() is blocked on try to acquire lock_sys.rd_lock(SRW_LOCK_CALL) in lock_release_try(), as lock_sys is blocked by trx 2, as deadlock resolution works under lock_sys.wr_lock(SRW_LOCK_CALL), see Deadlock::report() for details. trx 2 executes lock_cancel_waiting_and_release() for deadlock victim, i. e. for trx 1. lock_cancel_waiting_and_release() contains some trx->lock.wait_thr and trx->state assertions, which will fail, because trx 1 has changed them during rollback execution. So, according to the above scenario, it's legal to have trx->lock.wait_thr==0 and trx->state!=TRX_STATE_ACTIVE in lock_cancel_waiting_and_release(), if it was invoked from Deadlock::report(), and the fix is just in the assertion conditions changing. The fix is just in changing assertion condition. There is also lock_wait() cleanup around trx->error_state. If trx->error_state can be changed not by the owned thread, it must be protected with lock_sys.wait_mutex, as lock_wait() uses trx->lock.cond along with that mutex. Also if trx->error_state was changed before lock_sys.wait_mutex acquision, then it could be reset with the following code, what is wrong. Also we need to check trx->error_state before entering waiting loop, otherwise it can be the case when trx->error_state was set before lock_sys.wait_mutex acquision, but the thread will be waiting on trx->lock.cond.	2022-10-21 10:55:19 +03:00
Vlad Lesin	acebe35719	MDEV-29635 race on trx->lock.wait_lock in deadlock resolution Returning DB_SUCCESS unconditionally if !trx->lock.wait_lock in lock_trx_handle_wait() is wrong. Because even if trx->lock.was_chosen_as_deadlock_victim was not set before the first check in lock_trx_handle_wait(), it can be set after the check, and trx->lock.wait_lock can be reset by another thread from lock_reset_lock_and_trx_wait() if the transaction was chosen as deadlock victim. In this case lock_trx_handle_wait() will return DB_SUCCESS even the transaction was marked as deadlock victim, and continue execution instead of rolling back. The fix is to check trx->lock.was_chosen_as_deadlock_victim once more if trx->lock.wait_lock is reset, as trx->lock.wait_lock can be reset only after trx->lock.was_chosen_as_deadlock_victim was set if the transaction was chosen as deadlock victim.	2022-10-21 10:55:18 +03:00
Marko Mäkelä	ab0190101b	MDEV-24402: InnoDB CHECK TABLE ... EXTENDED Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB, and InnoDB only counted the records in each index according to the current read view. Unless the attribute QUICK was specified, the function btr_validate_index() would be invoked to validate the B-tree structure (the sibling and child links between index pages). The EXTENDED check will not only count all index records according to the current read view, but also ensure that any delete-marked records in the clustered index are waiting for the purge of history, and that all secondary index records point to a version of the clustered index record that is waiting for the purge of history. In other words, no index may contain orphan records. Normal MVCC reads and the non-EXTENDED version of CHECK TABLE would ignore these orphans. Unpurged records merely result in warnings (at most one per index), not errors, and no indexes will be flagged as corrupted due to such garbage. It will remain possible to SELECT data from such indexes or tables (which will skip such records) or to rebuild the table to reclaim some space. We introduce purge_sys.end_view that will be (almost) a copy of purge_sys.view at the end of a batch of purging committed transaction history. It is not an exact copy, because if the size of a purge batch is limited by innodb_purge_batch_size, some records that purge_sys.view would allow to be purged will be left over for subsequent batches. The purge_sys.view is relevant in the purge of committed transaction history, to determine if records are safe to remove. The new purge_sys.end_view is relevant in MVCC operations and in CHECK TABLE ... EXTENDED. It tells which undo log records are safe to access (have not been discarded at the end of a purge batch). purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(), clone the oldest read view similar to purge_sys_t::clone_end_view() so that CHECK TABLE ... EXTENDED will not report bogus failures between InnoDB restart and the completed purge of committed transaction history. purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible() in the case that purge_sys.latch will not be held by the caller. Among other things, this guards access to BLOBs. It is not safe to dereference any BLOBs of a delete-marked purgeable record, because they may have already been freed. purge_sys_t::view_guard::view(): Return a reference to purge_sys.view that will be protected by purge_sys.latch, held by purge_sys_t::view_guard. purge_sys_t::end_view_guard::view(): Return a reference to purge_sys.end_view while it is protected by purge_sys.end_latch. Whenever a thread needs to retrieve an older version of a clustered index record, it will hold a page latch on the clustered index page and potentially also on a secondary index page that points to the clustered index page. If these pages contain purgeable records that would be accessed by a currently running purge batch, the progress of the purge batch would be blocked by the page latches. Hence, it is safe to make a copy of purge_sys.end_view while holding an index page latch, and consult the copy of the view to determine whether a record should already have been purged. btr_validate_index(): Remove a redundant check. row_check_index_match(): Check if a secondary index record and a version of a clustered index record match each other. row_check_index(): Replaces row_scan_index_for_mysql(). Count the records in each index directly, duplicating the relevant logic from row_search_mvcc(). Initialize check_table_extended_view for CHECK ... EXTENDED while holding an index leaf page latch. If we encounter an orphan record, the copy of purge_sys.end_view that we make is safe for visibility checks, and trx_undo_get_undo_rec() will check for the safety to access each undo log record. Should that check fail, we should return DB_MISSING_HISTORY to report a corrupted index. The EXTENDED check tries to match each secondary index record with every available clustered index record version, by duplicating the logic of row_vers_build_for_consistent_read() and invoking trx_undo_prev_version_build() directly. Before invoking row_check_index_match() on delete-marked clustered index record versions, we will consult purge_sys.is_purgeable() in order to avoid accessing freed BLOBs. We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not exceed the global maximum. Orphan secondary index records will be flagged only if everything up to PAGE_MAX_TRX_ID has been purged. We warn also about clustered index records whose nonzero DB_TRX_ID should have been reset in purge or rollback. trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id(). trx_undo_prev_version_build(): Remove two debug-only parameters, and return an error code instead of a Boolean. trx_undo_get_undo_rec(): Return a pointer to the undo log record, or nullptr if one cannot be retrieved. Instead of consulting the purge_sys.view, consult the purge_sys.end_view to determine which records can be accessed. trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec() that will consult purge_sys.view instead of purge_sys.end_view. TRX_UNDO_CHECK_PURGEABILITY: A new parameter to trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry() so that purge_sys.view instead of purge_sys.end_view will be consulted to determine whether a secondary index record may be safely purged. row_upd_changes_disowned_external(): Remove. This should be more expensive than briefly latching purge_sys in trx_undo_prev_version_build() (which may make use of transactional memory). row_sel_reset_old_vers_heap(): New function, split from row_sel_build_prev_vers_for_mysql(). row_sel_build_prev_vers_for_mysql(): Reorder some parameters to simplify the call to row_sel_reset_old_vers_heap(). row_search_for_mysql(): Replaced with direct calls to row_search_mvcc(). sel_node_get_nth_plan(): Define inline in row0sel.h open_step(): Define at the call site, in simplified form. sel_node_reset_cursor(): Merged with the only caller open_step(). --- ReadViewBase::check_trx_id_sanity(): Remove. Let us handle "future" DB_TRX_ID in a more meaningful way: row_sel_clust_sees(): Return DB_SUCCESS if the record is visible, DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if the DB_TRX_ID is in the future. row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed that corruption when we were about to modify the record in the first place (leading us to refuse the operation). row_vers_build_for_consistent_read(): Return DB_CORRUPTION if DB_TRX_ID is in the future. Tested by: Matthias Leich Reviewed by: Vladislav Lesin	2022-10-21 10:02:54 +03:00
Marko Mäkelä	44fd2c4b24	Merge 10.5 into 10.6	2022-09-20 16:53:20 +03:00
Vlad Lesin	5ab78cf340	MDEV-29515 innodb.deadlock_victim_race is unstable The test is unstable because 'UPDATE t SET b = 100' latches a page and waits for 'upd_cont' signal in lock_trx_handle_wait_enter sync point, then purge requests RW_X_LATCH on the same page, and then 'SELECT * FROM t WHERE a = 10 FOR UPDATE' requests RW_S_LATCH, waiting for RW_X_LATCH requested by purge. 'UPDATE t SET b = 100' can't release page latch as it waits for upd_cont signal, which must be emitted after 'SELECT * FROM t WHERE a = 10 FOR UPDATE' acquired RW_S_LATCH. So we have a deadlock, which is resolved by finishing the debug sync point wait by timeout, and the 'UPDATE t SET b = 100' releases it's record locks rolling back the transaction, and 'SELECT * FROM t WHERE a = 10 FOR UPDATE' is finished successfully instead of finishing by lock wait timeout. The fix is to forbid purging during the test by opening read view in a separate connection before the first insert into the table. Besides, 'lock_wait_end' syncpoint is not needed, as it enough to wait the end of the SELECT execution to let the UPDATE to continue.	2022-09-19 16:57:58 +03:00
Alexander Barkov	fe844c16b6	Merge remote-tracking branch 'origin/10.4' into 10.5	2022-09-14 16:24:51 +04:00
Marko Mäkelä	18795f5512	Merge 10.3 into 10.4	2022-09-13 16:36:38 +03:00
Marko Mäkelä	667df98c3e	MDEV-29507 InnoDB: Failing assertion: table->n_rec_locks == 0 lock_place_prdt_page_lock(): Do not place locks on temporary tables. Temporary tables can only be accessed from one connection, so it does not make any sense to acquire any transactional locks on them.	2022-09-12 09:27:46 +03:00
Vlad Lesin	8ff1096999	MDEV-29081 trx_t::lock.was_chosen_as_deadlock_victim race in lock_wait_end() The issue is that trx_t::lock.was_chosen_as_deadlock_victim can be reset before the transaction check it and set trx_t::error_state. The fix is to reset trx_t::lock.was_chosen_as_deadlock_victim only in trx_t::commit_in_memory(), which is invoked on full rollback. There is also no need to have separate bit in trx_t::lock.was_chosen_as_deadlock_victim to flag transaction it was chosen as a victim of Galera conflict resolution, the same variable can be used for both cases except debug build. For debug build we need to distinguish deadlock and Galera's abort victims for debug checks. Also there is no need to check for deadlock in lock_table_enqueue_waiting() for Galera as the coresponding check presents in lock_wait(). Local variable "error_state" in lock_wait() was replaced with trx->error_state, because before the replace lock_sys_t::cancel<false>(trx, lock) and lock_sys.deadlock_check() could change trx->error_state, which then could be overwritten with the local "error_state" variable value. The lock_wait_suspend_thread_enter DEBUG_SYNC point name is misleading, because lock_wait_suspend_thread was eliminated in `e71e613`. It was renamed to lock_wait_start. Reviewed by: Marko Mäkelä, Jan Lindström.	2022-08-24 17:06:57 +03:00
Marko Mäkelä	63478e72de	MDEV-21098: Assertion failure in rec_get_offsets_func() The function rec_get_offsets_func() used to hit ut_error due to an invalid rec_get_status() value of a ROW_FORMAT!=REDUNDANT record. This fix is twofold: We will not only avoid a crash on corruption in this case, but we will also make more effort to validate each record every time we are iterating over index page records. rec_get_offsets_func(): Do not crash on a corrupted record. page_rec_get_nth(): Return nullptr on error. page_dir_slot_get_rec_validate(): Like page_dir_slot_get_rec(), but validate the pointer and return nullptr on error. page_cur_search_with_match(), page_cur_search_with_match_bytes(), page_dir_split_slot(), page_cur_move_to_next(): Indicate failure in a return value. page_cur_search(): Replaced with page_cur_search_with_match(). rec_get_next_ptr_const(), rec_get_next_ptr(): Replaced with page_rec_get_next_low(). TODO: rtr_page_split_initialize_nodes(), rtr_update_mbr_field(), and possibly other SPATIAL INDEX functions fail to properly handle errors. Reviewed by: Thirunarayanan Balathandayuthapani Tested by: Matthias Leich Performance tested by: Axel Schwenke	2022-08-01 11:25:50 +03:00
Daniel Black	2658410afc	MDEV-29187: Deadlock output in InnoDB status always shows transaction (0) At some point the incrementing of the transaction counter got dropped. Thanks Agustin for the bug report.	2022-07-28 16:22:43 +10:00
Daniel Black	552919d041	MDEV-29166: reduce locking of innodb rseg for user exposure of innodb_history_list_length SHOW ENGINE INNODB STATUS and SHOW GLOBAL VARIABLES were blocking on the locks used to access the history length in MDEV-29141. While the reason for the blockage was elsewhere, we should make these monitoring commands less blocking as there is a trx_sys.history_size_approx function that can be used. SHOW ENGINE INNODB STATUS and SHOW GLOBAL STATUS LIKE 'innodb_history_list_length' and Innodb Monitors can use trx_sys.history_size_approx().	2022-07-26 15:51:11 +03:00
Marko Mäkelä	3794673111	MDEV-28836: Memory alignment cleanup Table_cache_instance: Define the structure aligned at the CPU cache line, and remove a pad[] data member. Krunal Bauskar reported this to improve performance on ARMv8. aligned_malloc(): Wrapper for the Microsoft _aligned_malloc() and the ISO/IEC 9899:2011 <stdlib.h> aligned_alloc(). Note: The parameters are in the Microsoft order (size, alignment), opposite of aligned_alloc(alignment, size). Note: The standard defines that size must be an integer multiple of alignment. It is enforced by AddressSanitizer but not by GNU libc on Linux. aligned_free(): Wrapper for the Microsoft _aligned_free() and the standard free(). HAVE_ALIGNED_ALLOC: A new test. Unfortunately, support for aligned_alloc() may still be missing on some platforms. We will fall back to posix_memalign() for those cases. HAVE_MEMALIGN: Remove, along with any use of the nonstandard memalign(). PFS_ALIGNEMENT (sic): Removed; we will use CPU_LEVEL1_DCACHE_LINESIZE. PFS_ALIGNED: Defined using the C++11 keyword alignas. buf_pool_t::page_hash_table::create(), lock_sys_t::hash_table::create(): lock_sys_t::hash_table::resize(): Pad the allocation size to an integer multiple of the alignment. Reviewed by: Vladislav Vaintroub	2022-06-21 16:59:49 +03:00
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Marko Mäkelä	05d049bdbe	Merge 10.5 into 10.6	2022-05-25 14:39:42 +03:00
Marko Mäkelä	ea40c75c27	Merge 10.4 into 10.5	2022-05-25 14:24:51 +03:00
Marko Mäkelä	99c8aed00d	MDEV-28601 InnoDB history list length was reverted to 32 bits srv_do_purge(): In commit `edde1f6e0d` when the de-facto 32-bit trx_sys_t::history_size() was replaced with 32-bit trx_sys.rseg_history_len, some more variables were changed from ulint (size_t) to uint32_t. The history list length is the number of committed transactions whose undo logs are waiting to be purged. Each TRX_RSEG_HISTORY list is storing the number of entries in a 32-bit field and each transaction will occupy at least one undo log page. It is thinkable that the length of each TRX_RSEG_HISTORY list may approach the maximum representable number. The number cannot be exceeded, because the rollback segment header is allocated from the same tablespace as the undo log header pages it is pointing to, and because the page numbers of a tablespace are stored in 32 bits. In any case, it is possible that the total number of unpurged committed transactions cannot be represented in 32 but 39 bits (corresponding to 128 rollback segments and undo tablespaces).	2022-05-25 14:06:04 +03:00
Sergei Golubchik	3bc98a4ec4	Merge branch '10.5' into 10.6	2022-05-10 14:01:23 +02:00
Sergei Golubchik	ef781162ff	Merge branch '10.4' into 10.5	2022-05-09 22:04:06 +02:00
Sergei Golubchik	a70a1cf3f4	Merge branch '10.3' into 10.4	2022-05-08 23:03:08 +02:00
Vlad Lesin	2c381d8cf6	MDEV-17843 Assertion `page_rec_is_leaf(rec)' failed in lock_rec_queue_validate upon SHOW ENGINE INNODB STATUS lock_validate() accumulates page ids under locked lock_sys->mutex, then releases the latch, and invokes lock_rec_block_validate() for each page. Some other thread has ability to add/remove locks and change pages between releasing the latch in lock_validate() and acquiring it in lock_rec_validate_page(). lock_rec_validate_page() can invoke lock_rec_queue_validate() for non-locked supremum, what can cause ut_ad(page_rec_is_leaf(rec)) failure in lock_rec_queue_validate(). The fix is to invoke lock_rec_queue_validate() only for locked records in lock_rec_validate_page(). The error message in lock_rec_block_validate() is not necessary as BUF_GET_POSSIBLY_FREED mode is used to get block from buffer pool, and this is not error if a block was evicted. The test case would require new debug sync point. I think it's not necessary as the fixed code is debug-only.	2022-05-04 12:51:28 +03:00
Marko Mäkelä	b6e41e3872	MDEV-28445 Secondary index locking invokes costly trx_sys.get_min_trx_id() lock_sec_rec_read_check_and_lock(): Remove a redundant check for trx_sys.get_min_trx_id(). This would be checked in lock_sec_rec_some_has_impl() anyway. Also, test the cheapest conditions first. lock_sec_rec_some_has_impl(): Replace trx_sys.get_min_trx_id() with trx_sys.find_same_or_older() that is much easier to evaluate. Inspired by mysql/mysql-server@0a0c0db97e	2022-04-29 12:42:29 +03:00
Marko Mäkelä	0806592ac8	MDEV-28422 Page split breaks a gap lock btr_insert_into_right_sibling(): Inherit any gap lock from the left sibling to the right sibling before inserting the record to the right sibling and updating the node pointer(s). lock_update_node_pointer(): Update locks in case a node pointer will move. Based on mysql/mysql-server@c7d93c274f	2022-04-27 13:38:08 +03:00
Marko Mäkelä	2ca1123464	MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table This follows up the previous fix in commit `c3c53926c4` (MDEV-26554). ha_innobase::delete_table(): Work around the insufficient metadata locking (MDL) during DML operations by acquiring exclusive InnoDB table locks on all child tables. Previously, this was only done on TRUNCATE and ALTER. ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke lock_update_delete() during change buffer operations. The revised trx_t::commit(std::vector<pfs_os_file_t>&) will hold exclusive lock_sys.latch while invoking fil_delete_tablespace(), which in turn may invoke ibuf_delete_rec(). dict_index_t::has_locking(): A new predicate, replacing the dummy !dict_table_is_locking_disabled(index->table). Used for skipping lock operations during ibuf_delete_rec(). trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks and remove the table from the cache while holding exclusive lock_sys.latch. trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds. trx_t::commit(): Reset dict_operation before invoking commit_in_memory() via commit_persist(). lock_release_on_drop(): Release locks while lock_sys.latch is exclusively locked. lock_table(): Add a parameter for a pointer to the table. We must not dereference the table before a lock_sys.latch has been acquired. If the pointer to the table does not match the table at that point, the table is invalid and DB_DEADLOCK will be returned. row_ins_foreign_check_on_constraint(): Improve the checks. Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed before commit `c5fd9aa562` (MDEV-25919). row_upd_check_references_constraints(), wsrep_row_upd_check_foreign_constraints(): Simplify checks.	2022-04-26 18:09:03 +03:00
Marko Mäkelä	d1edb011ee	Cleanup: Remove os0thread Let us use the common pthread_t wrapper for Microsoft Windows. This fixes up commit `dbe941e06f`	2022-04-19 13:49:52 +03:00
Marko Mäkelä	8074ab5784	MDEV-28313: Shrink rw_trx_hash_element_t::mutex The element mutex is unnecessarily large. The PERFORMANCE_SCHEMA instrumentation was not even enabled.	2022-04-14 10:10:44 +03:00
Marko Mäkelä	8840583a92	MDEV-27909 InnoDB: Failing assertion: state == TRX_STATE_NOT_STARTED ... on DDL The fix in commit `6e390a62ba` (MDEV-26772) was a step to the right direction, but implemented incorrectly. When an InnoDB persistent statistics table cannot be locked immediately, we must not let row_mysql_handle_errors() to roll back the transaction. lock_table_for_trx(): Add the parameter no_wait (default false) for an immediate return of DB_LOCK_WAIT in case of a conflict. ha_innobase::delete_table(), ha_innobase::rename_table(): Pass no_wait=true to lock_table_for_trx() when needed, instead of temporarily setting THDVAR(thd, lock_wait_timeout) to 0.	2022-03-18 10:52:08 +02:00
Marko Mäkelä	ed20e5b111	After-merge fixes	2022-03-08 09:04:03 +02:00
Vlad Lesin	202316a38f	Merge 10.5 into 10.6	2022-03-07 18:42:47 +03:00
Vlad Lesin	0b92c7b0e0	Merge 10.4 into 10.5	2022-03-07 17:16:11 +03:00
Vlad Lesin	1ec3205703	Merge 10.3 into 10.4	2022-03-07 16:46:00 +03:00
Vlad Lesin	86c1bf118a	MDEV-27992 DELETE fails to delete record after blocking is released MDEV-27025 allows to insert records before the record on which DELETE is locked, as a result the DELETE misses those records, what causes serious ACID violation. Revert MDEV-27025, MDEV-27550. The test which shows the scenario of ACID violation is added.	2022-03-07 16:42:05 +03:00
Vlad Lesin	f6f055a191	Merge 10.3 into 10.4	2022-02-21 14:10:27 +03:00
Vlad Lesin	5f001bd7b8	MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lock The code was backported from 10.5 `be8113861c` commit. See that commit message for details.	2022-02-21 12:49:54 +03:00
Vlad Lesin	be8113861c	MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lock The code was backported from 10.6 `bd03c0e516` commit. See that commit message for details. Apart from the above commit trx_lock_t::wait_trx was also backported from MDEV-24738. trx_lock_t::wait_trx is protected with lock_sys.wait_mutex in 10.6, but that mutex was implemented only in MDEV-24789. As there is no need to backport MDEV-24789 for MDEV-27025, trx_lock_t::wait_trx is protected with the same mutexes as trx_lock_t::wait_lock. This fix should not break innodb-lock-schedule-algorithm=VATS. This algorithm uses an Eldest-Transaction-First (ETF) heuristic, which prefers older transactions over new ones. In this fix we just insert granted lock just before the last granted lock of the same transaction, what does not change transactions execution order. The changes in lock_rec_create_low() should not break Galera Cluster, there is a big "if" branch for WSREP. This branch is necessary to provide the correct transactions execution order, and should not be changed for the current bug fix.	2022-01-18 18:15:10 +03:00
Vlad Lesin	bd03c0e516	MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lock When lock is checked for conflict, ignore other locks on the record if they wait for the requesting transaction. lock_rec_has_to_wait_in_queue() iterates not all locks for the page, but only the locks located before the waiting lock in the queue. So there is some invariant - any lock in the queue can wait only lock which is located before the waiting lock in the queue. In the case when conflicting lock waits for the transaction of requesting lock, we need to place the requesting lock before the waiting lock in the queue to preserve the invariant. That is why we are looking for the first waiting for requesting transation lock and place the new lock just after the last granted requesting transaction lock before the first waiting for requesting transaction lock. Example: trx1 waiting lock, trx1 granted lock, ..., trx2 lock - waiting for trx1 place new lock here -----------------^ There are also implicit locks which are lazily converted to explicit ones, and we need to place the newly created explicit lock to the correct place in a queue. All explicit locks converted from implicit ones are placed just after the last non-waiting lock of the same transaction before the first waiting for the transaction lock. Code review and cleanup was made by Marko Mäkelä.	2022-01-18 15:18:42 +03:00
Marko Mäkelä	75d4c530c5	MDEV-26879: Detach innodb_evict_tables_on_commit_debug from SAFE_MUTEX In commit `18535a4028` (MDEV-24811) the implementation of innodb_evict_tables_on_commit_debug depended on dict_sys.mutex and SAFE_MUTEX. That is no longer the case. SAFE_MUTEX is not available on Microsoft Windows.	2022-01-09 14:09:00 +02:00
Marko Mäkelä	59e8a12657	MDEV-26879 innodb_evict_tables_on_commit_debug=on makes table creation hang In commit `c5fd9aa562` (MDEV-25919) an incorrect change to lock_release() was applied. The setting innodb_evict_tables_on_commit_debug=on should only be applied to normal transactions, not DDL transactions in the likes of CREATE TABLE, nor transactions that are holding dict_sys.latch, such as dict_stats_save().	2022-01-05 12:14:01 +02:00
Marko Mäkelä	917b421012	MDEV-26682 fixup: GCC -Wunused-variable	2021-11-24 12:05:44 +02:00
Marko Mäkelä	aaef2e1d8c	MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.	2021-11-18 17:47:19 +02:00
sjaakola	ef2dbb8dbc	MDEV-23328 Server hang due to Galera lock conflict resolution Mutex order violation when wsrep bf thread kills a conflicting trx, the stack is wsrep_thd_LOCK() wsrep_kill_victim() lock_rec_other_has_conflicting() lock_clust_rec_read_check_and_lock() row_search_mvcc() ha_innobase::index_read() ha_innobase::rnd_pos() handler::ha_rnd_pos() handler::rnd_pos_by_record() handler::ha_rnd_pos_by_record() Rows_log_event::find_row() Update_rows_log_event::do_exec_row() Rows_log_event::do_apply_event() Log_event::apply_event() wsrep_apply_events() and mutexes are taken in the order lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data When a normal KILL statement is executed, the stack is innobase_kill_query() kill_handlerton() plugin_foreach_with_mask() ha_kill_query() THD::awake() kill_one_thread() and mutexes are victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex This patch is the plan D variant for fixing potetial mutex locking order exercised by BF aborting and KILL command execution. In this approach, KILL command is replicated as TOI operation. This guarantees total isolation for the KILL command execution in the first node: there is no concurrent replication applying and no concurrent DDL executing. Therefore there is no risk of BF aborting to happen in parallel with KILL command execution either. Potential mutex deadlocks between the different mutex access paths with KILL command execution and BF aborting cannot therefore happen. TOI replication is used, in this approach, purely as means to provide isolated KILL command execution in the first node. KILL command should not (and must not) be applied in secondary nodes. In this patch, we make this sure by skipping KILL execution in secondary nodes, in applying phase, where we bail out if applier thread is trying to execute KILL command. This is effective, but skipping the applying of KILL command could happen much earlier as well. This also fixed unprotected calls to wsrep_thd_abort that will use wsrep_abort_transaction. This is fixed by holding THD::LOCK_thd_data while we abort transaction. Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>	2021-10-29 20:40:35 +02:00
sjaakola	157b3a637f	MDEV-23328 Server hang due to Galera lock conflict resolution Mutex order violation when wsrep bf thread kills a conflicting trx, the stack is wsrep_thd_LOCK() wsrep_kill_victim() lock_rec_other_has_conflicting() lock_clust_rec_read_check_and_lock() row_search_mvcc() ha_innobase::index_read() ha_innobase::rnd_pos() handler::ha_rnd_pos() handler::rnd_pos_by_record() handler::ha_rnd_pos_by_record() Rows_log_event::find_row() Update_rows_log_event::do_exec_row() Rows_log_event::do_apply_event() Log_event::apply_event() wsrep_apply_events() and mutexes are taken in the order lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data When a normal KILL statement is executed, the stack is innobase_kill_query() kill_handlerton() plugin_foreach_with_mask() ha_kill_query() THD::awake() kill_one_thread() and mutexes are victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex This patch is the plan D variant for fixing potetial mutex locking order exercised by BF aborting and KILL command execution. In this approach, KILL command is replicated as TOI operation. This guarantees total isolation for the KILL command execution in the first node: there is no concurrent replication applying and no concurrent DDL executing. Therefore there is no risk of BF aborting to happen in parallel with KILL command execution either. Potential mutex deadlocks between the different mutex access paths with KILL command execution and BF aborting cannot therefore happen. TOI replication is used, in this approach, purely as means to provide isolated KILL command execution in the first node. KILL command should not (and must not) be applied in secondary nodes. In this patch, we make this sure by skipping KILL execution in secondary nodes, in applying phase, where we bail out if applier thread is trying to execute KILL command. This is effective, but skipping the applying of KILL command could happen much earlier as well. This also fixed unprotected calls to wsrep_thd_abort that will use wsrep_abort_transaction. This is fixed by holding THD::LOCK_thd_data while we abort transaction. Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>	2021-10-29 10:00:17 +03:00
sjaakola	5c230b21bf	MDEV-23328 Server hang due to Galera lock conflict resolution Mutex order violation when wsrep bf thread kills a conflicting trx, the stack is wsrep_thd_LOCK() wsrep_kill_victim() lock_rec_other_has_conflicting() lock_clust_rec_read_check_and_lock() row_search_mvcc() ha_innobase::index_read() ha_innobase::rnd_pos() handler::ha_rnd_pos() handler::rnd_pos_by_record() handler::ha_rnd_pos_by_record() Rows_log_event::find_row() Update_rows_log_event::do_exec_row() Rows_log_event::do_apply_event() Log_event::apply_event() wsrep_apply_events() and mutexes are taken in the order lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data When a normal KILL statement is executed, the stack is innobase_kill_query() kill_handlerton() plugin_foreach_with_mask() ha_kill_query() THD::awake() kill_one_thread() and mutexes are victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex This patch is the plan D variant for fixing potetial mutex locking order exercised by BF aborting and KILL command execution. In this approach, KILL command is replicated as TOI operation. This guarantees total isolation for the KILL command execution in the first node: there is no concurrent replication applying and no concurrent DDL executing. Therefore there is no risk of BF aborting to happen in parallel with KILL command execution either. Potential mutex deadlocks between the different mutex access paths with KILL command execution and BF aborting cannot therefore happen. TOI replication is used, in this approach, purely as means to provide isolated KILL command execution in the first node. KILL command should not (and must not) be applied in secondary nodes. In this patch, we make this sure by skipping KILL execution in secondary nodes, in applying phase, where we bail out if applier thread is trying to execute KILL command. This is effective, but skipping the applying of KILL command could happen much earlier as well. This also fixed unprotected calls to wsrep_thd_abort that will use wsrep_abort_transaction. This is fixed by holding THD::LOCK_thd_data while we abort transaction. Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>	2021-10-29 09:52:52 +03:00
Marko Mäkelä	1ad1d78981	MDEV-26779: Enable adaptive spinning on ARMv8 for lock_sys.wait_mutex Similar to commit `f7684f0ca5` (MDEV-26855) we will try to enable the adaptive spinloop for lock_sys.wait_mutex on ARMv8. Enabling any form of spinloop for lock_sys.wait_mutex did not show a significant improvement in our tests on AMD64. Spinning can be argued to be a hack to reduce the impact on mutex contention. It would be better to adjust the code to reduce contention in the first place.	2021-10-27 17:21:19 +03:00
Marko Mäkelä	58fe6b47d4	MDEV-26903: Assertion ctx->trx->state == TRX_STATE_ACTIVE on DROP INDEX rollback_inplace_alter_table(): Tolerate a case where the transaction is not in an active state. If ha_innobase::commit_inplace_alter_table() failed with a deadlock, the transaction would already have been rolled back. This omission of error handling was introduced in commit `1bd681c8b3` (MDEV-25506 part 3). After commit `c3c53926c4` (MDEV-26554) it became easier to trigger DB_DEADLOCK during exclusive table lock acquisition in ha_innobase::commit_inplace_alter_table(). lock_table_low(): Add DBUG injection "innodb_table_deadlock".	2021-10-26 09:54:37 +03:00
Marko Mäkelä	1f02280904	MDEV-26769 InnoDB does not support hardware lock elision This implements memory transaction support for: * Intel Restricted Transactional Memory (RTM), also known as TSX-NI (Transactional Synchronization Extensions New Instructions) * POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux transactional_lock_guard, transactional_shared_lock_guard: RAII lock guards that try to elide the lock acquisition when transactional memory is available. buf_pool.page_hash: Try to elide latches whenever feasible. Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED tables, this is not always possible. In buf_page_get_low(), memory transactions only work reasonably well for validating a guessed block address. TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards that try to elide lock_sys.latch and related latches.	2021-10-22 12:38:45 +03:00
Marko Mäkelä	5caff20216	MDEV-26883 InnoDB hang due to table lock conflict In a stress test campaign of a 10.6-based branch by Matthias Leich, a deadlock between two InnoDB threads occurred, involving lock_sys.wait_mutex and a dict_table_t::lock_mutex. The cause of the hang is a latching order violation in lock_sys_t::cancel(). That function and the latching order violation were originally introduced in commit `8d16da1487` (MDEV-24789). lock_sys_t::cancel(): Invoke table->lock_mutex_trylock() in order to avoid a deadlock. If that fails, release lock_sys.wait_mutex, and acquire both latches. In that way, we will be obeying the latching order and no hangs will occur. This hang should mostly affect DDL operations. DML operations will acquire only IX or IS table locks, which are compatible with each other.	2021-10-22 11:59:44 +03:00

1 2 3 4 5 ...

709 commits