mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-04-01 12:55:38 +02:00

Author	SHA1	Message	Date
Marko Mäkelä	12a91b57e2	Merge 10.11 into 11.2	2024-10-03 13:24:43 +03:00
Marko Mäkelä	6acada713a	MDEV-34062: Implement innodb_log_file_mmap on 64-bit systems When using the default innodb_log_buffer_size=2m, mariadb-backup --backup would spend a lot of time re-reading and re-parsing the log. For reads, it would be beneficial to memory-map the entire ib_logfile0 to the address space (typically 48 bits or 256 TiB) and read it from there, both during --backup and --prepare. We will introduce the Boolean read-only parameter innodb_log_file_mmap that will be OFF by default on most platforms, to avoid aggressive read-ahead of the entire ib_logfile0 in when only a tiny portion would be accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON, because those platforms define a specific mmap(2) option for enabling such read-ahead and therefore it can be assumed that the default would be on-demand paging. This parameter will only have impact on the initial InnoDB startup and recovery. Any writes to the log will use regular I/O, except when the ib_logfile0 is stored in a specially configured file system that is backed by persistent memory (Linux "mount -o dax"). We also experimented with allowing writes of the ib_logfile0 via a memory mapping and decided against it. A fundamental problem would be unnecessary read-before-write in case of a major page fault, that is, when a new, not yet cached, virtual memory page in the circular ib_logfile0 is being written to. There appears to be no way to tell the operating system that we do not care about the previous contents of the page, or that the page fault handler should just zero it out. Many references to HAVE_PMEM have been replaced with references to HAVE_INNODB_MMAP. The predicate log_sys.is_pmem() has been replaced with log_sys.is_mmap() && !log_sys.is_opened(). Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the way that an open file handle to ib_logfile0 will be retained. In both code paths, log_sys.is_mmap() will hold. Holding a file handle open will allow log_t::clear_mmap() to disable the interface with fewer operations. It should be noted that ever since commit `685d958e38` (MDEV-14425) most 64-bit Linux platforms on our CI platforms (s390x a.k.a. IBM System Z being a notable exception) read and write /dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is persistent memory (mount -o dax). So, the memory mapping based log parsing that this change is enabling by default on Linux and FreeBSD has already been extensively tested on Linux. ::log_mmap(): If a log cannot be opened as PMEM and the desired access is read-only, try to open a read-only memory mapping. xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile(): Copy the InnoDB log in mariadb-backup --backup from a memory mapped file.	2024-09-26 18:47:12 +03:00
Sergei Golubchik	f9807aadef	Merge branch '10.11' into 11.0	2024-05-12 12:18:28 +02:00
Marko Mäkelä	42bda685db	MDEV-33585 follow-up optimization log_t: Define buf_size, max_buf_free as 32-bit and next_checkpoint_no as byte (we only need a bit) and rearrange some data members, so that on AMD64 we can fit log_sys.latch and log_sys.log in the same 64-byte cache line. mtr_t::commit_log(), mtr_t::commit_logger: A part of mtr_t::commit() split into a separate function, so that we will not unnecessarily invoke log_sys.get_write_target() when running on a memory-mapped log file, or log_sys.is_pmem(). Reviewed by: Vladislav Vaintroub Tested by: Matthias Leich	2024-04-09 09:36:45 +03:00
Marko Mäkelä	fec2fd6add	Merge 10.11 into 11.0	2024-03-28 10:51:36 +02:00
Marko Mäkelä	0c6cac0a6f	MDEV-33515 fixup: Clarify mtr_t::spin_wait_delay innodb_log_spin_wait_delay_update(): Always acquire log_sys.latch to protect the change of mtr_t::spin_wait_delay. log_t::lock_lsn(): In the general case, actually use mtr_t::spin_wait_delay as it was intended. In the x86 specific log_t::lock_lsn_bts() we used mtr_t::spin_wait_delay.	2024-03-27 09:33:37 +02:00
Marko Mäkelä	bf0b82d24b	MDEV-33515 log_sys.lsn_lock causes excessive context switching The log_sys.lsn_lock is a very contended resource with a small critical section in log_sys.append_prepare(). On many processor microarchitectures, replacing the system call based log_sys.lsn_lock with a pure spin lock would fare worse during high concurrency workloads, wasting a significant amount of CPU cycles in the spin loop. On other microarchitectures, we would see a significant amount of time being spent in native_queued_spin_lock_slowpath() in the Linux kernel, plus context switching between user and kernel address space. This was pointed out by Steve Shaw from Intel Corporation. Depending on the workload and the hardware implementation, it may be useful to use a pure spin lock in log_sys.append_prepare(). We will introduce a parameter. The statement SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50; would enable a spin lock that will execute that many MY_RELAX_CPU() operations (such as the x86 PAUSE instruction) between successive attempts of acquiring the spin lock. The use of a system call based log_sys.lsn_lock (which is the default setting) can be enabled by SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0; This patch will also introduce #ifdef LOG_LATCH_DEBUG (part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate tracking of log_sys.latch ownership and reorganize the fields of log_sys to improve the locality of reference and to reduce the chances of false sharing. When a spin lock is being used, it will be maintained in the most significant bit of log_sys.buf_free. This is useful, because that is one of the fields that is covered by the lock. For IA-32 or AMD64, we implement the spin lock specially via log_t::lsn_lock_bts(), employing the i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would translate into an inefficient loop around LOCK CMPXCHG. mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay. mtr_t::finisher: Pointer to the currently used mtr_t::finish_write() implementation. This allows to avoid introducing conditional branches. We no longer invoke log_sys.is_pmem() at the mini-transaction level, but we would do that in log_write_up_to(). mtr_t::finisher_update(): Update finisher when spin_wait_delay is changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or vice versa).	2024-03-22 12:29:01 +02:00
Marko Mäkelä	c2da55ac01	Merge 10.11 into 11.0	2024-01-10 12:42:56 +02:00
Marko Mäkelä	1eb11da3e5	Merge 10.6 into 10.11	2024-01-10 12:37:19 +02:00
Marko Mäkelä	3613fb2aa8	MDEV-33112 innodb_undo_log_truncate=ON is blocking page write When innodb_undo_log_truncate=ON causes an InnoDB undo tablespace to be truncated, we must guarantee that the undo tablespace will be rebuilt atomically: After mtr_t::commit_shrink() has durably written the mini-transaction that rebuilds the undo tablespace, we must not write any old pages to the tablespace. To guarantee this, in trx_purge_truncate_history() we used to traverse the entire buf_pool.flush_list in order to acquire exclusive latches on all pages for the undo tablespace that reside in the buffer pool, so that those pages cannot be written and will be evicted during mtr_t::commit_shrink(). But, this traversal may interfere with the page writing activity of buf_flush_page_cleaner(). It would be better to lazily discard the old pages of the truncated undo tablespace. fil_space_t::is_being_truncated, fil_space_t::clear_stopping(): Remove. fil_space_t::create_lsn: A new field, identifying the LSN of the latest rebuild of a tablespace. buf_page_t::flush(), buf_flush_try_neighbors(): Evict pages whose FIL_PAGE_LSN is below fil_space_t::create_lsn. mtr_t::commit_shrink(): Update fil_space_t::create_lsn and fil_space_t::size right before the log is durably written and the tablespace file is being truncated. fsp_page_create(), trx_purge_truncate_history(): Simplify the logic. Reviewed by: Thirunarayanan Balathandayuthapani, Vladislav Lesin Performance tested by: Axel Schwenke Correctness tested by: Matthias Leich	2024-01-10 11:53:00 +02:00
Marko Mäkelä	5b6134b040	Merge 10.11 into 11.0	2023-11-24 11:20:56 +02:00
Marko Mäkelä	7443ad1c8a	MDEV-32374 log_sys.lsn_lock is a performance hog The log_sys.lsn_lock that was introduced in commit `a635c40648` had better be located in the same cache line with log_sys.latch so that log_t::append_prepare() needs to modify only two first cache lines where log_sys is stored. log_t::lsn_lock: On Linux, change the type from pthread_mutex_t to something that may be as small as 32 bits, to pack more data members in the same cache line. On Microsoft Windows, CRITICAL_SECTION works better. log_t::check_flush_or_checkpoint_: Renamed to need_checkpoint. There is no need to pause all writer threads in log_free_check() when we only need to write log_sys.buf to ib_logfile0. That will be done in mtr_t::commit(). log_t::append_prepare_wait(): Make the member function non-static to simplify the call interface, and add a parameter for the LSN. log_t::append_prepare(): Invoke append_prepare_wait() at most once. Only set_check_for_checkpoint() if a log checkpoint needs to be written. If the log buffer needs to be written, we will take care of it ourselves later in our caller. This will reduce interference with log_free_check() in other threads. mtr_t::commit(): Call log_write_up_to() if needed. log_t::get_write_target(): Return a log_write_up_to() target to mtr_t::commit(). buf_flush_ahead(): If we are in furious flushing, call log_sys.set_check_for_checkpoint() so that all writers will wait in log_free_check() until the checkpoint is done. Otherwise, the test innodb.insert_into_empty could occasionally report an error "Crash recovery is broken". log_check_margins(): Replaced by log_free_check(). log_flush_margin(): Removed. This is part of mtr_t::commit() and other operations that write log. log_t::create(), log_t::attach(): Guarantee that buf_free < max_buf_free will always hold on PMEM, to satisfy an assumption of log_t::get_write_target(). log_write_up_to(): Assert lsn!=0. Such calls are not incorrect, but it is cheaper to test that single unlikely condition in mtr_t::commit() rather than test several conditions in log_write_up_to(). innodb_drop_database(), unlock_and_close_files(): Check the LSN before calling log_write_up_to(). ha_innobase::commit_inplace_alter_table(): Remove redundant calls to log_write_up_to() after calling unlock_and_close_files(). Reviewed by: Vladislav Vaintroub Stress tested by: Matthias Leich Performance tested by: Steve Shaw	2023-11-21 14:38:35 +02:00
Oleksandr Byelkin	48af85db21	Merge branch '10.11' into 11.0	2023-11-08 17:09:44 +01:00
Oleksandr Byelkin	04d9a46c41	Merge branch '10.6' into 10.10	2023-11-08 16:23:30 +01:00
Marko Mäkelä	39e3ca8bd2	MDEV-31826 InnoDB may fail to recover after being killed in fil_delete_tablespace() InnoDB was violating the write-ahead-logging protocol when a file was being deleted, like this: 1. fil_delete_tablespace() set the fil_space_t::STOPPING flag 2. The buf_flush_page_cleaner() thread discards some changed pages for this tablespace advances the log checkpoint a little. 3. The server process is killed before fil_delete_tablespace() wrote a FILE_DELETE record. 4. Recovery will try to apply log to pages of the tablespace, because there was no FILE_DELETE record. This will fail, because some pages that had been modified since the latest checkpoint had not been written by the page cleaner. Page writes must not be stopped before a FILE_DELETE record has been durably written. fil_space_t::drop(): Replaces fil_space_t::check_pending_operations(). Add the parameter detached_handle, and return a tablespace pointer if this thread was the first one to stop I/O on the tablespace. mtr_t::commit_file(): Remove the parameter detached_handle, and move some handling to fil_space_t::drop(). fil_space_t: STOPPING_READS, STOPPING_WRITES: Separate flags for STOPPING. We want to stop reads (and encryption) before stopping page writes. fil_space_t::is_stopping_writes(), fil_space_t::get_for_write(): Special accessors for the write path. fil_space_t::flush_low(): Ignore the STOPPING_READS flag and only stop if STOPPING_WRITES is set, to avoid an infinite loop in fil_flush_file_spaces(), which was occasionally repeated by running the test encryption.create_or_replace. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2023-10-26 15:07:59 +03:00
Marko Mäkelä	54819192fe	Merge 10.11 into 11.0	2023-04-26 18:50:15 +03:00
Marko Mäkelä	3c25077899	Merge 10.6 into 10.8	2023-04-24 15:59:23 +03:00
Marko Mäkelä	86767bcc0f	MDEV-29593 Purge misses a chance to free not-yet-reused undo pages trx_purge_truncate_rseg_history(): If all other conditions for invoking trx_purge_remove_log_hdr() hold, but the state is TRX_UNDO_CACHED instead of TRX_UNDO_TO_PURGE, detach and free it. Tested by: Matthias Leich	2023-04-21 17:58:09 +03:00
Vlad Lesin	71f16c836f	MDEV-31049 fil_delete_tablespace() returns wrong file handle if tablespace was closed by parallel thread fil_delete_tablespace() stores file handle in local variable and calls mtr_t::commit_file()=>fil_system_t::detach(..., detach_handle=true), which sets space->chain.start->handle = OS_FILE_CLOSED. fil_system_t::detach() is invoked under fil_system.mutex. But before the mutex is acquired some parallel thread can change space->chain.start->handle. fil_delete_tablespace() returns value, stored in local variable, i.e. wrong value. File handle can be closed, for example, from buf_flush_space() when the limit of innodb_open_files exceded and fil_space_t::get() causes fil_space_t::try_to_close() call. fil_space_t::try_to_close() is executed under fil_system.mutex. And mtr_t::commit_file() locks it for fil_system_t::detach() call. fil_system_t::detach() returns detached file handle if its argument detach_handle is true. The fix is to let mtr_t::commit_file() to pass that detached file handle to fil_delete_tablespace().	2023-04-14 10:42:12 +03:00
Marko Mäkelä	4c355d4e81	Merge 10.11 into 11.0	2023-03-17 15:03:17 +02:00
Marko Mäkelä	acf46b7b36	Merge 10.6 into 10.8	2023-03-16 18:11:37 +02:00
Marko Mäkelä	f2096478d5	MDEV-29835 InnoDB hang on B-tree split or merge This is a follow-up to commit `de4030e4d4` (MDEV-30400), which fixed some hangs related to B-tree split or merge. btr_root_block_get(): Use and update the root page guess. This is just a minor performance optimization, not affecting correctness. btr_validate_level(): Remove the parameter "lockout", and always acquire an exclusive dict_index_t::lock in CHECK TABLE without QUICK. This is needed in order to avoid latching order violation in btr_page_get_father_node_ptr_for_validate(). btr_cur_need_opposite_intention(): Return true in case btr_cur_compress_recommendation() would hold later during the mini-transaction, or if a page underflow or overflow is possible. If we return true, our caller will escalate to aqcuiring an exclusive dict_index_t::lock, to prevent a latching order violation and deadlock during btr_compress() or btr_page_split_and_insert(). btr_cur_t::search_leaf(), btr_cur_t::open_leaf(): Also invoke btr_cur_need_opposite_intention() on the leaf page. btr_cur_t::open_leaf(): When escalating to exclusive index locking, acquire exclusive latches on all pages as well. innobase_instant_try(): Return an error code if the root page cannot be retrieved. In addition to the normal stress testing with Random Query Generator (RQG) this has been tested with ./mtr --mysqld=--loose-innodb-limit-optimistic-insert-debug=2 but with the injection in btr_cur_optimistic_insert() for non-leaf pages adjusted so that it would use the value 3. (Otherwise, infinite page splits could occur in some mtr tests.) Tested by: Matthias Leich	2023-03-16 15:52:42 +02:00
Marko Mäkelä	2e431ff7e6	Merge 10.11 into 11.0	2023-02-16 13:34:45 +02:00
Marko Mäkelä	5abbe092e6	Merge 10.6 into 10.8	2023-02-16 09:17:06 +02:00
Marko Mäkelä	201cfc33e6	MDEV-30638 Deadlock between INSERT and InnoDB non-persistent statistics update This is a partial revert of commit `8b6a308e46` (MDEV-29883) and a follow-up to the merge commit `394fc71f4f` (MDEV-24569). The latching order related to any operation that accesses the allocation metadata of an InnoDB index tree is as follows: 1. Acquire dict_index_t::lock in non-shared mode. 2. Acquire the index root page latch in non-shared mode. 3. Possibly acquire further index page latches. Unless an exclusive dict_index_t::lock is held, this must follow the root-to-leaf, left-to-right order. 4. Acquire a non-shared fil_space_t::latch. 5. Acquire latches on the allocation metadata pages. 6. Possibly allocate and write some pages, or free some pages. btr_get_size_and_reserved(), dict_stats_update_transient_for_index(), dict_stats_analyze_index(): Acquire an exclusive fil_space_t::latch in order to avoid a deadlock in fseg_n_reserved_pages() in case of concurrent access to multiple indexes sharing the same "inode page". fseg_page_is_allocated(): Acquire an exclusive fil_space_t::latch in order to avoid deadlocks. All callers are holding latches on a buffer pool page, or an index, or both. Before commit `edbde4a11f` (MDEV-24167) a third mode was available that would not conflict with the shared fil_space_t::latch acquired by ha_innobase::info_low(), i_s_sys_tablespaces_fill_table(), or i_s_tablespaces_encryption_fill_table(). Because those calls should be rather rare, it makes sense to use the simple rw_lock with only shared and exclusive modes. fil_crypt_get_page_throttle(): Avoid invoking fseg_page_is_allocated() on an allocation bitmap page (which can never be freed), to avoid acquiring a shared latch on top of an exclusive one. mtr_t::s_lock_space(), MTR_MEMO_SPACE_S_LOCK: Remove.	2023-02-16 08:30:20 +02:00
Marko Mäkelä	75c78316d6	Merge 10.11 into 11.0	2023-01-25 10:17:54 +02:00
Marko Mäkelä	fa543a0f62	Merge 10.7 into 10.8	2023-01-24 14:52:25 +02:00
Marko Mäkelä	cea50896d2	Merge 10.6 into 10.7	2023-01-24 14:35:36 +02:00
Marko Mäkelä	de4030e4d4	MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT This also fixes part of MDEV-29835 Partial server freeze which is caused by violations of the latching order that was defined in https://dev.mysql.com/worklog/task/?id=6326 (WL#6326: InnoDB: fix index->lock contention). Unless the current thread is holding an exclusive dict_index_t::lock, it must acquire page latches in a strict parent-to-child, left-to-right order. Not all cases of MDEV-29835 are fixed yet. Failure to follow the correct latching order will cause deadlocks of threads due to lock order inversion. As part of these changes, the BTR_MODIFY_TREE mode is modified so that an Update latch (U a.k.a. SX) will be acquired on the root page, and eXclusive latches (X) will be acquired on all pages leading to the leaf page, as well as any left and right siblings of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326 will be removed, because at the time the DEBUG_SYNC point is hit, the thread is actually holding several page latches that will be blocking a concurrent SELECT statement. We also remove double bookkeeping that was caused due to excessive information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo store information of latched pages, and ensure that mtr_memo_slot_t::object is never a null pointer. The tree_blocks[] and tree_savepoints[] were redundant. buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid a hang, do not try to evict blocks if we are holding a latch on a modified page. The test innodb.innodb-change-buffer-recovery will be removed, because change buffering may no longer be forced by debug injection when the change buffer comprises multiple pages. Remove a debug assertion that could fail when innodb_change_buffering_debug=1 fails to evict a page. For other cases, the assertion is redundant, because we already checked that right after the got_block: label. The test innodb.innodb-change-buffering-recovery will be removed, because due to this change, we will be unable to evict the desired page. mtr_t::lock_register(): Register a change of a page latch on an unmodified buffer-fixed block. mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint(): Replaced by the use of mtr_t::upgrade_buffer_fix(), which now also handles RW_S_LATCH. mtr_t::set_modified(): For temporary tables, invoke buf_page_t::set_modified() here and not in mtr_t::commit(). We will never set the MTR_MEMO_MODIFY flag on other than persistent data pages, nor set mtr_t::m_modifications when temporary data pages are modified. mtr_t::commit(): Only invoke the buf_flush_note_modification() loop if persistent data pages were modified. mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo. This avoids many redundant entries in mtr_t::m_memo, as well as redundant calls to buf_page_get_gen() for blocks that had already been looked up in a mini-transaction. btr_get_latched_root(): Return a pointer to an already latched root page. This replaces btr_root_block_get() in cases where the mini-transaction has already latched the root page. btr_page_get_parent(): Fetch a parent page that was already latched in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched(). If needed, upgrade the root page U latch to X. This avoids bloating mtr_t::m_memo as well as performing redundant buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for B-tree defragmentation, we will invoke btr_cur_search_to_nth_level(). btr_cur_search_to_nth_level(): This will only be used for non-leaf (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be removed altogether, or retained for the case of CHECK TABLE without QUICK. btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page() can retrieve the left sibling from the end of mtr_t::m_memo. btr_cur_t::open_leaf(): Some clean-up. btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level() for searches to level=0 (the leaf level). We will never release parent page latches before acquiring leaf page latches. If we need to temporarily release the level=1 page latch in the BTR_SEARCH_PREV or BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the child node pointer so that we will land on the correct leaf page. btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE latching logic in the case that page splits or merges will be needed. The parent pages (and their siblings) should already be latched on the first dive to the leaf and be present in mtr_t::m_memo; there should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost suffices; it must be revised in MDEV-29835 and work-arounds removed for cases where mtr_t::get_already_latched() fails to find a block. rtr_search_to_nth_level(): A SPATIAL INDEX version of btr_search_to_nth_level() that can search to any level (including the leaf level). rtr_search_leaf(), rtr_insert_leaf(): Wrappers for rtr_search_to_nth_level(). rtr_search(): Replaces rtr_pcur_open(). rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike in the B-tree code, there is no error handling in case the sibling pages are corrupted. rtr_cur_restore_position(): Remove an unused constant parameter. btr_pcur_open_on_user_rec(): Remove the constant parameter mode=PAGE_CUR_GE. row_ins_clust_index_entry_low(): Use a new mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC. BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove. BTR_CONT_MODIFY_TREE: Note that this is only used by rtr_search_to_nth_level(). btr_pcur_optimistic_latch_leaves(): Replaces btr_cur_optimistic_latch_leaves(). ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV). btr_blob_log_check_t(): Acquire a U latch on the root page, so that btr_page_alloc() in btr_store_big_rec_extern_fields() will avoid a deadlock. btr_store_big_rec_extern_fields(): Assert that the root page latch is being held. Tested by: Matthias Leich Reviewed by: Vladislav Lesin	2023-01-24 14:09:21 +02:00
Marko Mäkelä	e41fb3697c	Revert "MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT" This reverts commit `f9cac8d2cb` which was accidentally pushed prematurely.	2023-01-23 14:52:49 +02:00
Marko Mäkelä	f9cac8d2cb	MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT This also fixes part of MDEV-29835 Partial server freeze which is caused by violations of the latching order that was defined in https://dev.mysql.com/worklog/task/?id=6326 (WL#6326: InnoDB: fix index->lock contention). Unless the current thread is holding an exclusive dict_index_t::lock, it must acquire page latches in a strict parent-to-child, left-to-right order. Not all cases are fixed yet. Failure to follow the correct latching order will cause deadlocks of threads due to lock order inversion. As part of these changes, the BTR_MODIFY_TREE mode is modified so that an Update latch (U a.k.a. SX) will be acquired on the root page, and eXclusive latches (X) will be acquired on all pages leading to the leaf page, as well as any left and right siblings of the pages along the path. The test innodb.innodb_wl6326 will be removed, because at the time the DEBUG_SYNC point is hit, the thread is actually holding several page latches that will be blocking a concurrent SELECT statement. We also remove double bookkeeping that was caused due to excessive information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo store information of latched pages, and ensure that mtr_memo_slot_t::object is never a null pointer. The tree_blocks[] and tree_savepoints[] were redundant. mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo. This avoids many redundant entries in mtr_t::m_memo, as well as redundant calls to buf_page_get_gen() for blocks that had already been looked up in a mini-transaction. btr_get_latched_root(): Return a pointer to an already latched root page. This replaces btr_root_block_get() in cases where the mini-transaction has already latched the root page. btr_page_get_parent(): Fetch a parent page that was already latched in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched(). If needed, upgrade the root page U latch to X. This avoids bloating mtr_t::m_memo as well as redundant buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for B-tree defragmentation, we will invoke btr_cur_search_to_nth_level(). btr_cur_search_to_nth_level(): This will only be used for non-leaf (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be removed altogether, or retained for the case of CHECK TABLE without QUICK. btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level() for searches to level=0 (the leaf level). btr_cur_t::pessimistic_search_leaf(): Implement the new BTR_MODIFY_TREE latching logic in the case that page splits or merges will be needed. The parent pages (and their siblings) should already be latched on the first dive to the leaf and be present in mtr_t::m_memo; there should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost suffices; MDEV-29835 will have to revise it and remove work-arounds where mtr_t::get_already_latched() fails to find a block. rtr_search_to_nth_level(): A SPATIAL INDEX version of btr_search_to_nth_level() that can search to any level (including the leaf level). rtr_search_leaf(), rtr_insert_leaf(): Wrappers for rtr_search_to_nth_level(). rtr_search(): Replaces rtr_pcur_open(). rtr_cur_restore_position(): Remove an unused constant parameter. btr_pcur_open_on_user_rec(): Remove the constant parameter mode=PAGE_CUR_GE. btr_cur_latch_leaves(): Update a pre-existing mtr_t::m_memo entry for the current leaf page. row_ins_clust_index_entry_low(): Use a new mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC. btr_cur_t::open_leaf(): Some clean-up. mtr_t::lock_register(): Register a page latch on a buffer-fixed block. BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove. BTR_CONT_MODIFY_TREE: Note that this is only used by rtr_search_to_nth_level(). btr_pcur_optimistic_latch_leaves(): Replaces btr_cur_optimistic_latch_leaves(). ibuf_delete_rec(): Acquire ibuf.index->lock.u_lock() in order to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV). Tested by: Matthias Leich	2023-01-19 17:19:18 +02:00
Marko Mäkelä	67dc8af2a7	MDEV-30289: Implement small_vector for mtr_t::m_memo To avoid heap memory allocation overhead for mtr_t::m_memo, we will allocate a small number of elements statically in mtr_t::m_memo::small. Only if that preallocated data is insufficient, we will invoke my_alloc() or my_realloc() for more storage. The implementation of the data structure is inspired by llvm::SmallVector.	2023-01-19 16:10:29 +02:00
Marko Mäkelä	7fa5cce305	MDEV-30289: Remove the pointer indirection for mtr_t::m_memo	2023-01-19 16:10:18 +02:00
Marko Mäkelä	f27e9c8947	MDEV-29694 Remove the InnoDB change buffer The purpose of the change buffer was to reduce random disk access, which could be useful on rotational storage, but maybe less so on solid-state storage. When we wished to (1) insert a record into a non-unique secondary index, (2) delete-mark a secondary index record, (3) delete a secondary index record as part of purge (but not ROLLBACK), and the B-tree leaf page where the record belongs to is not in the buffer pool, we inserted a record into the change buffer B-tree, indexed by the page identifier. When the page was eventually read into the buffer pool, we looked up the change buffer B-tree for any modifications to the page, applied these upon the completion of the read operation. This was called the insert buffer merge. We remove the change buffer, because it has been the source of various hard-to-reproduce corruption bugs, including those fixed in commit `5b9ee8d819` and commit `165564d3c3` but not limited to them. A downgrade will fail with a clear message starting with commit `db14eb16f9` (MDEV-30106). buf_page_t::state: Merge IBUF_EXIST to UNFIXED and WRITE_FIX_IBUF to WRITE_FIX. buf_pool_t::watch[]: Remove. trx_t: Move isolation_level, check_foreigns, check_unique_secondary, bulk_insert into the same bit-field. The only purpose of trx_t::check_unique_secondary is to enable bulk insert into an empty table. It no longer enables insert buffering for UNIQUE INDEX. btr_cur_t::thr: Remove. This field was originally needed for change buffering. Later, its use was extended to cover SPATIAL INDEX. Much of the time, rtr_info::thr holds this field. When it does not, we will add parameters to SPATIAL INDEX specific functions. ibuf_upgrade_needed(): Check if the change buffer needs to be updated. ibuf_upgrade(): Merge and upgrade the change buffer after all redo log has been applied. Free any pages consumed by the change buffer, and zero out the change buffer root page to mark the upgrade completed, and to prevent a downgrade to an earlier version. dict_load_tablespaces(): Renamed from dict_check_tablespaces_and_store_max_id(). This needs to be invoked before ibuf_upgrade(). btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer allocate any change buffer pages. btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. row_search_index_entry(), btr_lift_page_up(): Add a parameter thr for the SPATIAL INDEX case. rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert(). rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert(). Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0 change buffer format that predates the MySQL 4.1 introduction of the option innodb_file_per_table was removed in MySQL 5.6.5 as part of mysql/mysql-server@69b6241a79 and MariaDB 10.0.11 as part of `1d0f70c2f8`. In the tests innodb.log_upgrade and innodb.log_corruption, we create valid (upgraded) change buffer pages. Tested by: Matthias Leich	2023-01-11 17:59:36 +02:00
Marko Mäkelä	f46efb4476	Merge 10.7 into 10.8	2022-11-17 21:35:12 +02:00
Marko Mäkelä	d5332086d7	Merge 10.6 into 10.7	2022-11-17 09:19:32 +02:00
Marko Mäkelä	24fe53477c	MDEV-29603 btr_cur_open_at_index_side() is missing some consistency checks btr_cur_t: Zero-initialize all fields in the default constructor. btr_cur_t::index: Remove; it duplicated page_cur.index. Many functions: Remove arguments that were duplicating page_cur_t::index and page_cur_t::block. page_cur_open_level(), btr_pcur_open_level(): Replaces btr_cur_open_at_index_side() for dict_stats_analyze_index(). At the end, release all latches except the dict_index_t::lock and the buf_page_t::lock on the requested page. dict_stats_analyze_index(): Rely on mtr_t::rollback_to_savepoint() to release all uninteresting page latches. btr_search_guess_on_hash(): Simplify the logic, and invoke mtr_t::rollback_to_savepoint(). We will use plain C++ std::vector<mtr_memo_slot_t> for mtr_t::m_memo. In this way, we can avoid setting mtr_memo_slot_t::object to nullptr and instead just remove garbage from m_memo. mtr_t::rollback_to_savepoint(): Shrink the vector. We will be needing this in dict_stats_analyze_index(), where we will release page latches and only retain the index->lock in mtr_t::m_memo. mtr_t::release_last_page(): Release the last acquired page latch. Replaces btr_leaf_page_release(). mtr_t::release(const buf_block_t&): Release a single page latch. Used in btr_pcur_move_backward_from_page(). mtr_t::memo_release(): Replaced with mtr_t::release(). mtr_t::upgrade_buffer_fix(): Acquire a latch for a buffer-fixed page. This replaces the double bookkeeping in btr_cur_t::open_leaf(). Reviewed by: Vladislav Lesin	2022-11-17 08:19:01 +02:00
Marko Mäkelä	8442bc6e13	Cleanup: Remove an unnecessary page lookup btr_cur_search_to_nth_level(): Simply acquire a latch on the already buffer-fixed page. There is no need to release the buffer-fix and re-lookup the page.	2022-11-16 09:36:36 +02:00
Marko Mäkelä	da21f3f428	Merge 10.5 into 10.6	2022-11-10 17:30:15 +02:00
Marko Mäkelä	7ee612c912	MDEV-21174 fixup: Remove mtr_t::release_page() mtr_t::release_page(): Remove. The function became unused in commit `56f6dab1d0` when the call was replaced with a call to mtr_t::memo_release().	2022-11-10 12:50:44 +02:00
Marko Mäkelä	c8cd162a0a	Merge 10.7 into 10.8	2022-08-30 13:04:17 +03:00
Marko Mäkelä	b86be02ecf	Merge 10.6 into 10.7	2022-08-30 13:02:42 +03:00
Marko Mäkelä	76bb671e42	Merge 10.5 into 10.6	2022-08-25 16:02:44 +03:00
Marko Mäkelä	2bddc5d045	Merge 10.7 into 10.8	2022-08-24 10:22:37 +03:00
Marko Mäkelä	bdd80e3fb1	Merge 10.6 into 10.7	2022-08-24 09:22:34 +03:00
Marko Mäkelä	01f9c81237	MDEV-29336: Potential deadlock in btr_page_alloc_low() with the AHI The index root page contains the fields BTR_SEG_TOP and BTR_SEG_LEAF which keep track of allocated pages in the index tree. These fields are normally protected by an Update latch, so that concurrent read access to other parts of the page will be possible. When the index root page is already exclusively latched in the mini-transaction, we must not try to acquire a lower-grade Update latch. In fact, when the root page is already X or U latched in the mini-transaction, there is no point to acquire another latch. Moreover, after a U latch was acquired on top of an X-latch, mtr_t::defer_drop_ahi() would trigger an assertion failure or lock corruption in block->page.lock.u_x_upgrade() because X locks already exist on the block. This problem may have been introduced in commit `03ca6495df` (MDEV-24142). btr_page_alloc_low(), btr_page_free(): Initially buffer-fix the root page. If it is already U or X latched, release the buffer-fix. Else, upgrade the buffer-fix to a U latch. mtr_t::u_lock_register(): Upgrade a buffer-fix to U latch. mtr_t::have_u_or_x_latch(): Check if U or X latches are already registered in the mini-transaction.	2022-08-23 08:47:49 +03:00
Marko Mäkelä	54ac356dea	Merge 10.7 into 10.8	2022-06-21 18:19:24 +03:00
Marko Mäkelä	6680fd8d4b	Merge 10.6 into 10.7	2022-06-21 18:02:41 +03:00
Marko Mäkelä	2e43af69e3	MDEV-28870 InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT There was a race condition between log_checkpoint_low() and deleting or renaming data files. The scenario is as follows: 1. The buffer pool does not contain dirty pages. 2. A FILE_DELETE or FILE_RENAME record is written. 3. The checkpoint LSN will be moved ahead of the write of the record. 4. The server is killed before the file is actually renamed or deleted. We will prevent this race condition by ensuring that a log checkpoint cannot occur between the durable write and the file system operation: 1. Durably write the FILE_DELETE or FILE_RENAME record. 2. Perform the file system operation. 3. Allow any log checkpoint to proceed. mtr_t::commit_file(): Implement the DELETE or RENAME logic. fil_delete_tablespace(): Delegate some of the logic to mtr_t::commit_file(). fil_space_t::rename(): Delegate some logic to mtr_t::commit_file(). Remove the debug injection point fil_rename_tablespace_failure_2 because we do test RENAME failures without any debug injection. fil_name_write_rename_low(), fil_name_write_rename(): Remove. Tested by Matthias Leich	2022-06-21 16:59:21 +03:00
Marko Mäkelä	813986a647	Merge 10.7 into 10.8	2022-06-14 16:19:29 +03:00

1 2 3 4 5

210 commits