mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-29 10:14:19 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	e039720bf3	MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait lock_sys_t::cancel(trx_t*): Remove, and merge to its only caller innobase_kill_query(). innobase_kill_query(): Before reading trx->lock.wait_lock, do acquire lock_sys.wait_mutex, like we did before commit `e71e613353` (MDEV-24671). In this way, we should not miss a recently started lock wait by the killee transaction. lock_rec_lock(): Add a DEBUG_SYNC "lock_rec" for the test case. lock_wait(): Invoke trx_is_interrupted() before entering the wait, in case innobase_kill_query() was invoked some time earlier and some longer-running operation did not check for interrupts. As suggested by Vladislav Lesin, do not overwrite trx->error_state==DB_INTERRUPTED with DB_SUCCESS. This would avoid a call to trx_is_interrupted() when the test is modified to use the DEBUG_SYNC point lock_wait_start instead of lock_rec. Avoid some redundant loads of trx->lock.wait_lock; cache the value in the local variable wait_lock. Deadlock::check_and_resolve(): Take wait_lock as a parameter and return wait_lock (or -1 or nullptr). We only need to reload trx->lock.wait_lock if lock_sys.wait_mutex had been released and reacquired. trx_t::error_state: Correctly document the data member. trx_lock_t::was_chosen_as_deadlock_victim: Clarify that other threads may set the field (or flags in it) while holding lock_sys.wait_mutex. Thanks to Johannes Baumgarten for reporting the problem and testing the fix, as well as to Kristian Nielsen for suggesting the fix. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2023-09-11 14:51:02 +03:00
Marko Mäkelä	9d1466522e	MDEV-32029 Assertion failures in log_sort_flush_list upon crash recovery In commit `0d175968d1` (MDEV-31354) we only waited that no buf_pool.flush_list writes are in progress. The buf_flush_page_cleaner() thread could still initiate page writes from the buf_pool.LRU list while only holding buf_pool.mutex, not buf_pool.flush_list_mutex. This is something that was changed in commit `a55b951e60` (MDEV-26827). log_sort_flush_list(): Wait for the buf_flush_page_cleaner() thread to be completely idle, including LRU flushing. buf_flush_page_cleaner(): Always broadcast buf_pool.done_flush_list when becoming idle, so that log_sort_flush_list() will be woken up. Also, ensure that buf_pool.n_flush_inc() or buf_pool.flush_list_set_active() has been invoked before any page writes are initiated. buf_flush_try_neighbors(): Release buf_pool.mutex here and not in the callers, to avoid code duplication. Make innodb_flush_neighbors=ON obey the innodb_io_capacity limit.	2023-08-30 14:40:13 +03:00
Marko Mäkelä	f7780a8eb8	MDEV-30100: Assertion purge_sys.tail.trx_no <= purge_sys.rseg->last_trx_no() trx_t::commit_empty(): A special case of transaction "commit" when the transaction was actually rolled back or the persistent undo log is empty. In this case, we need to change the undo log header state to TRX_UNDO_CACHED and move the undo log from rseg->undo_list to rseg->undo_cached for fast reuse. Furthermore, unless this is the only undo log record in the page, we will remove the record and rewind TRX_UNDO_PAGE_START, TRX_UNDO_PAGE_FREE, TRX_UNDO_LAST_LOG. We must also ensure that the system-wide transaction identifier will be persisted up to this->id, so that there will not be warnings or errors due to a PAGE_MAX_TRX_ID being too large. We might have modified secondary index pages before being rolled back, and any changes of PAGE_MAX_TRX_ID are never rolled back. Even though it is not going to be written persistently anywhere, we will invoke trx_sys.assign_new_trx_no(this), so that in the test innodb.instant_alter everything will be purged as expected. trx_t::write_serialisation_history(): Renamed from trx_write_serialisation_history(). If there is no undo log, invoke commit_empty(). trx_purge_add_undo_to_history(): Simplify an assertion and remove a comment. This function will not be invoked on an empty undo log anymore. trx_undo_header_create(): Add a debug assertion. trx_undo_mem_create_at_db_start(): Remove a duplicated assignment. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2023-08-25 13:41:54 +03:00
Marko Mäkelä	4ff5311dec	MDEV-30100 preparation: Simplify InnoDB transaction commit further trx_commit_complete_for_mysql(): Remove some conditions. We will rely on trx_t::commit_lsn. trx_t::must_flush_log_later: Remove. trx_commit_complete_for_mysql() can simply check for trx_t::flush_log_later. trx_t::commit_in_memory(): Set commit_lsn=0 if the log was written. trx_flush_log_if_needed_low(): Renamed to trx_flush_log_if_needed(). Assert that innodb_flush_log_at_trx_commit!=0 was checked by the caller and that the transaction is not in XA PREPARE state. Unconditionally flush the log for data dictionary transactions, to ensure the correct processing of ddl_recovery.log. trx_write_serialisation_history(): Move some code from trx_purge_add_undo_to_history(). trx_prepare(): Invoke log_write_up_to() directly if needed. innobase_commit_ordered_2(): Simplify some conditions. A read-write transaction will always carry nonzero trx_t::id. Let us unconditionally reset mysql_log_file_name, flush_log_later after trx_t::commit() was invoked.	2023-08-25 13:23:21 +03:00
Marko Mäkelä	f4bbea90f1	MDEV-30100 preparation: Simplify InnoDB transaction commit trx_commit_cleanup(): Clean up any temporary undo log. Replaces trx_undo_commit_cleanup() and trx_undo_seg_free(). trx_write_serialisation_history(): Commit the mini-transaction. Do not touch temporary undo logs. Assume that a persistent rollback segment has been assigned. trx_serialise(): Merged into trx_write_serialisation_history(). trx_t::commit_low(): Correct some comments and assertions. trx_t::commit_persist(): Only invoke commit_low() on a mini-transaction if the persistent state needs to change.	2023-08-25 13:16:54 +03:00
Marko Mäkelä	a60462d93e	Remove bogus references to replaced Google contributions In commit `03ca6495df` and commit `ff5d306e29` we forgot to remove some Google copyright notices related to a contribution of using atomic memory access in the old InnoDB mutex_t and rw_lock_t implementation. The copyright notices had been mostly added in commit `c6232c06fa` due to commit `a1bb700fd2`. The following Google contributions remain: * some logic related to the parameter innodb_io_capacity * innodb_encrypt_tables, added in MariaDB Server 10.1	2023-08-21 15:51:16 +03:00
Marko Mäkelä	6cc88c3db1	Clean up buf0buf.inl Let us move some #include directives from buf0buf.inl to the compilation units where they are really used.	2023-08-21 15:51:10 +03:00
Oleksandr Byelkin	6bf8483cac	Merge branch '10.5' into 10.6	2023-08-01 15:08:52 +02:00
Marko Mäkelä	72928e640e	MDEV-27593: Crashing on I/O error is unhelpful buf_page_t::write_complete(), buf_page_write_complete(), IORequest::write_complete(): Add a parameter for passing an error code. If an error occurred, we will release the io-fix, buffer-fix and page latch but not reset the oldest_modification field. The block would remain in buf_pool.LRU and possibly buf_pool.flush_list, to be written again later, by buf_flush_page_cleaner(). If all page writes start consistently failing, all write threads should eventually hang in log_free_check() because the log checkpoint cannot be advanced to make room in the circular write-ahead-log ib_logfile0. IORequest::read_complete(): Add a parameter for passing an error code. If a read operation fails, we report the error and discard the page, just like we would do if the page checksum was not validated or the page could not be decrypted. This only affects asynchronous reads, due to linear or random read-ahead or crash recovery. When buf_page_get_low() invokes buf_read_page(), that will be a synchronous read, not involving this code. This was tested by randomly injecting errors in write_io_callback() and read_io_callback(), like this: if (!ut_rnd_interval(100)) cb->m_err= 42;	2023-08-01 14:39:29 +03:00
Oleksandr Byelkin	7564be1352	Merge branch '10.4' into 10.5	2023-07-26 16:02:57 +02:00
Marko Mäkelä	b102872ad5	MDEV-31767 InnoDB tables are being flagged as corrupted on an I/O bound server The main problem is that at ever since commit `aaef2e1d8c` removed the function buf_wait_for_read(), it is not safe to invoke buf_page_get_low() with RW_NO_LATCH, that is, only buffer-fixing the page. If a page read (or decryption or decompression) is in progress, there would be a race condition when executing consistency checks, and a page would wrongly be flagged as corrupted. Furthermore, if the page is actually corrupted and the initial access to it was with RW_NO_LATCH (only buffer-fixing), the page read handler would likely end up in an infinite loop in buf_pool_t::corrupted_evict(). It is not safe to invoke mtr_t::upgrade_buffer_fix() on a block on which a page latch was not initially acquired in buf_page_get_low(). btr_block_reget(): Remove the constant parameter rw_latch=RW_X_LATCH. btr_block_get(): Assert that RW_NO_LATCH is not being used, and change the parameter type of rw_latch. btr_pcur_move_to_next_page(), innobase_table_is_empty(): Adjust for the parameter type change of btr_block_get(). btr_root_block_get(): If mode==RW_NO_LATCH, do not check the integrity of the page, because it is not safe to do so. btr_page_alloc_low(), btr_page_free(): If the root page latch is not previously held by the mini-transaction, invoke btr_root_block_get() again with the proper latching mode. btr_latch_prev(): Helper function to safely acquire a latch on a preceding sibling page while holding a latch on a B-tree page. To avoid deadlocks, we must not wait for the latch while holding a latch on the current page, because another thread may be waiting for our page latch when moving to the next page from our preceding sibling page. If s_lock_try() or x_lock_try() on the preceding page fails, we must release the current page latch, and wait for the latch on the preceding page as well as the current page, in that order. Page splits or merges will be prevented by the parent page latch that we are holding. btr_cur_t::search_leaf(): Make use of btr_latch_prev(). btr_cur_t::open_leaf(): Make use of btr_latch_prev(). Do not invoke mtr_t::upgrade_buffer_fix() (when latch_mode == BTR_MODIFY_TREE), because we will already have acquired all page latches upfront. btr_cur_t::pessimistic_search_leaf(): Do acquire an exclusive index latch before accessing the page. Make use of btr_latch_prev().	2023-07-25 11:40:58 +03:00
Oleksandr Byelkin	f52954ef42	Merge commit '10.4' into 10.5	2023-07-20 11:54:52 +02:00
Vlad Lesin	090a84366a	MDEV-29311 Server Status Innodb_row_lock_time% is reported in seconds Before MDEV-24671, the wait time was derived from my_interval_timer() / 1000 (nanoseconds converted to microseconds, and not microseconds to milliseconds like I must have assumed). The lock_sys.wait_time and lock_sys.wait_time_max are already in milliseconds; we should not divide them by 1000. In MDEV-24738 the millisecond counts lock_sys.wait_time and lock_sys.wait_time_max were changed to a 32-bit type. That would overflow in 49.7 days. Keep using a 64-bit type for those millisecond counters. Reviewed by: Marko Mäkelä	2023-07-10 12:42:46 +03:00
Monty	99bd226059	MDEV-31558 Add InnoDB engine information to the slow query log The new statistics is enabled by adding the "engine", "innodb" or "full" option to --log-slow-verbosity Example output: # Pages_accessed: 184 Pages_read: 95 Pages_updated: 0 Old_rows_read: 1 # Pages_read_time: 17.0204 Engine_time: 248.1297 Page_read_time is time doing physical reads inside a storage engine. (Writes cannot be tracked as these are usually done in the background). Engine_time is the time spent inside the storage engine for the full duration of the read/write/update calls. It uses the same code as 'analyze statement' for calculating the time spent. The engine statistics is done with a generic interface that should be easy for any engine to use. It can also easily be extended to provide even more statistics. Currently only InnoDB has counters for Pages_% and Undo_% status. Engine_time works for all engines. Implementation details: class ha_handler_stats holds all engine stats. This class is included in handler and THD classes. While a query is running, all statistics is updated in the handler. In close_thread_tables() the statistics is added to the THD. handler::handler_stats is a pointer to where statistics should be collected. This is set to point to handler::active_handler_stats if stats are requested. If not, it is set to 0. handler_stats has also an element, 'active' that is 1 if stats are requested. This is to allow engines to avoid doing any 'if's while updating the statistics. Cloned or partition tables have the pointer set to the base table if status are requested. There is a small performance impact when using --log-slow-verbosity=engine: - All engine calls in 'select' will be timed. - IO calls for InnoDB reads will be timed. - Incrementation of counters are done on local variables and accesses are inline, so these should have very little impact. - Statistics has to be reset for each statement for the THD and each used handler. This is only 40 bytes, which should be neglectable. - For partition tables we have to loop over all partitions to update the handler_status as part of table_init(). Can be optimized in the future to only do this is log-slow-verbosity changes. For this to work we have to update handler_status for all opened partitions and also for all partitions opened in the future. Other things: - Added options 'engine' and 'full' to log-slow-verbosity. - Some of the new files in the test suite comes from Percona server, which has similar status information. - buf_page_optimistic_get(): Do not increment any counter, since we are only validating a pointer, not performing any buf_pool.page_hash lookup. - Added THD argument to save_explain_data_intern(). - Switched arguments for save_explain_.*_data() to have always THD first (generates better code as other functions also have THD first).	2023-07-07 12:53:18 +03:00
Vlad Lesin	1bfd3cc457	MDEV-10962 Deadlock with 3 concurrent DELETEs by unique key PROBLEM: A deadlock was possible when a transaction tried to "upgrade" an already held Record Lock to Next Key Lock. SOLUTION: This patch is based on observations that: (1) a Next Key Lock is equivalent to Record Lock combined with Gap Lock (2) a GAP Lock never has to wait for any other lock In case we request a Next Key Lock, we check if we already own a Record Lock of equal or stronger mode, and if so, then we change the requested lock type to GAP Lock, which we either already have, or can be granted immediately, as GAP locks don't conflict with any other lock types. (We don't consider Insert Intention Locks a Gap Lock in above statements). The reason of why we don't upgrage Record Lock to Next Key Lock is the following. Imagine a transaction which does something like this: for each row { request lock in LOCK_X\|LOCK_REC_NOT_GAP mode request lock in LOCK_S mode } If we upgraded lock from Record Lock to Next Key lock, there would be created only two lock_t structs for each page, one for LOCK_X\|LOCK_REC_NOT_GAP mode and one for LOCK_S mode, and then used their bitmaps to mark all records from the same page. The situation would look like this: request lock in LOCK_X\|LOCK_REC_NOT_GAP mode on row 1: // -> creates new lock_t for LOCK_X\|LOCK_REC_NOT_GAP mode and sets bit for // 1 request lock in LOCK_S mode on row 1: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 1, // so it upgrades it to X request lock in LOCK_X\|LOCK_REC_NOT_GAP mode on row 2: // -> creates a new lock_t for LOCK_X\|LOCK_REC_NOT_GAP mode (because we // don't have any after we've upgraded!) and sets bit for 2 request lock in LOCK_S mode on row 2: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 2, // so it upgrades it to X ...etc...etc.. Each iteration of the loop creates a new lock_t struct, and in the end we have a lot (one for each record!) of LOCK_X locks, each with single bit set in the bitmap. Soon we run out of space for lock_t structs. If we create LOCK_GAP instead of lock upgrading, the above scenario works like the following: // -> creates new lock_t for LOCK_X\|LOCK_REC_NOT_GAP mode and sets bit for // 1 request lock in LOCK_S mode on row 1: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 1, // so it creates LOCK_S\|LOCK_GAP only and sets bit for 1 request lock in LOCK_X\|LOCK_REC_NOT_GAP mode on row 2: // -> reuses the lock_t for LOCK_X\|LOCK_REC_NOT_GAP by setting bit for 2 request lock in LOCK_S mode on row 2: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 2, // so it reuses LOCK_S\|LOCK_GAP setting bit for 2 In the end we have just two locks per page, one for each mode: LOCK_X\|LOCK_REC_NOT_GAP and LOCK_S\|LOCK_GAP. Another benefit of this solution is that it avoids not-entirely const-correct, (and otherwise looking risky) "upgrading". The fix was ported from mysql/mysql-server@bfba840dfa mysql/mysql-server@75cefdb1f7 Reviewed by: Marko Mäkelä	2023-07-06 15:06:10 +03:00
Marko Mäkelä	3d90143859	MDEV-31559 btr_search_hash_table_validate() does not check if CHECK TABLE is killed btr_search_hash_table_validate(), btr_search_validate(): Add the parameter THD for checking if the statement has been killed. Any non-QUICK CHECK TABLE will validate the entire adaptive hash index for all InnoDB tables, which may be extremely slow when running multiple concurrent CHECK TABLE.	2023-06-30 17:07:21 +03:00
Marko Mäkelä	493083833b	Merge 10.5 into 10.6	2023-06-26 17:11:38 +03:00
Thirunarayanan Balathandayuthapani	bd076d4dff	MDEV-31442 page_cleaner thread aborts while releasing the tablespace - InnoDB shouldn't acquire the tablespace when it is being stopped or closed	2023-06-16 14:58:48 +05:30
Thirunarayanan Balathandayuthapani	841e905f20	MDEV-31442 page_cleaner thread aborts while releasing the tablespace After further I/O on a tablespace has been stopped (for example due to DROP TABLE or an operation that rebuilds a table), page cleaner thread tries to flush the pending writes for the tablespace and releases the tablespace reference even though it was not acquired. fil_space_t::flush(): Don't release the tablespace when it is being stopped and closed Thanks to Marko Mäkelä for suggesting this patch.	2023-06-09 18:15:33 +05:30
Marko Mäkelä	80585c9d6f	Merge 10.5 into 10.6	2023-06-08 10:42:56 +03:00
Marko Mäkelä	c25b496724	MDEV-31382 SET GLOBAL innodb_undo_log_truncate=ON has no effect on logically empty undo logs innodb_undo_log_truncate_update(): A callback function. If SET GLOBAL innodb_undo_log_truncate=ON, invoke srv_wake_purge_thread_if_not_active(). srv_wake_purge_thread_if_not_active(): If innodb_undo_log_truncate=ON, always wake up the purge subsystem. srv_do_purge(): If the history is empty, invoke trx_purge_truncate_history() in order to free undo log pages. trx_purge_truncate_history(): If head.trx_no==0, consider the cached undo logs to be free. trx_purge(): Remove the parameter "bool truncate" and let the caller invoke trx_purge_truncate_history() directly. Reviewed by: Vladislav Lesin	2023-06-08 09:18:21 +03:00
Marko Mäkelä	3e40f9a7f3	MDEV-31355 innodb_undo_log_truncate=ON fails to wait for purge of enough transaction history purge_sys_t::sees(): Wrapper for view.sees(). trx_purge_truncate_history(): Invoke purge_sys.sees() instead of comparing to head.trx_no, to determine if undo pages can be safely freed. The test innodb.cursor-restore-locking was adjusted by Vladislav Lesin, as was the the debug instrumentation in row_purge_del_mark(). Reviewed by: Vladislav Lesin	2023-06-08 09:17:52 +03:00
Marko Mäkelä	a6c0a27696	MDEV-31362 recv_sys_t::apply(bool): Assertion `!last_batch \|\| recovered_lsn == scanned_lsn' failed recv_sys_t::apply(): Remove a bogus debug assertion that had been added in commit `f2c17cc9d9` (MDEV-29911). It is perfectly normal that when the server was killed in the middle of writing multiple redo log blocks, the recovery would end such that recv_sys.scanned_lsn will point to the end of the last complete 512-byte log block, but recv_sys.recovered_lsn will be less than that. Also, correct the function comment of recv_sys_t::parse().	2023-05-30 17:21:49 +03:00
Marko Mäkelä	e38c075aa0	MDEV-31346 trx_purge_add_undo_to_history() is not optimal trx_undo_set_state_at_finish(): Merge to its only caller, trx_purge_add_undo_to_history(). trx_purge_add_undo_to_history(): Evaluate the condition related to TRX_UNDO_STATE only once. Tested by: Matthias Leich	2023-05-26 16:39:46 +03:00
Teemu Ollakka	f307160218	MDEV-29293 MariaDB stuck on starting commit state This commit contains a merge from 10.5-MDEV-29293-squash into 10.6. Although the bug MDEV-29293 was not reproducible with 10.6, the fix contains several improvements for wsrep KILL query and BF abort handling, and addresses the following issues: * MDEV-30307 KILL command issued inside a transaction is problematic for galera replication: This commit will remove KILL TOI replication, so Galera side transaction context is not lost during KILL. * MDEV-21075 KILL QUERY maintains nodes data consistency but breaks GTID sequence: This is fixed as well as KILL does not use TOI, and thus does not change GTID state. * MDEV-30372 Assertion in wsrep-lib state: This was caused by BF abort or KILL when local transaction was in the middle of group commit. This commit disables THD::killed handling during commit, so the problem is avoided. * MDEV-30963 Assertion failure !lock.was_chosen_as_deadlock_victim in trx0trx.h:1065: The assertion happened when the victim was BF aborted via MDL while it was committing. This commit changes MDL BF aborts so that transactions which are committing cannot be BF aborted via MDL. The RQG grammar attached in the issue could not reproduce the crash anymore. Original commit message from 10.5 fix: MDEV-29293 MariaDB stuck on starting commit state The problem seems to be a deadlock between KILL command execution and BF abort issued by an applier, where: * KILL has locked victim's LOCK_thd_kill and LOCK_thd_data. * Applier has innodb side global lock mutex and victim trx mutex. * KILL is calling innobase_kill_query, and is blocked by innodb global lock mutex. * Applier is in wsrep_innobase_kill_one_trx and is blocked by victim's LOCK_thd_kill. The fix in this commit removes the TOI replication of KILL command and makes KILL execution less intrusive operation. Aborting the victim happens now by using awake_no_mutex() and ha_abort_transaction(). If the KILL happens when the transaction is committing, the KILL operation is postponed to happen after the statement has completed in order to avoid KILL to interrupt commit processing. Notable changes in this commit: * wsrep client connections's error state may remain sticky after client connection is closed. This error message will then pop up for the next client session issuing first SQL statement. This problem raised with test galera.galera_bf_kill. The fix is to reset wsrep client error state, before a THD is reused for next connetion. * Release THD locks in wsrep_abort_transaction when locking innodb mutexes. This guarantees same locking order as with applier BF aborting. * BF abort from MDL was changed to do BF abort on server/wsrep-lib side first, and only then do the BF abort on InnoDB side. This removes the need to call back from InnoDB for BF aborts which originate from MDL and simplifies the locking. * Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h. The manipulation of the wsrep_aborter can be done solely on server side. Moreover, it is now debug only variable and could be excluded from optimized builds. * Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more fine grained locking for SR BF abort which may require locking of victim LOCK_thd_kill. Added explicit call for wsrep_thd_kill_LOCK/UNLOCK where appropriate. * Wsrep-lib was updated to version which allows external locking for BF abort calls. Changes to MTR tests: * Disable galera_bf_abort_group_commit. This test is going to be removed (MDEV-30855). * Make galera_var_retry_autocommit result more readable by echoing cases and expectations into result. Only one expected result for reap to verify that server returns expected status for query. * Record galera_gcache_recover_manytrx as result file was incomplete. Trivial change. * Make galera_create_table_as_select more deterministic: Wait until CTAS execution has reached MDL wait for multi-master conflict case. Expected error from multi-master conflict is ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open wsrep transaction when it is waiting for MDL, query gets interrupted instead of BF aborted. This should be addressed in separate task. * A new test galera_bf_abort_registering to check that registering trx gets BF aborted through MDL. * A new test galera_kill_group_commit to verify correct behavior when KILL is executed while the transaction is committing. Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi> Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com> Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-05-22 00:42:05 +02:00
Vlad Lesin	b54e7b0cea	MDEV-31185 rw_trx_hash_t::find() unpins pins too early rw_trx_hash_t::find() acquires element->mutex, then unpins pins, used for lf_hash element search. After that the "element" can be deallocated and reused by some other thread. If we take a look rw_trx_hash_t::insert()->lf_hash_insert()->lf_alloc_new() calls, we will not find any element->mutex acquisition, as it was not initialized yet before it's allocation. rw_trx_hash_t::insert() can reuse the chunk, unpinned in rw_trx_hash_t::find(). The scenario is the following: 1. Thread 1 have just executed lf_hash_search() in rw_trx_hash_t::find(), but have not acquired element->mutex yet. 2. Thread 2 have removed the element from hash table with rw_trx_hash_t::erase() call. 3. Thread 1 acquired element->mutex and unpinned pin 2 pin with lf_hash_search_unpin(pins) call. 4. Some thread purged memory of the element. 5. Thread 3 reused the memory for the element, filled element->id, element->trx. 6. Thread 1 crashes with failed "DBUG_ASSERT(trx_id == trx->id)" assertion. Note that trx_t objects are also reused, see the code around trx_pools for details. The fix is to invoke "lf_hash_search_unpin(pins);" after element->trx is stored in local variable in rw_trx_hash_t::find(). Reviewed by: Nikita Malyavin, Marko Mäkelä.	2023-05-19 15:50:20 +03:00
Marko Mäkelä	d2420669bd	MDEV-31309 Innodb_buffer_pool_read_requests is not updated correctly srv_export_innodb_status(): Update export_vars.innodb_buffer_pool_read_requests as it was done before commit `a55b951e60` (MDEV-26827). If innodb_status_variables[] pointed to a sharded variable, it would only access the first shard.	2023-05-19 15:38:48 +03:00
Marko Mäkelä	f2c17cc9d9	MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress This is a 10.6 port of commit `2f9e264781` from MariaDB Server 10.9 that is missing some optimization due to a more complex redo log format and recovery logic (which was simplified in commit `685d958e38`). The progress reporting of InnoDB crash recovery was rather intermittent. Nothing was reported during the single-threaded log record parsing, which could consume minutes when parsing a large log. During log application, there only was progress reporting in background threads that would be invoked on data page read completion. The progress reporting here will be detailed like this: InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 InnoDB: Read redo log up to LSN=1963895808 InnoDB: Multi-batch recovery needed at LSN 2534560930 InnoDB: Read redo log up to LSN=3312233472 InnoDB: Read redo log up to LSN=1599646720 InnoDB: Read redo log up to LSN=2160831488 InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages InnoDB: Read redo log up to LSN=3195776000 InnoDB: Read redo log up to LSN=3687099392 InnoDB: Read redo log up to LSN=4165315584 InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages InnoDB: Read redo log up to LSN=4508724224 InnoDB: Read redo log up to LSN=5094550528 InnoDB: To recover: 205230 pages The previous messages "Starting a batch to recover" or "Starting a final batch to recover" will be replaced by "To recover: ... pages" messages. If a batch lasts longer than 15 seconds, then there will be progress reports every 15 seconds, showing the number of remaining pages. For the non-final batch, the "To recover:" message includes two end LSN: that of the batch, and of the recovered log. This is the primary measure of progress. The batch will end once the number of pages to recover reaches 0. If recovery is possible in a single batch, the output will look like this, with a shorter "To recover:" message that counts only the remaining pages: InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 InnoDB: Read redo log up to LSN=1984539648 InnoDB: Read redo log up to LSN=2710875136 InnoDB: Read redo log up to LSN=3358895104 InnoDB: Read redo log up to LSN=3965299712 InnoDB: Read redo log up to LSN=4557417472 InnoDB: Read redo log up to LSN=5219527680 InnoDB: To recover: 450915 pages We will also speed up recovery by improving the memory management and implementing multi-threaded recovery of data pages that will not need to be read into the buffer pool ("fake read"). Log application in the "fake read" threads will be protected by an atomic being_recovered field and exclusive buf_page_t::lock. Recovery will reserve for data pages two thirds of the buffer pool, or 256 pages, whichever is smaller. Previously, we could only use at most one third of the buffer pool for buffered log records. This would typically mean that with large buffer pools, recovery unnecessary consisted of multiple batches. If recovery runs out of memory, it will "roll back" or "rewind" the current mini-transaction. The recv_sys.recovered_lsn and recv_sys.pages will correspond to the "out of memory LSN", at the end of the previous complete mini-transaction. If recovery runs out of memory while executing the final recovery batch, we can simply invoke recv_sys.apply(false) to make room, and resume parsing. If recovery runs out of memory before the final batch, we will scan the redo log to the end and check for any missing or inconsistent files. In this version of the patch, we will throw away any previously buffered recv_sys.pages and rescan the log from the checkpoint onwards. recv_sys_t::pages_it: A cached iterator to recv_sys.pages. recv_sys_t::is_memory_exhausted(): Remove. We will have out-of-memory handling deep inside recv_sys_t::parse(). recv_sys_t::rewind(), page_recv_t::recs_t::rewind(): Remove all log starting with a specific LSN. IORequest::write_complete(), IORequest::read_complete(): Replaces fil_aio_callback(). read_io_callback(), write_io_callback(): Replaces io_callback(). IORequest::fake_read_complete(), fake_io_callback(), os_fake_read(): Process a "fake read" request for concurrent recovery. recv_sys_t::apply_batch(): Choose a number of successive pages for a recovery batch. recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a page whose recovery is not in progress. Log application threads will not invoke this; they will only set being_recovered=-1 to indicate that the entry is no longer needed. recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries. recv_sys_t::wait_for_pool(): Wait for some space to become available in the buffer pool. mlog_init_t::mark_ibuf_exist(): Avoid calls to recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low(). Such calls would lead to double locking of recv_sys.mutex, which depending on implementation could cause a deadlock. We will use lower-level calls to look up index pages. buf_LRU_block_remove_hashed(): Disable consistency checks for freed ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage. This fixes an occasional failure of the test innodb.innodb_bulk_create_index_debug. Tested by: Matthias Leich	2023-05-19 15:20:07 +03:00
Marko Mäkelä	347e22fbf8	Merge bb-10.6-release into 10.6	2023-05-19 14:23:53 +03:00
Marko Mäkelä	06d555a41a	Merge bb-10.5-release into 10.5	2023-05-19 14:23:04 +03:00
Marko Mäkelä	37492960f3	Merge 10.5 into 10.6	2023-05-19 12:24:58 +03:00
Marko Mäkelä	e0084b9d31	MDEV-31234 InnoDB does not free UNDO after the fix of MDEV-30671 trx_purge_truncate_history(): Only call trx_purge_truncate_rseg_history() if the rollback segment is safe to process. This will avoid leaking undo log pages that are not yet ready to be processed. This fixes a regression that was introduced in commit `0de3be8cfd` (MDEV-30671). trx_sys_t::any_active_transactions(): Separately count XA PREPARE transactions. srv_purge_should_exit(): Terminate slow shutdown if the history size does not change and XA PREPARE transactions exist in the system. This will avoid a hang of the test innodb.recovery_shutdown. Tested by: Matthias Leich	2023-05-19 12:19:26 +03:00
Marko Mäkelä	a3e5b5c4db	Merge 10.5 into 10.6	2023-05-15 09:02:32 +03:00
Marko Mäkelä	477285c8ea	MDEV-31253 Freed data pages are not always being scrubbed fil_space_t::flush_freed(): Renamed from buf_flush_freed_pages(); this is a backport of `aa45850687` from 10.6. Invoke log_write_up_to() on last_freed_lsn, instead of avoiding the operation when the log has not yet been written. A more costly alternative would be that log_checkpoint() would invoke this function on every affected tablespace.	2023-05-12 14:57:14 +03:00
Marko Mäkelä	4a668c1892	MDEV-29401 InnoDB history list length increased in 10.6 compared to 10.5 The InnoDB buffer pool and locking were heavily refactored in MariaDB Server 10.6. Among other things, dict_sys.mutex was removed, and the contended lock_sys.mutex was replaced with a combination of lock_sys.latch and distributed latches in hash tables. Also, a default value was changed to innodb_flush_method=O_DIRECT to improve performance in write-heavy workloads. One thing where an adjustment was missing is around the parameters innodb_max_purge_lag (number of committed transactions waiting to be purged), and innodb_max_purge_lag_delay (maximum number of microseconds to delay a DML operation). purge_coordinator_state::do_purge(): Pass the history_size to trx_purge() and reset srv_dml_needed_delay if the history is empty. Keep executing the loop non-stop as long as srv_dml_needed_delay is set. trx_purge_dml_delay(): Made part of trx_purge(). Set srv_dml_needed_delay=0 when nothing can be purged (!n_pages_handled). row_mysql_delay_if_needed(): Mimic the logic of innodb_max_purge_lag_wait_update(). Reviewed by: Thirunarayanan Balathandayuthapani	2023-04-27 17:11:32 +03:00
Marko Mäkelä	5740638c4c	MDEV-31132 Deadlock between DDL and purge of InnoDB history log_free_check(): Assert that the caller must not hold exclusive lock_sys.latch. This was the case for calls from ibuf_delete_for_discarded_space(). This caused a deadlock with another thread that would be holding a latch on a dirty page that would need to be written so that the checkpoint would advance and log_free_check() could return. That other thread was waiting for a shared lock_sys.latch. fil_delete_tablespace(): Do not invoke ibuf_delete_for_discarded_space() because in DDL operations, we will be holding exclusive lock_sys.latch. trx_t::commit(std::vector<pfs_os_file_t>&), innodb_drop_database(), row_purge_remove_clust_if_poss_low(), row_undo_ins_remove_clust_rec(), row_discard_tablespace_for_mysql(): Invoke ibuf_delete_for_discarded_space() on the deleted tablespaces after releasing all latches.	2023-04-26 12:08:59 +03:00
Marko Mäkelä	818d5e4814	Merge 10.5 into 10.6	2023-04-25 13:10:33 +03:00
Marko Mäkelä	50f3b7d164	MDEV-31124 Innodb_data_written miscounts doublewrites When commit `a5a2ef079c` implemented asynchronous doublewrite, the writes via the doublewrite buffer started to be counted incorrectly, without multiplying them by innodb_page_size. srv_export_innodb_status(): Correctly count the Innodb_data_written. buf_dblwr_t: Remove submitted(), because it is close to written() and only Innodb_data_written was interested in it. According to its name, it should count completed and not submitted writes. Tested by: Axel Schwenke	2023-04-25 12:17:06 +03:00
Oleksandr Byelkin	1d74927c58	Merge branch '10.4' into 10.5	2023-04-24 12:43:47 +02:00
Marko Mäkelä	0976afec88	MDEV-31114 Assertion !...is_waiting() failed in os_aio_wait_until_no_pending_writes() os_aio_wait_until_no_pending_reads(), os_aio_wait_until_pending_writes(): Add a Boolean parameter to indicate whether the wait should be declared in the thread pool. buf_flush_wait(): The callers have already declared a wait, so let us avoid doing that again, just call os_aio_wait_until_pending_writes(false). buf_flush_wait_flushed(): Do not declare a wait in the rare case that the buf_flush_page_cleaner thread has been shut down already. buf_flush_page_cleaner(), buf_flush_buffer_pool(): In the code that runs during shutdown, do not declare waits. buf_flush_buffer_pool(): Remove a debug assertion that might fail. What really matters here is buf_pool.flush_list.count==0. buf_read_recv_pages(), srv_prepare_to_delete_redo_log_file(): Do not declare waits during InnoDB startup.	2023-04-24 09:57:58 +03:00
Thirunarayanan Balathandayuthapani	2c567b2fa3	MDEV-30996 insert.. select in presence of full text index freezes all other commits at commit time - This patch does the following: git revert --no-commit `673243c893` git revert --no-commit `6c669b9586` git revert --no-commit `bacaf2d4f4` git checkout HEAD mysql-test git revert --no-commit `1fd7d3a9ad` Above command reverts MDEV-29277, MDEV-25581, MDEV-29342. When binlog is enabled, trasaction takes a lot of time to do sync operation on innodb fts table. This leads to block of other transaction commit. To avoid this failure, remove the fulltext sync operation during transaction commit. So reverted MDEV-25581 related patches. We filed MDEV-31105 to avoid the memory consumption problem during fulltext sync operation.	2023-04-24 11:06:56 +05:30
Alexander Barkov	9f98a2acd7	MDEV-30968 mariadb-backup does not copy Aria logs if aria_log_dir_path is used - `mariadb-backup --backup` was fixed to fetch the value of the @@aria_log_dir_path server variable and copy aria_log* files from @@aria_log_dir_path directory to the backup directory. Absolute and relative (to --datadir) paths are supported. Before this change aria_log* files were copied to the backup only if they were in the default location in @@datadir. - `mariadb-backup --copy-back` now understands a new my.cnf and command line parameter --aria-log-dir-path. `mariadb-backup --copy-back` in the main loop in copy_back() (when copying back from the backup directory to --datadir) was fixed to ignore all aria_log* files. A new function copy_back_aria_logs() was added. It consists of a separate loop copying back aria_log* files from the backup directory to the directory specified in --aria-log-dir-path. Absolute and relative (to --datadir) paths are supported. If --aria-log-dir-path is not specified, aria_log* files are copied to --datadir by default. - The function is_absolute_path() was fixed to understand MTR style paths on Windows with forward slashes, e.g. --aria-log-dir-path=D:/Buildbot/amd64-windows/build/mysql-test/var/...	2023-04-21 19:08:35 +04:00
Marko Mäkelä	51e62cb3b3	MDEV-26782 InnoDB temporary tablespace: reclaiming of free space does not work The motivation of this change is to allow undo pages for temporary tables to be marked free as often as possible, so that we can avoid buf_pool.LRU eviction (and writes) of undo pages that contain data that is no longer needed. For temporary tables, no MVCC or purge of history is needed, and reusing cached undo log pages might not help that much. It is possible that this may cause some performance regression due to more frequent allocation and freeing of undo log pages, but I only measured a performance improvement. trx_write_serialisation_history(): Never cache temporary undo log pages. trx_undo_reuse_cached(): Assert that the rollback segment is persistent. trx_undo_assign_low(): Add template<bool is_temp>. Never invoke trx_undo_reuse_cached() for temporary tables. Tested by: Matthias Leich	2023-04-21 17:58:26 +03:00
Marko Mäkelä	86767bcc0f	MDEV-29593 Purge misses a chance to free not-yet-reused undo pages trx_purge_truncate_rseg_history(): If all other conditions for invoking trx_purge_remove_log_hdr() hold, but the state is TRX_UNDO_CACHED instead of TRX_UNDO_TO_PURGE, detach and free it. Tested by: Matthias Leich	2023-04-21 17:58:09 +03:00
Marko Mäkelä	485a1b1f11	MDEV-30863 Server freeze, all threads in trx_assign_rseg_low() trx_assign_rseg_low(): Simplify the debug check. trx_rseg_t::reinit(): Reset the skip_allocation() flag. This logic was broken in the merge commit `3e2ad0e918` of commit `0de3be8cfd` (that is, innodb_undo_log_truncate=ON would never be "completed"). Tested by: Matthias Leich	2023-04-18 14:54:40 +03:00
Vlad Lesin	71f16c836f	MDEV-31049 fil_delete_tablespace() returns wrong file handle if tablespace was closed by parallel thread fil_delete_tablespace() stores file handle in local variable and calls mtr_t::commit_file()=>fil_system_t::detach(..., detach_handle=true), which sets space->chain.start->handle = OS_FILE_CLOSED. fil_system_t::detach() is invoked under fil_system.mutex. But before the mutex is acquired some parallel thread can change space->chain.start->handle. fil_delete_tablespace() returns value, stored in local variable, i.e. wrong value. File handle can be closed, for example, from buf_flush_space() when the limit of innodb_open_files exceded and fil_space_t::get() causes fil_space_t::try_to_close() call. fil_space_t::try_to_close() is executed under fil_system.mutex. And mtr_t::commit_file() locks it for fil_system_t::detach() call. fil_system_t::detach() returns detached file handle if its argument detach_handle is true. The fix is to let mtr_t::commit_file() to pass that detached file handle to fil_delete_tablespace().	2023-04-14 10:42:12 +03:00
Vlad Lesin	0cca8166f3	MDEV-30775 Performance regression in fil_space_t::try_to_close() introduced in MDEV-23855 Post-push fix. 10.5 MDEV-30775 fix inserts just opened tablespace just after the element which fil_system.space_list_last_opened points to. In MDEV-25223 fil_system_t::space_list was changed from UT_LIST to ilist. ilist<...>::insert(iterator pos, reference value) inserts element to list before pos. But it was not taken into account during 10.5->10.6 merge in `85cbfaefee`, and the fix does not work properly, i.e. it inserted just opened tablespace to the position preceding fil_system.space_list_last_opened.	2023-04-14 10:41:59 +03:00
Thirunarayanan Balathandayuthapani	2ddfb83807	MDEV-29273 Race condition between drop table and closing of table - This issue caused by race condition between drop thread and fil_encrypt_thread. fil_encrypt_thread closes the tablespace if the number of opened files exceeds innodb_open_files. fil_node_open_file() closes the tablespace which are open and it doesn't have pending operations. At that time, InnoDB drop tries to write the redo log for the file delete operation. It throws the bad file descriptor error. - When trying to close the file, InnoDB should check whether the table is going to be dropped.	2023-04-12 19:07:59 +05:30
Marko Mäkelä	a091d6ac4e	MDEV-26827 fixup: Do not duplicate io_slots::pending_io_count() os_aio_pending_reads_approx(), os_aio_pending_reads(): Replaces buf_pool.n_pend_reads. os_aio_pending_writes(): Replaces buf_dblwr.pending_writes(). buf_dblwr_t::write_cond, buf_dblwr_t::writes_pending: Remove.	2023-04-12 13:49:57 +03:00
Marko Mäkelä	5bada1246d	Merge 10.5 into 10.6	2023-04-11 16:15:19 +03:00

1 2 3 4 5 ...

4052 commits