mariadb

mirror of https://github.com/MariaDB/server.git synced 2026-05-16 03:47:17 +02:00

Author	SHA1	Message	Date
mariadb-DebarunBanerjee	8047c8bc71	MDEV-28800 SIGABRT due to running out of memory for InnoDB locks This regression is introduced in 10.6 by following commit. commit `898dcf93a8` (Cleanup the lock creation) It removed one important optimization for lock bitmap pre-allocation. We pre-allocate about 8 byte extra space along with every lock object to adjust for similar locks on newly created records on the same page by same transaction. When it is exhausted, a new lock object is created with similar 8 byte pre-allocation. With this optimization removed we are left with only 1 byte pre-allocation. When large number of records are inserted and locked in a single page, we end up creating too many new locks almost in n^2 order. Fix-1: Bring back LOCK_PAGE_BITMAP_MARGIN for pre-allocation. Fix-2: Use the extra space (40 bytes) for bitmap in trx->lock.rec_pool.	2024-05-20 21:19:13 +05:30
Marko Mäkelä	4aa92911c7	MDEV-33802 Weird read view after ROLLBACK of another transaction Even after commit `b8a6719889` there is an anomaly where a locking read could return inconsistent results. If a locking read would have to wait for a record lock, then by the definition of a read view, the modifications made by the current lock holder cannot be visible in the read view. This is because the read view must exclude any transactions that had not been committed at the time when the read view was created. lock_rec_convert_impl_to_expl_for_trx(), lock_rec_convert_impl_to_expl(): Return an unsafe-to-dereference pointer to a transaction that holds or held the lock, or nullptr if the lock was available. lock_clust_rec_modify_check_and_lock(), lock_sec_rec_read_check_and_lock(), lock_clust_rec_read_check_and_lock(): Return DB_RECORD_CHANGED if innodb_strict_isolation=ON and the lock was being held by another transaction. The test case, which is based on a bug report by Zhuang Liu, covers the function lock_sec_rec_read_check_and_lock(). Reviewed by: Vladislav Lesin	2024-04-09 12:50:24 +03:00
Jan Lindström	b762541dd6	MDEV-33278 : Assertion failure in thd_get_thread_id at lock_wait_wsrep Problem is that not all conflicting transactions have THD object. Therefore, it must be checked that victim has THD before it's identification is added to victim list as victim's thread identification is later requested using thd_get_thread_id function that requires that we have valid pointer to THD object in trx->mysql_thd. Victim might not have trx->mysql_thd in two cases: (1) An incomplete transaction that was recovered from undo logs on server startup (and not yet rolled back). (2) Transaction that is in XA PREPARE state and whose client connection was disconnected. Neither of these can complete before lock_wait_wsrep() releases lock_sys.latch. (1) trx_t::commit_in_memory() is clearing both trx_t::state and trx_t::is_recovered before it invokes lock_release(trx_t*) (which would be blocked by the exclusive lock_sys.latch that we are holding here). Hence, it is not possible to write a debug assertion to document this scenario. (2) If is in XA PREPARE state, it would eventually be rolled back and the lock conflict would be resolved when an XA COMMIT or XA ROLLBACK statement is executed in some other connection. Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2024-03-26 02:06:51 +01:00
Marko Mäkelä	17e59ed3aa	MDEV-33454 release row locks for non-modified rows at XA PREPARE From the correctness point of view, it should be safe to release all locks on index records that were not modified by the transaction. Doing so should make the locks after XA PREPARE fully compatible with what would happen if the server were restarted: InnoDB table IX locks and exclusive record locks would be resurrected based on undo log records. Concurrently running transactions that are waiting for a lock may invoke lock_rec_convert_impl_to_expl() to create an explicit record lock object on behalf of the lock-owning transaction so that they can attaching their waiting lock request on the explicit record lock object. Explicit locks would be released by trx_t::release_locks() during commit or rollback. Any clustered index record whose DB_TRX_ID belongs to a transaction that is in active or XA PREPARE state will be implicitly locked by that transaction. On XA PREPARE, we can release explicit exclusive locks on records whose DB_TRX_ID does not match the current transaction identifier. lock_rec_unlock_unmodified(): Release record locks that are not implicitly held by the current transaction. lock_release_on_prepare_try(), lock_release_on_prepare(): Invoke lock_rec_unlock_unmodified(). row_trx_id_offset(): Declare non-static. lock_rec_unlock(): Replaces lock_rec_unlock_supremum(). Reviewed by: Vladislav Lesin	2024-03-22 14:33:48 +02:00
Marko Mäkelä	b8a6719889	MDEV-26642/MDEV-26643/MDEV-32898 Implement innodb_snapshot_isolation https://jepsen.io/analyses/mysql-8.0.34 highlights that the transaction isolation levels in the InnoDB storage engine do not correspond to any widely accepted definitions, such as "Generalized Isolation Level Definitions" https://pmg.csail.mit.edu/papers/icde00.pdf (PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ, PL-3 = SERIALIZABLE). Only READ UNCOMMITTED in InnoDB seems to match the above definition. The issue is that InnoDB does not detect write/write conflicts (Section 4.4.3, Definition 6) in the above. It appears that as soon as we implement write/write conflict detection (SET SESSION innodb_snapshot_isolation=ON), the default isolation level (SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of "A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf Locking reads inside InnoDB used to read the latest committed version, ignoring what should actually be visible to the transaction. The added test innodb.lock_isolation illustrates this. The statement UPDATE t SET a=3 WHERE b=2; is executed in a transaction that was started before a read view or a snapshot of the current transaction was created, and committed before the current transaction attempts to execute UPDATE t SET b=3; If SET innodb_snapshot_isolation=ON is in effect when the second transaction was started, the second transaction will be aborted with the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF), the second transaction would execute inconsistently, displaying an incorrect SELECT COUNT(*) FROM t in its read view. If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a record that does not exist in the current read view is made, an error DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will be raised. This error will be treated in the same way as a deadlock: the transaction will be rolled back. lock_clust_rec_read_check_and_lock(): If the current transaction has a read view where the record is not visible and innodb_snapshot_isolation=ON, fail before trying to acquire the lock. row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON, disable the "semi-consistent read" logic that had been implemented by myself on the directions of Heikki Tuuri in order to address https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer wanting UPDATE to skip locked rows that do not match the WHERE condition. It looks like my changes were included in the MySQL 5.1.5 commit `ad126d90e0`; at that time, employees of Innobase Oy (a recent acquisition of Oracle) had lost write access to the repository. The only reason why we set innodb_snapshot_isolation=OFF by default is backward compatibility with applications, such as the one that motivated the implementation of "semi-consistent read" back in 2005. In a later major release, we can default to innodb_snapshot_isolation=ON. Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations and some testing of this fix. Thanks to Vladislav Lesin for the initial test for MDEV-26643, as well as reviewing these changes.	2024-03-20 09:48:03 +02:00
Marko Mäkelä	c3a00dfa53	Merge 10.5 into 10.6	2024-03-12 09:19:57 +02:00
mariadb-DebarunBanerjee	afe9632913	MDEV-33593 Auto increment deadlock error causes ASSERT in subsequent save point The issue here is ha_innobase::get_auto_increment() could cause a deadlock involving auto-increment lock and rollback the transaction implicitly. For such cases, storage engines usually call thd_mark_transaction_to_rollback() to inform SQL engine about it which in turn takes appropriate actions and close the transaction. In innodb, we call it while converting Innodb error code to MySQL. However, since ::innobase_get_autoinc() returns void, we skip the call for error code conversion and also miss marking the transaction for rollback for deadlock error. We assert eventually while releasing a savepoint as the transaction state is not active. Since convert_error_code_to_mysql() is handling some generic error handling part, like invoking the callback when needed, we should call that function in ha_innobase::get_auto_increment() even if we don't return the resulting mysql error code back.	2024-03-07 21:54:06 +05:30
Marko Mäkelä	b2654ba826	MDEV-32899 InnoDB is holding shared dict_sys.latch while waiting for FOREIGN KEY child table lock on DDL lock_table_children(): A new function to lock all child tables of a table. We will only hold dict_sys.latch while traversing dict_table_t::referenced_set. To prevent a race condition with std::set::erase() we will copy the pointers to the child tables to a local vector. Once we have acquired MDL and references to all child tables, we can safely release dict_sys.latch, wait for the locks, and finally release the references. dict_acquire_mdl_shared(): A new variant that takes mdl_context as a parameter. lock_table_for_trx(): Assert that we are not holding dict_sys.latch. ha_innobase::truncate(): When foreign_key_checks=ON, assert that no child tables exist (other than the current table). In any case, we will invoke lock_table_children() so that the child table metadata can be safely updated. (It is possible that a child table is being created concurrently with TRUNCATE TABLE.) ha_innobase::delete_table(): Before and after acquiring exclusive locks on the current table as well as all child tables, check that FOREIGN KEY constraints will not be violated. In this way, we can reject impossible DROP TABLE without having to wait for locks first. This fixes up commit `2ca1123464` (MDEV-26217) and commit `c3c53926c4` (MDEV-26554).	2024-02-08 14:22:35 +11:00
Marko Mäkelä	5f2dcd112b	MDEV-24167 fixup: srw_lock_debug instrumentation While the index_lock and block_lock include debug instrumentation to keep track of shared lock holders, such instrumentation was never part of the simpler srw_lock, and therefore some users of the class implemented a limited form of bookkeeping. srw_lock_debug encapsulates srw_lock and adds the data members writer, readers_lock, and readers to keep track of the threads that hold the exclusive latch or any shared latches. The debug checks are available also with SUX_LOCK_GENERIC (in environments that do not implement a futex-like system call). dict_sys_t::latch: Use srw_lock_debug in debug builds. This makes the debug fields latch_ex, latch_readers redundant. fil_space_t::latch: Use srw_lock_debug in debug builds. This makes the debug field latch_count redundant. The field latch_owner must be preserved, because fil_space_t::is_owner() is being used in all builds. lock_sys_t::latch: Use srw_lock_debug in debug builds. This makes the debug fields writer, readers redundant. lock_sys_t::is_holder(): A new debug predicate to check if the current thread is holding lock_sys.latch in any mode. trx_rseg_t::latch: Use srw_lock_debug in debug builds.	2024-02-08 14:22:35 +11:00
Marko Mäkelä	21560bee9d	Revert "MDEV-32899 InnoDB is holding shared dict_sys.latch while waiting for FOREIGN KEY child table lock on DDL" This reverts commit `569da6a7ba`, commit `768a736174`, and commit `ba6bf7ad9e` because of a regression that was filed as MDEV-33104.	2024-01-19 12:46:11 +02:00
Marko Mäkelä	ba6bf7ad9e	MDEV-32899 instrumentation In debug builds, let us declare dict_sys.latch as index_lock instead of srw_lock, so that we will benefit from the full tracking of lock ownership. lock_table_for_trx(): Assert that the current thread is not holding dict_sys.latch. If the dict_sys.unfreeze() call were moved to the end of lock_table_children(), this assertion would fail in the test innodb.innodb and many other tests that use FOREIGN KEY.	2023-11-29 10:48:10 +02:00
Marko Mäkelä	569da6a7ba	MDEV-32899 InnoDB is holding shared dict_sys.latch while waiting for FOREIGN KEY child table lock on DDL lock_table_children(): A new function to lock all child tables of a table. We will only hold dict_sys.latch while traversing dict_table_t::referenced_set. To prevent a race condition with std::set::erase() we will copy the pointers to the child tables to a local vector. Once we have acquired references to all child tables, we can safely release dict_sys.latch, wait for the locks, and finally release the references. This fixes up commit `2ca1123464` (MDEV-26217) and commit `c3c53926c4` (MDEV-26554).	2023-11-28 15:50:41 +02:00
Marko Mäkelä	b78b77e77d	MDEV-32530 Race condition in lock_wait_rpl_report() After acquiring lock_sys.latch, always load trx->lock.wait_lock. It could have changed by another thread that did lock_rec_move() and released lock_sys.latch right before lock_sys.wr_lock_try() succeeded. This regression was introduced in commit `e039720bf3` (MDEV-32096). Reviewed by: Vladislav Lesin	2023-10-24 14:33:14 +03:00
Vlad Lesin	18fa00a54c	MDEV-32272 lock_release_on_prepare_try() does not release lock if supremum bit is set along with other bits set in lock's bitmap The error is caused by MDEV-30165 fix with the following commit: `d13a57ae81` There is logical error in lock_release_on_prepare_try(): if (supremum_bit) lock_rec_unlock_supremum(*cell, lock); else lock_rec_dequeue_from_page(lock, false); Because there can be other bits set in the lock's bitmap, and the lock type can be suitable for releasing criteria, but the above logic releases only supremum bit of the lock. The fix is to release lock if it suits for releasing criteria and unlock supremum if supremum is locked otherwise. Tere is also the test for the case, which was reported by QA team. I placed it in a separate files, because it requires debug build. Reviewed by: Marko Mäkelä	2023-10-13 16:29:04 +03:00
Jan Lindström	076df87b4c	MDEV-30217 : Assertion `mode_ == m_local \|\| transaction_.is_streaming()' failed in int wsrep::client_state::bf_abort(wsrep::seqno) Problem was that brute force (BF) thread requested conflicting lock and was trying to kill victim transaction, but this victim was also brute force thread. However, this victim was not actually holding conflicting lock, instead both brute force transaction and victim transaction were had insert intention locks. We should not kill brute force victim transaction if requesting lock does not need to wait. Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-09-25 16:38:55 +02:00
Vlad Lesin	d13a57ae81	Merge 10.5 into 10.6.	2023-09-22 15:21:15 +03:00
Vlad Lesin	95730372bd	MDEV-30165 X-lock on supremum for prepared transaction for RR trx_t::set_skip_lock_inheritance() must be invoked at the very beginning of lock_release_on_prepare(). Currently trx_t::set_skip_lock_inheritance() is invoked at the end of lock_release_on_prepare() when lock_sys and trx are released, and there can be a case when locks on prepare are released, but "not inherit gap locks" bit has not yet been set, and page split inherits lock to supremum. Also reset supremum bit and rebuild waiting queue when XA is prepared. Reviewed by: Marko Mäkelä	2023-09-21 20:07:53 +03:00
Marko Mäkelä	4a8291fc5f	MDEV-30531 Corrupt index(es) on busy table when using FOREIGN KEY lock_wait(): Never return the transient error code DB_LOCK_WAIT. In commit `78a04a4c22` (MDEV-29869) some assignments assign trx->error_state = DB_SUCCESS were removed, and it was possible that the field was left at its initial value DB_LOCK_WAIT. The test case for this is nondeterministic; without this fix, it would only occasionally fail. Reviewed by: Vladislav Lesin	2023-09-11 14:52:05 +03:00
Marko Mäkelä	e039720bf3	MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait lock_sys_t::cancel(trx_t*): Remove, and merge to its only caller innobase_kill_query(). innobase_kill_query(): Before reading trx->lock.wait_lock, do acquire lock_sys.wait_mutex, like we did before commit `e71e613353` (MDEV-24671). In this way, we should not miss a recently started lock wait by the killee transaction. lock_rec_lock(): Add a DEBUG_SYNC "lock_rec" for the test case. lock_wait(): Invoke trx_is_interrupted() before entering the wait, in case innobase_kill_query() was invoked some time earlier and some longer-running operation did not check for interrupts. As suggested by Vladislav Lesin, do not overwrite trx->error_state==DB_INTERRUPTED with DB_SUCCESS. This would avoid a call to trx_is_interrupted() when the test is modified to use the DEBUG_SYNC point lock_wait_start instead of lock_rec. Avoid some redundant loads of trx->lock.wait_lock; cache the value in the local variable wait_lock. Deadlock::check_and_resolve(): Take wait_lock as a parameter and return wait_lock (or -1 or nullptr). We only need to reload trx->lock.wait_lock if lock_sys.wait_mutex had been released and reacquired. trx_t::error_state: Correctly document the data member. trx_lock_t::was_chosen_as_deadlock_victim: Clarify that other threads may set the field (or flags in it) while holding lock_sys.wait_mutex. Thanks to Johannes Baumgarten for reporting the problem and testing the fix, as well as to Kristian Nielsen for suggesting the fix. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2023-09-11 14:51:02 +03:00
Kristian Nielsen	7c9837ce74	Merge 10.4 into 10.5 Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 18:02:18 +02:00
Kristian Nielsen	805e0668c9	MDEV-31482: Lock wait timeout with INSERT-SELECT, autoinc, and statement-based replication Remove the exception that InnoDB does not report auto-increment locks waits to the parallel replication. There was an assumption that these waits could not cause conflicts with in-order parallel replication and thus need not be reported. However, this assumption is wrong and it is possible to get conflicts that lead to hangs for the duration of --innodb-lock-wait-timeout. This can be seen with three transactions: 1. T1 is waiting for T3 on an autoinc lock 2. T2 is waiting for T1 to commit 3. T3 is waiting on a normal row lock held by T2 Here, T3 needs to be deadlock killed on the wait by T1. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:40:02 +02:00
Kristian Nielsen	18acbaf416	MDEV-31655: Parallel replication deadlock victim preference code errorneously removed Restore code to make InnoDB choose the second transaction as a deadlock victim if two transactions deadlock that need to commit in-order for parallel replication. This code was erroneously removed when VATS was implemented in InnoDB. Also add a test case for InnoDB choosing the right deadlock victim. Also fixes this bug, with testcase that reliably reproduces: MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master Reviewed-by: Marko Mäkelä <marko.makela@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:39:49 +02:00
Kristian Nielsen	900c4d6920	MDEV-31655: Parallel replication deadlock victim preference code errorneously removed Restore code to make InnoDB choose the second transaction as a deadlock victim if two transactions deadlock that need to commit in-order for parallel replication. This code was erroneously removed when VATS was implemented in InnoDB. Also add a test case for InnoDB choosing the right deadlock victim. Also fixes this bug, with testcase that reliably reproduces: MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master Note: This should be null-merged to 10.6, as a different fix is needed there due to InnoDB locking code changes. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:35:30 +02:00
Kristian Nielsen	920789e9d4	MDEV-31482: Lock wait timeout with INSERT-SELECT, autoinc, and statement-based replication Remove the exception that InnoDB does not report auto-increment locks waits to the parallel replication. There was an assumption that these waits could not cause conflicts with in-order parallel replication and thus need not be reported. However, this assumption is wrong and it is possible to get conflicts that lead to hangs for the duration of --innodb-lock-wait-timeout. This can be seen with three transactions: 1. T1 is waiting for T3 on an autoinc lock 2. T2 is waiting for T1 to commit 3. T3 is waiting on a normal row lock held by T2 Here, T3 needs to be deadlock killed on the wait by T1. Note: This should be null-merged to 10.6, as a different fix is needed there due to InnoDB lock code changes. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:34:09 +02:00
Oleksandr Byelkin	6bf8483cac	Merge branch '10.5' into 10.6	2023-08-01 15:08:52 +02:00
Oleksandr Byelkin	7564be1352	Merge branch '10.4' into 10.5	2023-07-26 16:02:57 +02:00
Vlad Lesin	090a84366a	MDEV-29311 Server Status Innodb_row_lock_time% is reported in seconds Before MDEV-24671, the wait time was derived from my_interval_timer() / 1000 (nanoseconds converted to microseconds, and not microseconds to milliseconds like I must have assumed). The lock_sys.wait_time and lock_sys.wait_time_max are already in milliseconds; we should not divide them by 1000. In MDEV-24738 the millisecond counts lock_sys.wait_time and lock_sys.wait_time_max were changed to a 32-bit type. That would overflow in 49.7 days. Keep using a 64-bit type for those millisecond counters. Reviewed by: Marko Mäkelä	2023-07-10 12:42:46 +03:00
Vlad Lesin	1bfd3cc457	MDEV-10962 Deadlock with 3 concurrent DELETEs by unique key PROBLEM: A deadlock was possible when a transaction tried to "upgrade" an already held Record Lock to Next Key Lock. SOLUTION: This patch is based on observations that: (1) a Next Key Lock is equivalent to Record Lock combined with Gap Lock (2) a GAP Lock never has to wait for any other lock In case we request a Next Key Lock, we check if we already own a Record Lock of equal or stronger mode, and if so, then we change the requested lock type to GAP Lock, which we either already have, or can be granted immediately, as GAP locks don't conflict with any other lock types. (We don't consider Insert Intention Locks a Gap Lock in above statements). The reason of why we don't upgrage Record Lock to Next Key Lock is the following. Imagine a transaction which does something like this: for each row { request lock in LOCK_X\|LOCK_REC_NOT_GAP mode request lock in LOCK_S mode } If we upgraded lock from Record Lock to Next Key lock, there would be created only two lock_t structs for each page, one for LOCK_X\|LOCK_REC_NOT_GAP mode and one for LOCK_S mode, and then used their bitmaps to mark all records from the same page. The situation would look like this: request lock in LOCK_X\|LOCK_REC_NOT_GAP mode on row 1: // -> creates new lock_t for LOCK_X\|LOCK_REC_NOT_GAP mode and sets bit for // 1 request lock in LOCK_S mode on row 1: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 1, // so it upgrades it to X request lock in LOCK_X\|LOCK_REC_NOT_GAP mode on row 2: // -> creates a new lock_t for LOCK_X\|LOCK_REC_NOT_GAP mode (because we // don't have any after we've upgraded!) and sets bit for 2 request lock in LOCK_S mode on row 2: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 2, // so it upgrades it to X ...etc...etc.. Each iteration of the loop creates a new lock_t struct, and in the end we have a lot (one for each record!) of LOCK_X locks, each with single bit set in the bitmap. Soon we run out of space for lock_t structs. If we create LOCK_GAP instead of lock upgrading, the above scenario works like the following: // -> creates new lock_t for LOCK_X\|LOCK_REC_NOT_GAP mode and sets bit for // 1 request lock in LOCK_S mode on row 1: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 1, // so it creates LOCK_S\|LOCK_GAP only and sets bit for 1 request lock in LOCK_X\|LOCK_REC_NOT_GAP mode on row 2: // -> reuses the lock_t for LOCK_X\|LOCK_REC_NOT_GAP by setting bit for 2 request lock in LOCK_S mode on row 2: // -> notices that we already have LOCK_X\|LOCK_REC_NOT_GAP on the row 2, // so it reuses LOCK_S\|LOCK_GAP setting bit for 2 In the end we have just two locks per page, one for each mode: LOCK_X\|LOCK_REC_NOT_GAP and LOCK_S\|LOCK_GAP. Another benefit of this solution is that it avoids not-entirely const-correct, (and otherwise looking risky) "upgrading". The fix was ported from mysql/mysql-server@bfba840dfa mysql/mysql-server@75cefdb1f7 Reviewed by: Marko Mäkelä	2023-07-06 15:06:10 +03:00
Teemu Ollakka	f307160218	MDEV-29293 MariaDB stuck on starting commit state This commit contains a merge from 10.5-MDEV-29293-squash into 10.6. Although the bug MDEV-29293 was not reproducible with 10.6, the fix contains several improvements for wsrep KILL query and BF abort handling, and addresses the following issues: * MDEV-30307 KILL command issued inside a transaction is problematic for galera replication: This commit will remove KILL TOI replication, so Galera side transaction context is not lost during KILL. * MDEV-21075 KILL QUERY maintains nodes data consistency but breaks GTID sequence: This is fixed as well as KILL does not use TOI, and thus does not change GTID state. * MDEV-30372 Assertion in wsrep-lib state: This was caused by BF abort or KILL when local transaction was in the middle of group commit. This commit disables THD::killed handling during commit, so the problem is avoided. * MDEV-30963 Assertion failure !lock.was_chosen_as_deadlock_victim in trx0trx.h:1065: The assertion happened when the victim was BF aborted via MDL while it was committing. This commit changes MDL BF aborts so that transactions which are committing cannot be BF aborted via MDL. The RQG grammar attached in the issue could not reproduce the crash anymore. Original commit message from 10.5 fix: MDEV-29293 MariaDB stuck on starting commit state The problem seems to be a deadlock between KILL command execution and BF abort issued by an applier, where: * KILL has locked victim's LOCK_thd_kill and LOCK_thd_data. * Applier has innodb side global lock mutex and victim trx mutex. * KILL is calling innobase_kill_query, and is blocked by innodb global lock mutex. * Applier is in wsrep_innobase_kill_one_trx and is blocked by victim's LOCK_thd_kill. The fix in this commit removes the TOI replication of KILL command and makes KILL execution less intrusive operation. Aborting the victim happens now by using awake_no_mutex() and ha_abort_transaction(). If the KILL happens when the transaction is committing, the KILL operation is postponed to happen after the statement has completed in order to avoid KILL to interrupt commit processing. Notable changes in this commit: * wsrep client connections's error state may remain sticky after client connection is closed. This error message will then pop up for the next client session issuing first SQL statement. This problem raised with test galera.galera_bf_kill. The fix is to reset wsrep client error state, before a THD is reused for next connetion. * Release THD locks in wsrep_abort_transaction when locking innodb mutexes. This guarantees same locking order as with applier BF aborting. * BF abort from MDL was changed to do BF abort on server/wsrep-lib side first, and only then do the BF abort on InnoDB side. This removes the need to call back from InnoDB for BF aborts which originate from MDL and simplifies the locking. * Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h. The manipulation of the wsrep_aborter can be done solely on server side. Moreover, it is now debug only variable and could be excluded from optimized builds. * Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more fine grained locking for SR BF abort which may require locking of victim LOCK_thd_kill. Added explicit call for wsrep_thd_kill_LOCK/UNLOCK where appropriate. * Wsrep-lib was updated to version which allows external locking for BF abort calls. Changes to MTR tests: * Disable galera_bf_abort_group_commit. This test is going to be removed (MDEV-30855). * Make galera_var_retry_autocommit result more readable by echoing cases and expectations into result. Only one expected result for reap to verify that server returns expected status for query. * Record galera_gcache_recover_manytrx as result file was incomplete. Trivial change. * Make galera_create_table_as_select more deterministic: Wait until CTAS execution has reached MDL wait for multi-master conflict case. Expected error from multi-master conflict is ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open wsrep transaction when it is waiting for MDL, query gets interrupted instead of BF aborted. This should be addressed in separate task. * A new test galera_bf_abort_registering to check that registering trx gets BF aborted through MDL. * A new test galera_kill_group_commit to verify correct behavior when KILL is executed while the transaction is committing. Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi> Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com> Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-05-22 00:42:05 +02:00
Marko Mäkelä	4105017a58	MDEV-30357 Performance regression in locking reads from secondary indexes lock_sec_rec_some_has_impl(): Remove a harmful condition that caused the performance regression and should not have been added in commit `b6e41e3872` in the first place. Locking transactions that have not modified any persistent tables can carry the transaction identifier 0. trx_t::max_inactive_id: A cache for trx_sys_t::find_same_or_older(). The value is not reset on transaction commit so that previous results can be reused for subsequent transactions. The smallest active transaction ID can only increase over time, not decrease. trx_sys_t::find_same_or_older(): Remember the maximum previous id for which rw_trx_hash.iterate() returned false, to avoid redundant iterations. lock_sec_rec_read_check_and_lock(): Add an early return in case we are already holding a covering table lock. lock_rec_convert_impl_to_expl(): Add a template parameter to avoid a redundant run-time check on whether the index is secondary. lock_rec_convert_impl_to_expl_for_trx(): Move some code from lock_rec_convert_impl_to_expl(), to reduce code duplication due to the added template parameter. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2023-03-16 16:00:45 +02:00
Vlad Lesin	a474e3278c	MDEV-27701 Race on trx->lock.wait_lock between lock_rec_move() and lock_sys_t::cancel() The initial issue was in assertion failure, which checked the equality of lock to cancel with trx->lock.wait_lock in lock_sys_t::cancel(). If we analyze lock_sys_t::cancel() code from the perspective of trx->lock.wait_lock racing, we won't find the error there, except the cases when we need to reload it after the corresponding latches acquiring. So the fix is just to remove the assertion and reload trx->lock.wait_lock after acquiring necessary latches. Reviewed by: Marko Mäkelä <marko.makela@mariadb.com>	2023-02-20 20:31:24 +03:00
Marko Mäkelä	a8c5635cf1	Merge 10.5 into 10.6	2023-01-17 20:02:29 +02:00
Jan Lindström	179c283372	Merge branch 10.4 into 10.5	2023-01-14 08:25:57 +02:00
sjaakola	a44d896f98	10.4-MDEV-29684 Fixes for cluster wide write conflict resolving If two high priority threads have lock conflict, we look at the order of these transactions and honor the earlier transaction. for_locking parameter in lock_rec_has_to_wait() has become obsolete and it is now removed from the code . Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>	2023-01-14 07:50:04 +02:00
Marko Mäkelä	3386b30975	Merge 10.5 into 10.6	2023-01-13 10:45:41 +02:00
Marko Mäkelä	73ecab3d26	Merge 10.4 into 10.5	2023-01-13 10:18:30 +02:00
Marko Mäkelä	71e8e4934d	Merge 10.3 into 10.4	2023-01-13 09:28:25 +02:00
Marko Mäkelä	b218dfead2	Remove an unused parameter lock_rec_has_to_wait(): Remove the unused parameter for_locking that had been originally added in commit `df4dd593f2`	2023-01-11 08:37:27 +02:00
Marko Mäkelä	e572c745dc	MDEV-29504/MDEV-29849 TRUNCATE breaks FOREIGN KEY locking ha_innobase::referenced_by_foreign_key(): Protect the check with dict_sys.freeze(), to prevent races with TRUNCATE TABLE. The test innodb.instant_alter_crash has been adjusted for this additional locking. dict_table_is_referenced_by_foreign_key(): Removed (merged to the only caller). create_table_info_t::create_table(): Ignore missing indexes for FOREIGN KEY constraints if foreign_key_checks=0. create_table_info_t::create_table_update_dict(): Rewritten as a static function. Do not return any error. ha_innobase::create(): When trx!=nullptr and we are operating on a persistent table, do not rollback, commit, or release the data dictionary latch. ha_innobase::truncate(): Protect the entire critical section with an exclusive dict_sys.latch, so that ha_innobase::referenced_by_foreign_key() on referenced tables will return a consistent result. In case of a failure, invoke dict_load_foreigns() to restore also any FOREIGN KEY constraints. ha_innobase::free_foreign_key_create_info(): Define inline. lock_release(): Disregard innodb_evict_tables_on_commit_debug=ON when dict_sys.locked() holds. It would hold when fts_load_stopword() is invoked by create_table_info_t::create_table_update_dict(). dict_sys_t::locked(): Return whether the current thread is holding the exclusive dict_sys.latch. dict_sys_t::frozen_not_locked(): Return whether any thread is holding a shared dict_sys.latch. In the test main.mysql_upgrade, the InnoDB persistent statistics will no longer be recalculated in ha_innobase::open() as part of CHECK TABLE ... FOR UPGRADE. They were deleted earlier in the test. Tested by: Matthias Leich	2022-11-08 17:34:34 +02:00
Vlad Lesin	2f421688c6	MDEV-28709 unexpected X lock on Supremum in READ COMMITTED Post-push fix. The flag of transaction which indicates that it's necessary to forbid gap lock inheritance after XA PREPARE could be inverted if lock_release_on_prepare_try() is invoked several times. The fix is to toggle it on lock_release_on_prepare() exit.	2022-10-28 10:04:48 +03:00
Vlad Lesin	78a04a4c22	MDEV-29869 mtr failure: innodb.deadlock_wait_thr_race 1. The merge `aeccbbd926` has overwritten lock0lock.cc, and the changes of MDEV-29622 and MDEV-29635 were partially lost, this commit restores the changes. 2. innodb.deadlock_wait_thr_race test: The following hang was found during testing. There is deadlock_report_before_lock_releasing sync point in Deadlock::report(), which is waiting for sel_cont signal under lock_sys_t lock. The signal must be issued after "UPDATE t SET b = 100" rollback, and that rollback is executing undo record, which is blocked on dict_sys latch request. dict_sys is locked by the thread of statistics update(dict_stats_save()), and during that update lock_sys lock is requested, and can't be acquired as Deadlock::report() holds it. We have to disable statistics update to make the test stable. But even if statistics update is disabled, and transaction with consistent snapshot is started at the very beginning of the test to prevent purging, the purge can still be invoked for system tables, and it tries to open system table by id, what causes dict_sys.freeze() call and dict_sys latching. What, in combination with lock_sys::xx_lock() causes the same deadlock as described above. We need to disable purging globally for the test as well. All the above is applicable to innodb.deadlock_wait_lock_race test also.	2022-10-26 12:15:40 +03:00
Marko Mäkelä	aeccbbd926	Merge 10.5 into 10.6 To prevent ASAN heap-use-after-poison in the MDEV-16549 part of ./mtr --repeat=6 main.derived the initialization of Name_resolution_context was cleaned up.	2022-10-25 14:25:42 +03:00
Vlad Lesin	8128a46827	MDEV-28709 unexpected X lock on Supremum in READ COMMITTED The lock is created during page splitting after moving records and locks(lock_move_rec_list_(start\|end)()) to the new page, and inheriting the locks to the supremum of left page from the successor of the infimum on right page. There is no need in such inheritance for READ COMMITTED isolation level and not-gap locks, so the fix is to add the corresponding condition in gap lock inheritance function. One more fix is to forbid gap lock inheritance if XA was prepared. Use the most significant bit of trx_t::n_ref to indicate that gap lock inheritance is forbidden. This fix is based on mysql/mysql-server@b063e52a83	2022-10-25 00:52:10 +03:00
Vlad Lesin	9c04d66d11	MDEV-29622 Wrong assertions in lock_cancel_waiting_and_release() for deadlock resolving caller Suppose we have two transactions, trx 1 and trx 2. trx 2 does deadlock resolving from lock_wait(), it sets victim->lock.was_chosen_as_deadlock_victim=true for trx 1, but has not yet invoked lock_cancel_waiting_and_release(). trx 1 checks the flag in lock_trx_handle_wait(), and starts rollback from row_mysql_handle_errors(). It can change trx->lock.wait_thr and trx->state as it holds trx_t::mutex, but trx 2 has not yet requested it, as lock_cancel_waiting_and_release() has not yet been called. After that trx 1 tries to release locks in trx_t::rollback_low(), invoking trx_t::rollback_finish(). lock_release() is blocked on try to acquire lock_sys.rd_lock(SRW_LOCK_CALL) in lock_release_try(), as lock_sys is blocked by trx 2, as deadlock resolution works under lock_sys.wr_lock(SRW_LOCK_CALL), see Deadlock::report() for details. trx 2 executes lock_cancel_waiting_and_release() for deadlock victim, i. e. for trx 1. lock_cancel_waiting_and_release() contains some trx->lock.wait_thr and trx->state assertions, which will fail, because trx 1 has changed them during rollback execution. So, according to the above scenario, it's legal to have trx->lock.wait_thr==0 and trx->state!=TRX_STATE_ACTIVE in lock_cancel_waiting_and_release(), if it was invoked from Deadlock::report(), and the fix is just in the assertion conditions changing. The fix is just in changing assertion condition. There is also lock_wait() cleanup around trx->error_state. If trx->error_state can be changed not by the owned thread, it must be protected with lock_sys.wait_mutex, as lock_wait() uses trx->lock.cond along with that mutex. Also if trx->error_state was changed before lock_sys.wait_mutex acquision, then it could be reset with the following code, what is wrong. Also we need to check trx->error_state before entering waiting loop, otherwise it can be the case when trx->error_state was set before lock_sys.wait_mutex acquision, but the thread will be waiting on trx->lock.cond.	2022-10-21 10:55:19 +03:00
Vlad Lesin	acebe35719	MDEV-29635 race on trx->lock.wait_lock in deadlock resolution Returning DB_SUCCESS unconditionally if !trx->lock.wait_lock in lock_trx_handle_wait() is wrong. Because even if trx->lock.was_chosen_as_deadlock_victim was not set before the first check in lock_trx_handle_wait(), it can be set after the check, and trx->lock.wait_lock can be reset by another thread from lock_reset_lock_and_trx_wait() if the transaction was chosen as deadlock victim. In this case lock_trx_handle_wait() will return DB_SUCCESS even the transaction was marked as deadlock victim, and continue execution instead of rolling back. The fix is to check trx->lock.was_chosen_as_deadlock_victim once more if trx->lock.wait_lock is reset, as trx->lock.wait_lock can be reset only after trx->lock.was_chosen_as_deadlock_victim was set if the transaction was chosen as deadlock victim.	2022-10-21 10:55:18 +03:00
Marko Mäkelä	ab0190101b	MDEV-24402: InnoDB CHECK TABLE ... EXTENDED Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB, and InnoDB only counted the records in each index according to the current read view. Unless the attribute QUICK was specified, the function btr_validate_index() would be invoked to validate the B-tree structure (the sibling and child links between index pages). The EXTENDED check will not only count all index records according to the current read view, but also ensure that any delete-marked records in the clustered index are waiting for the purge of history, and that all secondary index records point to a version of the clustered index record that is waiting for the purge of history. In other words, no index may contain orphan records. Normal MVCC reads and the non-EXTENDED version of CHECK TABLE would ignore these orphans. Unpurged records merely result in warnings (at most one per index), not errors, and no indexes will be flagged as corrupted due to such garbage. It will remain possible to SELECT data from such indexes or tables (which will skip such records) or to rebuild the table to reclaim some space. We introduce purge_sys.end_view that will be (almost) a copy of purge_sys.view at the end of a batch of purging committed transaction history. It is not an exact copy, because if the size of a purge batch is limited by innodb_purge_batch_size, some records that purge_sys.view would allow to be purged will be left over for subsequent batches. The purge_sys.view is relevant in the purge of committed transaction history, to determine if records are safe to remove. The new purge_sys.end_view is relevant in MVCC operations and in CHECK TABLE ... EXTENDED. It tells which undo log records are safe to access (have not been discarded at the end of a purge batch). purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(), clone the oldest read view similar to purge_sys_t::clone_end_view() so that CHECK TABLE ... EXTENDED will not report bogus failures between InnoDB restart and the completed purge of committed transaction history. purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible() in the case that purge_sys.latch will not be held by the caller. Among other things, this guards access to BLOBs. It is not safe to dereference any BLOBs of a delete-marked purgeable record, because they may have already been freed. purge_sys_t::view_guard::view(): Return a reference to purge_sys.view that will be protected by purge_sys.latch, held by purge_sys_t::view_guard. purge_sys_t::end_view_guard::view(): Return a reference to purge_sys.end_view while it is protected by purge_sys.end_latch. Whenever a thread needs to retrieve an older version of a clustered index record, it will hold a page latch on the clustered index page and potentially also on a secondary index page that points to the clustered index page. If these pages contain purgeable records that would be accessed by a currently running purge batch, the progress of the purge batch would be blocked by the page latches. Hence, it is safe to make a copy of purge_sys.end_view while holding an index page latch, and consult the copy of the view to determine whether a record should already have been purged. btr_validate_index(): Remove a redundant check. row_check_index_match(): Check if a secondary index record and a version of a clustered index record match each other. row_check_index(): Replaces row_scan_index_for_mysql(). Count the records in each index directly, duplicating the relevant logic from row_search_mvcc(). Initialize check_table_extended_view for CHECK ... EXTENDED while holding an index leaf page latch. If we encounter an orphan record, the copy of purge_sys.end_view that we make is safe for visibility checks, and trx_undo_get_undo_rec() will check for the safety to access each undo log record. Should that check fail, we should return DB_MISSING_HISTORY to report a corrupted index. The EXTENDED check tries to match each secondary index record with every available clustered index record version, by duplicating the logic of row_vers_build_for_consistent_read() and invoking trx_undo_prev_version_build() directly. Before invoking row_check_index_match() on delete-marked clustered index record versions, we will consult purge_sys.is_purgeable() in order to avoid accessing freed BLOBs. We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not exceed the global maximum. Orphan secondary index records will be flagged only if everything up to PAGE_MAX_TRX_ID has been purged. We warn also about clustered index records whose nonzero DB_TRX_ID should have been reset in purge or rollback. trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id(). trx_undo_prev_version_build(): Remove two debug-only parameters, and return an error code instead of a Boolean. trx_undo_get_undo_rec(): Return a pointer to the undo log record, or nullptr if one cannot be retrieved. Instead of consulting the purge_sys.view, consult the purge_sys.end_view to determine which records can be accessed. trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec() that will consult purge_sys.view instead of purge_sys.end_view. TRX_UNDO_CHECK_PURGEABILITY: A new parameter to trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry() so that purge_sys.view instead of purge_sys.end_view will be consulted to determine whether a secondary index record may be safely purged. row_upd_changes_disowned_external(): Remove. This should be more expensive than briefly latching purge_sys in trx_undo_prev_version_build() (which may make use of transactional memory). row_sel_reset_old_vers_heap(): New function, split from row_sel_build_prev_vers_for_mysql(). row_sel_build_prev_vers_for_mysql(): Reorder some parameters to simplify the call to row_sel_reset_old_vers_heap(). row_search_for_mysql(): Replaced with direct calls to row_search_mvcc(). sel_node_get_nth_plan(): Define inline in row0sel.h open_step(): Define at the call site, in simplified form. sel_node_reset_cursor(): Merged with the only caller open_step(). --- ReadViewBase::check_trx_id_sanity(): Remove. Let us handle "future" DB_TRX_ID in a more meaningful way: row_sel_clust_sees(): Return DB_SUCCESS if the record is visible, DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if the DB_TRX_ID is in the future. row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed that corruption when we were about to modify the record in the first place (leading us to refuse the operation). row_vers_build_for_consistent_read(): Return DB_CORRUPTION if DB_TRX_ID is in the future. Tested by: Matthias Leich Reviewed by: Vladislav Lesin	2022-10-21 10:02:54 +03:00
Vlad Lesin	5ab78cf340	MDEV-29515 innodb.deadlock_victim_race is unstable The test is unstable because 'UPDATE t SET b = 100' latches a page and waits for 'upd_cont' signal in lock_trx_handle_wait_enter sync point, then purge requests RW_X_LATCH on the same page, and then 'SELECT * FROM t WHERE a = 10 FOR UPDATE' requests RW_S_LATCH, waiting for RW_X_LATCH requested by purge. 'UPDATE t SET b = 100' can't release page latch as it waits for upd_cont signal, which must be emitted after 'SELECT * FROM t WHERE a = 10 FOR UPDATE' acquired RW_S_LATCH. So we have a deadlock, which is resolved by finishing the debug sync point wait by timeout, and the 'UPDATE t SET b = 100' releases it's record locks rolling back the transaction, and 'SELECT * FROM t WHERE a = 10 FOR UPDATE' is finished successfully instead of finishing by lock wait timeout. The fix is to forbid purging during the test by opening read view in a separate connection before the first insert into the table. Besides, 'lock_wait_end' syncpoint is not needed, as it enough to wait the end of the SELECT execution to let the UPDATE to continue.	2022-09-19 16:57:58 +03:00
Vlad Lesin	8ff1096999	MDEV-29081 trx_t::lock.was_chosen_as_deadlock_victim race in lock_wait_end() The issue is that trx_t::lock.was_chosen_as_deadlock_victim can be reset before the transaction check it and set trx_t::error_state. The fix is to reset trx_t::lock.was_chosen_as_deadlock_victim only in trx_t::commit_in_memory(), which is invoked on full rollback. There is also no need to have separate bit in trx_t::lock.was_chosen_as_deadlock_victim to flag transaction it was chosen as a victim of Galera conflict resolution, the same variable can be used for both cases except debug build. For debug build we need to distinguish deadlock and Galera's abort victims for debug checks. Also there is no need to check for deadlock in lock_table_enqueue_waiting() for Galera as the coresponding check presents in lock_wait(). Local variable "error_state" in lock_wait() was replaced with trx->error_state, because before the replace lock_sys_t::cancel<false>(trx, lock) and lock_sys.deadlock_check() could change trx->error_state, which then could be overwritten with the local "error_state" variable value. The lock_wait_suspend_thread_enter DEBUG_SYNC point name is misleading, because lock_wait_suspend_thread was eliminated in `e71e613`. It was renamed to lock_wait_start. Reviewed by: Marko Mäkelä, Jan Lindström.	2022-08-24 17:06:57 +03:00
Marko Mäkelä	63478e72de	MDEV-21098: Assertion failure in rec_get_offsets_func() The function rec_get_offsets_func() used to hit ut_error due to an invalid rec_get_status() value of a ROW_FORMAT!=REDUNDANT record. This fix is twofold: We will not only avoid a crash on corruption in this case, but we will also make more effort to validate each record every time we are iterating over index page records. rec_get_offsets_func(): Do not crash on a corrupted record. page_rec_get_nth(): Return nullptr on error. page_dir_slot_get_rec_validate(): Like page_dir_slot_get_rec(), but validate the pointer and return nullptr on error. page_cur_search_with_match(), page_cur_search_with_match_bytes(), page_dir_split_slot(), page_cur_move_to_next(): Indicate failure in a return value. page_cur_search(): Replaced with page_cur_search_with_match(). rec_get_next_ptr_const(), rec_get_next_ptr(): Replaced with page_rec_get_next_low(). TODO: rtr_page_split_initialize_nodes(), rtr_update_mbr_field(), and possibly other SPATIAL INDEX functions fail to properly handle errors. Reviewed by: Thirunarayanan Balathandayuthapani Tested by: Matthias Leich Performance tested by: Axel Schwenke	2022-08-01 11:25:50 +03:00
Daniel Black	2658410afc	MDEV-29187: Deadlock output in InnoDB status always shows transaction (0) At some point the incrementing of the transaction counter got dropped. Thanks Agustin for the bug report.	2022-07-28 16:22:43 +10:00

1 2 3 4 5 ...

586 commits