mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-31 02:51:44 +01:00

Author	SHA1	Message	Date
Vlad Lesin	a474e3278c	MDEV-27701 Race on trx->lock.wait_lock between lock_rec_move() and lock_sys_t::cancel() The initial issue was in assertion failure, which checked the equality of lock to cancel with trx->lock.wait_lock in lock_sys_t::cancel(). If we analyze lock_sys_t::cancel() code from the perspective of trx->lock.wait_lock racing, we won't find the error there, except the cases when we need to reload it after the corresponding latches acquiring. So the fix is just to remove the assertion and reload trx->lock.wait_lock after acquiring necessary latches. Reviewed by: Marko Mäkelä <marko.makela@mariadb.com>	2023-02-20 20:31:24 +03:00
Sergei Golubchik	3bc98a4ec4	Merge branch '10.5' into 10.6	2022-05-10 14:01:23 +02:00
Sergei Golubchik	ef781162ff	Merge branch '10.4' into 10.5	2022-05-09 22:04:06 +02:00
Marko Mäkelä	0806592ac8	MDEV-28422 Page split breaks a gap lock btr_insert_into_right_sibling(): Inherit any gap lock from the left sibling to the right sibling before inserting the record to the right sibling and updating the node pointer(s). lock_update_node_pointer(): Update locks in case a node pointer will move. Based on mysql/mysql-server@c7d93c274f	2022-04-27 13:38:08 +03:00
Marko Mäkelä	2ca1123464	MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table This follows up the previous fix in commit `c3c53926c4` (MDEV-26554). ha_innobase::delete_table(): Work around the insufficient metadata locking (MDL) during DML operations by acquiring exclusive InnoDB table locks on all child tables. Previously, this was only done on TRUNCATE and ALTER. ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke lock_update_delete() during change buffer operations. The revised trx_t::commit(std::vector<pfs_os_file_t>&) will hold exclusive lock_sys.latch while invoking fil_delete_tablespace(), which in turn may invoke ibuf_delete_rec(). dict_index_t::has_locking(): A new predicate, replacing the dummy !dict_table_is_locking_disabled(index->table). Used for skipping lock operations during ibuf_delete_rec(). trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks and remove the table from the cache while holding exclusive lock_sys.latch. trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds. trx_t::commit(): Reset dict_operation before invoking commit_in_memory() via commit_persist(). lock_release_on_drop(): Release locks while lock_sys.latch is exclusively locked. lock_table(): Add a parameter for a pointer to the table. We must not dereference the table before a lock_sys.latch has been acquired. If the pointer to the table does not match the table at that point, the table is invalid and DB_DEADLOCK will be returned. row_ins_foreign_check_on_constraint(): Improve the checks. Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed before commit `c5fd9aa562` (MDEV-25919). row_upd_check_references_constraints(), wsrep_row_upd_check_foreign_constraints(): Simplify checks.	2022-04-26 18:09:03 +03:00
Marko Mäkelä	d1edb011ee	Cleanup: Remove os0thread Let us use the common pthread_t wrapper for Microsoft Windows. This fixes up commit `dbe941e06f`	2022-04-19 13:49:52 +03:00
Marko Mäkelä	2aed566d22	Cleanup: alignas(CPU_LEVEL1_DCACHE_LINESIZE) Let us replace all use of MY_ALIGNED in InnoDB with C++11 alignas. CACHE_LINE_SIZE: Replaced with CPU_LEVEL1_DCACHE_LINESIZE.	2022-04-14 10:40:26 +03:00
Marko Mäkelä	8840583a92	MDEV-27909 InnoDB: Failing assertion: state == TRX_STATE_NOT_STARTED ... on DDL The fix in commit `6e390a62ba` (MDEV-26772) was a step to the right direction, but implemented incorrectly. When an InnoDB persistent statistics table cannot be locked immediately, we must not let row_mysql_handle_errors() to roll back the transaction. lock_table_for_trx(): Add the parameter no_wait (default false) for an immediate return of DB_LOCK_WAIT in case of a conflict. ha_innobase::delete_table(), ha_innobase::rename_table(): Pass no_wait=true to lock_table_for_trx() when needed, instead of temporarily setting THDVAR(thd, lock_wait_timeout) to 0.	2022-03-18 10:52:08 +02:00
Marko Mäkelä	ed20e5b111	After-merge fixes	2022-03-08 09:04:03 +02:00
Vlad Lesin	202316a38f	Merge 10.5 into 10.6	2022-03-07 18:42:47 +03:00
Vlad Lesin	0b92c7b0e0	Merge 10.4 into 10.5	2022-03-07 17:16:11 +03:00
Vlad Lesin	86c1bf118a	MDEV-27992 DELETE fails to delete record after blocking is released MDEV-27025 allows to insert records before the record on which DELETE is locked, as a result the DELETE misses those records, what causes serious ACID violation. Revert MDEV-27025, MDEV-27550. The test which shows the scenario of ACID violation is added.	2022-03-07 16:42:05 +03:00
Vlad Lesin	5f001bd7b8	MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lock The code was backported from 10.5 `be8113861c` commit. See that commit message for details.	2022-02-21 12:49:54 +03:00
Oleksandr Byelkin	f5c5f8e41e	Merge branch '10.5' into 10.6	2022-02-03 17:01:31 +01:00
Oleksandr Byelkin	cf63eecef4	Merge branch '10.4' into 10.5	2022-02-01 20:33:04 +01:00
Oleksandr Byelkin	41a163ac5c	Merge branch '10.2' into 10.3	2022-01-29 15:41:05 +01:00
Vlad Lesin	be8113861c	MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lock The code was backported from 10.6 `bd03c0e516` commit. See that commit message for details. Apart from the above commit trx_lock_t::wait_trx was also backported from MDEV-24738. trx_lock_t::wait_trx is protected with lock_sys.wait_mutex in 10.6, but that mutex was implemented only in MDEV-24789. As there is no need to backport MDEV-24789 for MDEV-27025, trx_lock_t::wait_trx is protected with the same mutexes as trx_lock_t::wait_lock. This fix should not break innodb-lock-schedule-algorithm=VATS. This algorithm uses an Eldest-Transaction-First (ETF) heuristic, which prefers older transactions over new ones. In this fix we just insert granted lock just before the last granted lock of the same transaction, what does not change transactions execution order. The changes in lock_rec_create_low() should not break Galera Cluster, there is a big "if" branch for WSREP. This branch is necessary to provide the correct transactions execution order, and should not be changed for the current bug fix.	2022-01-18 18:15:10 +03:00
Vlad Lesin	bd03c0e516	MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lock When lock is checked for conflict, ignore other locks on the record if they wait for the requesting transaction. lock_rec_has_to_wait_in_queue() iterates not all locks for the page, but only the locks located before the waiting lock in the queue. So there is some invariant - any lock in the queue can wait only lock which is located before the waiting lock in the queue. In the case when conflicting lock waits for the transaction of requesting lock, we need to place the requesting lock before the waiting lock in the queue to preserve the invariant. That is why we are looking for the first waiting for requesting transation lock and place the new lock just after the last granted requesting transaction lock before the first waiting for requesting transaction lock. Example: trx1 waiting lock, trx1 granted lock, ..., trx2 lock - waiting for trx1 place new lock here -----------------^ There are also implicit locks which are lazily converted to explicit ones, and we need to place the newly created explicit lock to the correct place in a queue. All explicit locks converted from implicit ones are placed just after the last non-waiting lock of the same transaction before the first waiting for the transaction lock. Code review and cleanup was made by Marko Mäkelä.	2022-01-18 15:18:42 +03:00
Vladislav Vaintroub	47e18af906	MDEV-27494 Rename .ic files to .inl	2022-01-17 16:41:51 +01:00
Marko Mäkelä	1f02280904	MDEV-26769 InnoDB does not support hardware lock elision This implements memory transaction support for: * Intel Restricted Transactional Memory (RTM), also known as TSX-NI (Transactional Synchronization Extensions New Instructions) * POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux transactional_lock_guard, transactional_shared_lock_guard: RAII lock guards that try to elide the lock acquisition when transactional memory is available. buf_pool.page_hash: Try to elide latches whenever feasible. Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED tables, this is not always possible. In buf_page_get_low(), memory transactions only work reasonably well for validating a guessed block address. TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards that try to elide lock_sys.latch and related latches.	2021-10-22 12:38:45 +03:00
Marko Mäkelä	9c5835e067	Merge 10.5 into 10.6	2021-10-18 16:36:24 +03:00
Marko Mäkelä	18eab4a832	MDEV-26682 Replication timeouts with XA PREPARE The purpose of non-exclusive locks in a transaction is to guarantee that the records covered by those locks must remain in that way until the transaction is committed. (The purpose of gap locks is to ensure that a record that was nonexistent will remain that way.) Once a transaction has reached the XA PREPARE state, the only allowed further actions are XA ROLLBACK or XA COMMIT. Therefore, it can be argued that only the exclusive locks that the XA PREPARE transaction is holding are essential. Furthermore, InnoDB never preserved explicit locks across server restart. For XA PREPARE transations, we will only recover implicit exclusive locks for records that had been modified. Because of the fact that XA PREPARE followed by a server restart will cause some locks to be lost, we might as well always release all non-exclusive locks during the execution of an XA PREPARE statement. lock_release_on_prepare(): Release non-exclusive locks on XA PREPARE. trx_prepare(): Invoke lock_release_on_prepare() unless the isolation level is SERIALIZABLE or this is an internal distributed transaction with the binlog (not actual XA PREPARE statement). This has been discussed with Sergei Golubchik and Andrei Elkin. Reviewed by: Sergei Golubchik	2021-10-18 12:49:10 +03:00
Krunal Bauskar	48bbc44733	MDEV-26609 : Avoid deriving ELEMENT_PER_LATCH from cacheline * buffer pool has latches that protect access to pages. * there is a latch per N pages. (check page_hash_table for more details) * N is calculated based on the cacheline size. * for example: if cacheline size is : 64 then 7 pages pointers + 1 latch can be hosted on the same cacheline : 128 then 15 pages pointers + 1 latch can be hosted on the same cacheline * arm generally have wider cacheline so with arm 1 latch is used to access 15 pages vs with x86 1 latch is used to access 7 pages. Naturally, the contention is more with arm case. * said patch help relax this contention by limiting the elements per cacheline to 7 (+ 1 latch slot). for wider-cacheline (say 128), the remaining 8 slots are kept empty. this ensures there are no 2 latches on the same cacheline to avoid latch level contention. Based on suggestion from Marko, the same logic is now extended to lock_sys_t::hash_table.	2021-09-17 11:58:49 +03:00
Marko Mäkelä	277ba134ad	MDEV-26467: Avoid futile spin loops Typically, index_lock and fil_space_t::latch will be held for a longer time than the spin loop in latch acquisition would be waiting for. Let us avoid spin loops for those as well as dict_sys.latch, which could be held in exclusive mode for a longer time (while loading metadata into the buffer pool and the dictionary cache). Performance testing on a dual Intel Xeon E5-2630 v4 (2 NUMA nodes) suggests that the buffer pool page latch (block_lock) benefits from a spin loop in both read-only and read-write workloads where the working set is slightly larger than the buffer pool. Presumably, most contention would occur on leaf page latches. Contention on upper level pages in the buffer pool should intuitively last longer. We introduce srw_spin_lock and srw_spin_mutex to allow users of srw_lock or srw_mutex to opt in for the spin loop. On Microsoft Windows, a spin loop variant was and will not be available; srw_mutex and srw_lock will simply wrap SRWLOCK. That is, on Microsoft Windows, the parameters innodb_sync_spin_loops and innodb_spin_wait_delay will only affect block_lock.	2021-09-06 12:32:24 +03:00
Marko Mäkelä	c5fd9aa562	MDEV-25919: Lock tables before acquiring dict_sys.latch In commit `1bd681c8b3` (MDEV-25506 part 3) we introduced a "fake instant timeout" when a transaction would wait for a table or record lock while holding dict_sys.latch. This prevented a deadlock of the server but could cause bogus errors for operations on the InnoDB persistent statistics tables. A better fix is to ensure that whenever a transaction is being executed in the InnoDB internal SQL parser (which will for now require dict_sys.latch to be held), it will already have acquired all locks that could be required for the execution. So, we will acquire the following locks upfront, before acquiring dict_sys.latch: (1) MDL on the affected user table (acquired by the SQL layer) (2) If applicable (not for RENAME TABLE): InnoDB table lock (3) If persistent statistics are going to be modified: (3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats (3.b) exclusive table locks on the statistics tables (4) Exclusive table locks on the InnoDB data dictionary tables (not needed in ANALYZE TABLE and the like) Note: Acquiring exclusive locks on the statistics tables may cause more locking conflicts between concurrent DDL operations. Notably, RENAME TABLE will lock the statistics tables even if no persistent statistics are enabled for the table. DROP DATABASE will only acquire locks on statistics tables if persistent statistics are enabled for the tables on which the SQL layer is invoking ha_innobase::delete_table(). For any "garbage collection" in innodb_drop_database(), a timeout while acquiring locks on the statistics tables will result in any statistics not being deleted for any tables that the SQL layer did not know about. If innodb_defragment=ON, information may be written to the statistics tables even for tables for which InnoDB persistent statistics are disabled. But, DROP TABLE will no longer attempt to delete that information if persistent statistics are not enabled for the table. This change should also fix the hangs related to InnoDB persistent statistics and STATS_AUTO_RECALC (MDEV-15020) as well as a bug that running ALTER TABLE on the statistics tables concurrently with running ALTER TABLE on InnoDB tables could cause trouble. lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(): Do not issue a fake instant timeout error when the transaction is holding dict_sys.latch. Instead, assert that the dict_sys.latch is never being held here. lock_sys_tables(): A new function to acquire exclusive locks on all dictionary tables, in case DROP TABLE or similar operation is being executed. Locking non-hard-coded tables is optional to avoid a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require all these dictionary tables to exist before executing any DDL, but the function row_merge_drop_temp_indexes() is an exception. When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier, the table SYS_VIRTUAL would not exist at this point. ha_innobase::commit_inplace_alter_table(): Invoke log_write_up_to() while not holding dict_sys.latch. dict_sys_t::remove(), dict_table_close(): No longer try to drop index stubs that were left behind by aborted online ADD INDEX. Such indexes should be dropped from the InnoDB data dictionary by row_merge_drop_indexes() as part of the failed DDL operation. Stubs for aborted indexes may only be left behind in the data dictionary cache. dict_stats_fetch_from_ps(): Use a normal read-only transaction. ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table(): While waiting for purge to stop using the table, do not hold dict_sys.latch. ha_innobase::delete_table(): Implement a work-around for the rollback of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL due to a conflicting LOCK TABLES, such as in the first ALTER TABLE in the test case of Bug#53676 in parts.partition_special_innodb. Therefore, we must explicitly stop purge, because it would not be stopped by MDL. dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that we can acquire MDL on the InnoDB persistent statistics tables. mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory() in order to avoid ASAN heap-use-after-free related to acquire_thd(). trx_t::dict_operation_lock_mode: Changed the type to bool. row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary(): Implemented as macros. rollback_inplace_alter_table(): Apply an infinite timeout to lock waits. innodb_thd_increment_pending_ops(): Wrapper for thd_increment_pending_ops(). Never attempt async operation for InnoDB background threads, such as the trx_t::commit() in dict_stats_process_entry_from_recalc_pool(). lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL. lock_wait(): Make dictionary transactions immune to KILL, and to lock wait timeout when waiting for locks on dictionary tables. parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly get ER_LOCK_WAIT_TIMEOUT. main.mdl: Filter out MDL on InnoDB persistent statistics tables Reviewed by: Thirunarayanan Balathandayuthapani	2021-08-31 13:54:44 +03:00
Marko Mäkelä	a7d68e7a0f	MDEV-25791: Remove UNIV_INTERN Back in 2006 or 2007, when MySQL AB and Innobase Oy existed as separately controlled entities (Innobase had been acquired by Oracle Corporation), MySQL 5.1 introduced a storage engine plugin interface and Oracle made use of it by distributing a separate InnoDB Plugin, which would contain some more bug fixes and improvements, compared to the version of InnoDB that was statically linked with the mysqld server that was distributed by MySQL AB. The built-in InnoDB would export global symbols, which would clash with the symbols of the dynamic InnoDB Plugin (which was supposed to override the built-in one when present). The solution to this problem was to declare all global symbols with UNIV_INTERN, so that they would get the GCC function attribute that specifies hidden visibility. Later, in MariaDB Server, something based on Percona XtraDB (a fork of MySQL InnoDB) became the statically linked implementation, and something closer to MySQL InnoDB was available as a dynamic plugin. Starting with version 10.2, MariaDB Server includes only one InnoDB implementation, and hence any reason to have the UNIV_INTERN definition was lost. btr_get_size_and_reserved(): Move to the same compilation unit with the only caller. innodb_set_buf_pool_size(): Remove. Modify innobase_buffer_pool_size directly. fil_crypt_calculate_checksum(): Merge to the only caller. ha_innobase::innobase_reset_autoinc(): Merge to the only caller. thd_query_start_micro(): Remove. Call thd_start_utime() directly.	2021-05-27 13:28:08 +03:00
Marko Mäkelä	c366845a0b	MDEV-25691: Simplify handlerton::drop_database for InnoDB The implementation of handlerton::drop_database in InnoDB is unnecessarily complex. The minimal implementation should check that no conflicting locks or references exist on the tables, delete all table metadata in a single transaction, and finally delete the tablespaces. Note: DROP DATABASE will delete each individual table that the SQL layer knows about, one table per transaction. The handlerton::drop_database is basically a final cleanup step for removing any garbage that could have been left behind in InnoDB due to some bug, or not having atomic DDL in the past. hash_node_t: Remove. Use the proper data type name in pointers. dict_drop_index_tree(): Do not take the table as a parameter. Instead, return the tablespace ID if the tablespace should be dropped (we are dropping a clustered index tree). fil_delete_tablespace(), fil_system_t::detach(): Return a single detached file handle. Multi-file tablespaces cannot be deleted via this interface. ha_innobase::delete_table(): Remove a work-around for non-atomic DDL and do not try to drop tables with similar-looking name. innodb_drop_database(): Complete rewrite. innobase_drop_database(), dict_get_first_table_name_in_db(), row_drop_database_for_mysql(), drop_all_foreign_keys_in_db(): Remove. row_purge_remove_clust_if_poss_low(), row_undo_ins_remove_clust_rec(): If the tablespace is to be deleted, try to evict the table definition from the cache. Failing that, set dict_table_t::space to nullptr. lock_release_on_rollback(): On the rollback of CREATE TABLE, release all locks that the transaction had on the table, to avoid heap-use-after-free.	2021-05-18 12:53:40 +03:00
Marko Mäkelä	a29618f3bd	MDEV-25522: Purge of aborted ADD INDEX leaves orphan locks behind lock_discard_for_index(): New function, to discard locks for an index whose index tree has been purged. By definition, such indexes must be ones for which the MDL upgrade failed in inplace ALTER TABLE and the ADD INDEX operation was never committed. Note: Because we do not support online ADD SPATIAL INDEX, we only have to traverse the lock_sys.rec_hash for B-trees and not the hash tables for R-trees. row_purge_remove_clust_if_poss_low(): Invoke lock_discard_for_index() if necessary before dropping a B-tree for a SYS_INDEXES record.	2021-04-28 17:24:44 +03:00
Marko Mäkelä	8751aa7397	MDEV-25404: ssux_lock_low: Introduce a separate writer mutex Having both readers and writers use a single lock word in futex system calls caused performance regression compared to SRW_LOCK_DUMMY (mutex and 2 condition variables). A contributing factor is that we did not accurately keep track of the number of waiting threads and thus had to invoke system calls to wake up any waiting threads. SUX_LOCK_GENERIC: Renamed from SRW_LOCK_DUMMY. This is the original implementation, with rw_lock (std::atomic<uint32_t>), a mutex and two condition variables. Using a separate writer mutex (as described below) is not possible, because the mutex ownership in a buf_block_t::lock must be able to transfer from a write submitter thread to an I/O completion thread, and pthread_mutex_lock() may assume that the submitter thread is recursively acquiring the mutex that it already holds, while in reality the I/O completion thread is the real owner. POSIX does not define an interface for requesting a mutex to be non-recursive. On Microsoft Windows, srw_lock_low will remain a simple wrapper of SRWLOCK. On 32-bit Microsoft Windows, sizeof(SRWLOCK)=4 while sizeof(srw_lock_low)=8. On other platforms, srw_lock_low is an alias of ssux_lock_low, the Simple (non-recursive) Shared/Update/eXclusive lock. In the futex-based implementation of ssux_lock_low (Linux, OpenBSD, Microsoft Windows), we shall use a dedicated mutex for exclusive requests (writer), and have a WRITER flag in the 'readers' lock word to inform that a writer is holding the lock or waiting for the lock to be granted. When the WRITER flag is set, all lock requests must acquire the writer mutex. Normally, shared (S) lock requests simply perform a compare-and-swap on the 'readers' word. Update locks are implemented as a combination of writer mutex and a normal counter in the 'readers' lock word. The conflict between U and X locks is guaranteed by the writer mutex. Unlike SUX_LOCK_GENERIC, wr_u_downgrade() will not wake up any pending rd_lock() waits. They will wait until u_unlock() releases the writer mutex. The ssux_lock_low is always wrapped by sux_lock (with a recursion count of U and X locks), used for dict_index_t::lock and buf_block_t::lock. Their memory footprint for the futex-based implementation will increase by sizeof(srw_mutex), or 4 bytes. This change addresses a performance regression in read-only benchmarks, such as sysbench oltp_read_only. Also write performance was improved. On 32-bit Linux and OpenBSD, lock_sys_t::hash_table will allocate two hash table elements for each srw_lock (14 instead of 15 hash table cells per 64-byte cache line on IA-32). On Microsoft Windows, sizeof(SRWLOCK)==sizeof(void*) and there is no change. Reviewed by: Vladislav Vaintroub Tested by: Axel Schwenke and Vladislav Vaintroub	2021-04-19 18:15:49 +03:00
sjaakola	a1e70388c4	MDEV-24966 Galera multi-master regression After the merging of MDEV-24915, 10.6 branch has regressions with handling of concurrent write load against two or more cluster nodes. These regressions may surface as cluster hanging, node crashes or data inconsistency. With some test scenarios, the only visible symptom could be that the BF victim aborting happens only by innodb lock wait timeout expiration. This would result only to poor performance (by default 50 sec hang for each BF conflict), and could be somewhat difficult to diagnose. This pull request has following fixes to handle concurrent write load from multiple nodes: In lock_wait_wsrep_kill(), the victim trx was expected to be only in TRX_STATE_ACTIVE state. With the delayed BF conflict handling, it can happen that victim has advanced into pre commit state. This was fixed by choosing victim both in TRX_STATE_ACTIVE and TRX_STATE_PREPARED states. Victim transaction may be in several different states at the time of detected lock conflict, and due to delayed BF aborting practice in MDEV-24915, the victim may advance further before the actual BF aborting takes place. The BF aborting in MDEV-24915 did not wake the victim, if it was in the state of waiting for some other lock (than the one that was blocking the high priority thread). This anomaly caused the innodb lock wait timeout expiration delays and poor performance symptom. To fix this, lock_wait_wsrep_kill() now looks if victim is in lock waiting state, and uses lock_cancel_waiting_and_release() to cancel this lock wait. wsrep_bf_abort() checks if the victim has active transaction (in wsrep-lib), and starts a new transaction if there was no active transaction before. Due to late BF aborting, the victim may have e.g. failed in certification and is already aborting or has aborted at this stage. This has caused problems in testing where BF aborter tries to BF abort himself. The fix in wsrep_bf_abort() now skips the BF abort, if victim is aborting or has aborted. Victim may not have started transaction yet in wsrep context, but it may have acquired MDL locks (due to DDL execution), and this has caused BF conflict. Such case does not require aborting in wsrep or replication provider state. BF aborting could cause BF-BF conflict scenario, if victim was already aborted and changed to replayer having high priority as well. This BF-BF conflict scenario is now avoided in lock_wait_wsrep() where we now check if blocking lock holder is also high priority and is ordered before, caller should wait for the lock in this situation. The natural innodb deadlock resolving algorithm could pick BF thread as deadlock victim. This is fixed by giving max weigh to BF threads in Deadlock::report(). MDEV-24341 has changed excution paths in do_command() and this affects BF aborted victim execution. This PR fixes one assert in do_command(): DBUG_ASSERT(!thd->async_state.pending_ops()) Which fired if the thd was BF aborted earlier. This assert is now changed to allow pending_ops() if thd was BF aborted before. With these fixes, long term highly conflicting write load could be run against to node cluster. If binlogging is configured, log_slave_updates should be also set.	2021-04-13 14:58:54 +03:00
Marko Mäkelä	7c524d4414	MDEV-25371 Potential hang in wsrep_is_BF_lock_timeout() In commit `e71e613353` (MDEV-24671), lock_sys.wait_mutex was moved above lock_sys.mutex (which was later replaced with lock_sys.latch) in the latching order. In commit `7cf4419fc4` (MDEV-24789), a potential hang was introduced to Galera. The function lock_wait() would hold lock_sys.wait_mutex while invoking wsrep_is_BF_lock_timeout(), which in turn could acquire LockMutexGuard for some diagnostic printout. wsrep_is_BF_lock_timeout(): Do not invoke trx_print_latched() or LockMutexGuard. lock_sys_t::wr_lock(), lock_sys_t::rd_lock(): Assert that the current thread is not holding lock_sys.wait_mutex. Unfortunately, RW-locks are not covered by SAFE_MUTEX. Reviewed by: Jan Lindström	2021-04-08 13:32:16 +03:00
Marko Mäkelä	8ea923f55b	MDEV-24818: Optimize multi-statement INSERT into an empty table If the user "opts in" (as in the parent commit `92b2a911e5`), we can optimize multiple INSERT statements to use table-level locking and undo logging. There will be a change of behavior: CREATE TABLE t(a PRIMARY KEY) ENGINE=InnoDB; SET foreign_key_checks=0, unique_checks=0; BEGIN; INSERT INTO t SET a=1; INSERT INTO t SET a=1; COMMIT; will end up with an empty table, because in case of an error, the entire transaction will be rolled back, instead of rolling back the failing statement. Previously, the second INSERT statement would have been logged row by row, and only that second statement would have been rolled back, leaving the first INSERT intact. lock_table_x_unlock(), trx_mod_table_time_t::WAS_BULK: Remove. Because we cannot really support statement rollback in this optimized mode, we will not optimize the locking. The exclusive table lock will be held until the end of the transaction.	2021-03-16 15:21:34 +02:00
Marko Mäkelä	bda8a2a63a	MDEV-24671 fixup: Merge lock_sys_t::wait_pending into wait_count The maximum number of concurrently waiting transactions is one less than the maximum number of concurrent transactions. A 45-bit cumulative counter of lock waits will support more than one million lock waits per second for a year.	2021-03-09 09:05:26 +02:00
Marko Mäkelä	1c7d4f8de7	MDEV-25016 Race condition between lock_sys_t::cancel() and page split or merge In commit `8d16da1487` (MDEV-24789) we accidentally introduced a race condition. During the time a waiting lock request is being removed, the request might be moved to another page due to a concurrent page split or merge. To prevent this, we must hold exclusive lock_sys.latch when releasing a record lock. lock_release_autoinc_locks(): Avoid a potential hang. No dict_table_t::lock_mutex must be waited for while already holding lock_sys.wait_mutex or trx_t::mutex. lock_cancel_waiting_and_release(): Correctly handle AUTO_INCREMENT locks.	2021-03-03 13:49:49 +02:00
Marko Mäkelä	8513007c84	Cleanup: Remove some lock accessor functions	2021-03-02 14:26:57 +02:00
Marko Mäkelä	8d16da1487	MDEV-24789: Reduce lock_sys mutex contention further lock_sys_t::deadlock_check(): Assume that only lock_sys.wait_mutex is being held by the caller. lock_sys_t::rd_lock_try(): New function. lock_sys_t::cancel(trx_t): Kill an active transaction that may be holding a lock. lock_sys_t::cancel(trx_t, lock_t*): Cancel a waiting lock request. lock_trx_handle_wait(): Avoid acquiring mutexes in some cases, and in never acquire lock_sys.latch in exclusive mode. This function is only invoked in a semi-consistent read (locking a clustered index record only if it matches the search condition). Normally, lock_wait() will take care of lock waits. lock_wait(): Invoke the new function lock_sys_t::cancel() at the end, to avoid acquiring exclusive lock_sys.latch. lock_rec_other_trx_holds_expl(): Use LockGuard instead of LockMutexGuard. lock_release_autoinc_locks(): Explicitly acquire table->lock_mutex, in case only a shared lock_sys.latch is being held. Deadlock::report() will still hold exclusive lock_sys.latch while invoking lock_cancel_waiting_and_release(). lock_cancel_waiting_and_release(): Acquire trx->mutex in this function, instead of expecting the caller to do so. lock_unlock_table_autoinc(): Only acquire shared lock_sys.latch. lock_table_has_locks(): Do not acquire lock_sys.latch at all. Deadlock::check_and_resolve(): Only acquire shared lock_sys.latchm for invoking lock_sys_t::cancel(trx, wait_lock). innobase_query_caching_table_check_low(), row_drop_tables_for_mysql_in_background(): Do not acquire lock_sys.latch.	2021-03-02 14:26:33 +02:00
Marko Mäkelä	7cf4419fc4	MDEV-24789: Reduce lock_sys.wait_mutex contention A performance regression was introduced by commit `e71e613353` (MDEV-24671) and mostly addressed by commit `455514c800`. The regression is likely caused by increased contention lock_sys.latch (former lock_sys.mutex), possibly indirectly caused by contention on lock_sys.wait_mutex. This change aims to reduce both, but further improvements will be needed. lock_wait(): Minimize the lock_sys.wait_mutex hold time. lock_sys_t::deadlock_check(): Add a parameter for indicating whether lock_sys.latch is exclusively locked. trx_t::was_chosen_as_deadlock_victim: Always use atomics. lock_wait_wsrep(): Assume that no mutex is being held. Deadlock::report(): Always kill the victim transaction. lock_sys_t::timeout: New counter to back MONITOR_TIMEOUT.	2021-02-26 14:58:48 +02:00
Marko Mäkelä	21987e5919	MDEV-20612 fixup: Reduce hash table lookups Let us calculate the hash table cell address while we are calculating the latch address, to avoid repeated computations of the address. The latch address can be derived from the cell address with a simple bitmask operation.	2021-02-24 14:47:42 +02:00
Marko Mäkelä	43b239a081	MDEV-24915 Galera conflict resolution is unnecessarily complex The fix of MDEV-23328 introduced a background thread for killing conflicting transactions. Thanks to the refactoring that was conducted in MDEV-24671, the high-priority ("brute-force") applier thread can kill the conflicting transactions itself, before waiting for the locks to be finally released (after the conflicting transactions have been rolled back). This also allows us to remove the hack LockGGuard that had to be added in MDEV-20612, and remove Galera-related function parameters from lock creation.	2021-02-18 12:16:51 +02:00
Marko Mäkelä	c68007d958	MDEV-24738 Improve the InnoDB deadlock checker A new configuration parameter innodb_deadlock_report is introduced: * innodb_deadlock_report=off: Do not report any details of deadlocks. * innodb_deadlock_report=basic: Report transactions and waiting locks. * innodb_deadlock_report=full (default): Report also the blocking locks. The improved deadlock checker will consider all involved transactions in one loop, even if the deadlock loop includes several transactions. The theoretical maximum number of transactions that can be involved in a deadlock is `innodb_page_size` * 8, limited by the persistent data structures. Note: Similar to mysql/mysql-server@3859219875 our deadlock checker will consider at most one blocking transaction for each waiting transaction. The new field trx->lock.wait_trx be nullptr if and only if trx->lock.wait_lock is nullptr. Note that trx->lock.wait_lock->trx == trx (the waiting transaction), while trx->lock.wait_trx points to one of the transactions whose lock is conflicting with trx->lock.wait_lock. Considering only one blocking transaction will greatly simplify our deadlock checker, but it may also make the deadlock checker blind to some deadlocks where the deadlock cycle is 'hidden' by the fact that the registered trx->lock.wait_trx is not actually waiting for any InnoDB lock, but something else. So, instead of deadlocks, sometimes lock wait timeout may be reported. To improve on this, whenever trx->lock.wait_trx is changed, we will register further 'candidate' transactions in Deadlock::to_check(), and check for 'revealed' deadlocks as soon as possible, in lock_release() and innobase_kill_query(). The old DeadlockChecker was holding lock_sys.latch, even though using lock_sys.wait_mutex should be less contended (and thus preferred) in the likely case that no deadlock is present. lock_wait(): Defer the deadlock check to this function, instead of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(). DeadlockChecker: Complete rewrite: (1) Explicitly keep track of transactions that are being waited for, in trx->lock.wait_trx, protected by lock_sys.wait_mutex. Previously, we were painstakingly traversing the lock heaps while blocking concurrent registration or removal of any locks (even uncontended ones). (2) Use Brent's cycle-detection algorithm for deadlock detection, traversing each trx->lock.wait_trx edge at most 2 times. (3) If a deadlock is detected, release lock_sys.wait_mutex, acquire LockMutexGuard, re-acquire lock_sys.wait_mutex and re-invoke find_cycle() to find out whether the deadlock is still present. (4) Display information on all transactions that are involved in the deadlock, and choose a victim to be rolled back. lock_sys.deadlocks: Replaces lock_deadlock_found. Protected by wait_mutex. Deadlock::find_cycle(): Quickly find a cycle of trx->lock.wait_trx... using Brent's cycle detection algorithm. Deadlock::report(): Report a deadlock cycle that was found by Deadlock::find_cycle(), and choose a victim with the least weight. Altogether, we may traverse each trx->lock.wait_trx edge up to 5 times (2*find_cycle()+1 time for reporting and choosing the victim). Deadlock::check_and_resolve(): Find and resolve a deadlock. lock_wait_rpl_report(): Report the waits-for information to replication. This used to be executed as part of DeadlockChecker. Replication must know the waits-for relations even if no deadlocks are present in InnoDB. Reviewed by: Vladislav Vaintroub	2021-02-17 12:44:08 +02:00
Marko Mäkelä	26d6224dd6	MDEV-20612: Enable concurrent lock_release() lock_release_try(): Try to release locks while only holding shared lock_sys.latch. lock_release(): If 5 attempts of lock_release_try() fail, proceed to acquire exclusive lock_sys.latch.	2021-02-12 17:44:58 +02:00
Marko Mäkelä	b08448de64	MDEV-20612: Partition lock_sys.latch We replace the old lock_sys.mutex (which was renamed to lock_sys.latch) with a combination of a global lock_sys.latch and table or page hash lock mutexes. The global lock_sys.latch can be acquired in exclusive mode, or it can be acquired in shared mode and another mutex will be acquired to protect the locks for a particular page or a table. This is inspired by mysql/mysql-server@1d259b87a6 but the optimization of lock_release() will be done in the next commit. Also, we will interleave mutexes with the hash table elements, similar to how buf_pool.page_hash was optimized in commit `5155a300fa` (MDEV-22871). dict_table_t::autoinc_trx: Use Atomic_relaxed. dict_table_t::autoinc_mutex: Use srw_mutex in order to reduce the memory footprint. On 64-bit Linux or OpenBSD, both this and the new dict_table_t::lock_mutex should be 32 bits and be stored in the same 64-bit word. On Microsoft Windows, the underlying SRWLOCK is 32 or 64 bits, and on other systems, sizeof(pthread_mutex_t) can be much larger. ib_lock_t::trx_locks, trx_lock_t::trx_locks: Document the new rules. Writers must assert lock_sys.is_writer() \|\| trx->mutex_is_owner(). LockGuard: A RAII wrapper for acquiring a page hash table lock. LockGGuard: Like LockGuard, but when Galera Write-Set Replication is enabled, we must acquire all shards, for updating arbitrary trx_locks. LockMultiGuard: A RAII wrapper for acquiring two page hash table locks. lock_rec_create_wsrep(), lock_table_create_wsrep(): Special Galera conflict resolution in non-inlined functions in order to keep the common code paths shorter. lock_sys_t::prdt_page_free_from_discard(): Refactored from lock_prdt_page_free_from_discard() and lock_rec_free_all_from_discard_page(). trx_t::commit_tables(): Replaces trx_update_mod_tables_timestamp(). lock_release(): Let trx_t::commit_tables() invalidate the query cache for those tables that were actually modified by the transaction. Merge lock_check_dict_lock() to lock_release(). We must never release lock_sys.latch while holding any lock_sys_t::hash_latch. Failure to do that could lead to memory corruption if the buffer pool is resized between the time lock_sys.latch is released and the hash_latch is released.	2021-02-12 17:44:32 +02:00
Marko Mäkelä	b01d8e1a33	MDEV-20612: Replace lock_sys.mutex with lock_sys.latch For now, we will acquire the lock_sys.latch only in exclusive mode, that is, use it as a mutex. This is preparation for the next commit where we will introduce a less intrusive alternative, combining a shared lock_sys.latch with dict_table_t::lock_mutex or a mutex embedded in lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash.	2021-02-11 14:52:10 +02:00
Marko Mäkelä	2e64513fba	MDEV-20612 preparation: Fewer calls to buf_page_t::id()	2021-02-11 12:48:07 +02:00
Marko Mäkelä	786bc312b8	Cleanup: Replace mysql_cond_t with pthread_cond_t Let us avoid the memory overhead and the dead duplicated code for each use of never-instrumented condition variables in InnoDB.	2021-02-07 13:21:18 +02:00
Marko Mäkelä	74ab97f58f	Cleanup: Remove lock_trx_lock_list_init(), lock_table_get_n_locks()	2021-02-07 11:18:21 +02:00
Marko Mäkelä	465bdabb7a	Cleanup: Reduce some lock_sys.mutex contention lock_table(): Remove the constant parameter flags=0. lock_table_resurrect(): Merge lock_table_ix_resurrect() and lock_table_x_resurrect(). lock_rec_lock(): Only acquire LockMutexGuard if lock_table_has() does not hold.	2021-02-05 13:14:50 +02:00
Marko Mäkelä	5f46385764	MDEV-24731 Excessive mutex contention in DeadlockChecker::check_and_resolve() The DeadlockChecker expects to be able to freeze the waits-for graph. Hence, it is best executed somewhere where we are not holding any additional mutexes. lock_wait(): Defer the deadlock check to this function, instead of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(). DeadlockChecker::trx_rollback(): Merge with the only caller, check_and_resolve(). LockMutexGuard: RAII accessor for lock_sys.mutex. lock_sys.deadlocks: Replaces lock_deadlock_found. trx_t: Clean up some comments.	2021-02-04 16:38:07 +02:00
Marko Mäkelä	68b2819342	Cleanup: Remove many C-style lock_get_ accessors Let us prefer member functions to the old C-style accessor functions. Also, prefer bitwise AND operations for checking multiple flags.	2021-01-27 18:41:58 +02:00
Marko Mäkelä	cbb0a60c57	Cleanup: Remove lock_get_size()	2021-01-27 18:02:11 +02:00

1 2 3 4

178 commits