The initial issue was in assertion failure, which checked the equality
of lock to cancel with trx->lock.wait_lock in lock_sys_t::cancel().
If we analyze lock_sys_t::cancel() code from the perspective of
trx->lock.wait_lock racing, we won't find the error there, except the
cases when we need to reload it after the corresponding latches
acquiring.
So the fix is just to remove the assertion and reload
trx->lock.wait_lock after acquiring necessary latches.
Reviewed by: Marko Mäkelä <marko.makela@mariadb.com>
btr_insert_into_right_sibling(): Inherit any gap lock from the
left sibling to the right sibling before inserting the record
to the right sibling and updating the node pointer(s).
lock_update_node_pointer(): Update locks in case a node pointer
will move.
Based on mysql/mysql-server@c7d93c274f
This follows up the previous fix in
commit c3c53926c4 (MDEV-26554).
ha_innobase::delete_table(): Work around the insufficient
metadata locking (MDL) during DML operations by acquiring exclusive
InnoDB table locks on all child tables. Previously, this was only
done on TRUNCATE and ALTER.
ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke
lock_update_delete() during change buffer operations.
The revised trx_t::commit(std::vector<pfs_os_file_t>&) will
hold exclusive lock_sys.latch while invoking fil_delete_tablespace(),
which in turn may invoke ibuf_delete_rec().
dict_index_t::has_locking(): A new predicate, replacing the dummy
!dict_table_is_locking_disabled(index->table). Used for skipping lock
operations during ibuf_delete_rec().
trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks
and remove the table from the cache while holding exclusive
lock_sys.latch.
trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds.
trx_t::commit(): Reset dict_operation before invoking commit_in_memory()
via commit_persist().
lock_release_on_drop(): Release locks while lock_sys.latch is
exclusively locked.
lock_table(): Add a parameter for a pointer to the table.
We must not dereference the table before a lock_sys.latch has
been acquired. If the pointer to the table does not match the table
at that point, the table is invalid and DB_DEADLOCK will be returned.
row_ins_foreign_check_on_constraint(): Improve the checks.
Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed
before commit c5fd9aa562 (MDEV-25919).
row_upd_check_references_constraints(),
wsrep_row_upd_check_foreign_constraints(): Simplify checks.
The fix in commit 6e390a62ba (MDEV-26772)
was a step to the right direction, but implemented incorrectly.
When an InnoDB persistent statistics table cannot be locked immediately,
we must not let row_mysql_handle_errors() to roll back the transaction.
lock_table_for_trx(): Add the parameter no_wait (default false)
for an immediate return of DB_LOCK_WAIT in case of a conflict.
ha_innobase::delete_table(), ha_innobase::rename_table():
Pass no_wait=true to lock_table_for_trx() when needed,
instead of temporarily setting THDVAR(thd, lock_wait_timeout) to 0.
MDEV-27025 allows to insert records before the record on which DELETE is
locked, as a result the DELETE misses those records, what causes serious ACID
violation.
Revert MDEV-27025, MDEV-27550. The test which shows the scenario of ACID
violation is added.
The code was backported from 10.6 bd03c0e516
commit. See that commit message for details.
Apart from the above commit trx_lock_t::wait_trx was also backported from
MDEV-24738. trx_lock_t::wait_trx is protected with lock_sys.wait_mutex
in 10.6, but that mutex was implemented only in MDEV-24789. As there is no
need to backport MDEV-24789 for MDEV-27025,
trx_lock_t::wait_trx is protected with the same mutexes as
trx_lock_t::wait_lock.
This fix should not break innodb-lock-schedule-algorithm=VATS. This
algorithm uses an Eldest-Transaction-First (ETF) heuristic, which prefers
older transactions over new ones. In this fix we just insert granted lock
just before the last granted lock of the same transaction, what does not
change transactions execution order.
The changes in lock_rec_create_low() should not break Galera Cluster,
there is a big "if" branch for WSREP. This branch is necessary to provide
the correct transactions execution order, and should not be changed for
the current bug fix.
When lock is checked for conflict, ignore other locks on the record if
they wait for the requesting transaction.
lock_rec_has_to_wait_in_queue() iterates not all locks for
the page, but only the locks located before the waiting lock in the
queue. So there is some invariant - any lock in the queue can wait only
lock which is located before the waiting lock in the queue.
In the case when conflicting lock waits for the transaction of
requesting lock, we need to place the requesting lock before the waiting
lock in the queue to preserve the invariant. That is why we are looking
for the first waiting for requesting transation lock and place the new
lock just after the last granted requesting transaction lock before the
first waiting for requesting transaction lock.
Example:
trx1 waiting lock, trx1 granted lock, ..., trx2 lock - waiting for trx1
place new lock here -----------------^
There are also implicit locks which are lazily converted to explicit
ones, and we need to place the newly created explicit lock to the correct
place in a queue. All explicit locks converted from implicit ones are
placed just after the last non-waiting lock of the same transaction before
the first waiting for the transaction lock.
Code review and cleanup was made by Marko Mäkelä.
This implements memory transaction support for:
* Intel Restricted Transactional Memory (RTM), also known as TSX-NI
(Transactional Synchronization Extensions New Instructions)
* POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux
transactional_lock_guard, transactional_shared_lock_guard:
RAII lock guards that try to elide the lock acquisition
when transactional memory is available.
buf_pool.page_hash: Try to elide latches whenever feasible.
Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED
tables, this is not always possible.
In buf_page_get_low(), memory transactions only work reasonably
well for validating a guessed block address.
TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards
that try to elide lock_sys.latch and related latches.
The purpose of non-exclusive locks in a transaction is to guarantee
that the records covered by those locks must remain in that way until
the transaction is committed. (The purpose of gap locks is to ensure
that a record that was nonexistent will remain that way.)
Once a transaction has reached the XA PREPARE state, the only allowed
further actions are XA ROLLBACK or XA COMMIT. Therefore, it can be
argued that only the exclusive locks that the XA PREPARE transaction
is holding are essential.
Furthermore, InnoDB never preserved explicit locks across server restart.
For XA PREPARE transations, we will only recover implicit exclusive locks
for records that had been modified.
Because of the fact that XA PREPARE followed by a server restart will
cause some locks to be lost, we might as well always release all
non-exclusive locks during the execution of an XA PREPARE statement.
lock_release_on_prepare(): Release non-exclusive locks on XA PREPARE.
trx_prepare(): Invoke lock_release_on_prepare() unless the
isolation level is SERIALIZABLE or this is an internal distributed
transaction with the binlog (not actual XA PREPARE statement).
This has been discussed with Sergei Golubchik and Andrei Elkin.
Reviewed by: Sergei Golubchik
* buffer pool has latches that protect access to pages.
* there is a latch per N pages.
(check page_hash_table for more details)
* N is calculated based on the cacheline size.
* for example: if cacheline size is
: 64 then 7 pages pointers + 1 latch can be hosted on the same cacheline
: 128 then 15 pages pointers + 1 latch can be hosted on the same cacheline
* arm generally have wider cacheline so with arm 1 latch is used
to access 15 pages vs with x86 1 latch is used to access 7 pages.
Naturally, the contention is more with arm case.
* said patch help relax this contention by limiting the elements
per cacheline to 7 (+ 1 latch slot).
for wider-cacheline (say 128), the remaining 8 slots are kept empty.
this ensures there are no 2 latches on the same cacheline to avoid
latch level contention.
Based on suggestion from Marko, the same logic is now extended to
lock_sys_t::hash_table.
Typically, index_lock and fil_space_t::latch will be held for a longer
time than the spin loop in latch acquisition would be waiting for.
Let us avoid spin loops for those as well as dict_sys.latch, which
could be held in exclusive mode for a longer time (while loading
metadata into the buffer pool and the dictionary cache).
Performance testing on a dual Intel Xeon E5-2630 v4 (2 NUMA nodes)
suggests that the buffer pool page latch (block_lock) benefits from a
spin loop in both read-only and read-write workloads where the working
set is slightly larger than the buffer pool. Presumably, most contention
would occur on leaf page latches. Contention on upper level pages in
the buffer pool should intuitively last longer.
We introduce srw_spin_lock and srw_spin_mutex to allow users of
srw_lock or srw_mutex to opt in for the spin loop.
On Microsoft Windows, a spin loop variant was and will not be available;
srw_mutex and srw_lock will simply wrap SRWLOCK.
That is, on Microsoft Windows, the parameters innodb_sync_spin_loops
and innodb_spin_wait_delay will only affect block_lock.
In commit 1bd681c8b3 (MDEV-25506 part 3)
we introduced a "fake instant timeout" when a transaction would wait
for a table or record lock while holding dict_sys.latch. This prevented
a deadlock of the server but could cause bogus errors for operations
on the InnoDB persistent statistics tables.
A better fix is to ensure that whenever a transaction is being
executed in the InnoDB internal SQL parser (which will for now
require dict_sys.latch to be held), it will already have acquired
all locks that could be required for the execution. So, we will
acquire the following locks upfront, before acquiring dict_sys.latch:
(1) MDL on the affected user table (acquired by the SQL layer)
(2) If applicable (not for RENAME TABLE): InnoDB table lock
(3) If persistent statistics are going to be modified:
(3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats
(3.b) exclusive table locks on the statistics tables
(4) Exclusive table locks on the InnoDB data dictionary tables
(not needed in ANALYZE TABLE and the like)
Note: Acquiring exclusive locks on the statistics tables may cause
more locking conflicts between concurrent DDL operations.
Notably, RENAME TABLE will lock the statistics tables
even if no persistent statistics are enabled for the table.
DROP DATABASE will only acquire locks on statistics tables if
persistent statistics are enabled for the tables on which the
SQL layer is invoking ha_innobase::delete_table().
For any "garbage collection" in innodb_drop_database(), a timeout
while acquiring locks on the statistics tables will result in any
statistics not being deleted for any tables that the SQL layer
did not know about.
If innodb_defragment=ON, information may be written to the statistics
tables even for tables for which InnoDB persistent statistics are
disabled. But, DROP TABLE will no longer attempt to delete that
information if persistent statistics are not enabled for the table.
This change should also fix the hangs related to InnoDB persistent
statistics and STATS_AUTO_RECALC (MDEV-15020) as well as
a bug that running ALTER TABLE on the statistics tables
concurrently with running ALTER TABLE on InnoDB tables could
cause trouble.
lock_rec_enqueue_waiting(), lock_table_enqueue_waiting():
Do not issue a fake instant timeout error when the transaction
is holding dict_sys.latch. Instead, assert that the dict_sys.latch
is never being held here.
lock_sys_tables(): A new function to acquire exclusive locks on all
dictionary tables, in case DROP TABLE or similar operation is
being executed. Locking non-hard-coded tables is optional to avoid
a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was
introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require
all these dictionary tables to exist before executing any DDL, but
the function row_merge_drop_temp_indexes() is an exception.
When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier,
the table SYS_VIRTUAL would not exist at this point.
ha_innobase::commit_inplace_alter_table(): Invoke
log_write_up_to() while not holding dict_sys.latch.
dict_sys_t::remove(), dict_table_close(): No longer try to
drop index stubs that were left behind by aborted online ADD INDEX.
Such indexes should be dropped from the InnoDB data dictionary by
row_merge_drop_indexes() as part of the failed DDL operation.
Stubs for aborted indexes may only be left behind in the
data dictionary cache.
dict_stats_fetch_from_ps(): Use a normal read-only transaction.
ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table():
While waiting for purge to stop using the table,
do not hold dict_sys.latch.
ha_innobase::delete_table(): Implement a work-around for the rollback
of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if
ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL
due to a conflicting LOCK TABLES, such as in the first ALTER TABLE
in the test case of Bug#53676 in parts.partition_special_innodb.
Therefore, we must explicitly stop purge, because it would not be
stopped by MDL.
dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that
we can acquire MDL on the InnoDB persistent statistics tables.
mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory()
in order to avoid ASAN heap-use-after-free related to acquire_thd().
trx_t::dict_operation_lock_mode: Changed the type to bool.
row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary():
Implemented as macros.
rollback_inplace_alter_table(): Apply an infinite timeout to lock waits.
innodb_thd_increment_pending_ops(): Wrapper for
thd_increment_pending_ops(). Never attempt async operation for
InnoDB background threads, such as the trx_t::commit() in
dict_stats_process_entry_from_recalc_pool().
lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL.
lock_wait(): Make dictionary transactions immune to KILL, and to
lock wait timeout when waiting for locks on dictionary tables.
parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly
get ER_LOCK_WAIT_TIMEOUT.
main.mdl: Filter out MDL on InnoDB persistent statistics tables
Reviewed by: Thirunarayanan Balathandayuthapani
Back in 2006 or 2007, when MySQL AB and Innobase Oy existed as
separately controlled entities (Innobase had been acquired by
Oracle Corporation), MySQL 5.1 introduced a storage engine plugin
interface and Oracle made use of it by distributing a separate
InnoDB Plugin, which would contain some more bug fixes and
improvements, compared to the version of InnoDB that was statically
linked with the mysqld server that was distributed by MySQL AB.
The built-in InnoDB would export global symbols, which would clash
with the symbols of the dynamic InnoDB Plugin (which was supposed
to override the built-in one when present).
The solution to this problem was to declare all global symbols with
UNIV_INTERN, so that they would get the GCC function attribute that
specifies hidden visibility.
Later, in MariaDB Server, something based on Percona XtraDB (a fork of
MySQL InnoDB) became the statically linked implementation, and something
closer to MySQL InnoDB was available as a dynamic plugin. Starting with
version 10.2, MariaDB Server includes only one InnoDB implementation,
and hence any reason to have the UNIV_INTERN definition was lost.
btr_get_size_and_reserved(): Move to the same compilation unit with
the only caller.
innodb_set_buf_pool_size(): Remove. Modify innobase_buffer_pool_size
directly.
fil_crypt_calculate_checksum(): Merge to the only caller.
ha_innobase::innobase_reset_autoinc(): Merge to the only caller.
thd_query_start_micro(): Remove. Call thd_start_utime() directly.
The implementation of handlerton::drop_database in InnoDB is
unnecessarily complex. The minimal implementation should check
that no conflicting locks or references exist on the tables,
delete all table metadata in a single transaction, and finally
delete the tablespaces.
Note: DROP DATABASE will delete each individual table that the
SQL layer knows about, one table per transaction.
The handlerton::drop_database is basically a final cleanup step
for removing any garbage that could have been left behind
in InnoDB due to some bug, or not having atomic DDL in the past.
hash_node_t: Remove. Use the proper data type name in pointers.
dict_drop_index_tree(): Do not take the table as a parameter.
Instead, return the tablespace ID if the tablespace should be dropped
(we are dropping a clustered index tree).
fil_delete_tablespace(), fil_system_t::detach(): Return a single
detached file handle. Multi-file tablespaces cannot be deleted
via this interface.
ha_innobase::delete_table(): Remove a work-around for non-atomic DDL
and do not try to drop tables with similar-looking name.
innodb_drop_database(): Complete rewrite.
innobase_drop_database(), dict_get_first_table_name_in_db(),
row_drop_database_for_mysql(), drop_all_foreign_keys_in_db(): Remove.
row_purge_remove_clust_if_poss_low(), row_undo_ins_remove_clust_rec():
If the tablespace is to be deleted, try to evict the table definition
from the cache. Failing that, set dict_table_t::space to nullptr.
lock_release_on_rollback(): On the rollback of CREATE TABLE, release all
locks that the transaction had on the table, to avoid heap-use-after-free.
lock_discard_for_index(): New function, to discard locks for an
index whose index tree has been purged. By definition, such indexes
must be ones for which the MDL upgrade failed in inplace ALTER TABLE
and the ADD INDEX operation was never committed.
Note: Because we do not support online ADD SPATIAL INDEX, we only
have to traverse the lock_sys.rec_hash for B-trees and not the
hash tables for R-trees.
row_purge_remove_clust_if_poss_low(): Invoke lock_discard_for_index()
if necessary before dropping a B-tree for a SYS_INDEXES record.
Having both readers and writers use a single lock word in
futex system calls caused performance regression compared to
SRW_LOCK_DUMMY (mutex and 2 condition variables).
A contributing factor is that we did not accurately keep
track of the number of waiting threads and thus had to invoke
system calls to wake up any waiting threads.
SUX_LOCK_GENERIC: Renamed from SRW_LOCK_DUMMY. This is the
original implementation, with rw_lock (std::atomic<uint32_t>),
a mutex and two condition variables. Using a separate writer
mutex (as described below) is not possible, because the mutex ownership
in a buf_block_t::lock must be able to transfer from a write submitter
thread to an I/O completion thread, and pthread_mutex_lock() may assume
that the submitter thread is recursively acquiring the mutex that it
already holds, while in reality the I/O completion thread is the real
owner. POSIX does not define an interface for requesting a mutex to
be non-recursive.
On Microsoft Windows, srw_lock_low will remain a simple wrapper of
SRWLOCK. On 32-bit Microsoft Windows, sizeof(SRWLOCK)=4 while
sizeof(srw_lock_low)=8.
On other platforms, srw_lock_low is an alias of ssux_lock_low,
the Simple (non-recursive) Shared/Update/eXclusive lock.
In the futex-based implementation of ssux_lock_low (Linux, OpenBSD,
Microsoft Windows), we shall use a dedicated mutex for exclusive
requests (writer), and have a WRITER flag in the 'readers' lock word
to inform that a writer is holding the lock or waiting for the lock to
be granted. When the WRITER flag is set, all lock requests must acquire
the writer mutex. Normally, shared (S) lock requests simply perform a
compare-and-swap on the 'readers' word.
Update locks are implemented as a combination of writer mutex
and a normal counter in the 'readers' lock word. The conflict between
U and X locks is guaranteed by the writer mutex.
Unlike SUX_LOCK_GENERIC, wr_u_downgrade() will not wake up any pending
rd_lock() waits. They will wait until u_unlock() releases the writer mutex.
The ssux_lock_low is always wrapped by sux_lock (with a recursion count
of U and X locks), used for dict_index_t::lock and buf_block_t::lock.
Their memory footprint for the futex-based implementation will increase
by sizeof(srw_mutex), or 4 bytes.
This change addresses a performance regression in read-only benchmarks,
such as sysbench oltp_read_only. Also write performance was improved.
On 32-bit Linux and OpenBSD, lock_sys_t::hash_table will allocate
two hash table elements for each srw_lock (14 instead of 15 hash
table cells per 64-byte cache line on IA-32). On Microsoft Windows,
sizeof(SRWLOCK)==sizeof(void*) and there is no change.
Reviewed by: Vladislav Vaintroub
Tested by: Axel Schwenke and Vladislav Vaintroub
After the merging of MDEV-24915, 10.6 branch has regressions with handling of
concurrent write load against two or more cluster nodes. These regressions may
surface as cluster hanging, node crashes or data inconsistency. With some test
scenarios, the only visible symptom could be that the BF victim aborting happens
only by innodb lock wait timeout expiration. This would result only to poor
performance (by default 50 sec hang for each BF conflict), and could be somewhat
difficult to diagnose.
This pull request has following fixes to handle concurrent write load from
multiple nodes:
In lock_wait_wsrep_kill(), the victim trx was expected to be only in
TRX_STATE_ACTIVE state. With the delayed BF conflict handling, it can happen
that victim has advanced into pre commit state. This was fixed by choosing
victim both in TRX_STATE_ACTIVE and TRX_STATE_PREPARED states.
Victim transaction may be in several different states at the time of detected
lock conflict, and due to delayed BF aborting practice in MDEV-24915, the victim
may advance further before the actual BF aborting takes place. The BF aborting
in MDEV-24915 did not wake the victim, if it was in the state of waiting for
some other lock (than the one that was blocking the high priority thread).
This anomaly caused the innodb lock wait timeout expiration delays and poor
performance symptom. To fix this, lock_wait_wsrep_kill() now looks if
victim is in lock waiting state, and uses lock_cancel_waiting_and_release()
to cancel this lock wait.
wsrep_bf_abort() checks if the victim has active transaction (in wsrep-lib),
and starts a new transaction if there was no active transaction before.
Due to late BF aborting, the victim may have e.g. failed in certification
and is already aborting or has aborted at this stage. This has caused
problems in testing where BF aborter tries to BF abort himself.
The fix in wsrep_bf_abort() now skips the BF abort, if victim is aborting
or has aborted. Victim may not have started transaction yet in wsrep context,
but it may have acquired MDL locks (due to DDL execution), and this has
caused BF conflict. Such case does not require aborting in wsrep or
replication provider state.
BF aborting could cause BF-BF conflict scenario, if victim was already aborted
and changed to replayer having high priority as well. This BF-BF conflict
scenario is now avoided in lock_wait_wsrep() where we now check if blocking
lock holder is also high priority and is ordered before, caller should wait
for the lock in this situation.
The natural innodb deadlock resolving algorithm could pick BF thread as
deadlock victim. This is fixed by giving max weigh to BF threads in
Deadlock::report().
MDEV-24341 has changed excution paths in do_command() and this affects BF
aborted victim execution. This PR fixes one assert in do_command():
DBUG_ASSERT(!thd->async_state.pending_ops())
Which fired if the thd was BF aborted earlier. This assert is now changed
to allow pending_ops() if thd was BF aborted before.
With these fixes, long term highly conflicting write load could be run against
to node cluster. If binlogging is configured, log_slave_updates should be
also set.
In commit e71e613353 (MDEV-24671),
lock_sys.wait_mutex was moved above lock_sys.mutex
(which was later replaced with lock_sys.latch) in the latching order.
In commit 7cf4419fc4 (MDEV-24789),
a potential hang was introduced to Galera. The function lock_wait()
would hold lock_sys.wait_mutex while invoking wsrep_is_BF_lock_timeout(),
which in turn could acquire LockMutexGuard for some diagnostic printout.
wsrep_is_BF_lock_timeout(): Do not invoke trx_print_latched() or
LockMutexGuard.
lock_sys_t::wr_lock(), lock_sys_t::rd_lock(): Assert that the current
thread is not holding lock_sys.wait_mutex.
Unfortunately, RW-locks are not covered by SAFE_MUTEX.
Reviewed by: Jan Lindström
If the user "opts in" (as in the parent
commit 92b2a911e5),
we can optimize multiple INSERT statements to use table-level locking
and undo logging.
There will be a change of behavior:
CREATE TABLE t(a PRIMARY KEY) ENGINE=InnoDB;
SET foreign_key_checks=0, unique_checks=0;
BEGIN; INSERT INTO t SET a=1; INSERT INTO t SET a=1; COMMIT;
will end up with an empty table, because in case of an error,
the entire transaction will be rolled back, instead of rolling
back the failing statement. Previously, the second INSERT statement
would have been logged row by row, and only that second statement
would have been rolled back, leaving the first INSERT intact.
lock_table_x_unlock(), trx_mod_table_time_t::WAS_BULK: Remove.
Because we cannot really support statement rollback in this
optimized mode, we will not optimize the locking. The exclusive
table lock will be held until the end of the transaction.
The maximum number of concurrently waiting transactions is one less
than the maximum number of concurrent transactions.
A 45-bit cumulative counter of lock waits will support more than
one million lock waits per second for a year.
In commit 8d16da1487 (MDEV-24789)
we accidentally introduced a race condition. During the time a
waiting lock request is being removed, the request might be
moved to another page due to a concurrent page split or merge.
To prevent this, we must hold exclusive lock_sys.latch when releasing
a record lock.
lock_release_autoinc_locks(): Avoid a potential hang.
No dict_table_t::lock_mutex must be waited for while already holding
lock_sys.wait_mutex or trx_t::mutex.
lock_cancel_waiting_and_release(): Correctly handle AUTO_INCREMENT locks.
lock_sys_t::deadlock_check(): Assume that only lock_sys.wait_mutex
is being held by the caller.
lock_sys_t::rd_lock_try(): New function.
lock_sys_t::cancel(trx_t*): Kill an active transaction that may be
holding a lock.
lock_sys_t::cancel(trx_t*, lock_t*): Cancel a waiting lock request.
lock_trx_handle_wait(): Avoid acquiring mutexes in some cases,
and in never acquire lock_sys.latch in exclusive mode.
This function is only invoked in a semi-consistent read
(locking a clustered index record only if it matches the search condition).
Normally, lock_wait() will take care of lock waits.
lock_wait(): Invoke the new function lock_sys_t::cancel() at the end,
to avoid acquiring exclusive lock_sys.latch.
lock_rec_other_trx_holds_expl(): Use LockGuard instead of LockMutexGuard.
lock_release_autoinc_locks(): Explicitly acquire table->lock_mutex,
in case only a shared lock_sys.latch is being held. Deadlock::report()
will still hold exclusive lock_sys.latch while invoking
lock_cancel_waiting_and_release().
lock_cancel_waiting_and_release(): Acquire trx->mutex in this function,
instead of expecting the caller to do so.
lock_unlock_table_autoinc(): Only acquire shared lock_sys.latch.
lock_table_has_locks(): Do not acquire lock_sys.latch at all.
Deadlock::check_and_resolve(): Only acquire shared lock_sys.latchm
for invoking lock_sys_t::cancel(trx, wait_lock).
innobase_query_caching_table_check_low(),
row_drop_tables_for_mysql_in_background(): Do not acquire lock_sys.latch.
A performance regression was introduced by
commit e71e613353 (MDEV-24671)
and mostly addressed by
commit 455514c800.
The regression is likely caused by increased contention
lock_sys.latch (former lock_sys.mutex), possibly indirectly
caused by contention on lock_sys.wait_mutex. This change aims to
reduce both, but further improvements will be needed.
lock_wait(): Minimize the lock_sys.wait_mutex hold time.
lock_sys_t::deadlock_check(): Add a parameter for indicating
whether lock_sys.latch is exclusively locked.
trx_t::was_chosen_as_deadlock_victim: Always use atomics.
lock_wait_wsrep(): Assume that no mutex is being held.
Deadlock::report(): Always kill the victim transaction.
lock_sys_t::timeout: New counter to back MONITOR_TIMEOUT.
Let us calculate the hash table cell address while we are calculating
the latch address, to avoid repeated computations of the address.
The latch address can be derived from the cell address with a simple
bitmask operation.
The fix of MDEV-23328 introduced a background thread for
killing conflicting transactions.
Thanks to the refactoring that was conducted in MDEV-24671,
the high-priority ("brute-force") applier thread can kill the
conflicting transactions itself, before waiting for the
locks to be finally released (after the conflicting transactions
have been rolled back).
This also allows us to remove the hack LockGGuard that had to
be added in MDEV-20612, and remove Galera-related function
parameters from lock creation.
A new configuration parameter innodb_deadlock_report is introduced:
* innodb_deadlock_report=off: Do not report any details of deadlocks.
* innodb_deadlock_report=basic: Report transactions and waiting locks.
* innodb_deadlock_report=full (default): Report also the blocking locks.
The improved deadlock checker will consider all involved transactions
in one loop, even if the deadlock loop includes several transactions.
The theoretical maximum number of transactions that can be involved in
a deadlock is `innodb_page_size` * 8, limited by the persistent data
structures.
Note: Similar to
mysql/mysql-server@3859219875
our deadlock checker will consider at most one blocking transaction
for each waiting transaction. The new field trx->lock.wait_trx be
nullptr if and only if trx->lock.wait_lock is nullptr. Note that
trx->lock.wait_lock->trx == trx (the waiting transaction), while
trx->lock.wait_trx points to one of the transactions whose lock is
conflicting with trx->lock.wait_lock.
Considering only one blocking transaction will greatly simplify
our deadlock checker, but it may also make the deadlock checker
blind to some deadlocks where the deadlock cycle is 'hidden' by
the fact that the registered trx->lock.wait_trx is not actually
waiting for any InnoDB lock, but something else. So, instead of
deadlocks, sometimes lock wait timeout may be reported.
To improve on this, whenever trx->lock.wait_trx is changed, we
will register further 'candidate' transactions in Deadlock::to_check(),
and check for 'revealed' deadlocks as soon as possible, in lock_release()
and innobase_kill_query().
The old DeadlockChecker was holding lock_sys.latch, even though using
lock_sys.wait_mutex should be less contended (and thus preferred)
in the likely case that no deadlock is present.
lock_wait(): Defer the deadlock check to this function, instead of
executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().
DeadlockChecker: Complete rewrite:
(1) Explicitly keep track of transactions that are being waited for,
in trx->lock.wait_trx, protected by lock_sys.wait_mutex. Previously,
we were painstakingly traversing the lock heaps while blocking
concurrent registration or removal of any locks (even uncontended ones).
(2) Use Brent's cycle-detection algorithm for deadlock detection,
traversing each trx->lock.wait_trx edge at most 2 times.
(3) If a deadlock is detected, release lock_sys.wait_mutex,
acquire LockMutexGuard, re-acquire lock_sys.wait_mutex and re-invoke
find_cycle() to find out whether the deadlock is still present.
(4) Display information on all transactions that are involved in the
deadlock, and choose a victim to be rolled back.
lock_sys.deadlocks: Replaces lock_deadlock_found. Protected by wait_mutex.
Deadlock::find_cycle(): Quickly find a cycle of trx->lock.wait_trx...
using Brent's cycle detection algorithm.
Deadlock::report(): Report a deadlock cycle that was found by
Deadlock::find_cycle(), and choose a victim with the least weight.
Altogether, we may traverse each trx->lock.wait_trx edge up to 5
times (2*find_cycle()+1 time for reporting and choosing the victim).
Deadlock::check_and_resolve(): Find and resolve a deadlock.
lock_wait_rpl_report(): Report the waits-for information to
replication. This used to be executed as part of DeadlockChecker.
Replication must know the waits-for relations even if no deadlocks
are present in InnoDB.
Reviewed by: Vladislav Vaintroub
lock_release_try(): Try to release locks while only holding
shared lock_sys.latch.
lock_release(): If 5 attempts of lock_release_try() fail,
proceed to acquire exclusive lock_sys.latch.
We replace the old lock_sys.mutex (which was renamed to lock_sys.latch)
with a combination of a global lock_sys.latch and table or page hash lock
mutexes.
The global lock_sys.latch can be acquired in exclusive mode, or
it can be acquired in shared mode and another mutex will be acquired
to protect the locks for a particular page or a table.
This is inspired by
mysql/mysql-server@1d259b87a6
but the optimization of lock_release() will be done in the next commit.
Also, we will interleave mutexes with the hash table elements, similar
to how buf_pool.page_hash was optimized
in commit 5155a300fa (MDEV-22871).
dict_table_t::autoinc_trx: Use Atomic_relaxed.
dict_table_t::autoinc_mutex: Use srw_mutex in order to reduce the
memory footprint. On 64-bit Linux or OpenBSD, both this and the new
dict_table_t::lock_mutex should be 32 bits and be stored in the same
64-bit word. On Microsoft Windows, the underlying SRWLOCK is 32 or 64
bits, and on other systems, sizeof(pthread_mutex_t) can be much larger.
ib_lock_t::trx_locks, trx_lock_t::trx_locks: Document the new rules.
Writers must assert lock_sys.is_writer() || trx->mutex_is_owner().
LockGuard: A RAII wrapper for acquiring a page hash table lock.
LockGGuard: Like LockGuard, but when Galera Write-Set Replication
is enabled, we must acquire all shards, for updating arbitrary trx_locks.
LockMultiGuard: A RAII wrapper for acquiring two page hash table locks.
lock_rec_create_wsrep(), lock_table_create_wsrep(): Special
Galera conflict resolution in non-inlined functions in order
to keep the common code paths shorter.
lock_sys_t::prdt_page_free_from_discard(): Refactored from
lock_prdt_page_free_from_discard() and
lock_rec_free_all_from_discard_page().
trx_t::commit_tables(): Replaces trx_update_mod_tables_timestamp().
lock_release(): Let trx_t::commit_tables() invalidate the query cache
for those tables that were actually modified by the transaction.
Merge lock_check_dict_lock() to lock_release().
We must never release lock_sys.latch while holding any
lock_sys_t::hash_latch. Failure to do that could lead to
memory corruption if the buffer pool is resized between
the time lock_sys.latch is released and the hash_latch is released.
For now, we will acquire the lock_sys.latch only in exclusive mode,
that is, use it as a mutex.
This is preparation for the next commit where we will introduce
a less intrusive alternative, combining a shared lock_sys.latch
with dict_table_t::lock_mutex or a mutex embedded in
lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash.
lock_table(): Remove the constant parameter flags=0.
lock_table_resurrect(): Merge lock_table_ix_resurrect() and
lock_table_x_resurrect().
lock_rec_lock(): Only acquire LockMutexGuard if lock_table_has()
does not hold.
The DeadlockChecker expects to be able to freeze the waits-for graph.
Hence, it is best executed somewhere where we are not holding any
additional mutexes.
lock_wait(): Defer the deadlock check to this function, instead
of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().
DeadlockChecker::trx_rollback(): Merge with the only caller,
check_and_resolve().
LockMutexGuard: RAII accessor for lock_sys.mutex.
lock_sys.deadlocks: Replaces lock_deadlock_found.
trx_t: Clean up some comments.