Typically, index_lock and fil_space_t::latch will be held for a longer
time than the spin loop in latch acquisition would be waiting for.
Let us avoid spin loops for those as well as dict_sys.latch, which
could be held in exclusive mode for a longer time (while loading
metadata into the buffer pool and the dictionary cache).
Performance testing on a dual Intel Xeon E5-2630 v4 (2 NUMA nodes)
suggests that the buffer pool page latch (block_lock) benefits from a
spin loop in both read-only and read-write workloads where the working
set is slightly larger than the buffer pool. Presumably, most contention
would occur on leaf page latches. Contention on upper level pages in
the buffer pool should intuitively last longer.
We introduce srw_spin_lock and srw_spin_mutex to allow users of
srw_lock or srw_mutex to opt in for the spin loop.
On Microsoft Windows, a spin loop variant was and will not be available;
srw_mutex and srw_lock will simply wrap SRWLOCK.
That is, on Microsoft Windows, the parameters innodb_sync_spin_loops
and innodb_spin_wait_delay will only affect block_lock.
srw_mutex::wait_and_lock(): In the spin loop, we will try to poll
for non-conflicting lock word state by reads, avoiding any writes.
We invoke explicit std::atomic_thread_fence(std::memory_order_acquire)
before returning. The individual operations on the lock word
can use memory_order_relaxed.
srw_mutex:🔒 Document that the value for a single writer is
HOLDER+1 instead of HOLDER.
srw_mutex::wr_lock_try(), srw_mutex::wr_unlock(): Adjust the value
of the lock word of a single writer from HOLDER to HOLDER+1.
The U-to-X upgrade turned out to be incorrect. A debug assertion
failed in wr_wait(), called from mtr_defer_drop_ahi() in a stress
test with innodb_adaptive_hash_index=ON.
A correct upgrade procedure ought to be readers.fetch_add(WRITER-1)
to register ourselves as a WRITER (or waiting writer) and to release
the reference that was being held for the U lock.
Thanks to Matthias Leich for catching the problem.
Having both readers and writers use a single lock word in
futex system calls caused performance regression compared to
SRW_LOCK_DUMMY (mutex and 2 condition variables).
A contributing factor is that we did not accurately keep
track of the number of waiting threads and thus had to invoke
system calls to wake up any waiting threads.
SUX_LOCK_GENERIC: Renamed from SRW_LOCK_DUMMY. This is the
original implementation, with rw_lock (std::atomic<uint32_t>),
a mutex and two condition variables. Using a separate writer
mutex (as described below) is not possible, because the mutex ownership
in a buf_block_t::lock must be able to transfer from a write submitter
thread to an I/O completion thread, and pthread_mutex_lock() may assume
that the submitter thread is recursively acquiring the mutex that it
already holds, while in reality the I/O completion thread is the real
owner. POSIX does not define an interface for requesting a mutex to
be non-recursive.
On Microsoft Windows, srw_lock_low will remain a simple wrapper of
SRWLOCK. On 32-bit Microsoft Windows, sizeof(SRWLOCK)=4 while
sizeof(srw_lock_low)=8.
On other platforms, srw_lock_low is an alias of ssux_lock_low,
the Simple (non-recursive) Shared/Update/eXclusive lock.
In the futex-based implementation of ssux_lock_low (Linux, OpenBSD,
Microsoft Windows), we shall use a dedicated mutex for exclusive
requests (writer), and have a WRITER flag in the 'readers' lock word
to inform that a writer is holding the lock or waiting for the lock to
be granted. When the WRITER flag is set, all lock requests must acquire
the writer mutex. Normally, shared (S) lock requests simply perform a
compare-and-swap on the 'readers' word.
Update locks are implemented as a combination of writer mutex
and a normal counter in the 'readers' lock word. The conflict between
U and X locks is guaranteed by the writer mutex.
Unlike SUX_LOCK_GENERIC, wr_u_downgrade() will not wake up any pending
rd_lock() waits. They will wait until u_unlock() releases the writer mutex.
The ssux_lock_low is always wrapped by sux_lock (with a recursion count
of U and X locks), used for dict_index_t::lock and buf_block_t::lock.
Their memory footprint for the futex-based implementation will increase
by sizeof(srw_mutex), or 4 bytes.
This change addresses a performance regression in read-only benchmarks,
such as sysbench oltp_read_only. Also write performance was improved.
On 32-bit Linux and OpenBSD, lock_sys_t::hash_table will allocate
two hash table elements for each srw_lock (14 instead of 15 hash
table cells per 64-byte cache line on IA-32). On Microsoft Windows,
sizeof(SRWLOCK)==sizeof(void*) and there is no change.
Reviewed by: Vladislav Vaintroub
Tested by: Axel Schwenke and Vladislav Vaintroub
On Linux, OpenBSD and Microsoft Windows, srw_mutex was an alias for a
rw-lock while we only need mutex functionality. Let us implement a
futex-based mutex with one bit for HOLDER and 31 bits for counting
waiting requests.
srw_lock::wr_unlock() can avoid waking up a waiter when no waiting
requests exist. (Previously, we only had 1-bit rw_lock::WRITER_WAITING
flag that could be wrongly cleared if multiple waiting wr_lock() exist.
Now we have no problem with up to 2,147,483,648 conflicting threads.)
On 64-bit Microsoft Windows, the advantage is that
sizeof(srw_mutex) is 4, while sizeof(SRWLOCK) would be 8.
Reviewed by: Vladislav Vaintroub
mtr_defer_drop_ahi(): Upgrade the U lock to X lock and downgrade
it back to U lock in case the adaptive hash index needs to be dropped.
This regression was introduced in
commit 03ca6495df (MDEV-24142).
WaitOnAddress() turns out to be too CPU-heavy for the specific scenario,
which makes it prominent in profiler output on several benchmarks with
contended sux_lock.
The condition variable implementation does not show the same behavior.
Thus, defined SRWLOCK_DUMMY for Windows
srw_mutex should remain mapped to SRWLOCK on Windows (since SRWLOCK is
smaller).
This conceptually reverts commit 1fdc161d8f
and reintroduces an option for srw_lock to wrap a native implementation.
The srw_lock and srw_lock_low differ from ssux_lock and ssux_lock_low
in that Slim SUX locks support three modes (Shared, Update, eXclusive)
while Slim RW locks support only two (Read, Write).
On Microsoft Windows, the srw_lock will be implemented by SRWLOCK.
On Linux and OpenBSD, it will be implemented by rw_lock and the
futex system call, just like earlier.
On other systems or if SRW_LOCK_DUMMY is defined on anything else
than Microsoft Windows, rw_lock_t will be used.
ssux_lock_low::read_lock(), ssux_lock_low::update_lock(): Correct
the SRW_LOCK_DUMMY implementation to prevent hangs. The intention of
commit 1fdc161d8f seems to have been
do ... while loops, but the 'do' keyword was missing. This total
breakage was missed in commit 260161fc9f
which did reduce the probability of the hangs.
ssux_lock_low::u_unlock(): In the SRW_LOCK_DUMMY implementation
(based on a mutex and two condition variables), always invoke
writer_wake() in order to ensure that a waiting update_lock()
will be woken up.
ssux_lock_low::writer_wait(), ssux_lock_low::readers_wait():
In the SRW_LOCK_DUMMY implementation, keep waiting for the signal
until the lock word has changed. The "while" had been changed to "if"
in order to avoid hangs.
sux_lock::recursive: Move right after the 32-bit sux_lock::lock.
This will reduce sizeof(block_lock) from 24 to 16 bytes on
64-bit systems with CMAKE_BUILD_TYPE=RelWithDebInfo. This may be
significant, because there will be one buf_block_t::lock for each
buffer pool page descriptor.
We still have some potential for savings, with sizeof(buf_page_t)==112
and sizeof(buf_block_t)==184 on a GNU/Linux AMD64 system.
Note: On GNU/Linux AMD64, sizeof(index_lock) remains 32 bytes
(16 with PLUGIN_PERFSCHEMA=NO) even tough it would fit in 24 bytes.
This is because sizeof(srw_lock) includes 4 bytes of padding
(to 16 bytes) that index_lock_t::recursive cannot reuse. So,
in total 4+4 bytes will be lost to padding. This is rather
insignificant compared to sizeof(dict_index_t)==400.
Let us remove sux_lock::waits and the associated bookkeeping.
Starting with commit 1669c8890c
the PERFORMANCE_SCHEMA instrumentation interface is keeping
track of lock waits.
The view INFORMATION_SCHEMA.INNODB_MUTEXES only exported counts
of rw-lock waits.
Also, SHOW ENGINE INNODB MUTEX will no longer export any information
about rw-locks.
InnoDB buffer pool block and index tree latches depend on a
special kind of read-update-write lock that allows reentrant
(recursive) acquisition of the 'update' and 'write' locks
as well as an upgrade from 'update' lock to 'write' lock.
The 'update' lock allows any number of reader locks from
other threads, but no concurrent 'update' or 'write' lock.
If there were no requirement to support an upgrade from 'update'
to 'write', we could compose the lock out of two srw_lock
(implemented as any type of native rw-lock, such as SRWLOCK on
Microsoft Windows). Removing this requirement is very difficult,
so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we
implemented an 'update' mode to our srw_lock.
Re-entrant or recursive locking is mostly needed when writing or
freeing BLOB pages, but also in crash recovery or when merging
buffered changes to an index page. The re-entrancy allows us to
attach a previously acquired page to a sub-mini-transaction that
will be committed before whatever else is holding the page latch.
The SUX lock supports Shared ('read'), Update, and eXclusive ('write')
locking modes. The S latches are not re-entrant, but a single S latch
may be acquired even if the thread already holds an U latch.
The idea of the U latch is to allow a write of something that concurrent
readers do not care about (such as the contents of BTR_SEG_LEAF,
BTR_SEG_TOP and other page allocation metadata structures, or
the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field
is only updated when a dict_table_t for the table exists, and only
read when a dict_table_t for the table is being added to dict_sys.)
block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page()
to allow concurrent readers but no concurrent modifications while the
page is being written to the data file. That latch will be released
by buf_page_write_complete() in a different thread. Hence, we use
the special lock owner value FOR_IO.
The index_lock::u_lock() improves concurrency on operations that
involve non-leaf index pages.
The interface has been cleaned up a little. We will use
x_lock_recursive() instead of x_lock() when we know that a
lock is already held by the current thread. Similarly,
a lock upgrade from U to X is only allowed via u_x_upgrade()
or x_lock_upgraded() but not via x_lock().
We will disable the LatchDebug and sync_array interfaces to
InnoDB rw-locks.
The SEMAPHORES section of SHOW ENGINE INNODB STATUS output
will no longer include any information about InnoDB rw-locks,
only TTASEventMutex (cmake -DMUTEXTYPE=event) waits.
This will make a part of the 'innotop' script dead code.
The block_lock buf_block_t::lock will not be covered by any
PERFORMANCE_SCHEMA instrumentation.
SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES
will no longer output source code file names or line numbers.
The dict_index_t::lock will be identified by index and table names,
which should be much more useful. PERFORMANCE_SCHEMA is lumping
information about all dict_index_t::lock together as
event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'.
buf_page_free(): Remove the file,line parameters. The sux_lock will
not store such diagnostic information.
buf_block_dbg_add_level(): Define as empty macro, to be removed
in a subsequent commit.
Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO
the index_lock dict_index_t::lock will be instrumented via
PERFORMANCE_SCHEMA. Similar to
commit 1669c8890c
we will distinguish lock waits by registering shared_lock,exclusive_lock
events instead of try_shared_lock,try_exclusive_lock.
Actual 'try' operations will not be instrumented at all.
rw_lock_list: Remove. After MDEV-24167, this only covered
buf_block_t::lock and dict_index_t::lock. We will output their
information by traversing buf_pool or dict_sys.
The PERFORMANCE_SCHEMA insists on distinguishing read-update-write
locks from read-write locks, so we must add
template<bool support_u_lock> in rd_lock() and wr_lock() operations.
rd_lock::read_trylock(): Add template<bool prioritize_updater=false>
which is used by the srw_lock_low::read_lock() loop. As long as
an UPDATE lock has already been granted to some thread, we will grant
subsequent READ lock requests even if a waiting WRITE lock request
exists. This will be necessary to be compatible with existing usage
pattern of InnoDB rw_lock_t where the holder of SX-latch (which we
will rename to UPDATE latch) may acquire an additional S-latch
on the same object. For normal read-write locks without update operations
this should make no difference at all, because the rw_lock::UPDATER
flag would never be set.
Let us try to avoid code bloat for the common case that
performance_schema is disabled at runtime, and use
ATTRIBUTE_NOINLINE member functions for instrumented latch acquisition.
Also, let us distinguish lock waits from non-contended lock requests
by using write_lock,read_lock for the requests that lead to waits,
and try_write_lock,try_read_lock for the wait-free lock acquisitions.
Actual 'try' operations are not being instrumented at all.
In commit 1fdc161d8f we introduced
a mutex-and-condition-variable based fallback implementation
for platforms that lack a futex system call. That implementation
is prone to hangs.
Let us use separate condition variables for shared and exclusive requests.
Let us always base srw_lock on our own std::atomic<uint32_t>
based rw_lock. In this way, we can extend the locks in a portable
way across all platforms.
We will use futex system calls where available:
Linux, OpenBSD, and Microsoft Windows.
Elsewhere, we will emulate futex with a mutex and a condition variable.
Thanks to Daniel Black for testing this on OpenBSD.
Many InnoDB rw-locks unnecessarily depend on the complex
InnoDB rw_lock_t implementation that support the SX lock mode
as well as recursive acquisition of X or SX locks.
One of them is the bunch of adaptive hash index search latches,
instrumented as btr_search_latch in PERFORMANCE_SCHEMA.
Let us introduce a simpler lock for those in order to
reduce overhead.
srw_lock: A simple read-write lock that does not support recursion.
On Microsoft Windows, this wraps SRWLOCK, only adding
runtime overhead if PERFORMANCE_SCHEMA is enabled.
On Linux (all architectures), this is implemented with
std::atomic<uint32_t> and the futex system call.
On other platforms, we will wrap mysql_rwlock_t with
zero runtime overhead.
The PERFORMANCE_SCHEMA instrumentation differs
from InnoDB rw_lock_t in that we will only invoke
PSI_RWLOCK_CALL(start_rwlock_wrwait) or
PSI_RWLOCK_CALL(start_rwlock_rdwait)
if there is an actual conflict.