Commit graph

429 commits

Author SHA1 Message Date
Marko Mäkelä
35f59bc4e1 MDEV-26467: More cache friendliness
srw_mutex_impl<bool>::wait_and_lock(): In
commit a73eedbf3f we introduced
an std::atomic::fetch_or() in a loop. Alas, on the IA-32 and AMD64,
that was being translated into a loop around LOCK CMPXCHG.
To avoid a nested loop, it is better to explicitly invoke
std::atomic::compare_exchange_weak() in the loop, but only if
the attempt has a chance to succeed (the HOLDER flag is not set).

It is even more efficient to use LOCK BTS, but contemporary compilers
fail to translate std::atomic::fetch_or(x) & x into that when x is
a single-bit constant. On GCC-compatible compilers, we will use
inline assembler to achieve that.

On other ISA than IA-32 and AMD64, we will continue to use
std::atomic::fetch_or().

ssux_lock_impl<spinloop>::rd_wait(): Use rd_lock_try().
A loop around std::atomic::compare_exchange_weak() should be
cheaper than fetch_add(), fetch_sub() and a wakeup system call.

These deficiencies were pointed out and the use of LOCK BTS was
suggested by Thiago Macieira.
2021-09-28 17:17:59 +03:00
Marko Mäkelä
37a074f6c3 MDEV-26467 fixup: Fix cmake -DWITH_UNIT_TESTS=ON for SUX_LOCK_GENERIC 2021-09-24 09:18:07 +03:00
Marko Mäkelä
277ba134ad MDEV-26467: Avoid futile spin loops
Typically, index_lock and fil_space_t::latch will be held for a longer
time than the spin loop in latch acquisition would be waiting for.
Let us avoid spin loops for those as well as dict_sys.latch, which
could be held in exclusive mode for a longer time (while loading
metadata into the buffer pool and the dictionary cache).

Performance testing on a dual Intel Xeon E5-2630 v4 (2 NUMA nodes)
suggests that the buffer pool page latch (block_lock) benefits from a
spin loop in both read-only and read-write workloads where the working
set is slightly larger than the buffer pool. Presumably, most contention
would occur on leaf page latches. Contention on upper level pages in
the buffer pool should intuitively last longer.

We introduce srw_spin_lock and srw_spin_mutex to allow users of
srw_lock or srw_mutex to opt in for the spin loop.
On Microsoft Windows, a spin loop variant was and will not be available;
srw_mutex and srw_lock will simply wrap SRWLOCK.
That is, on Microsoft Windows, the parameters innodb_sync_spin_loops
and innodb_spin_wait_delay will only affect block_lock.
2021-09-06 12:32:24 +03:00
Marko Mäkelä
0f0b7e47bc MDEV-26467: Avoid re-reading srv_spin_wait_delay inside a loop
Invoking ut_delay(srv_wpin_wait_delay) inside a spinloop would
cause a read of 2 global variables as well as multiplication.
Let us loop around MY_RELAX_CPU() using a precomputed loop count
to keep the loops simpler, to help them scale better.

We also tried precomputing the delay into a global variable,
but that appeared to result in slightly worse throughput.
2021-09-06 12:22:33 +03:00
Marko Mäkelä
a73eedbf3f MDEV-26467 Unnecessary compare-and-swap loop in srw_mutex
srw_mutex::wait_and_lock(): In the spin loop, we will try to poll
for non-conflicting lock word state by reads, avoiding any writes.
We invoke explicit std::atomic_thread_fence(std::memory_order_acquire)
before returning. The individual operations on the lock word
can use memory_order_relaxed.

srw_mutex:🔒 Document that the value for a single writer is
HOLDER+1 instead of HOLDER.

srw_mutex::wr_lock_try(), srw_mutex::wr_unlock(): Adjust the value
of the lock word of a single writer from HOLDER to HOLDER+1.
2021-09-06 12:16:26 +03:00
Marko Mäkelä
b81382887c MDEV-25512 Deadlock between sux_lock::u_x_upgrade() and sux_lock::u_lock()
In the SUX_LOCK_GENERIC implementation, we can remember at most
one pending exclusive lock request. If multiple exclusive lock
requests are pending, the WRITER_WAITING flag will be cleared when
the first waiting writer acquires the exclusive lock.

ssux_lock_low::update_lock(): If WRITER_WAITING is set, wake up
the writer even if the UPDATER flag is set, because the waiting
writer may be in the process of upgrading its U lock to X.

rw_lock::read_unlock(): Also indicate that an X lock waiter must
be woken up if an U lock exists.

This fix may cause unnecessary wake-ups and system calls, but this
is the best that we can do. Ideally we would use the MDEV-25404
idea of a separate 'writer' mutex, but there is no portable way to
request that a non-recursive mutex be created, and InnoDB requires
the ability to transfer buf_block_t::lock ownership to an I/O thread.

To allow problems like this to be caught more reliably in the future,
we add a unit test for srw_mutex, srw_lock, ssux_lock, sux_lock.
2021-04-25 12:58:16 +03:00
Marko Mäkelä
8751aa7397 MDEV-25404: ssux_lock_low: Introduce a separate writer mutex
Having both readers and writers use a single lock word in
futex system calls caused performance regression compared to
SRW_LOCK_DUMMY (mutex and 2 condition variables).
A contributing factor is that we did not accurately keep
track of the number of waiting threads and thus had to invoke
system calls to wake up any waiting threads.

SUX_LOCK_GENERIC: Renamed from SRW_LOCK_DUMMY. This is the
original implementation, with rw_lock (std::atomic<uint32_t>),
a mutex and two condition variables. Using a separate writer
mutex (as described below) is not possible, because the mutex ownership
in a buf_block_t::lock must be able to transfer from a write submitter
thread to an I/O completion thread, and pthread_mutex_lock() may assume
that the submitter thread is recursively acquiring the mutex that it
already holds, while in reality the I/O completion thread is the real
owner. POSIX does not define an interface for requesting a mutex to
be non-recursive.

On Microsoft Windows, srw_lock_low will remain a simple wrapper of
SRWLOCK. On 32-bit Microsoft Windows, sizeof(SRWLOCK)=4 while
sizeof(srw_lock_low)=8.

On other platforms, srw_lock_low is an alias of ssux_lock_low,
the Simple (non-recursive) Shared/Update/eXclusive lock.

In the futex-based implementation of ssux_lock_low (Linux, OpenBSD,
Microsoft Windows), we shall use a dedicated mutex for exclusive
requests (writer), and have a WRITER flag in the 'readers' lock word
to inform that a writer is holding the lock or waiting for the lock to
be granted. When the WRITER flag is set, all lock requests must acquire
the writer mutex. Normally, shared (S) lock requests simply perform a
compare-and-swap on the 'readers' word.

Update locks are implemented as a combination of writer mutex
and a normal counter in the 'readers' lock word. The conflict between
U and X locks is guaranteed by the writer mutex.
Unlike SUX_LOCK_GENERIC, wr_u_downgrade() will not wake up any pending
rd_lock() waits. They will wait until u_unlock() releases the writer mutex.

The ssux_lock_low is always wrapped by sux_lock (with a recursion count
of U and X locks), used for dict_index_t::lock and buf_block_t::lock.
Their memory footprint for the futex-based implementation will increase
by sizeof(srw_mutex), or 4 bytes.

This change addresses a performance regression in read-only benchmarks,
such as sysbench oltp_read_only. Also write performance was improved.

On 32-bit Linux and OpenBSD, lock_sys_t::hash_table will allocate
two hash table elements for each srw_lock (14 instead of 15 hash
table cells per 64-byte cache line on IA-32). On Microsoft Windows,
sizeof(SRWLOCK)==sizeof(void*) and there is no change.

Reviewed by: Vladislav Vaintroub
Tested by: Axel Schwenke and Vladislav Vaintroub
2021-04-19 18:15:49 +03:00
Marko Mäkelä
040c16ab8b MDEV-25404: Optimize srw_mutex on Linux, OpenBSD, Windows
On Linux, OpenBSD and Microsoft Windows, srw_mutex was an alias for a
rw-lock while we only need mutex functionality. Let us implement a
futex-based mutex with one bit for HOLDER and 31 bits for counting
waiting requests.

srw_lock::wr_unlock() can avoid waking up a waiter when no waiting
requests exist. (Previously, we only had 1-bit rw_lock::WRITER_WAITING
flag that could be wrongly cleared if multiple waiting wr_lock() exist.
Now we have no problem with up to 2,147,483,648 conflicting threads.)

On 64-bit Microsoft Windows, the advantage is that
sizeof(srw_mutex) is 4, while sizeof(SRWLOCK) would be 8.

Reviewed by: Vladislav Vaintroub
2021-04-19 18:03:17 +03:00
Marko Mäkelä
272a1289ad MDEV-24884 Hang in ssux_lock_low::write_lock()
ssux_lock_low::write_lock(): Before invoking writer_wait(), keep
attempting write_lock_wait_try() as long as no conflict exists.

rw_lock::upgrade_trylock(): Relax a bogus assertion and correct
the acquisition operation. Another thread may be executing in
ssux_lock_low::write_lock() on the same latch. Because we are the
only thread that can make progress on that latch, we must become
the writer. Any waiting thread will be eventually woken up by
ssux_lock_low::u_unlock() or ssux_lock_low::wr_unlock(), but not
by wr_u_downgrade() because the upgrade is a very rare operation.
2021-02-17 12:34:06 +02:00
Marko Mäkelä
07e4b6b276 MDEV-24167 fixup: Wake up all update_lock() in u_unlock()
It turns out that the hang that was fixed in
commit 43d3dad114
for the SRW_LOCK_DUMMY implementation is also possible in the futex
implementation. We have observed hangs of ssux_lock_low::u_unlock()
on Windows where the undesirable value is rw_lock::UPDATER, in the
test mariabackup.xb_compressed_encrypted.

The exact sequence of events to the hang is not known, but
it seems that u_unlock() had better always wake up one thread.
Possibly, the case involves multiple blocked u_unlock().

On a busy server, the hang might be 'rescued' by a subsequent
lock acquisition and release that is executed by another thread.

rw_lock::update_unlock(): Change the return type to void.

ssux_lock_low::u_unlock(): Always invoke readers_wake() [sic],
to wake up any pending update_lock() or write_lock().
On futex implementation, this will wake up all waiters.
On SRW_LOCK_DUMMY, writer_wake() and readers_wake() do the same
thing: wake up one write_lock(), or all update_lock() waiters.
2020-12-16 17:45:01 +02:00
Marko Mäkelä
ff5d306e29 MDEV-21452: Replace ib_mutex_t with mysql_mutex_t
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.

We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.

dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.

lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.

FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.

fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.

fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
2020-12-15 17:56:18 +02:00
Marko Mäkelä
db006a9a43 MDEV-21452: Remove os_event_t, MUTEX_EVENT, TTASEventMutex, sync_array
We will default to MUTEXTYPE=sys (using OSTrackMutex) for those
ib_mutex_t that have not been replaced yet.

The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed.

The parameter innodb_sync_array_size is removed.

FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced.
We should enforce it for lock_sys.mutex and dict_sys.mutex somehow!

innodb_sync_debug=ON might still cover ib_mutex_t.
2020-12-15 17:56:17 +02:00
Marko Mäkelä
38fd7b7d91 MDEV-21452: Replace all direct use of os_event_t
Let us replace os_event_t with mysql_cond_t, and replace the
necessary ib_mutex_t with mysql_mutex_t so that they can be
used with condition variables.

Also, let us replace polling (os_thread_sleep() or timed waits)
with plain mysql_cond_wait() wherever possible.

Furthermore, we will use the lightweight srw_mutex for trx_t::mutex,
to hopefully reduce contention on lock_sys.mutex.

FIXME: Add test coverage of
mariabackup --backup --kill-long-queries-timeout
2020-12-15 17:56:17 +02:00
Marko Mäkelä
43d3dad114 MDEV-24142/MDEV-24167 fixup: Split ssux_lock and srw_lock
This conceptually reverts commit 1fdc161d8f
and reintroduces an option for srw_lock to wrap a native implementation.

The srw_lock and srw_lock_low differ from ssux_lock and ssux_lock_low
in that Slim SUX locks support three modes (Shared, Update, eXclusive)
while Slim RW locks support only two (Read, Write).

On Microsoft Windows, the srw_lock will be implemented by SRWLOCK.
On Linux and OpenBSD, it will be implemented by rw_lock and the
futex system call, just like earlier.
On other systems or if SRW_LOCK_DUMMY is defined on anything else
than Microsoft Windows, rw_lock_t will be used.

ssux_lock_low::read_lock(), ssux_lock_low::update_lock(): Correct
the SRW_LOCK_DUMMY implementation to prevent hangs. The intention of
commit 1fdc161d8f seems to have been
do ... while loops, but the 'do' keyword was missing. This total
breakage was missed in commit 260161fc9f
which did reduce the probability of the hangs.

ssux_lock_low::u_unlock(): In the SRW_LOCK_DUMMY implementation
(based on a mutex and two condition variables), always invoke
writer_wake() in order to ensure that a waiting update_lock()
will be woken up.

ssux_lock_low::writer_wait(), ssux_lock_low::readers_wait():
In the SRW_LOCK_DUMMY implementation, keep waiting for the signal
until the lock word has changed. The "while" had been changed to "if"
in order to avoid hangs.
2020-12-15 14:29:40 +02:00
Marko Mäkelä
ba2d45dc54 MDEV-24142: Remove INFORMATION_SCHEMA.INNODB_MUTEXES
Let us remove sux_lock::waits and the associated bookkeeping.
Starting with commit 1669c8890c
the PERFORMANCE_SCHEMA instrumentation interface is keeping
track of lock waits.

The view INFORMATION_SCHEMA.INNODB_MUTEXES only exported counts
of rw-lock waits.

Also, SHOW ENGINE INNODB MUTEX will no longer export any information
about rw-locks.
2020-12-03 15:28:53 +02:00
Marko Mäkelä
ac028ec5d8 MDEV-24142: Remove the LatchDebug interface to rw-locks
The latching order checks for rw-locks have not caught many bugs
in the past few years and they are greatly complicating the code.

Last time the debug checks were useful was in
commit 59caf2c3c1 (MDEV-13485).

The B-tree hang MDEV-14637 was not caught by LatchDebug,
because the granularity of the checks is not sufficient
to distinguish the levels of non-leaf B-tree pages.

The interface was already made dead code by the grandparent
commit 03ca6495df.
2020-12-03 15:27:50 +02:00
Marko Mäkelä
03ca6495df MDEV-24142: Replace InnoDB rw_lock_t with sux_lock
InnoDB buffer pool block and index tree latches depend on a
special kind of read-update-write lock that allows reentrant
(recursive) acquisition of the 'update' and 'write' locks
as well as an upgrade from 'update' lock to 'write' lock.
The 'update' lock allows any number of reader locks from
other threads, but no concurrent 'update' or 'write' lock.

If there were no requirement to support an upgrade from 'update'
to 'write', we could compose the lock out of two srw_lock
(implemented as any type of native rw-lock, such as SRWLOCK on
Microsoft Windows). Removing this requirement is very difficult,
so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we
implemented an 'update' mode to our srw_lock.

Re-entrant or recursive locking is mostly needed when writing or
freeing BLOB pages, but also in crash recovery or when merging
buffered changes to an index page. The re-entrancy allows us to
attach a previously acquired page to a sub-mini-transaction that
will be committed before whatever else is holding the page latch.

The SUX lock supports Shared ('read'), Update, and eXclusive ('write')
locking modes. The S latches are not re-entrant, but a single S latch
may be acquired even if the thread already holds an U latch.

The idea of the U latch is to allow a write of something that concurrent
readers do not care about (such as the contents of BTR_SEG_LEAF,
BTR_SEG_TOP and other page allocation metadata structures, or
the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field
is only updated when a dict_table_t for the table exists, and only
read when a dict_table_t for the table is being added to dict_sys.)

block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page()
to allow concurrent readers but no concurrent modifications while the
page is being written to the data file. That latch will be released
by buf_page_write_complete() in a different thread. Hence, we use
the special lock owner value FOR_IO.

The index_lock::u_lock() improves concurrency on operations that
involve non-leaf index pages.

The interface has been cleaned up a little. We will use
x_lock_recursive() instead of x_lock() when we know that a
lock is already held by the current thread. Similarly,
a lock upgrade from U to X is only allowed via u_x_upgrade()
or x_lock_upgraded() but not via x_lock().

We will disable the LatchDebug and sync_array interfaces to
InnoDB rw-locks.

The SEMAPHORES section of SHOW ENGINE INNODB STATUS output
will no longer include any information about InnoDB rw-locks,
only TTASEventMutex (cmake -DMUTEXTYPE=event) waits.
This will make a part of the 'innotop' script dead code.

The block_lock buf_block_t::lock will not be covered by any
PERFORMANCE_SCHEMA instrumentation.

SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES
will no longer output source code file names or line numbers.
The dict_index_t::lock will be identified by index and table names,
which should be much more useful. PERFORMANCE_SCHEMA is lumping
information about all dict_index_t::lock together as
event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'.

buf_page_free(): Remove the file,line parameters. The sux_lock will
not store such diagnostic information.

buf_block_dbg_add_level(): Define as empty macro, to be removed
in a subsequent commit.

Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO
the index_lock dict_index_t::lock will be instrumented via
PERFORMANCE_SCHEMA. Similar to
commit 1669c8890c
we will distinguish lock waits by registering shared_lock,exclusive_lock
events instead of try_shared_lock,try_exclusive_lock.
Actual 'try' operations will not be instrumented at all.

rw_lock_list: Remove. After MDEV-24167, this only covered
buf_block_t::lock and dict_index_t::lock. We will output their
information by traversing buf_pool or dict_sys.
2020-12-03 15:19:49 +02:00
Marko Mäkelä
d46b42489a MDEV-24142 preparation: Add srw_mutex and srw_lock::u_lock()
The PERFORMANCE_SCHEMA insists on distinguishing read-update-write
locks from read-write locks, so we must add
template<bool support_u_lock> in rd_lock() and wr_lock() operations.

rd_lock::read_trylock(): Add template<bool prioritize_updater=false>
which is used by the srw_lock_low::read_lock() loop. As long as
an UPDATE lock has already been granted to some thread, we will grant
subsequent READ lock requests even if a waiting WRITE lock request
exists. This will be necessary to be compatible with existing usage
pattern of InnoDB rw_lock_t where the holder of SX-latch (which we
will rename to UPDATE latch) may acquire an additional S-latch
on the same object. For normal read-write locks without update operations
this should make no difference at all, because the rw_lock::UPDATER
flag would never be set.
2020-12-03 15:17:16 +02:00
Marko Mäkelä
1669c8890c MDEV-24167 fixup: Improve the PERFORMANCE_SCHEMA instrumentation
Let us try to avoid code bloat for the common case that
performance_schema is disabled at runtime, and use
ATTRIBUTE_NOINLINE member functions for instrumented latch acquisition.

Also, let us distinguish lock waits from non-contended lock requests
by using write_lock,read_lock for the requests that lead to waits,
and try_write_lock,try_read_lock for the wait-free lock acquisitions.
Actual 'try' operations are not being instrumented at all.
2020-12-03 09:55:53 +02:00
Marko Mäkelä
260161fc9f MDEV-24167 fixup: Avoid hangs in SRW_LOCK_DUMMY
In commit 1fdc161d8f we introduced
a mutex-and-condition-variable based fallback implementation
for platforms that lack a futex system call. That implementation
is prone to hangs.

Let us use separate condition variables for shared and exclusive requests.
2020-12-03 09:11:31 +02:00
Marko Mäkelä
1fdc161d8f MDEV-24167 fixup: Always derive srw_lock from rw_lock
Let us always base srw_lock on our own std::atomic<uint32_t>
based rw_lock. In this way, we can extend the locks in a portable
way across all platforms.

We will use futex system calls where available:
Linux, OpenBSD, and Microsoft Windows.

Elsewhere, we will emulate futex with a mutex and a condition variable.

Thanks to Daniel Black for testing this on OpenBSD.
2020-11-30 11:47:09 +02:00
Marko Mäkelä
565b0dd17d Merge 10.5 into 10.6 2020-11-30 11:30:26 +02:00
Marko Mäkelä
8fa6e36375 MDEV-24308: Remove some os_thread_ functions
os_thread_pf(): Remove.

os_thread_eq(), os_thread_yield(), os_thread_get_curr_id():
Define as macros.

ut_print_timestamp(), ut_sprintf_timestamp(): Simplify.
2020-11-30 11:15:31 +02:00
Marko Mäkelä
8b8969929d Merge 10.5 into 10.6 2020-11-26 07:36:53 +02:00
Marko Mäkelä
657fcdf430 MDEV-24280 InnoDB triggers too many independent periodic tasks
A side effect of MDEV-16264 is that a large number of threads will
be created at server startup, to be destroyed after a minute or two.

One source of such thread creation is srv_start_periodic_timer().
InnoDB is creating 3 periodic tasks: srv_master_callback (1Hz)
srv_error_monitor_task (1Hz), and srv_monitor_task (0.2Hz).

It appears that we can merge srv_error_monitor_task and srv_monitor_task
and have them invoked 4 times per minute (every 15 seconds). This will
affect our ability to enforce innodb_fatal_semaphore_wait_threshold and
some computations around BUF_LRU_STAT_N_INTERVAL.

We could remove srv_master_callback along with the DROP TABLE queue
at some point of time in the future. We must keep it independent
of the innodb_fatal_semaphore_wait_threshold detection, because
the background DROP TABLE queue could get stuck due to dict_sys
being locked by another thread. For now, srv_master_callback
must be invoked once per second, so that
innodb_flush_log_at_timeout=1 can work.

BUF_LRU_STAT_N_INTERVAL: Reduce the precision and extend the time
from 50*1 second to 4*15 seconds.

srv_error_monitor_timer: Remove.

MAX_MUTEX_NOWAIT: Increase from 20*1 second to 2*15 seconds.

srv_refresh_innodb_monitor_stats(): Avoid a repeated call to time(NULL).
Change the interval to less than 60 seconds.

srv_monitor(): Renamed from srv_monitor_task.

srv_monitor_task(): Renamed from srv_error_monitor_task().
Invoked only once in 15 seconds. Invoke also srv_monitor().
Increase the fatal_cnt threshold from 10*1 second to 1*15 seconds.

sync_array_print_long_waits_low(): Invoke time(NULL) only once.
Remove a bogus message about printouts for 30 seconds. Those
printouts were effectively already disabled in MDEV-16264
(commit 5e62b6a5e0).
2020-11-25 16:54:00 +02:00
Marko Mäkelä
edbde4a11f MDEV-24167: Replace fil_space::latch
We must avoid acquiring a latch while we are already holding one.
The tablespace latch was being acquired recursively in some
operations that allocate or free pages.
2020-11-24 15:43:12 +02:00
Marko Mäkelä
bdd88cfa34 MDEV-24167: Replace fts_cache_rw_lock, fts_cache_init_rw_lock with mutex
fts_cache_t::init_lock: Replace with mutex. This was only acquired
in exclusive mode.

fts_cache_t:🔒 Replace with mutex. The only read-lock user was
i_s_fts_index_cache_fill() for producing content for the view
INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE.
2020-11-24 15:43:12 +02:00
Marko Mäkelä
1a1b7a6f16 MDEV-24167: Replace dict_operation_lock (dict_sys.latch) 2020-11-24 15:43:11 +02:00
Marko Mäkelä
06ef5509d0 MDEV-24167: Replace trx_purge_latch 2020-11-24 15:43:10 +02:00
Marko Mäkelä
63dd2a97e4 MDEV-24167: Replace trx_i_s_cache_lock 2020-11-24 15:43:09 +02:00
Marko Mäkelä
c561f9e6e8 MDEV-24167: Use lightweight srw_lock for btr_search_latch
Many InnoDB rw-locks unnecessarily depend on the complex
InnoDB rw_lock_t implementation that support the SX lock mode
as well as recursive acquisition of X or SX locks.
One of them is the bunch of adaptive hash index search latches,
instrumented as btr_search_latch in PERFORMANCE_SCHEMA.
Let us introduce a simpler lock for those in order to
reduce overhead.

srw_lock: A simple read-write lock that does not support recursion.
On Microsoft Windows, this wraps SRWLOCK, only adding
runtime overhead if PERFORMANCE_SCHEMA is enabled.
On Linux (all architectures), this is implemented with
std::atomic<uint32_t> and the futex system call.
On other platforms, we will wrap mysql_rwlock_t with
zero runtime overhead.

The PERFORMANCE_SCHEMA instrumentation differs
from InnoDB rw_lock_t in that we will only invoke
PSI_RWLOCK_CALL(start_rwlock_wrwait) or
PSI_RWLOCK_CALL(start_rwlock_rdwait)
if there is an actual conflict.
2020-11-24 15:41:03 +02:00
Marko Mäkelä
1e5d989d2a MDEV-24167: Remove PFS instrumentation of buf_block_t
We always defined PFS_SKIP_BUFFER_MUTEX_RWLOCK, that is,
the latches of the buffer pool blocks were never instrumented
in PERFORMANCE_SCHEMA.

For some reason, the debug_latch (which enforce proper usage of
buffer-fixing in debug builds) was instrumented.
2020-11-20 08:55:41 +02:00
Marko Mäkelä
e8f8992801 MDEV-24188: Merge 10.4 into 10.5 2020-11-13 22:06:50 +02:00
Marko Mäkelä
749ecedfec MDEV-24188: Merge 10.3 into 10.4 2020-11-13 20:45:28 +02:00
Marko Mäkelä
f9f2f37495 MDEV-24188: Merge 10.2 into 10.3 2020-11-13 20:41:48 +02:00
Marko Mäkelä
bb328a2a27 MDEV-24188 Hang in buf_page_create() after reusing a previously freed page
The fix of MDEV-23456 (commit b1009ae5c1)
introduced a livelock between page flushing and a thread that is
executing buf_page_create().

buf_page_create(): If the current mini-transaction is holding
an exclusive latch on the page, do not attempt to acquire another
one, and do not care about any I/O fix.

mtr_t::have_x_latch(): Replaces mtr_t::get_fix_count().

dyn_buf_t::for_each_block(const Functor&) const: A new variant.

rw_lock_own(): Add a const qualifier.

Reviewed by: Thirunarayanan Balathandayuthapani
2020-11-13 20:16:39 +02:00
Marko Mäkelä
898521e2dd Merge 10.4 into 10.5 2020-10-30 11:15:30 +02:00
Marko Mäkelä
7b2bb67113 Merge 10.3 into 10.4 2020-10-29 13:38:38 +02:00
Marko Mäkelä
2b6f804490 Merge 10.2 into 10.3 2020-10-28 10:44:40 +02:00
Eugene Kosov
afc9d00c66 MDEV-23991 dict_table_stats_lock() has unnecessarily long scope
Patch removes dict_index_t::stats_latch. Table/index statistics now
protected with dict_sys->mutex. That way statistics computation can
happen in parallel in several threads and dict_sys->mutex will be locked
only for a short period of time.

This patch is a joint work with Marko Mäkelä

dict_index_t:🔒 make mutable which allows to pass const pointer
when only lock is touched in an object

btr_height_get()
btr_get_size(): make index argument const for better type safety

btr_estimate_number_of_different_key_vals(): now returns computed values
instead of setting fields in dict_index_t directly

remove everything related to dict_index_t::stats_latch

dict_stats_index_set_n_diff(): now returns computed values instead
of setting fields in dict_index_t directly

dict_stats_analyze_index():  now returns computed values instead
of setting fields in dict_index_t directly

Reviewed by: Marko Mäkelä
2020-10-27 19:09:20 +03:00
Marko Mäkelä
c27e53f459 MDEV-23855: Use normal mutex for log_sys.mutex, log_sys.flush_order_mutex
With an unreasonably small innodb_log_file_size, the page cleaner
thread would frequently acquire log_sys.flush_order_mutex and spend
a significant portion of CPU time spinning on that mutex when
determining the checkpoint LSN.
2020-10-26 17:53:55 +02:00
Marko Mäkelä
1657b7a583 Merge 10.4 to 10.5 2020-10-22 17:08:49 +03:00
Marko Mäkelä
46957a6a77 Merge 10.3 into 10.4 2020-10-22 13:27:18 +03:00
Marko Mäkelä
e3d692aa09 Merge 10.2 into 10.3 2020-10-22 08:26:28 +03:00
Marko Mäkelä
7cffb5f6e8 MDEV-23399: Performance regression with write workloads
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.

The configuration parameters will be changed as follows:

innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)

Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
 * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
 * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
   the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().

The status variables will be changed as follows:

innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call

When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.

Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.

The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.

The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.

Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.

We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.

To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.

For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.

mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().

buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.

buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.

buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.

buf_pool_t::done_flush_list: Condition variable for !n_flush_list.

buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.

buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.

buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().

flush_counters_t::unzip_LRU_evicted: Remove.

IORequest: Make more members const. FIXME: m_fil_node should be removed.

buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).

page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.

pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().

recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().

recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.

srv_started_redo: Replaces srv_start_state.

SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().

buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().

buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.

buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.

buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.

buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().

buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.

buf_flush_check_neighbor(): Take id.fold() as a parameter.

buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.

buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.

buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_do_LRU_batch(): Return the number of pages flushed.

buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.

buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.

buf_LRU_check_size_of_non_data_objects(): Simplify the code.

buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.

buf_page_create(): Take a pre-allocated page as a parameter.

buf_pool_t::free_block(): Free a pre-allocated block.

recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.

BtrBulk::logFreeCheck(): Skip a redundant condition.

row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.

sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.

Reviewed by: Vladislav Vaintroub
2020-10-15 17:04:56 +03:00
Marko Mäkelä
b535a79044 MDEV-23399: Remove recv_writer_thread
Recovery works just fine without a separate thread whose only
task is to tell the page cleaner thread to do its job.

recv_sys_t::apply(): Flush the buffer pool at the end of each batch.

Reviewed by: Vladislav Vaintroub
2020-10-15 10:33:20 +03:00
Marko Mäkelä
861cd4ce28 MDEV-22871 fixup: Remove SYNC_BUF_PAGE_HASH
This was missed in commit 5155a300fa.
2020-10-05 13:05:11 +03:00
Marko Mäkelä
199bc67144 Cleanup: Remove unused SYNC_REC_LOCK
SYNC_REC_LOCK was never used in the public history of InnoDB,
starting with commit 132e667b0b.
2020-10-05 09:12:12 +03:00
Marko Mäkelä
46890349bf Cleanup: Remove fts_t::bg_threads_mutex, fts_t::bg_threads
The unused fts_t::bg_threads was added in
mysql/mysql-server@4b1049625c.

Any usage of fts_t::bg_threads_mutex was removed in
mysql/mysql-server@33c2404b39.
2020-10-02 08:36:50 +03:00
Marko Mäkelä
91d39f630d Cleanup: Remove unused mutex keys 2020-10-02 07:10:16 +03:00