mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-31 02:51:44 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	35f59bc4e1	MDEV-26467: More cache friendliness srw_mutex_impl<bool>::wait_and_lock(): In commit `a73eedbf3f` we introduced an std::atomic::fetch_or() in a loop. Alas, on the IA-32 and AMD64, that was being translated into a loop around LOCK CMPXCHG. To avoid a nested loop, it is better to explicitly invoke std::atomic::compare_exchange_weak() in the loop, but only if the attempt has a chance to succeed (the HOLDER flag is not set). It is even more efficient to use LOCK BTS, but contemporary compilers fail to translate std::atomic::fetch_or(x) & x into that when x is a single-bit constant. On GCC-compatible compilers, we will use inline assembler to achieve that. On other ISA than IA-32 and AMD64, we will continue to use std::atomic::fetch_or(). ssux_lock_impl<spinloop>::rd_wait(): Use rd_lock_try(). A loop around std::atomic::compare_exchange_weak() should be cheaper than fetch_add(), fetch_sub() and a wakeup system call. These deficiencies were pointed out and the use of LOCK BTS was suggested by Thiago Macieira.	2021-09-28 17:17:59 +03:00
Marko Mäkelä	37a074f6c3	MDEV-26467 fixup: Fix cmake -DWITH_UNIT_TESTS=ON for SUX_LOCK_GENERIC	2021-09-24 09:18:07 +03:00
Marko Mäkelä	277ba134ad	MDEV-26467: Avoid futile spin loops Typically, index_lock and fil_space_t::latch will be held for a longer time than the spin loop in latch acquisition would be waiting for. Let us avoid spin loops for those as well as dict_sys.latch, which could be held in exclusive mode for a longer time (while loading metadata into the buffer pool and the dictionary cache). Performance testing on a dual Intel Xeon E5-2630 v4 (2 NUMA nodes) suggests that the buffer pool page latch (block_lock) benefits from a spin loop in both read-only and read-write workloads where the working set is slightly larger than the buffer pool. Presumably, most contention would occur on leaf page latches. Contention on upper level pages in the buffer pool should intuitively last longer. We introduce srw_spin_lock and srw_spin_mutex to allow users of srw_lock or srw_mutex to opt in for the spin loop. On Microsoft Windows, a spin loop variant was and will not be available; srw_mutex and srw_lock will simply wrap SRWLOCK. That is, on Microsoft Windows, the parameters innodb_sync_spin_loops and innodb_spin_wait_delay will only affect block_lock.	2021-09-06 12:32:24 +03:00
Marko Mäkelä	0f0b7e47bc	MDEV-26467: Avoid re-reading srv_spin_wait_delay inside a loop Invoking ut_delay(srv_wpin_wait_delay) inside a spinloop would cause a read of 2 global variables as well as multiplication. Let us loop around MY_RELAX_CPU() using a precomputed loop count to keep the loops simpler, to help them scale better. We also tried precomputing the delay into a global variable, but that appeared to result in slightly worse throughput.	2021-09-06 12:22:33 +03:00
Marko Mäkelä	a73eedbf3f	MDEV-26467 Unnecessary compare-and-swap loop in srw_mutex srw_mutex::wait_and_lock(): In the spin loop, we will try to poll for non-conflicting lock word state by reads, avoiding any writes. We invoke explicit std::atomic_thread_fence(std::memory_order_acquire) before returning. The individual operations on the lock word can use memory_order_relaxed. srw_mutex:🔒 Document that the value for a single writer is HOLDER+1 instead of HOLDER. srw_mutex::wr_lock_try(), srw_mutex::wr_unlock(): Adjust the value of the lock word of a single writer from HOLDER to HOLDER+1.	2021-09-06 12:16:26 +03:00
Marko Mäkelä	b81382887c	MDEV-25512 Deadlock between sux_lock::u_x_upgrade() and sux_lock::u_lock() In the SUX_LOCK_GENERIC implementation, we can remember at most one pending exclusive lock request. If multiple exclusive lock requests are pending, the WRITER_WAITING flag will be cleared when the first waiting writer acquires the exclusive lock. ssux_lock_low::update_lock(): If WRITER_WAITING is set, wake up the writer even if the UPDATER flag is set, because the waiting writer may be in the process of upgrading its U lock to X. rw_lock::read_unlock(): Also indicate that an X lock waiter must be woken up if an U lock exists. This fix may cause unnecessary wake-ups and system calls, but this is the best that we can do. Ideally we would use the MDEV-25404 idea of a separate 'writer' mutex, but there is no portable way to request that a non-recursive mutex be created, and InnoDB requires the ability to transfer buf_block_t::lock ownership to an I/O thread. To allow problems like this to be caught more reliably in the future, we add a unit test for srw_mutex, srw_lock, ssux_lock, sux_lock.	2021-04-25 12:58:16 +03:00
Marko Mäkelä	8751aa7397	MDEV-25404: ssux_lock_low: Introduce a separate writer mutex Having both readers and writers use a single lock word in futex system calls caused performance regression compared to SRW_LOCK_DUMMY (mutex and 2 condition variables). A contributing factor is that we did not accurately keep track of the number of waiting threads and thus had to invoke system calls to wake up any waiting threads. SUX_LOCK_GENERIC: Renamed from SRW_LOCK_DUMMY. This is the original implementation, with rw_lock (std::atomic<uint32_t>), a mutex and two condition variables. Using a separate writer mutex (as described below) is not possible, because the mutex ownership in a buf_block_t::lock must be able to transfer from a write submitter thread to an I/O completion thread, and pthread_mutex_lock() may assume that the submitter thread is recursively acquiring the mutex that it already holds, while in reality the I/O completion thread is the real owner. POSIX does not define an interface for requesting a mutex to be non-recursive. On Microsoft Windows, srw_lock_low will remain a simple wrapper of SRWLOCK. On 32-bit Microsoft Windows, sizeof(SRWLOCK)=4 while sizeof(srw_lock_low)=8. On other platforms, srw_lock_low is an alias of ssux_lock_low, the Simple (non-recursive) Shared/Update/eXclusive lock. In the futex-based implementation of ssux_lock_low (Linux, OpenBSD, Microsoft Windows), we shall use a dedicated mutex for exclusive requests (writer), and have a WRITER flag in the 'readers' lock word to inform that a writer is holding the lock or waiting for the lock to be granted. When the WRITER flag is set, all lock requests must acquire the writer mutex. Normally, shared (S) lock requests simply perform a compare-and-swap on the 'readers' word. Update locks are implemented as a combination of writer mutex and a normal counter in the 'readers' lock word. The conflict between U and X locks is guaranteed by the writer mutex. Unlike SUX_LOCK_GENERIC, wr_u_downgrade() will not wake up any pending rd_lock() waits. They will wait until u_unlock() releases the writer mutex. The ssux_lock_low is always wrapped by sux_lock (with a recursion count of U and X locks), used for dict_index_t::lock and buf_block_t::lock. Their memory footprint for the futex-based implementation will increase by sizeof(srw_mutex), or 4 bytes. This change addresses a performance regression in read-only benchmarks, such as sysbench oltp_read_only. Also write performance was improved. On 32-bit Linux and OpenBSD, lock_sys_t::hash_table will allocate two hash table elements for each srw_lock (14 instead of 15 hash table cells per 64-byte cache line on IA-32). On Microsoft Windows, sizeof(SRWLOCK)==sizeof(void*) and there is no change. Reviewed by: Vladislav Vaintroub Tested by: Axel Schwenke and Vladislav Vaintroub	2021-04-19 18:15:49 +03:00
Marko Mäkelä	040c16ab8b	MDEV-25404: Optimize srw_mutex on Linux, OpenBSD, Windows On Linux, OpenBSD and Microsoft Windows, srw_mutex was an alias for a rw-lock while we only need mutex functionality. Let us implement a futex-based mutex with one bit for HOLDER and 31 bits for counting waiting requests. srw_lock::wr_unlock() can avoid waking up a waiter when no waiting requests exist. (Previously, we only had 1-bit rw_lock::WRITER_WAITING flag that could be wrongly cleared if multiple waiting wr_lock() exist. Now we have no problem with up to 2,147,483,648 conflicting threads.) On 64-bit Microsoft Windows, the advantage is that sizeof(srw_mutex) is 4, while sizeof(SRWLOCK) would be 8. Reviewed by: Vladislav Vaintroub	2021-04-19 18:03:17 +03:00
Marko Mäkelä	272a1289ad	MDEV-24884 Hang in ssux_lock_low::write_lock() ssux_lock_low::write_lock(): Before invoking writer_wait(), keep attempting write_lock_wait_try() as long as no conflict exists. rw_lock::upgrade_trylock(): Relax a bogus assertion and correct the acquisition operation. Another thread may be executing in ssux_lock_low::write_lock() on the same latch. Because we are the only thread that can make progress on that latch, we must become the writer. Any waiting thread will be eventually woken up by ssux_lock_low::u_unlock() or ssux_lock_low::wr_unlock(), but not by wr_u_downgrade() because the upgrade is a very rare operation.	2021-02-17 12:34:06 +02:00
Marko Mäkelä	07e4b6b276	MDEV-24167 fixup: Wake up all update_lock() in u_unlock() It turns out that the hang that was fixed in commit `43d3dad114` for the SRW_LOCK_DUMMY implementation is also possible in the futex implementation. We have observed hangs of ssux_lock_low::u_unlock() on Windows where the undesirable value is rw_lock::UPDATER, in the test mariabackup.xb_compressed_encrypted. The exact sequence of events to the hang is not known, but it seems that u_unlock() had better always wake up one thread. Possibly, the case involves multiple blocked u_unlock(). On a busy server, the hang might be 'rescued' by a subsequent lock acquisition and release that is executed by another thread. rw_lock::update_unlock(): Change the return type to void. ssux_lock_low::u_unlock(): Always invoke readers_wake() [sic], to wake up any pending update_lock() or write_lock(). On futex implementation, this will wake up all waiters. On SRW_LOCK_DUMMY, writer_wake() and readers_wake() do the same thing: wake up one write_lock(), or all update_lock() waiters.	2020-12-16 17:45:01 +02:00
Marko Mäkelä	ff5d306e29	MDEV-21452: Replace ib_mutex_t with mysql_mutex_t SHOW ENGINE INNODB MUTEX functionality is completely removed, as are the InnoDB latching order checks. We will enforce innodb_fatal_semaphore_wait_threshold only for dict_sys.mutex and lock_sys.mutex. dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex. lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex. FIXME: srv_sys should be removed altogether; it is duplicating tpool functionality. fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must not hold fil_system.mutex. fil_close_all_files(): To prevent SAFE_MUTEX warnings for fil_space_destroy_crypt_data(), we must not hold fil_system.mutex while invoking fil_space_free_low() on a detached tablespace.	2020-12-15 17:56:18 +02:00
Marko Mäkelä	db006a9a43	MDEV-21452: Remove os_event_t, MUTEX_EVENT, TTASEventMutex, sync_array We will default to MUTEXTYPE=sys (using OSTrackMutex) for those ib_mutex_t that have not been replaced yet. The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed. The parameter innodb_sync_array_size is removed. FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced. We should enforce it for lock_sys.mutex and dict_sys.mutex somehow! innodb_sync_debug=ON might still cover ib_mutex_t.	2020-12-15 17:56:17 +02:00
Marko Mäkelä	38fd7b7d91	MDEV-21452: Replace all direct use of os_event_t Let us replace os_event_t with mysql_cond_t, and replace the necessary ib_mutex_t with mysql_mutex_t so that they can be used with condition variables. Also, let us replace polling (os_thread_sleep() or timed waits) with plain mysql_cond_wait() wherever possible. Furthermore, we will use the lightweight srw_mutex for trx_t::mutex, to hopefully reduce contention on lock_sys.mutex. FIXME: Add test coverage of mariabackup --backup --kill-long-queries-timeout	2020-12-15 17:56:17 +02:00
Marko Mäkelä	43d3dad114	MDEV-24142/MDEV-24167 fixup: Split ssux_lock and srw_lock This conceptually reverts commit `1fdc161d8f` and reintroduces an option for srw_lock to wrap a native implementation. The srw_lock and srw_lock_low differ from ssux_lock and ssux_lock_low in that Slim SUX locks support three modes (Shared, Update, eXclusive) while Slim RW locks support only two (Read, Write). On Microsoft Windows, the srw_lock will be implemented by SRWLOCK. On Linux and OpenBSD, it will be implemented by rw_lock and the futex system call, just like earlier. On other systems or if SRW_LOCK_DUMMY is defined on anything else than Microsoft Windows, rw_lock_t will be used. ssux_lock_low::read_lock(), ssux_lock_low::update_lock(): Correct the SRW_LOCK_DUMMY implementation to prevent hangs. The intention of commit `1fdc161d8f` seems to have been do ... while loops, but the 'do' keyword was missing. This total breakage was missed in commit `260161fc9f` which did reduce the probability of the hangs. ssux_lock_low::u_unlock(): In the SRW_LOCK_DUMMY implementation (based on a mutex and two condition variables), always invoke writer_wake() in order to ensure that a waiting update_lock() will be woken up. ssux_lock_low::writer_wait(), ssux_lock_low::readers_wait(): In the SRW_LOCK_DUMMY implementation, keep waiting for the signal until the lock word has changed. The "while" had been changed to "if" in order to avoid hangs.	2020-12-15 14:29:40 +02:00
Marko Mäkelä	ba2d45dc54	MDEV-24142: Remove INFORMATION_SCHEMA.INNODB_MUTEXES Let us remove sux_lock::waits and the associated bookkeeping. Starting with commit `1669c8890c` the PERFORMANCE_SCHEMA instrumentation interface is keeping track of lock waits. The view INFORMATION_SCHEMA.INNODB_MUTEXES only exported counts of rw-lock waits. Also, SHOW ENGINE INNODB MUTEX will no longer export any information about rw-locks.	2020-12-03 15:28:53 +02:00
Marko Mäkelä	ac028ec5d8	MDEV-24142: Remove the LatchDebug interface to rw-locks The latching order checks for rw-locks have not caught many bugs in the past few years and they are greatly complicating the code. Last time the debug checks were useful was in commit `59caf2c3c1` (MDEV-13485). The B-tree hang MDEV-14637 was not caught by LatchDebug, because the granularity of the checks is not sufficient to distinguish the levels of non-leaf B-tree pages. The interface was already made dead code by the grandparent commit `03ca6495df`.	2020-12-03 15:27:50 +02:00
Marko Mäkelä	03ca6495df	MDEV-24142: Replace InnoDB rw_lock_t with sux_lock InnoDB buffer pool block and index tree latches depend on a special kind of read-update-write lock that allows reentrant (recursive) acquisition of the 'update' and 'write' locks as well as an upgrade from 'update' lock to 'write' lock. The 'update' lock allows any number of reader locks from other threads, but no concurrent 'update' or 'write' lock. If there were no requirement to support an upgrade from 'update' to 'write', we could compose the lock out of two srw_lock (implemented as any type of native rw-lock, such as SRWLOCK on Microsoft Windows). Removing this requirement is very difficult, so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we implemented an 'update' mode to our srw_lock. Re-entrant or recursive locking is mostly needed when writing or freeing BLOB pages, but also in crash recovery or when merging buffered changes to an index page. The re-entrancy allows us to attach a previously acquired page to a sub-mini-transaction that will be committed before whatever else is holding the page latch. The SUX lock supports Shared ('read'), Update, and eXclusive ('write') locking modes. The S latches are not re-entrant, but a single S latch may be acquired even if the thread already holds an U latch. The idea of the U latch is to allow a write of something that concurrent readers do not care about (such as the contents of BTR_SEG_LEAF, BTR_SEG_TOP and other page allocation metadata structures, or the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field is only updated when a dict_table_t for the table exists, and only read when a dict_table_t for the table is being added to dict_sys.) block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page() to allow concurrent readers but no concurrent modifications while the page is being written to the data file. That latch will be released by buf_page_write_complete() in a different thread. Hence, we use the special lock owner value FOR_IO. The index_lock::u_lock() improves concurrency on operations that involve non-leaf index pages. The interface has been cleaned up a little. We will use x_lock_recursive() instead of x_lock() when we know that a lock is already held by the current thread. Similarly, a lock upgrade from U to X is only allowed via u_x_upgrade() or x_lock_upgraded() but not via x_lock(). We will disable the LatchDebug and sync_array interfaces to InnoDB rw-locks. The SEMAPHORES section of SHOW ENGINE INNODB STATUS output will no longer include any information about InnoDB rw-locks, only TTASEventMutex (cmake -DMUTEXTYPE=event) waits. This will make a part of the 'innotop' script dead code. The block_lock buf_block_t::lock will not be covered by any PERFORMANCE_SCHEMA instrumentation. SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES will no longer output source code file names or line numbers. The dict_index_t::lock will be identified by index and table names, which should be much more useful. PERFORMANCE_SCHEMA is lumping information about all dict_index_t::lock together as event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'. buf_page_free(): Remove the file,line parameters. The sux_lock will not store such diagnostic information. buf_block_dbg_add_level(): Define as empty macro, to be removed in a subsequent commit. Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO the index_lock dict_index_t::lock will be instrumented via PERFORMANCE_SCHEMA. Similar to commit `1669c8890c` we will distinguish lock waits by registering shared_lock,exclusive_lock events instead of try_shared_lock,try_exclusive_lock. Actual 'try' operations will not be instrumented at all. rw_lock_list: Remove. After MDEV-24167, this only covered buf_block_t::lock and dict_index_t::lock. We will output their information by traversing buf_pool or dict_sys.	2020-12-03 15:19:49 +02:00
Marko Mäkelä	d46b42489a	MDEV-24142 preparation: Add srw_mutex and srw_lock::u_lock() The PERFORMANCE_SCHEMA insists on distinguishing read-update-write locks from read-write locks, so we must add template<bool support_u_lock> in rd_lock() and wr_lock() operations. rd_lock::read_trylock(): Add template<bool prioritize_updater=false> which is used by the srw_lock_low::read_lock() loop. As long as an UPDATE lock has already been granted to some thread, we will grant subsequent READ lock requests even if a waiting WRITE lock request exists. This will be necessary to be compatible with existing usage pattern of InnoDB rw_lock_t where the holder of SX-latch (which we will rename to UPDATE latch) may acquire an additional S-latch on the same object. For normal read-write locks without update operations this should make no difference at all, because the rw_lock::UPDATER flag would never be set.	2020-12-03 15:17:16 +02:00
Marko Mäkelä	1669c8890c	MDEV-24167 fixup: Improve the PERFORMANCE_SCHEMA instrumentation Let us try to avoid code bloat for the common case that performance_schema is disabled at runtime, and use ATTRIBUTE_NOINLINE member functions for instrumented latch acquisition. Also, let us distinguish lock waits from non-contended lock requests by using write_lock,read_lock for the requests that lead to waits, and try_write_lock,try_read_lock for the wait-free lock acquisitions. Actual 'try' operations are not being instrumented at all.	2020-12-03 09:55:53 +02:00
Marko Mäkelä	260161fc9f	MDEV-24167 fixup: Avoid hangs in SRW_LOCK_DUMMY In commit `1fdc161d8f` we introduced a mutex-and-condition-variable based fallback implementation for platforms that lack a futex system call. That implementation is prone to hangs. Let us use separate condition variables for shared and exclusive requests.	2020-12-03 09:11:31 +02:00
Marko Mäkelä	1fdc161d8f	MDEV-24167 fixup: Always derive srw_lock from rw_lock Let us always base srw_lock on our own std::atomic<uint32_t> based rw_lock. In this way, we can extend the locks in a portable way across all platforms. We will use futex system calls where available: Linux, OpenBSD, and Microsoft Windows. Elsewhere, we will emulate futex with a mutex and a condition variable. Thanks to Daniel Black for testing this on OpenBSD.	2020-11-30 11:47:09 +02:00
Marko Mäkelä	565b0dd17d	Merge 10.5 into 10.6	2020-11-30 11:30:26 +02:00
Marko Mäkelä	8fa6e36375	MDEV-24308: Remove some os_thread_ functions os_thread_pf(): Remove. os_thread_eq(), os_thread_yield(), os_thread_get_curr_id(): Define as macros. ut_print_timestamp(), ut_sprintf_timestamp(): Simplify.	2020-11-30 11:15:31 +02:00
Marko Mäkelä	8b8969929d	Merge 10.5 into 10.6	2020-11-26 07:36:53 +02:00
Marko Mäkelä	657fcdf430	MDEV-24280 InnoDB triggers too many independent periodic tasks A side effect of MDEV-16264 is that a large number of threads will be created at server startup, to be destroyed after a minute or two. One source of such thread creation is srv_start_periodic_timer(). InnoDB is creating 3 periodic tasks: srv_master_callback (1Hz) srv_error_monitor_task (1Hz), and srv_monitor_task (0.2Hz). It appears that we can merge srv_error_monitor_task and srv_monitor_task and have them invoked 4 times per minute (every 15 seconds). This will affect our ability to enforce innodb_fatal_semaphore_wait_threshold and some computations around BUF_LRU_STAT_N_INTERVAL. We could remove srv_master_callback along with the DROP TABLE queue at some point of time in the future. We must keep it independent of the innodb_fatal_semaphore_wait_threshold detection, because the background DROP TABLE queue could get stuck due to dict_sys being locked by another thread. For now, srv_master_callback must be invoked once per second, so that innodb_flush_log_at_timeout=1 can work. BUF_LRU_STAT_N_INTERVAL: Reduce the precision and extend the time from 501 second to 415 seconds. srv_error_monitor_timer: Remove. MAX_MUTEX_NOWAIT: Increase from 201 second to 215 seconds. srv_refresh_innodb_monitor_stats(): Avoid a repeated call to time(NULL). Change the interval to less than 60 seconds. srv_monitor(): Renamed from srv_monitor_task. srv_monitor_task(): Renamed from srv_error_monitor_task(). Invoked only once in 15 seconds. Invoke also srv_monitor(). Increase the fatal_cnt threshold from 101 second to 115 seconds. sync_array_print_long_waits_low(): Invoke time(NULL) only once. Remove a bogus message about printouts for 30 seconds. Those printouts were effectively already disabled in MDEV-16264 (commit `5e62b6a5e0`).	2020-11-25 16:54:00 +02:00
Marko Mäkelä	edbde4a11f	MDEV-24167: Replace fil_space::latch We must avoid acquiring a latch while we are already holding one. The tablespace latch was being acquired recursively in some operations that allocate or free pages.	2020-11-24 15:43:12 +02:00
Marko Mäkelä	bdd88cfa34	MDEV-24167: Replace fts_cache_rw_lock, fts_cache_init_rw_lock with mutex fts_cache_t::init_lock: Replace with mutex. This was only acquired in exclusive mode. fts_cache_t:🔒 Replace with mutex. The only read-lock user was i_s_fts_index_cache_fill() for producing content for the view INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE.	2020-11-24 15:43:12 +02:00
Marko Mäkelä	1a1b7a6f16	MDEV-24167: Replace dict_operation_lock (dict_sys.latch)	2020-11-24 15:43:11 +02:00
Marko Mäkelä	06ef5509d0	MDEV-24167: Replace trx_purge_latch	2020-11-24 15:43:10 +02:00
Marko Mäkelä	63dd2a97e4	MDEV-24167: Replace trx_i_s_cache_lock	2020-11-24 15:43:09 +02:00
Marko Mäkelä	c561f9e6e8	MDEV-24167: Use lightweight srw_lock for btr_search_latch Many InnoDB rw-locks unnecessarily depend on the complex InnoDB rw_lock_t implementation that support the SX lock mode as well as recursive acquisition of X or SX locks. One of them is the bunch of adaptive hash index search latches, instrumented as btr_search_latch in PERFORMANCE_SCHEMA. Let us introduce a simpler lock for those in order to reduce overhead. srw_lock: A simple read-write lock that does not support recursion. On Microsoft Windows, this wraps SRWLOCK, only adding runtime overhead if PERFORMANCE_SCHEMA is enabled. On Linux (all architectures), this is implemented with std::atomic<uint32_t> and the futex system call. On other platforms, we will wrap mysql_rwlock_t with zero runtime overhead. The PERFORMANCE_SCHEMA instrumentation differs from InnoDB rw_lock_t in that we will only invoke PSI_RWLOCK_CALL(start_rwlock_wrwait) or PSI_RWLOCK_CALL(start_rwlock_rdwait) if there is an actual conflict.	2020-11-24 15:41:03 +02:00
Marko Mäkelä	1e5d989d2a	MDEV-24167: Remove PFS instrumentation of buf_block_t We always defined PFS_SKIP_BUFFER_MUTEX_RWLOCK, that is, the latches of the buffer pool blocks were never instrumented in PERFORMANCE_SCHEMA. For some reason, the debug_latch (which enforce proper usage of buffer-fixing in debug builds) was instrumented.	2020-11-20 08:55:41 +02:00
Marko Mäkelä	e8f8992801	MDEV-24188: Merge 10.4 into 10.5	2020-11-13 22:06:50 +02:00
Marko Mäkelä	749ecedfec	MDEV-24188: Merge 10.3 into 10.4	2020-11-13 20:45:28 +02:00
Marko Mäkelä	f9f2f37495	MDEV-24188: Merge 10.2 into 10.3	2020-11-13 20:41:48 +02:00
Marko Mäkelä	bb328a2a27	MDEV-24188 Hang in buf_page_create() after reusing a previously freed page The fix of MDEV-23456 (commit `b1009ae5c1`) introduced a livelock between page flushing and a thread that is executing buf_page_create(). buf_page_create(): If the current mini-transaction is holding an exclusive latch on the page, do not attempt to acquire another one, and do not care about any I/O fix. mtr_t::have_x_latch(): Replaces mtr_t::get_fix_count(). dyn_buf_t::for_each_block(const Functor&) const: A new variant. rw_lock_own(): Add a const qualifier. Reviewed by: Thirunarayanan Balathandayuthapani	2020-11-13 20:16:39 +02:00
Marko Mäkelä	898521e2dd	Merge 10.4 into 10.5	2020-10-30 11:15:30 +02:00
Marko Mäkelä	7b2bb67113	Merge 10.3 into 10.4	2020-10-29 13:38:38 +02:00
Marko Mäkelä	2b6f804490	Merge 10.2 into 10.3	2020-10-28 10:44:40 +02:00
Eugene Kosov	afc9d00c66	MDEV-23991 dict_table_stats_lock() has unnecessarily long scope Patch removes dict_index_t::stats_latch. Table/index statistics now protected with dict_sys->mutex. That way statistics computation can happen in parallel in several threads and dict_sys->mutex will be locked only for a short period of time. This patch is a joint work with Marko Mäkelä dict_index_t:🔒 make mutable which allows to pass const pointer when only lock is touched in an object btr_height_get() btr_get_size(): make index argument const for better type safety btr_estimate_number_of_different_key_vals(): now returns computed values instead of setting fields in dict_index_t directly remove everything related to dict_index_t::stats_latch dict_stats_index_set_n_diff(): now returns computed values instead of setting fields in dict_index_t directly dict_stats_analyze_index(): now returns computed values instead of setting fields in dict_index_t directly Reviewed by: Marko Mäkelä	2020-10-27 19:09:20 +03:00
Marko Mäkelä	c27e53f459	MDEV-23855: Use normal mutex for log_sys.mutex, log_sys.flush_order_mutex With an unreasonably small innodb_log_file_size, the page cleaner thread would frequently acquire log_sys.flush_order_mutex and spend a significant portion of CPU time spinning on that mutex when determining the checkpoint LSN.	2020-10-26 17:53:55 +02:00
Marko Mäkelä	1657b7a583	Merge 10.4 to 10.5	2020-10-22 17:08:49 +03:00
Marko Mäkelä	46957a6a77	Merge 10.3 into 10.4	2020-10-22 13:27:18 +03:00
Marko Mäkelä	e3d692aa09	Merge 10.2 into 10.3	2020-10-22 08:26:28 +03:00
Marko Mäkelä	7cffb5f6e8	MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will not be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit `d1ab89037a` unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub	2020-10-15 17:04:56 +03:00
Marko Mäkelä	b535a79044	MDEV-23399: Remove recv_writer_thread Recovery works just fine without a separate thread whose only task is to tell the page cleaner thread to do its job. recv_sys_t::apply(): Flush the buffer pool at the end of each batch. Reviewed by: Vladislav Vaintroub	2020-10-15 10:33:20 +03:00
Marko Mäkelä	861cd4ce28	MDEV-22871 fixup: Remove SYNC_BUF_PAGE_HASH This was missed in commit `5155a300fa`.	2020-10-05 13:05:11 +03:00
Marko Mäkelä	199bc67144	Cleanup: Remove unused SYNC_REC_LOCK SYNC_REC_LOCK was never used in the public history of InnoDB, starting with commit `132e667b0b`.	2020-10-05 09:12:12 +03:00
Marko Mäkelä	46890349bf	Cleanup: Remove fts_t::bg_threads_mutex, fts_t::bg_threads The unused fts_t::bg_threads was added in mysql/mysql-server@4b1049625c. Any usage of fts_t::bg_threads_mutex was removed in mysql/mysql-server@33c2404b39.	2020-10-02 08:36:50 +03:00
Marko Mäkelä	91d39f630d	Cleanup: Remove unused mutex keys	2020-10-02 07:10:16 +03:00

1 2 3 4 5 ...

429 commits