This is a 10.5 backport from 10.6
commit 9593cccf28.
Adaptive flushing is enabled by setting innodb_max_dirty_pages_pct_lwm>0
(not default) and innodb_adaptive_flushing=ON (default).
There is also the parameter innodb_adaptive_flushing_lwm
(default: 10 per cent of the log capacity). It should enable some
adaptive flushing even when innodb_max_dirty_pages_pct_lwm=0.
That is not being changed here.
This idea was first presented by Inaam Rana several years ago,
and I discussed it with Jean-François Gagné at FOSDEM 2023.
buf_flush_page_cleaner(): When we are not near the log capacity limit
(neither buf_flush_async_lsn nor buf_flush_sync_lsn are set),
also try to move clean blocks from the buf_pool.LRU list to buf_pool.free
or initiate writes (but not the eviction) of dirty blocks, until
the remaining I/O capacity has been consumed.
buf_flush_LRU_list_batch(): Add the parameter bool evict, to specify
whether dirty least recently used pages (from buf_pool.LRU) should
be evicted immediately after they have been written out. Callers outside
buf_flush_page_cleaner() will pass evict=true, to retain the existing
behaviour.
buf_do_LRU_batch(): Add the parameter bool evict.
Return counts of evicted and flushed pages.
buf_flush_LRU(): Add the parameter bool evict.
Assume that the caller holds buf_pool.mutex and
will invoke buf_dblwr.flush_buffered_writes() afterwards.
buf_flush_list_holding_mutex(): A low-level variant of buf_flush_list()
whose caller must hold buf_pool.mutex and invoke
buf_dblwr.flush_buffered_writes() afterwards.
buf_flush_wait_batch_end_acquiring_mutex(): Remove. It is enough to have
buf_flush_wait_batch_end().
page_cleaner_flush_pages_recommendation(): Avoid some floating-point
arithmetics.
buf_flush_page(), buf_flush_check_neighbor(), buf_flush_check_neighbors(),
buf_flush_try_neighbors(): Rename the parameter "bool lru" to "bool evict".
buf_free_from_unzip_LRU_list_batch(): Remove the parameter.
Only actual page writes will contribute towards the limit.
buf_LRU_free_page(): Evict freed pages of temporary tables.
buf_pool.done_free: Broadcast whenever a block is freed
(and buf_pool.try_LRU_scan is set).
buf_pool_t::io_buf_t::reserve(): Retry indefinitely.
During the test encryption.innochecksum we easily run out of
these buffers for PAGE_COMPRESSED or ENCRYPTED pages.
Tested by Matthias Leich and Axel Schwenke
This patch is the result of running
run-clang-tidy -fix -header-filter=.* -checks='-*,modernize-use-equals-default' .
Code style changes have been done on top. The result of this change
leads to the following improvements:
1. Binary size reduction.
* For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by
~400kb.
* A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb.
2. Compiler can better understand the intent of the code, thus it leads
to more optimization possibilities. Additionally it enabled detecting
unused variables that had an empty default constructor but not marked
so explicitly.
Particular change required following this patch in sql/opt_range.cc
result_keys, an unused template class Bitmap now correctly issues
unused variable warnings.
Setting Bitmap template class constructor to default allows the compiler
to identify that there are no side-effects when instantiating the class.
Previously the compiler could not issue the warning as it assumed Bitmap
class (being a template) would not be performing a NO-OP for its default
constructor. This prevented the "unused variable warning".
Since commit bd5a6403ca (MDEV-26033)
we can actually calculate the buf_pool.page_hash cell and latch
addresses while not holding buf_pool.mutex.
buf_page_alloc_descriptor(): Remove the MEM_UNDEFINED.
We now expect buf_page_t::hash to be zero-initialized.
buf_pool_t::hash_chain: Dedicated data type for buf_pool.page_hash.array.
buf_LRU_free_one_page(): Merged to the only caller
buf_pool_t::corrupted_evict().
A consistency check for fil_space_t::name is causing recovery failures
in MDEV-25180 (Atomic ALTER TABLE). So, we'd better remove that field
altogether.
fil_space_t::name was more or less a copy of dict_table_t::name
(except for some special cases), and it was not being used for
anything useful.
There used to be a name_hash, but it had been removed already in
commit a75dbfd718 (MDEV-12266).
We will also remove os_normalize_path(), OS_PATH_SEPARATOR,
OS_PATH_SEPATOR_ALT. On Microsoft Windows, we will treat \ and /
roughly in the same way. The intention is that for per-table
tablespaces, the filenames will always follow the pattern
prefix/databasename/tablename.ibd. (Any \ in the prefix must not
be converted.)
ut_basename_noext(): Remove (unused function).
read_link_file(): Replaces RemoteDatafile::read_link_file().
We will ensure that the last two path component separators are
forward slashes (converting up to 2 trailing backslashes on
Microsoft Windows), so that everywhere else we can
assume that data file names end in "/databasename/tablename.ibd".
Note: On Microsoft Windows, path names that start with \\?\ must
not contain / as path component separators. Previously, such paths
did work in the DATA DIRECTORY argument of InnoDB tables.
Reviewed by: Vladislav Vaintroub
std version has an advantage of a more convenient units implementation from
std::chrono. Now it's no need to multipy/divide to bring anything to
micro seconds.
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.
We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.
dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.
lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.
FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.
fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.
fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
We will default to MUTEXTYPE=sys (using OSTrackMutex) for those
ib_mutex_t that have not been replaced yet.
The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed.
The parameter innodb_sync_array_size is removed.
FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced.
We should enforce it for lock_sys.mutex and dict_sys.mutex somehow!
innodb_sync_debug=ON might still cover ib_mutex_t.
InnoDB buffer pool block and index tree latches depend on a
special kind of read-update-write lock that allows reentrant
(recursive) acquisition of the 'update' and 'write' locks
as well as an upgrade from 'update' lock to 'write' lock.
The 'update' lock allows any number of reader locks from
other threads, but no concurrent 'update' or 'write' lock.
If there were no requirement to support an upgrade from 'update'
to 'write', we could compose the lock out of two srw_lock
(implemented as any type of native rw-lock, such as SRWLOCK on
Microsoft Windows). Removing this requirement is very difficult,
so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we
implemented an 'update' mode to our srw_lock.
Re-entrant or recursive locking is mostly needed when writing or
freeing BLOB pages, but also in crash recovery or when merging
buffered changes to an index page. The re-entrancy allows us to
attach a previously acquired page to a sub-mini-transaction that
will be committed before whatever else is holding the page latch.
The SUX lock supports Shared ('read'), Update, and eXclusive ('write')
locking modes. The S latches are not re-entrant, but a single S latch
may be acquired even if the thread already holds an U latch.
The idea of the U latch is to allow a write of something that concurrent
readers do not care about (such as the contents of BTR_SEG_LEAF,
BTR_SEG_TOP and other page allocation metadata structures, or
the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field
is only updated when a dict_table_t for the table exists, and only
read when a dict_table_t for the table is being added to dict_sys.)
block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page()
to allow concurrent readers but no concurrent modifications while the
page is being written to the data file. That latch will be released
by buf_page_write_complete() in a different thread. Hence, we use
the special lock owner value FOR_IO.
The index_lock::u_lock() improves concurrency on operations that
involve non-leaf index pages.
The interface has been cleaned up a little. We will use
x_lock_recursive() instead of x_lock() when we know that a
lock is already held by the current thread. Similarly,
a lock upgrade from U to X is only allowed via u_x_upgrade()
or x_lock_upgraded() but not via x_lock().
We will disable the LatchDebug and sync_array interfaces to
InnoDB rw-locks.
The SEMAPHORES section of SHOW ENGINE INNODB STATUS output
will no longer include any information about InnoDB rw-locks,
only TTASEventMutex (cmake -DMUTEXTYPE=event) waits.
This will make a part of the 'innotop' script dead code.
The block_lock buf_block_t::lock will not be covered by any
PERFORMANCE_SCHEMA instrumentation.
SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES
will no longer output source code file names or line numbers.
The dict_index_t::lock will be identified by index and table names,
which should be much more useful. PERFORMANCE_SCHEMA is lumping
information about all dict_index_t::lock together as
event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'.
buf_page_free(): Remove the file,line parameters. The sux_lock will
not store such diagnostic information.
buf_block_dbg_add_level(): Define as empty macro, to be removed
in a subsequent commit.
Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO
the index_lock dict_index_t::lock will be instrumented via
PERFORMANCE_SCHEMA. Similar to
commit 1669c8890c
we will distinguish lock waits by registering shared_lock,exclusive_lock
events instead of try_shared_lock,try_exclusive_lock.
Actual 'try' operations will not be instrumented at all.
rw_lock_list: Remove. After MDEV-24167, this only covered
buf_block_t::lock and dict_index_t::lock. We will output their
information by traversing buf_pool or dict_sys.
btr_search_sys::parts[]: A single structure for the partitions of
the adaptive hash index. Replaces the 3 separate arrays:
btr_search_latches[], btr_search_sys->hash_tables,
btr_search_sys->hash_tables[i]->heap.
hash_table_t::heap, hash_table_t::adaptive: Remove.
ha0ha.cc: Remove. Move all code to btr0sea.cc.
ut_filename_hash(): Add better casts to please the compiler:
warning C4307: '*': integral constant overflow
This regression was introduced in
commit dd77f072f9 (MDEV-22841).
Change how lookup for the "auto" PSI_memory_keys is done.
Lookup for filename hashes (integers), instead of C strings
Generate these hashes at the compile time with constexpr,
rather than at runtime.
This applies to large allocations.
This maps to the way Linux does it in MDEV-10814 except FreeBSD uses
different constants.
Adjust error string to match to implementation.
Tested on FreeBSD-12.0
In MDEV-10814, a missing argument caused a later optional argument
(bool true) to be treated as a size. The unmap of this memory occurs
during shutdown and resizing innodb buffer pool. As a result the
memory is lost but still allocated until shutdown is completed.
In 1dc78d35a0beb9620bae1f4841cc07389b425707 the arguments
to a deallocate_large(dontdump=true) was passed a wrong value.
To avoid accidential calling large memory function that have
DODUMP/DONTDUMP options and missing arguments, the functions
have been given distinct names.
Remove unused InnoDB function parameters and functions.
i_s_sys_virtual_fill_table(): Do not allocate heap memory.
mtr_is_block_fix(): Replace with mtr_memo_contains().
mtr_is_page_fix(): Replace with mtr_memo_contains_page().
Note: Linux only
Core dumps of large buffer pool pages take time and space
and pose potential data expose in scenarios where data-at-rest
encryption is deployed.
Here we use madvise(MADV_DONT_DUMP) on large memory allocations
used by the innodb buffer pool, log_sys and recv_sys. The effect
of this system call is that these memory areas will not appear in
a core dump. Data from these buffers is rarely useful in fault
diagnosis.
log_sys and recv_sys structures now use large memory allocations
for their large buffer.
Debug builds don't include the madvise syscall and as such will
include full core dumps.
A function, buf_madvise_do_dump, is added but never called. It
is there to be called from a debugger to re-enable the core
dumping of all of these pages if for some reason the entire
contents of these buffers are needed.
Idea thanks to Hartmut Holzgraefe
Take snapshot of registered read-write transaction identifiers directly
from rw_trx_hash. It immediately saves one trx_sys.mutex lock, reduces
size of another critical section protected by this mutex, and makes
further optimisations like removing trx_sys_t::serialisation_list
possible.
Downside of this approach is bigger overhead for view opening, because
iterating LF_HASH is more expensive compared to taking snapshot of an
array. However for low concurrency overhead difference is negligible,
while for high concurrency mutex is much bigger evil.
Currently we still take trx_sys.mutex to serialise ReadView creation.
This is required to keep serialisation_list ordered by trx->no as well
as not to let purge thread to create more recent snapshot while another
thread gets suspended during creation of older snapshot. This will
become completely mutex free along with serialisation_list removal.
Compared to previous implementation removing element from rw_trx_hash
and serialisation_list is not atomic. We disregard all possible bad
consequences (if there're any) since it will be solved along with
serialisation_list removal.
Contains also:
MDEV-10549 mysqld: sql/handler.cc:2692: int handler::ha_index_first(uchar*): Assertion `table_share->tmp_table != NO_TMP_TABLE || m_lock_type != 2' failed. (branch bb-10.2-jan)
Unlike MySQL, InnoDB still uses THR_LOCK in MariaDB
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
enable tests that were fixed in MDEV-10549
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
fix main.innodb_mysql_sync - re-enable online alter for partitioned innodb tables
Contains also
MDEV-10547: Test multi_update_innodb fails with InnoDB 5.7
The failure happened because 5.7 has changed the signature of
the bool handler::primary_key_is_clustered() const
virtual function ("const" was added). InnoDB was using the old
signature which caused the function not to be used.
MDEV-10550: Parallel replication lock waits/deadlock handling does not work with InnoDB 5.7
Fixed mutexing problem on lock_trx_handle_wait. Note that
rpl_parallel and rpl_optimistic_parallel tests still
fail.
MDEV-10156 : Group commit tests fail on 10.2 InnoDB (branch bb-10.2-jan)
Reason: incorrect merge
MDEV-10550: Parallel replication can't sync with master in InnoDB 5.7 (branch bb-10.2-jan)
Reason: incorrect merge