Historically, InnoDB supported a buggy page checksum algorithm that did not
compute a checksum over the full page. Later, well before MySQL 4.1
introduced .ibd files and the innodb_file_per_table option, the algorithm
was corrected and the first 4 bytes of each page were redefined to be
a checksum.
The original checksum was so slow that an option to disable page checksum
was introduced for benchmarketing purposes.
The Intel Nehalem microarchitecture introduced the SSE4.2 instruction set
extension, which includes instructions for faster computation of CRC-32C.
In MySQL 5.6 (and MariaDB 10.0), innodb_checksum_algorithm=crc32 was
implemented to make of that. As that option was changed to be the default
in MySQL 5.7, a bug was found on big-endian platforms and some work-around
code was added to weaken that checksum further. MariaDB disables that
work-around by default since MDEV-17958.
Later, SIMD-accelerated CRC-32C has been implemented in MariaDB for POWER
and ARM and also for IA-32/AMD64, making use of carry-less multiplication
where available.
Long story short, innodb_checksum_algorithm=crc32 is faster and more secure
than the pre-MySQL 5.6 checksum, called innodb_checksum_algorithm=innodb.
It should have removed any need to use innodb_checksum_algorithm=none.
The setting innodb_checksum_algorithm=crc32 is the default in
MySQL 5.7 and MariaDB Server 10.2, 10.3, 10.4. In MariaDB 10.5,
MDEV-19534 made innodb_checksum_algorithm=full_crc32 the default.
It is even faster and more secure.
The default settings in MariaDB do allow old data files to be read,
no matter if a worse checksum algorithm had been used.
(Unfortunately, before innodb_checksum_algorithm=full_crc32,
the data files did not identify which checksum algorithm is being used.)
The non-default settings innodb_checksum_algorithm=strict_crc32 or
innodb_checksum_algorithm=strict_full_crc32 would only allow CRC-32C
checksums. The incompatibility with old data files is why they are
not the default.
The newest server not to support innodb_checksum_algorithm=crc32
were MySQL 5.5 and MariaDB 5.5. Both have reached their end of life.
A valid reason for using innodb_checksum_algorithm=innodb could have
been the ability to downgrade. If it is really needed, data files
can be converted with an older version of the innochecksum utility.
Because there is no good reason to allow data files to be written
with insecure checksums, we will reject those option values:
innodb_checksum_algorithm=none
innodb_checksum_algorithm=innodb
innodb_checksum_algorithm=strict_none
innodb_checksum_algorithm=strict_innodb
Furthermore, the following innochecksum options will be removed,
because only strict crc32 will be supported:
innochecksum --strict-check=crc32
innochecksum -C crc32
innochecksum --write=crc32
innochecksum -w crc32
If a user wishes to convert a data file to use a different checksum
(so that it might be used with the no-longer-supported
MySQL 5.5 or MariaDB 5.5, which do not support IMPORT TABLESPACE
nor system tablespace format changes that were made in MariaDB 10.3),
then the innochecksum tool from MariaDB 10.2, 10.3, 10.4, 10.5 or
MySQL 5.7 can be used.
Reviewed by: Thirunarayanan Balathandayuthapani
The condition variables that were introduced in
commit 7cffb5f6e8 (MDEV-23399)
are never instrumented with PERFORMANCE_SCHEMA.
Let us avoid the storage overhead and dead code.
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.
We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.
dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.
lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.
FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.
fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.
fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
We observed a race condition that involved two threads
executing fil_flush_file_spaces() and one thread
executing fil_delete_tablespace(). After one of the
fil_flush_file_spaces() observed that
space.needs_flush_not_stopping() is set and was
releasing the fil_system.mutex, the other fil_flush_file_spaces()
would complete the execution of fil_space_t::flush_low() on
the same tablespace. Then, fil_delete_tablespace() would
destroy the object, because the value of fil_space_t::n_pending
did not prevent that. Finally, the fil_flush_file_spaces() would
resume execution and invoke fil_space_t::flush_low() on the freed
object.
This race condition was introduced in
commit 118e258aaa of MDEV-23855.
fil_space_t::flush(): Add a template parameter that indicates
whether the caller is holding a reference to prevent the
tablespace from being freed.
buf_dblwr_t::flush_buffered_writes_completed(),
row_quiesce_table_start(): Acquire a reference for the duration
of the fil_space_t::flush_low() operation. It should be impossible
for the object to be freed in these code paths, but we want to
satisfy the debug assertions.
fil_space_t::flush_low(): Do not increment or decrement the
reference count, but instead assert that the caller is holding
a reference.
fil_space_extend_must_retry(), fil_flush_file_spaces():
Acquire a reference before releasing fil_system.mutex.
This is what will fix the race condition.
The counters in srv_stats use std::atomic and multiple cache lines per
counter. This is an overkill in a case where a critical section already
exists in the code. A regular variable will work just fine, with much
smaller memory bus impact.
The latching order checks for rw-locks have not caught many bugs
in the past few years and they are greatly complicating the code.
Last time the debug checks were useful was in
commit 59caf2c3c1 (MDEV-13485).
The B-tree hang MDEV-14637 was not caught by LatchDebug,
because the granularity of the checks is not sufficient
to distinguish the levels of non-leaf B-tree pages.
The interface was already made dead code by the grandparent
commit 03ca6495df.
InnoDB buffer pool block and index tree latches depend on a
special kind of read-update-write lock that allows reentrant
(recursive) acquisition of the 'update' and 'write' locks
as well as an upgrade from 'update' lock to 'write' lock.
The 'update' lock allows any number of reader locks from
other threads, but no concurrent 'update' or 'write' lock.
If there were no requirement to support an upgrade from 'update'
to 'write', we could compose the lock out of two srw_lock
(implemented as any type of native rw-lock, such as SRWLOCK on
Microsoft Windows). Removing this requirement is very difficult,
so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we
implemented an 'update' mode to our srw_lock.
Re-entrant or recursive locking is mostly needed when writing or
freeing BLOB pages, but also in crash recovery or when merging
buffered changes to an index page. The re-entrancy allows us to
attach a previously acquired page to a sub-mini-transaction that
will be committed before whatever else is holding the page latch.
The SUX lock supports Shared ('read'), Update, and eXclusive ('write')
locking modes. The S latches are not re-entrant, but a single S latch
may be acquired even if the thread already holds an U latch.
The idea of the U latch is to allow a write of something that concurrent
readers do not care about (such as the contents of BTR_SEG_LEAF,
BTR_SEG_TOP and other page allocation metadata structures, or
the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field
is only updated when a dict_table_t for the table exists, and only
read when a dict_table_t for the table is being added to dict_sys.)
block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page()
to allow concurrent readers but no concurrent modifications while the
page is being written to the data file. That latch will be released
by buf_page_write_complete() in a different thread. Hence, we use
the special lock owner value FOR_IO.
The index_lock::u_lock() improves concurrency on operations that
involve non-leaf index pages.
The interface has been cleaned up a little. We will use
x_lock_recursive() instead of x_lock() when we know that a
lock is already held by the current thread. Similarly,
a lock upgrade from U to X is only allowed via u_x_upgrade()
or x_lock_upgraded() but not via x_lock().
We will disable the LatchDebug and sync_array interfaces to
InnoDB rw-locks.
The SEMAPHORES section of SHOW ENGINE INNODB STATUS output
will no longer include any information about InnoDB rw-locks,
only TTASEventMutex (cmake -DMUTEXTYPE=event) waits.
This will make a part of the 'innotop' script dead code.
The block_lock buf_block_t::lock will not be covered by any
PERFORMANCE_SCHEMA instrumentation.
SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES
will no longer output source code file names or line numbers.
The dict_index_t::lock will be identified by index and table names,
which should be much more useful. PERFORMANCE_SCHEMA is lumping
information about all dict_index_t::lock together as
event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'.
buf_page_free(): Remove the file,line parameters. The sux_lock will
not store such diagnostic information.
buf_block_dbg_add_level(): Define as empty macro, to be removed
in a subsequent commit.
Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO
the index_lock dict_index_t::lock will be instrumented via
PERFORMANCE_SCHEMA. Similar to
commit 1669c8890c
we will distinguish lock waits by registering shared_lock,exclusive_lock
events instead of try_shared_lock,try_exclusive_lock.
Actual 'try' operations will not be instrumented at all.
rw_lock_list: Remove. After MDEV-24167, this only covered
buf_block_t::lock and dict_index_t::lock. We will output their
information by traversing buf_pool or dict_sys.
Starting with commit ef3f71fa74
MemorySanitizer would complain that we are writing uninitialized
data via the doublewrite buffer.
buf_dblwr_t::add_to_batch(): Zero out any unused part of the
doublewrite buffer, for PAGE_COMPRESSED and ROW_FORMAT=COMPRESSED
tables.
Reviewed by: Eugene Kosov
Synchronous writes and calls to fdatasync(), fsync() or
FlushFileBuffers() would ruin performance. So, let us
submit asynchronous writes for the doublewrite buffer.
We submit a single request for the likely case that the
two doublewrite buffers are contiquous in the system tablespace.
buf_dblwr_t::flush_buffered_writes_completed(): The completion callback
of buf_dblwr_t::flush_buffered_writes().
os_aio_wait_until_no_pending_writes(): Also wait for doublewrite batches.
buf_dblwr_t::element::space: Remove. We can simply use
element::request.node->space instead.
Reviewed by: Vladislav Vaintroub
Merge n_pending_ios, n_pending_ops to std::atomic<uint32_t> n_pending.
Change some more fil_space_t members to uint32_t to reduce
the memory footprint.
fil_space_t::add(), fil_ibd_create(): Attach the already opened
handle to the tablespace, and enforce the fil_system.n_open limit.
dict_boot(): Initialize fil_system.max_assigned_id.
srv_boot(): Call srv_thread_pool_init() before anything else,
so that files should be opened in the correct mode on Windows.
fil_ibd_create(): Create the file in OS_FILE_AIO mode, just like
fil_node_open_file_low() does it.
dict_table_t::is_accessible(): Replaces fil_table_accessible().
Reviewed by: Vladislav Vaintroub
Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored
for system tablespace on SSD
When the maximum configured number of file is exceeded, InnoDB will
close data files. We used to maintain a fil_system.LRU list and
a counter fil_node_t::n_pending to achieve this, at the huge cost
of multiple fil_system.mutex operations per I/O operation.
fil_node_open_file_low(): Implement a FIFO replacement policy:
The last opened file will be moved to the end of fil_system.space_list,
and files will be closed from the start of the list. However, we will
not move tablespaces in fil_system.space_list while
i_s_tablespaces_encryption_fill_table() is executing
(producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION)
because it may cause information of some tablespaces to go missing.
We also avoid this in mariabackup --backup because datafiles_iter_next()
assumes that the ordering is not changed.
IORequest: Fold more parameters to IORequest::type.
fil_space_t::io(): Replaces fil_io().
fil_space_t::flush(): Replaces fil_flush().
OS_AIO_IBUF: Remove. We will always issue synchronous reads of the
change buffer pages in buf_read_page_low().
We will always ignore some errors for background reads.
This should reduce fil_system.mutex contention a little.
fil_node_t::complete_write(): Replaces fil_node_t::complete_io().
On both read and write completion, fil_space_t::release_for_io()
will have to be called.
fil_space_t::io(): Do not acquire fil_system.mutex in the normal
code path.
xb_delta_open_matching_space(): Do not try to open the system tablespace
which was already opened. This fixes a file sharing violation in
mariabackup --prepare --incremental.
Reviewed by: Vladislav Vaintroub
InnoDB stores a 32-bit page number in page headers and in some
data structures, such as FIL_ADDR (consisting of a 32-bit page number
and a 16-bit byte offset within a page). For better compile-time
error detection and to reduce the memory footprint in some data
structures, let us use a uint32_t for the page number, instead
of ulint (size_t) which can be 64 bits.
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.
The configuration parameters will be changed as follows:
innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)
Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
* When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
* As a buf_pool.free limit in buf_LRU_list_batch() for terminating
the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
The status variables will be changed as follows:
innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call
When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.
Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.
The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.
The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.
Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.
We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.
To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.
For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.
mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.
buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.
buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.
buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.
buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().
flush_counters_t::unzip_LRU_evicted: Remove.
IORequest: Make more members const. FIXME: m_fil_node should be removed.
buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).
page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().
recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.
srv_started_redo: Replaces srv_start_state.
SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().
buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().
buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.
buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.
buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.
buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.
buf_flush_check_neighbor(): Take id.fold() as a parameter.
buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.
buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_do_LRU_batch(): Return the number of pages flushed.
buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.
buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.
buf_LRU_check_size_of_non_data_objects(): Simplify the code.
buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.
buf_page_create(): Take a pre-allocated page as a parameter.
buf_pool_t::free_block(): Free a pre-allocated block.
recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.
BtrBulk::logFreeCheck(): Skip a redundant condition.
row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.
sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.
Reviewed by: Vladislav Vaintroub
buf_page_create() is invoked when page is initialized. So that
previous contents of the page ignored. In few cases, it calls
buf_page_get_gen() is called to fetch the page from buffer pool.
It should take x-latch on the page. If other thread uses the block
or block io state is different from BUF_IO_NONE then release the
mutex and check the state and buffer fix count again. For compressed
page, use the existing free block from LRU list to create new page.
Retry to fetch the compressed page if it is in flush list
fseg_create(), fseg_create_general(): Introduce block as a parameter
where segment header is placed. It is used to avoid repetitive
x-latch on the same page
Change the assert to check whether the page has SX latch and
X latch in all callee function of buf_page_create()
mtr_t::get_fix_count(): Get the buffer fix count of the given
block added by the mtr
FindBlock is added to find the buffer fix count of the given
block acquired by the mini-transaction
The purpose of the InnoDB doublewrite buffer is to make InnoDB
tolerant against cases where the server was killed in the middle
of a page write. (In Linux, killing a process may interrupt a
write system call, typically on a 4096-byte boundary.)
There may exist multiple copies of a page number in the doublewrite
buffer. Recovery should choose the latest valid copy of the page.
By design, the FIL_PAGE_LSN must not precede the latest checkpoint LSN
nor be later than the end of the recovered log.
For page_compressed and encrypted pages, we were missing proper
consistency checks. In the 10.4 data set generated for in MDEV-23231,
the data file contained a valid page_compressed page, and an
identical copy of that page was also present in the doublewrite
buffer. But, recovery would incorrectly consider the page invalid
and restore an uncompressed copy of the same page that had been
written before the log checkpoint. (In fact, no redo log was to
be applied to that page.)
buf_dblwr_process(): Validate the FIL_PAGE_LSN in the doublewrite
buffer pages, and always skip page 0, because those pages should
have been recovered by Datafile::restore_from_doublewrite() if
necessary.
Datafile::restore_from_doublewrite(): Choose the latest applicable
page from the doublewrite buffer.
recv_dblwr_t::find_page(): Also validate encrypted or
page_compressed pages.
recv_dblwr_t::validate_page(): New function to validate a page,
either a copy in a data file or in the doublewrite buffer.
Also validate encrypted or page_compressed pages.
This is joint work with Thirunarayanan Balathandayuthapani.
In commit 0f90728bc0 (MDEV-16809)
we introduced the configuration option innodb_log_optimize_ddl
for controlling whether native index creation or table-rebuild
in InnoDB should avoid writing full redo log.
Fungo Wang reported that this option is causing occasional failures.
The reason is that pages may be written to data files in an
inconsistent state. Applying log records to such inconsistent pages
may fail.
The solution is to always invoke PageBulk::finish() before page latches
may be released, to ensure that the page contents is in a consistent
state.
Something similar was implemented in MySQL 8.0.13:
mysql/mysql-server@d1254b9473
buf_block_t::skip_flush_check: Remove. Suppressing consistency checks
is a bad idea.
PageBulk::needs_finish(): New predicate: Determine whether
PageBulk::finish() must fix up the page.
PageBulk::init(): Clear PAGE_DIRECTION to ensure that needs_finish()
will hold. We change the field from PAGE_NO_DIRECTION to 0
and back without writing redo log. This trick avoids the need
to introduce any new data member to PageBulk.
PageBulk::insert(): Replace some high-level accessors to bypass
debug assertions related to PAGE_HEAP_TOP that we will be violating
until finish() has been executed.
PageBulk::finish(): Tolerate m_rec_no==0. We must invoke this also
on an empty page, to ensure that PAGE_HEAP_TOP is initialized.
PageBulk::commit(): Always invoke finish().
PageBulk::release(), BtrBulk::pageSplit(), BtrBulk::storeExt(),
BtrBulk::finish(): Invoke PageBulk::finish().
MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The only exception is the
C runtime library. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.
In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.
Note: Before MariaDB Server 10.5, ./mtr will typically fail
due to the old PCRE library, which was updated in MDEV-14024.
The following cmake options were tested on 10.5
in commit 94d0bb4dbe:
cmake \
-DCMAKE_C_FLAGS='-march=native -O2' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \
-DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DWITH_MSAN=ON
MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and __msan_unpoison().
MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for
VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow().
InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros.
ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.
User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
and will no longer report the PAGE_STATE value READY_FOR_USE.
We will remove some fields from buf_page_t and move much code to
member functions of buf_pool_t and buf_page_t, so that the access
rules of data members can be enforced consistently.
Evicting or adding pages in buf_pool.LRU will remain covered by
buf_pool.mutex.
Evicting or adding pages in buf_pool.page_hash will remain
covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.
After this fix, buf_pool.page_hash lookups can entirely
avoid acquiring buf_pool.mutex, only relying on
buf_pool.hash_lock_get() S-latch.
Similarly, buf_flush_check_neighbors() can will rely solely on
buf_pool.mutex, no buf_pool.page_hash latch at all.
The buf_pool.mutex is rather contended in I/O heavy benchmarks,
especially when the workload does not fit in the buffer pool.
The first attempt to alleviate the contention was the
buf_pool_t::mutex split in
commit 4ed7082eef
which introduced buf_block_t::mutex, which we are now removing.
Later, multiple instances of buf_pool_t were introduced
in commit c18084f71b
and recently removed by us in
commit 1a6f708ec5 (MDEV-15058).
UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
related debugging in otherwise non-debug builds has not been used
for years. Instead, we have been using UNIV_DEBUG, which is enabled
in CMAKE_BUILD_TYPE=Debug.
buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
std::atomic and the buf_pool.page_hash latches, and in some cases
depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
We must always release buf_block_t::lock before invoking
unfix() or io_unfix(), to prevent a glitch where a block that was
added to the buf_pool.free list would apper X-latched. See
commit c5883debd6 how this glitch
was finally caught in a debug environment.
We move some buf_pool_t::page_hash specific code from the
ha and hash modules to buf_pool, for improved readability.
buf_pool_t::close(): Assert that all blocks are clean, except
on aborted startup or crash-like shutdown.
buf_pool_t::validate(): No longer attempt to validate
n_flush[] against the number of BUF_IO_WRITE fixed blocks,
because buf_page_t::flush_type no longer exists.
buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
Reduce mutex contention by separating the buf_pool.watch[]
allocation and the insert into buf_pool.page_hash.
buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
buf_pool.page_hash latch.
Replaces and extends buf_page_hash_lock_s_confirm()
and buf_page_hash_lock_x_confirm().
buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.
buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
Use Atomic_counter.
buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().
buf_pool_t::LRU_remove(): Remove a block from the LRU list
and return its predecessor. Incorporates buf_LRU_adjust_hp(),
which was removed.
buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
BTR_DELETE_OP (purge), which is never invoked on temporary tables.
buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.
buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.
buf_LRU_free_page(): Clarify the function comment.
buf_flush_check_neighbor(), buf_flush_check_neighbors():
Rewrite the construction of the page hash range. We will hold
the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
consecutive lookups of buf_pool.page_hash.
buf_flush_page_and_try_neighbors(): Remove.
Merge to its only callers, and remove redundant operations in
buf_flush_LRU_list_batch().
buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
Do not acquire buf_pool.mutex, and iterate directly with page_id_t.
ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
and avoids any loops.
fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.
buf_flush_page(): Add a fil_space_t* parameter. Minimize the
buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
atomically with the io_fix, and we will protect most buf_block_t
fields with buf_block_t::lock. The function
buf_flush_write_block_low() is removed and merged here.
buf_page_init_for_read(): Use static linkage. Initialize the newly
allocated block and acquire the exclusive buf_block_t::lock while not
holding any mutex.
IORequest::IORequest(): Remove the body. We only need to invoke
set_punch_hole() in buf_flush_page() and nowhere else.
buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
This field is only used during a fil_io() call.
That function already takes IORequest as a parameter, so we had
better introduce for the rarely changing field.
buf_block_t::init(): Replaces buf_page_init().
buf_page_t::init(): Replaces buf_page_init_low().
buf_block_t::initialise(): Initialise many fields, but
keep the buf_page_t::state(). Both buf_pool_t::validate() and
buf_page_optimistic_get() requires that buf_page_t::in_file()
be protected atomically with buf_page_t::in_page_hash
and buf_page_t::in_LRU_list.
buf_page_optimistic_get(): Now that buf_block_t::mutex
no longer exists, we must check buf_page_t::io_fix()
after acquiring the buf_pool.page_hash lock, to detect
whether buf_page_init_for_read() has been initiated.
We will also check the io_fix() before acquiring hash_lock
in order to avoid unnecessary computation.
The field buf_block_t::modify_clock (protected by buf_block_t::lock)
allows buf_page_optimistic_get() to validate the block.
buf_page_t::real_size: Remove. It was only used while flushing
pages of page_compressed tables.
buf_page_encrypt(): Add an output parameter that allows us ot eliminate
buf_page_t::real_size. Replace a condition with debug assertion.
buf_page_should_punch_hole(): Remove.
buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
Add the parameter size (to replace buf_page_t::real_size).
buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
Add the parameter size (to replace buf_page_t::real_size).
fil_system_t::detach(): Replaces fil_space_detach().
Ensure that fil_validate() will not be violated even if
fil_system.mutex is released and reacquired.
fil_node_t::complete_io(): Renamed from fil_node_complete_io().
fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
Avoid invoking fil_node_t::close() because fil_system.n_open
has already been decremented in fil_space_t::detach().
BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.
BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
and distinguish dirty pages by buf_page_t::oldest_modification().
BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
This state was only being used for buf_page_t that are in
buf_pool.watch.
buf_pool_t::watch[]: Remove pointer indirection.
buf_page_t::in_flush_list: Remove. It was set if and only if
buf_page_t::oldest_modification() is nonzero.
buf_page_decrypt_after_read(), buf_corrupt_page_release(),
buf_page_check_corrupt(): Change the const fil_space_t* parameter
to const fil_node_t& so that we can report the correct file name.
buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.
buf_page_io_complete(): Split to buf_page_read_complete() and
buf_page_write_complete().
buf_dblwr_t::in_use: Remove.
buf_dblwr_t::buf_block_array: Add IORequest::flush_t.
buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
os_aio_wait_until_no_pending_writes().
buf_flush_write_complete(): Declare static, not global.
Add the parameter IORequest::flush_t.
buf_flush_freed_page(): Simplify the code.
recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.
fil_read(), fil_write(): Replaced with direct use of fil_io().
fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.
fil_mutex_enter_and_prepare_for_io(): Return the resolved
fil_space_t* to avoid a duplicated lookup in the caller.
fil_report_invalid_page_access(): Clean up the parameters.
fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
Always invoke fil_space_t::acquire_for_io() and let either the
sync=true caller or fil_aio_callback() invoke
fil_space_t::release_for_io().
fil_aio_callback(): Rewrite to replace buf_page_io_complete().
fil_check_pending_operations(): Remove a parameter, and remove some
redundant lookups.
fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
do an extra lookup of the tablespace between fil_io() and the
completion of the operation, we must give fil_node_t::complete_io() a
chance to decrement the counter.
fil_close_tablespace(): Remove unused parameter trx, and document
that this is only invoked during the error handling of IMPORT TABLESPACE.
row_import_discard_changes(): Merged with the only caller,
row_import_cleanup(). Do not lock up the data dictionary while
invoking fil_close_tablespace().
logs_empty_and_mark_files_at_shutdown(): Do not invoke
fil_close_all_files(), to avoid a !needs_flush assertion failure
on fil_node_t::close().
innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().
fil_close_all_files(): Invoke fil_flush_file_spaces()
to ensure proper durability.
thread_pool::unbind(): Fix a crash that would occur on Windows
after srv_thread_pool->disable_aio() and os_file_close().
This fix was submitted by Vladislav Vaintroub.
Thanks to Matthias Leich and Axel Schwenke for extensive testing,
Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
Introduce a new ATTRIBUTE_NOINLINE to
ib::logger member functions, and add UNIV_UNLIKELY hints to callers.
Also, remove some crash reporting output. If needed, the
information will be available using debugging tools.
Furthermore, remove some fts_enable_diag_print output that included
indexed words in raw form. The code seemed to assume that words are
NUL-terminated byte strings. It is not clear whether a NUL terminator
is always guaranteed to be present. Also, UCS2 or UTF-16 strings would
typically contain many NUL bytes.
In my micro-benchmarks memcmp(4196) 3 times faster than old
implementation. Also, it's generally better to use as less
reinterpret_casts<> as possible.
buf_is_zeroes(): renamed from buf_page_is_zeroes() and
argument changed to span<> for convenience.
st_::span<T>::const_iterator: fixed
page_zip-verify_checksum(): make argument byte* instead of void*