Let us use implement a simple fixed-size allocator for the adaptive hash
index, insted of complicating mem_heap_t or mem_block_info_t.
MEM_HEAP_BTR_SEARCH: Remove.
mem_block_info_t::free_block(), mem_heap_free_block_free(): Remove.
mem_heap_free_top(), mem_heap_get_top(): Remove.
btr_sea::partition::spare: Replaces mem_block_info_t::free_block.
This keeps one spare block per adaptive hash index partition, to
process an insert.
We must not wait for buf_pool.mutex while holding
any btr_sea::partition::latch. That is why we cache one block for
future allocations. This is protected by a new
btr_sea::partition::blocks_mutex in order to relieve pressure on
btr_sea::partition::latch.
btr_sea::partition::prepare_insert(): Replaces
btr_search_check_free_space_in_heap().
btr_sea::partition::erase(): Replaces ha_search_and_delete_if_found().
btr_sea::partition::cleanup_after_erase(): Replaces the most part of
ha_delete_hash_node(). Unlike the previous implementation, we will
retain a spare block for prepare_insert().
This should reduce some contention on buf_pool.mutex.
btr_search.n_parts: Replaces btr_ahi_parts.
btr_search.enabled: Replaces btr_search_enabled. This must hold
whenever buf_block_t::index is set while a thread is holding a
btr_sea::partition::latch.
dict_index_t::search_info: Remove pointer indirection, and use
Atomic_relaxed or Atomic_counter for most fields.
btr_search_guess_on_hash(): Let the caller ensure that latch_mode is
BTR_MODIFY_LEAF or BTR_SEARCH_LEAF. Release btr_sea::partition::latch
before buffer-fixing the block. The page latch that we already acquired
is preventing buffer pool eviction. We must validate both
block->index and block->page.state while holding part.latch
in order to avoid race conditions with buffer page relocation
or buf_pool_t::resize().
btr_search_check_guess(): Remove the constant parameter
can_only_compare_to_cursor_rec=false.
ahi_node: Replaces ha_node_t.
This has been tested by running the regression test suite
with the adaptive hash index enabled:
./mtr --mysqld=--loose-innodb-adaptive-hash-index=ON
Reviewed by: Vladislav Lesin
- InnoDB fails to recover the full crc32 encrypted page from
doublewrite buffer. The reason is that buf_dblwr_t::recover()
fails to identify the space id from the page because the page has
been encrypted from FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION bytes.
Fix:
===
buf_dblwr_t::recover(): preserve any pages whose space_id
does not match a known tablespace. These could be encrypted pages
of tablespaces that had been created with
innodb_checksum_algorithm=full_crc32.
buf_page_t::read_complete(): If the page looks corrupted and the
tablespace is encrypted and in full_crc32 format, try to
restore the page from doublewrite buffer.
recv_dblwr_t::recover_encrypted_page(): Find the page which
has the same page number and try to decrypt the page using
space->crypt_data. After decryption, compare the space id.
Write the recovered page back to the file.
Note: Changes to the test innodb.stats_persistent
in commit e5c4c0842d (MDEV-35443)
are not merged, because the test scenario is impossible
due to commit e66928ab28 (MDEV-33462).
fil_space_t::create(): Instead of invoking the default fil_space_t
constructor on a zero-filled buffer, allocate an uninitialized buffer
and invoke an explicitly defined constructor on it. Also, specify
initializer expressions for all constant data members, so that all of them
will be initialized in the constructor.
fil_space_t::being_imported: Replaces part of fil_space_t::purpose.
fil_space_t::is_being_imported(), fil_space_t::is_temporary():
Replaces fil_space_t::purpose.
fil_space_t:🆔 Changed the type from ulint to uint32_t to reduce
incompatibility with later branches that include
commit ca501ffb04 (MDEV-26195).
fil_space_t::try_to_close(): Do not attempt to close files that are
in an I/O bound phase of ALTER TABLE…IMPORT TABLESPACE.
log_file_op, first_page_init: recv_spaces_t:
Use uint32_t for the tablespace id.
Reviewed by: Debarun Banerjee
The bug was that backup_file_op_fail() was using a static variable
initialized by arguments. This does not work as static variables
will only be initialized on the first call and not change for future
calls, even when the arguments changes.
Fixed by removing 'static'.
Partial commit of the greater MDEV-34348 scope.
MDEV-34348: MariaDB is violating clang-16 -Wcast-function-type-strict
Reviewed By:
============
Marko Mäkelä <marko.makela@mariadb.com>
Partial commit of the greater MDEV-34348 scope.
MDEV-34348: MariaDB is violating clang-16 -Wcast-function-type-strict
Change the type of my_hash_get_key to:
1) Return const
2) Change the context parameter to be const void*
Also fix casting in hash adjacent areas.
Reviewed By:
============
Marko Mäkelä <marko.makela@mariadb.com>
The HASH_ macros are unnecessarily obfuscating the logic,
so we had better replace them.
hash_cell_t::search(): Implement most of the HASH_DELETE logic,
for a subsequent insert or remove().
hash_cell_t::remove(): Remove an element.
hash_cell_t::find(): Implement the HASH_SEARCH logic.
xb_filter_hash_free(): Avoid any hash table lookup;
just traverse the hash bucket chains and free each element.
xb_register_filter_entry(): Search databases_hash only once.
rm_if_not_found(): Make use of find_filter_in_hashtable().
dict_sys_t::acquire_temporary_table(), dict_sys_t::find_table():
Define non-inline to avoid unnecessary code duplication.
dict_sys_t::add(dict_table_t *table), dict_table_rename_in_cache():
Look for duplicate while finding the insert position.
dict_table_change_id_in_cache(): Merged to the only caller
row_discard_tablespace().
hash_insert(): Helper function of dict_sys_t::resize().
fil_space_t::create(): Look for a duplicate (and crash if found)
when searching for the insert position.
lock_rec_discard(): Take the hash array cell as a parameter
to avoid a duplicated lookup.
lock_rec_free_all_from_discard_page(): Remove a parameter.
Reviewed by: Debarun Banerjee
In commit 1c55b845e0 (MDEV-32932) the
test mariabackup.innodb_ddl_on_intermediate_table was introduced but
disabled.
xb_load_single_table_tablespace(): Properly handle missing FTS_ tables.
backup_file_op_fail(): Properly handle FILE_DELETE records.
Removed 'purpose' parameter from os_file_create() and related functions.
Always use FILE_FLAG_OVERLAPPED when opening Windows files.
No performance regression was measured, nor there is any measurable
improvement.
The invariant of write-ahead logging is that before any change to a
page is written to the data file, the corresponding log record must
must first have been durably written.
In crash recovery, there were some sloppy checks for this. Let us
implement accurate checks and flag an inconsistency as a hard error,
so that we can avoid further corruption of a corrupted database.
For data extraction from the corrupted database, innodb_force_recovery
can be used.
Before recovery is reading any data pages or invoking
buf_dblwr_t::recover() to recover torn pages from the
doublewrite buffer, InnoDB will have parsed the log until the
final LSN and updated log_sys.lsn to that. So, we can rely on
log_sys.lsn at all times. The doublewrite buffer recovery has been
refactored in such a way that the recv_sys.dblwr.pages may be consulted
while discovering files and their page sizes, but nothing will be
written back to data files before buf_dblwr_t::recover() is invoked.
recv_max_page_lsn, recv_lsn_checks_on: Remove.
recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging
condition at the end of the recovery.
recv_dblwr_t::validate_page(): Keep track of the maximum LSN
(if we are checking a non-doublewrite copy of a page) but
do not complain LSN being in the future. The doublewrite buffer
is a special case, because it will be read early during recovery.
Besides, starting with commit 762bcb81b5
the dblwr=true copies of pages may legitimately be "too new".
recv_dblwr_t::find_page(): Find a valid page with the smallest
FIL_PAGE_LSN that is in the valid range for recovery.
recv_dblwr_t::restore_first_page(): Replaced by find_page().
Only buf_dblwr_t::recover() will write to data files.
buf_dblwr_t::recover(): Simplify the message output. Do attempt
doublewrite recovery on user page read error. Ignore doublewrite
pages whose FIL_PAGE_LSN is outside the usable bounds. Previously,
we could wrongly recover a too new page from the doublewrite buffer.
It is unlikely that this could have lead to an actual error.
Write back all recovered pages from the doublewrite buffer here,
including for the first page of any tablespace.
buf_page_is_corrupted(): Distinguish the return values
CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER.
buf_page_check_corrupt(): Return the error code DB_CORRUPTION
in case the LSN is in the future.
Datafile::read_first_page_flags(): Split from read_first_page().
Take a copy of the first page as a parameter.
recv_sys_t::free_corrupted_page(): Take the file as a parameter
and return whether a message was displayed. This avoids some duplicated
and incomplete error messages.
buf_page_t::read_complete(): Remove some redundant output and always
display the name of the corrupted file. Never return DB_FAIL;
use it only in internal error handling.
IORequest::read_complete(): Assume that buf_page_t::read_complete()
will have reported any error.
fil_space_t::set_corrupted(): Return whether this is the first time
the tablespace had been flagged as corrupted.
Datafile::validate_first_page(), fil_node_open_file_low(),
fil_node_open_file(), fil_space_t::read_page0(),
fil_node_t::read_page0(): Add a parameter for a copy of the
first page, and a parameter to indicate whether the FIL_PAGE_LSN
check should be suppressed. Before buf_dblwr_t::recover() is
invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the
FSP_SPACE_FLAGS and the tablespace ID that may be present in a
potentially too new copy of a page.
Reviewed by: Debarun Banerjee
The invariant of write-ahead logging is that before any change to a
page is written to the data file, the corresponding log record must
must first have been durably written.
On crash recovery, there were some sloppy checks for this. Let us
implement accurate checks and flag an inconsistency as a hard error,
so that we can avoid further corruption of a corrupted database.
For data extraction from the corrupted database, innodb_force_recovery
can be used.
Before recovery is reading any data pages or invoking
buf_dblwr_t::recover() to recover torn pages from the
doublewrite buffer, InnoDB will have parsed the log until the
final LSN and updated log_sys.lsn to that. So, we can rely on
log_sys.lsn at all times. The doublewrite buffer recovery has been
refactored in such a way that the recv_sys.dblwr.pages may be consulted
while discovering files and their page sizes, but nothing will be
written back to data files before buf_dblwr_t::recover() is invoked.
A section of the test mariabackup.innodb_redo_overwrite
that is parsing some mariadb-backup --backup output has
been removed, because that output "redo log block is overwritten"
would often be missing in a Microsoft Windows environment
as a result of these changes.
recv_max_page_lsn, recv_lsn_checks_on: Remove.
recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging
condition at the end of the recovery.
recv_dblwr_t::validate_page(): Keep track of the maximum LSN
(if we are checking a non-doublewrite copy of a page) but
do not complain LSN being in the future. The doublewrite buffer
is a special case, because it will be read early during recovery.
Besides, starting with commit 762bcb81b5
the dblwr=true copies of pages may legitimately be "too new".
recv_dblwr_t::find_page(): Find a valid page with the smallest
FIL_PAGE_LSN that is in the valid range for recovery.
recv_dblwr_t::restore_first_page(): Replaced by find_page().
Only buf_dblwr_t::recover() will write to data files.
buf_dblwr_t::recover(): Simplify the message output. Do attempt
doublewrite recovery on user page read error. Ignore doublewrite
pages whose FIL_PAGE_LSN is outside the usable bounds. Previously,
we could wrongly recover a too new page from the doublewrite buffer.
It is unlikely that this could have lead to an actual error.
Write back all recovered pages from the doublewrite buffer here,
including for the first page of any tablespace.
buf_page_is_corrupted(): Distinguish the return values
CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER.
buf_page_check_corrupt(): Return the error code DB_CORRUPTION
in case the LSN is in the future.
Datafile::read_first_page(): Handle FSP_SPACE_FLAGS=0xffffffff
in the same way on both 32-bit and 64-bit architectures.
Datafile::read_first_page_flags(): Split from read_first_page().
Take a copy of the first page as a parameter.
recv_sys_t::free_corrupted_page(): Take the file as a parameter
and return whether a message was displayed. This avoids some duplicated
and incomplete error messages.
buf_page_t::read_complete(): Remove some redundant output and always
display the name of the corrupted file. Never return DB_FAIL;
use it only in internal error handling.
IORequest::read_complete(): Assume that buf_page_t::read_complete()
will have reported any error.
fil_space_t::set_corrupted(): Return whether this is the first time
the tablespace had been flagged as corrupted.
Datafile::validate_first_page(), fil_node_open_file_low(),
fil_node_open_file(), fil_space_t::read_page0(),
fil_node_t::read_page0(): Add a parameter for a copy of the
first page, and a parameter to indicate whether the FIL_PAGE_LSN
check should be suppressed. Before buf_dblwr_t::recover() is
invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the
FSP_SPACE_FLAGS and the tablespace ID that may be present in a
potentially too new copy of a page.
Reviewed by: Debarun Banerjee
In mariadb-backup --backup, we only have to invoke the undo_space_trunc
and log_file_op callbacks as well as validate the mini-transaction
checksums. There is absolutely no need to access recv_sys.pages or
recv_spaces, or to allocate a decrypt_buf in case of innodb_encrypt_log=ON.
This is what the new mode recv_sys_t::store::BACKUP will do.
In the skip_the_rest: loop, the main thing is to process all FILE_ records
until the end of the log is reached. Additionally, we must process
INIT_PAGE and FREE_PAGE records in the same way as they would be
during storing == YES.
This was measured to reduce the CPU time between the messages
"InnoDB: Multi-batch recovery needed at LSN" and
"InnoDB: End of log at LSN"
by some 20%.
recv_sys_t::store: A ternary enumeration that specifies how records
should be stored: NO, BACKUP, or YES.
recv_sys_t::parse(), recv_sys_t::parse_mtr(), recv_sys_t::parse_pmem():
Replace template<bool store> with template<store storing>.
store_freed_or_init_rec(): Simplify some logic. We can look up also
the system tablespace.
Reviewed by: Debarun Banerjee
When using the default innodb_log_buffer_size=2m, mariadb-backup --backup
would spend a lot of time re-reading and re-parsing the log. For reads,
it would be beneficial to memory-map the entire ib_logfile0 to the
address space (typically 48 bits or 256 TiB) and read it from there,
both during --backup and --prepare.
We will introduce the Boolean read-only parameter innodb_log_file_mmap
that will be OFF by default on most platforms, to avoid aggressive
read-ahead of the entire ib_logfile0 in when only a tiny portion would be
accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON,
because those platforms define a specific mmap(2) option for enabling
such read-ahead and therefore it can be assumed that the default would
be on-demand paging. This parameter will only have impact on the initial
InnoDB startup and recovery. Any writes to the log will use regular I/O,
except when the ib_logfile0 is stored in a specially configured file system
that is backed by persistent memory (Linux "mount -o dax").
We also experimented with allowing writes of the ib_logfile0 via a
memory mapping and decided against it. A fundamental problem would be
unnecessary read-before-write in case of a major page fault, that is,
when a new, not yet cached, virtual memory page in the circular
ib_logfile0 is being written to. There appears to be no way to tell
the operating system that we do not care about the previous contents of
the page, or that the page fault handler should just zero it out.
Many references to HAVE_PMEM have been replaced with references to
HAVE_INNODB_MMAP.
The predicate log_sys.is_pmem() has been replaced with
log_sys.is_mmap() && !log_sys.is_opened().
Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the
way that an open file handle to ib_logfile0 will be retained. In both
code paths, log_sys.is_mmap() will hold. Holding a file handle open will
allow log_t::clear_mmap() to disable the interface with fewer operations.
It should be noted that ever since
commit 685d958e38 (MDEV-14425)
most 64-bit Linux platforms on our CI platforms
(s390x a.k.a. IBM System Z being a notable exception) read and write
/dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is
persistent memory (mount -o dax). So, the memory mapping based log
parsing that this change is enabling by default on Linux and FreeBSD
has already been extensively tested on Linux.
::log_mmap(): If a log cannot be opened as PMEM and the desired access
is read-only, try to open a read-only memory mapping.
xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile():
Copy the InnoDB log in mariadb-backup --backup from a memory
mapped file.
In mariadb-backup --backup there are multiple mechanisms for ensuring that
a sufficient amount of the InnoDB write-ahead log (ib_logfile0) is being
copied at the end of the backup. The backup needs to include the latest
committed transaction. While further transaction commits are blocked by
BACKUP STAGE BLOCK_COMMIT, ongoing transactions may modify the database
contents and write log records. We were unnecessarily copying such log,
which would also cause further effort of rolling back incomplete
transactions after the backup is restored.
backup_wait_for_lsn(): Declare as static, and refactor some code
to separate functions backup_wait_for_lsn_low() and
backup_wait_timeout().
backup_wait_for_commit_lsn(): A new function to determine the current
LSN (within BACKUP STAGE BLOCK_COMMIT) and to wait for the log to be
copied until that. Invoked by BackupStages::stage_block_commit().
xtrabackup_backup_func(): Remove a condition that had already been
checked by a caller of backup_wait_timeout().
server_lsn_after_lock: Declare as a local variable in
BackupStages::stage_block_ddl().
log_copying_thread(), io_watching_thread(): Use metadata_last_lsn
instead of metadata_to_lsn as the stop condition.
BackupStages::stage_block_commit(): Ensure that the log tables
(in particular, mysql.general_log) will have been copied before
the BACKUP STAGE BLOCK_COMMIT is being followed by any further
SQL statements.
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich