The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
This is based on commit 20ae4816bb
with some adjustments for MDEV-12353.
row_ins_sec_index_entry_low(): If a separate mini-transaction is
needed to adjust the minimum bounding rectangle (MBR) in the parent
page, we must disable redo logging if the table is a temporary table.
For temporary tables, no log is supposed to be written, because
the temporary tablespace will be reinitialized on server restart.
rtr_update_mbr_field(), rtr_merge_and_update_mbr(): Changed the return
type to void and removed unreachable code. In older versions, these
used to return a different value for temporary tables.
page_id_t: Add constexpr to most member functions.
mtr_t::log_write(): Catch log writes to invalid tablespaces
so that the test case would crash without the fix to
row_ins_sec_index_entry_low().
btr_insert_into_right_sibling(): Inherit any gap lock from the
left sibling to the right sibling before inserting the record
to the right sibling and updating the node pointer(s).
lock_update_node_pointer(): Update locks in case a node pointer
will move.
Based on mysql/mysql-server@c7d93c274f
This follows up the previous fix in
commit c3c53926c4 (MDEV-26554).
ha_innobase::delete_table(): Work around the insufficient
metadata locking (MDL) during DML operations by acquiring exclusive
InnoDB table locks on all child tables. Previously, this was only
done on TRUNCATE and ALTER.
ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke
lock_update_delete() during change buffer operations.
The revised trx_t::commit(std::vector<pfs_os_file_t>&) will
hold exclusive lock_sys.latch while invoking fil_delete_tablespace(),
which in turn may invoke ibuf_delete_rec().
dict_index_t::has_locking(): A new predicate, replacing the dummy
!dict_table_is_locking_disabled(index->table). Used for skipping lock
operations during ibuf_delete_rec().
trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks
and remove the table from the cache while holding exclusive
lock_sys.latch.
trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds.
trx_t::commit(): Reset dict_operation before invoking commit_in_memory()
via commit_persist().
lock_release_on_drop(): Release locks while lock_sys.latch is
exclusively locked.
lock_table(): Add a parameter for a pointer to the table.
We must not dereference the table before a lock_sys.latch has
been acquired. If the pointer to the table does not match the table
at that point, the table is invalid and DB_DEADLOCK will be returned.
row_ins_foreign_check_on_constraint(): Improve the checks.
Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed
before commit c5fd9aa562 (MDEV-25919).
row_upd_check_references_constraints(),
wsrep_row_upd_check_foreign_constraints(): Simplify checks.
This also fixes MDEV-20198: Instant ALTER TABLE is not crash safe
InnoDB dictionary recovery wrongly used the READ UNCOMMITTED isolation
level, causing some mismatch. For example, if a table was renamed or
replaced in a transaction, according to READ UNCOMMITTED the table might
not exist at all.
We implement READ COMMITTED isolation level for accessing the dictionary
tables SYS_TABLES, SYS_COLUMNS, SYS_INDEXES, SYS_FIELDS, SYS_VIRTUAL,
SYS_FOREIGN, SYS_FOREIGN_COLS. For most of these tables, no secondary
index exists. For the secondary indexes (on SYS_TABLES.ID,
SYS_FOREIGN.FOR_NAME, SYS_FOREIGN.REF_NAME), we will always look up
the primary key in the clustered index and check if the record actually
is a committed version.
dict_check_sys_tables(): Recover tablespaces also from delete-marked
committed records, so that if a matching .ibd file exists, it will
be removed by fil_delete_tablespace() when the committed delete-marked
SYS_INDEXES record of the clustered index is purged
in row_purge_remove_clust_if_poss_low().
fil_ibd_open(): Change the Boolean parameter "validate" to a ternary
one, to suppress error messages when the file might not exist.
It is possible that a .ibd file was deleted and the server shut down
before the SYS_INDEXES and SYS_TABLES records were purged. Hence, if
dict_check_sys_tables() finds a committed delete-marked record,
we must not complain if the tablespace file is not found.
On Windows, we msut treat ERROR_PATH_NOT_FOUND (directory not found)
in the same way as ERROR_FILE_NOT_FOUND. This fixes a few failures where
a previous test successfully executed DROP DATABASE (and deleted all
files and the directory), but a committed delete-marked SYS_TABLES
record had not been purged before server restart.
dict_getnext_system_low(): Do not filter out delete-marked records.
dict_startscan_system(), dict_getnext_system(): Do filter out
delete-marked records, for accessing the INFORMATION_SCHEMA tables.
dict_sys_tables_rec_read(): Return the DB_TRX_ID of the committed
version of the record. This is needed in dict_load_table_low().
dict_load_foreign_cols(), dict_load_foreign(): Add a parameter for
the current transaction identifier. In some DDL operations, the
FOREIGN KEY constraints are being loaded from the data dictionary
before the DDL transaction has been committed. For SYS_FOREIGN
and SYS_FOREIGN_COLS, we must implement the special case of
READ COMMITTED that the changes of the uncommitted current transaction
are visible.
dict_load_foreign(): Validate the table name. We could find a
SYS_FOREIGN.ID via a committed delete-marked secondary index record
that does not match the REF_NAME or FOR_NAME of the secondary index record.
dict_load_index_low(): Optionally take the table as a parameter,
so that table->def_trx_id can be updated in case of a
committed delete-marked SYS_INDEXES record corresponding
to DROP INDEX, but not corresponding to an index stub of ADD INDEX.
dict_load_indexes(): Do not update table->def_trx_id
in case of delete-marked records.
rec_is_metadata(), rec_offs_make_valid(), rec_get_offsets_func(),
row_build_low(): Relax some assertions. We may now have
!index->is_instant() even if a metadata record is present in the index.
Previously, the recovery of instant ADD/DROP COLUMN assumed
that READ UNCOMMITTED of the data dictionary will be performed.
Now, we will have a READ COMMITTED copy of the data dictionary
cache, and a READ UNCOMMITTED copy of the metadata record.
btr_page_reorganize_low(): Correctly update the FIL_PAGE_TYPE
when rolling back an instant ADD/DROP COLUMN operation.
row_rec_to_index_entry_impl(): Relax some assertions,
and disallow accessing "extra" fields. This fixes the recovery
of a crash during an instant ADD COLUMN after a successful
instant DROP COLUMN, in the test innodb.instant_alter_crash.
Tested by: Matthias Leich
buf_page_t::frame: Moved from buf_block_t::frame.
All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
pages will have frame=nullptr, while all 'fat' buf_block_t
will have a non-null frame pointing to aligned innodb_page_size bytes.
This eliminates the need for separate states for
BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.
buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block
descriptors will have a page latch. The IO_PIN state that was used
for discarding or creating the uncompressed page frame of a
ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
and page X-latch.
page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
of buf_page_t with a single std::atomic<uint32_t>. All modifications
will use store(), fetch_add(), fetch_sub(). This space was previously
wasted to alignment on 64-bit systems. We will use the following encoding
that combines a state (partly read-fix or write-fix) and a buffer-fix
count:
buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
buf_page_t::FREED=3 + fix: pages marked as freed in the file
buf_page_t::UNFIXED=1U<<29 + fix: normal pages
buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)
buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.
buf_page_t::read_complete(): Renamed from buf_page_read_complete().
Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.
buf_page_t::can_relocate(): If the page latch is being held or waited for,
or the block is buffer-fixed or io-fixed, return false. (The condition
on the page latch is new.)
Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
will acquire the page latch before fix(), and unfix() before unlocking.
buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
handling of FREED pages.
buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
by the caller.
buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.
buf_page_get_low(): Ignore guesses that are read-fixed because they
may not yet be registered in buf_pool.page_hash and buf_pool.LRU.
buf_page_optimistic_get(): Acquire latch before buffer-fixing.
buf_page_make_young(): Leave read-fixed blocks alone, because they
might not be registered in buf_pool.LRU yet.
recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
Possibly fix MDEV-26326, by holding a page X-latch instead of
only buffer-fixing the page.
This implements memory transaction support for:
* Intel Restricted Transactional Memory (RTM), also known as TSX-NI
(Transactional Synchronization Extensions New Instructions)
* POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux
transactional_lock_guard, transactional_shared_lock_guard:
RAII lock guards that try to elide the lock acquisition
when transactional memory is available.
buf_pool.page_hash: Try to elide latches whenever feasible.
Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED
tables, this is not always possible.
In buf_page_get_low(), memory transactions only work reasonably
well for validating a guessed block address.
TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards
that try to elide lock_sys.latch and related latches.
When computing statistics, let us play it safe and check whether
an insert into an empty table is in progress, once we have acquired
the root page latch. If yes, let us pretend that the table is empty,
just like MVCC reads would do.
It is unlikely that this race condition could lead into any crashes
before MDEV-24621, because the index tree structure should be protected
by the page latches. But, at least we can avoid some busy work and
return earlier.
As part of this, some code that is only used for statistics calculation
is being moved into static functions in that compilation unit.
btr_root_block_get(): Check for index->page == FIL_NULL.
btr_root_get(): Declare static. Other callers can invoke
btr_root_block_get() directly.
btr_get_size(): Remove conditions that are checked in
btr_root_block_get().
This is a complete rewrite of DROP TABLE, also as part of other DDL,
such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE.
The background DROP TABLE queue hack is removed.
If a transaction needs to drop and create a table by the same name
(like TRUNCATE TABLE does), it must first rename the table to an
internal #sql-ib name. No committed version of the data dictionary
will include any #sql-ib tables, because whenever a transaction
renames a table to a #sql-ib name, it will also drop that table.
Either the rename will be rolled back, or the drop will be committed.
Data files will be unlinked after the transaction has been committed
and a FILE_RENAME record has been durably written. The file will
actually be deleted when the detached file handle returned by
fil_delete_tablespace() will be closed, after the latches have been
released. It is possible that a purge of the delete of the SYS_INDEXES
record for the clustered index will execute fil_delete_tablespace()
concurrently with the DDL transaction. In that case, the thread that
arrives later will wait for the other thread to finish.
HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag.
ha_innobase::truncate() now requires that all other references to
the table be released in advance. This was implemented by Monty.
ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected,
we will "hijack" the current transaction, drop the table in
the current transaction and commit the current transaction.
This essentially fixes MDEV-21602. There is a FIXME comment about
making the check less failure-prone.
ha_innobase::truncate(), ha_innobase::delete_table():
Implement a fast path for temporary tables. We will no longer allow
temporary tables to use the adaptive hash index.
dict_table_t::mdl_name: The original table name for the purpose of
acquiring MDL in purge, to prevent a race condition between a
DDL transaction that is dropping a table, and purge processing
undo log records of DML that had executed before the DDL operation.
For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the
dict_table_t::mdl_name will differ from dict_table_t::name.
dict_table_t::parse_name(): Use mdl_name instead of name.
dict_table_rename_in_cache(): Update mdl_name.
For the internal FTS_ tables of FULLTEXT INDEX, purge would
acquire MDL on the FTS_ table name, but not on the main table,
and therefore it would be able to run concurrently with a
DDL transaction that is dropping the table. Previously, the
DROP TABLE queue hack prevented a race between purge and DDL.
For now, we introduce purge_sys.stop_FTS() to prevent purge from
opening any table, while a DDL transaction that may drop FTS_
tables is in progress. The function fts_lock_table(), which will
be invoked before the dictionary is locked, will wait for
purge to release any table handles.
trx_t::drop_table_statistics(): Drop statistics for the table.
This replaces dict_stats_drop_index(). We will drop or rename
persistent statistics atomically as part of DDL transactions.
On lock conflict for dropping statistics, we will fail instantly
with DB_LOCK_WAIT_TIMEOUT, because we will be holding the
exclusive data dictionary latch.
trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory().
Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT
in addition to DB_DUPLICATE_KEY. The call to fts_commit() is
entirely misplaced here and may obviously break the consistency
of transactions that affect FULLTEXT INDEX. It needs to be fixed
separately.
dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175).
The counter was a work-around for missing meta-data locking (MDL)
on the SQL layer, and not really needed in MariaDB.
ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28.
HA_ERR_TABLE_IN_FK_CHECK: Remove.
row_ins_check_foreign_constraints(): Do not acquire
dict_sys.latch either. The SQL-layer MDL will protect us.
This was reviewed by Thirunarayanan Balathandayuthapani
and tested by Matthias Leich.
Back in 2006 or 2007, when MySQL AB and Innobase Oy existed as
separately controlled entities (Innobase had been acquired by
Oracle Corporation), MySQL 5.1 introduced a storage engine plugin
interface and Oracle made use of it by distributing a separate
InnoDB Plugin, which would contain some more bug fixes and
improvements, compared to the version of InnoDB that was statically
linked with the mysqld server that was distributed by MySQL AB.
The built-in InnoDB would export global symbols, which would clash
with the symbols of the dynamic InnoDB Plugin (which was supposed
to override the built-in one when present).
The solution to this problem was to declare all global symbols with
UNIV_INTERN, so that they would get the GCC function attribute that
specifies hidden visibility.
Later, in MariaDB Server, something based on Percona XtraDB (a fork of
MySQL InnoDB) became the statically linked implementation, and something
closer to MySQL InnoDB was available as a dynamic plugin. Starting with
version 10.2, MariaDB Server includes only one InnoDB implementation,
and hence any reason to have the UNIV_INTERN definition was lost.
btr_get_size_and_reserved(): Move to the same compilation unit with
the only caller.
innodb_set_buf_pool_size(): Remove. Modify innobase_buffer_pool_size
directly.
fil_crypt_calculate_checksum(): Merge to the only caller.
ha_innobase::innobase_reset_autoinc(): Merge to the only caller.
thd_query_start_micro(): Remove. Call thd_start_utime() directly.
dict_drop_index_tree(): Even if SYS_INDEXES.PAGE contains the
special value FIL_NULL, the tablespace identified by SYS_INDEXES.SPACE
may exist and may need to be dropped. This would definitely be the case
if the server had been killed right after a FILE_CREATE record was
persistently written during CREATE TABLE, but before the transaction
was committed.
btr_free_if_exists(): Simplify the interface, to avoid repeated
tablespace lookup.
One more scenario is known to be broken: If the server is killed
during DROP TABLE (or table-rebuilding ALTER TABLE) right after a
FILE_DELETE record has been persistently written but before the
file was deleted, then we could end up recovering no tablespace
at all, and failing to delete the file, in either of fil_name_process()
or dict_drop_index_tree().
Thanks to Elena Stepanova for providing "rr replay" and data directories
of these scenarios.
btr_free_if_exists(): Always use the BUF_GET_POSSIBLY_FREED mode
when accessing pages, because due to MDEV-24589 the function
fil_space_t::set_stopping(true) can be called at any time during
the execution of this function.
mtr_t::m_freeing_tree: New data member for debugging purposes.
buf_page_get_low(): Assert that the BUF_GET mode is not being used
anywhere during the execution of btr_free_if_exists().
In all code related to freeing or allocating pages, we will add some
robustness, by making more use of BUF_GET_POSSIBLY_FREED and by
reporting an error instead of crashing in some cases of corruption.
Between btr_pcur_store_position() and btr_pcur_restore_position()
it is possible that purge empties a table and enlarges
index->n_core_fields and index->n_core_null_bytes.
Therefore, we must cache index->n_core_fields in
btr_pcur_t::old_n_core_fields so that btr_pcur_t::old_rec can be
parsed correctly.
Unfortunately, this is a huge change, because we will replace
"bool leaf" parameters with "ulint n_core"
(passing index->n_core_fields, or 0 for non-leaf pages).
For special cases where we know that index->is_instant() cannot hold,
we may also pass index->n_fields.
- This is caused by merge commit a26e7a3726.
InnoDB fails to fetch the next index field when there is a externally
stored column length check involved.
In btr_index_rec_validate(), externally stored column
check is missing while matching the length of the field
with the length of the field data stored in record.
Fetch the length of the externally stored part and compare it
with the fixed field length.
We replace the old lock_sys.mutex (which was renamed to lock_sys.latch)
with a combination of a global lock_sys.latch and table or page hash lock
mutexes.
The global lock_sys.latch can be acquired in exclusive mode, or
it can be acquired in shared mode and another mutex will be acquired
to protect the locks for a particular page or a table.
This is inspired by
mysql/mysql-server@1d259b87a6
but the optimization of lock_release() will be done in the next commit.
Also, we will interleave mutexes with the hash table elements, similar
to how buf_pool.page_hash was optimized
in commit 5155a300fa (MDEV-22871).
dict_table_t::autoinc_trx: Use Atomic_relaxed.
dict_table_t::autoinc_mutex: Use srw_mutex in order to reduce the
memory footprint. On 64-bit Linux or OpenBSD, both this and the new
dict_table_t::lock_mutex should be 32 bits and be stored in the same
64-bit word. On Microsoft Windows, the underlying SRWLOCK is 32 or 64
bits, and on other systems, sizeof(pthread_mutex_t) can be much larger.
ib_lock_t::trx_locks, trx_lock_t::trx_locks: Document the new rules.
Writers must assert lock_sys.is_writer() || trx->mutex_is_owner().
LockGuard: A RAII wrapper for acquiring a page hash table lock.
LockGGuard: Like LockGuard, but when Galera Write-Set Replication
is enabled, we must acquire all shards, for updating arbitrary trx_locks.
LockMultiGuard: A RAII wrapper for acquiring two page hash table locks.
lock_rec_create_wsrep(), lock_table_create_wsrep(): Special
Galera conflict resolution in non-inlined functions in order
to keep the common code paths shorter.
lock_sys_t::prdt_page_free_from_discard(): Refactored from
lock_prdt_page_free_from_discard() and
lock_rec_free_all_from_discard_page().
trx_t::commit_tables(): Replaces trx_update_mod_tables_timestamp().
lock_release(): Let trx_t::commit_tables() invalidate the query cache
for those tables that were actually modified by the transaction.
Merge lock_check_dict_lock() to lock_release().
We must never release lock_sys.latch while holding any
lock_sys_t::hash_latch. Failure to do that could lead to
memory corruption if the buffer pool is resized between
the time lock_sys.latch is released and the hash_latch is released.
For now, we will acquire the lock_sys.latch only in exclusive mode,
that is, use it as a mutex.
This is preparation for the next commit where we will introduce
a less intrusive alternative, combining a shared lock_sys.latch
with dict_table_t::lock_mutex or a mutex embedded in
lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash.
This failure is caused by commit 43ca6059ca
(MDEV-24720). InnoDB fails to remove the ahi entries
during rollback of bulk insert operation. InnoDB should
remove the AHI entries of root page before reinitialising it.
Reviewed-by: Marko Mäkelä
InnoDB fails to remove the ahi entries during rollback
of bulk insert operation. InnoDB throws the error when
validates the ahi hash tables. InnoDB should remove
the ahi entries while freeing the segment only during
bulk index rollback operation.
Reviewed-by: Marko Mäkelä
We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.
For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().
Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.
Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.
This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).
INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.
In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.
Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.
dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.
ins_node_t::bulk_insert: Whether bulk insert was initiated.
trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.
trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().
trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.
row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.
lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.
btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.
ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.
row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
btr_root_page_init(): Refactored from btr_create().
dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.
ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.
This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.
btr_discard_only_page_on_level(): Attempt to read the MDEV-15562 metadata
record from the leaf page, not the root page. In the root, the leftmost
(in this case, the only) node pointer would look like a metadata record.
This corruption bug was introduced in
commit 0e5a4ac253 (MDEV-15562).
The scenario is rare: a column was dropped instantly or the order of
columns was changed instantly, and then the table became empty in such
a way that in the last step, the root page had one child page.
Normally, a non-leaf B-tree page would always contain at least 2 children.
In the rewrite of MDEV-8139 (based on MDEV-15528), we introduced a
wrong assumption that any persistent tablespace that is not an .ibd
file is the system tablespace. This assumption is broken when
innodb_undo_tablespaces (files undo001, undo002, ...) are being used.
By default, we have innodb_undo_tablespaces=0 (the persistent undo
log is being stored in the system tablespace).
In MDEV-15528 and MDEV-8139 we rewrote the page scrubbing logic
so that it will follow the tried-and-true write-ahead logging
protocol, first writing FREE_PAGE records and then in the page
flushing, zerofilling or hole-punching freed pages.
Unfortunately, the implementation included a wrong assumption that
that anything that is not in an .ibd file must be the system tablespace.
This wrong assumption would cause overwrites of valid data pages in
the system tablespace.
mtr_t::m_freed_in_system_tablespace: Remove.
mtr_t::m_freed_space: The tablespace associated with m_freed_pages.
buf_page_free(): Take the tablespace and page number as a parameter,
instead of taking a page identifier.
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.
We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.
dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.
lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.
FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.
fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.
fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
Let us replace os_event_t with mysql_cond_t, and replace the
necessary ib_mutex_t with mysql_mutex_t so that they can be
used with condition variables.
Also, let us replace polling (os_thread_sleep() or timed waits)
with plain mysql_cond_wait() wherever possible.
Furthermore, we will use the lightweight srw_mutex for trx_t::mutex,
to hopefully reduce contention on lock_sys.mutex.
FIXME: Add test coverage of
mariabackup --backup --kill-long-queries-timeout