ibuf_remove_free_page(): Correct the calculation of root_savepoint().
The first entry acquired by ibuf_tree_root_get() will be ibuf.index.lock
and not the change buffer root page.
Thanks to Matthias Leich for finding this bug in RQG.
Unfortunately, this code is very difficult to cover
in our regression test suite.
When the change buffer records for a page span across multiple change
buffer leaf pages or the starting record is at the beginning of a page
with a left sibling, ibuf_delete_recs deletes only the records in first
page and fails to move to subsequent pages.
Subsequently a slow shutdown hangs trying to delete those left over
records.
Fix-A: Position the cursor to an user record in B-tree and exit only
when all records are exhausted.
Fix-B: Make sure we call ibuf_delete_recs during slow shutdown for
pages with IBUF entries to cleanup any previously left over records.
Issue:
------
The actual order of acquisition of the IBUF pessimistic insert mutex
(SYNC_IBUF_PESS_INSERT_MUTEX) and IBUF header page latch
(SYNC_IBUF_HEADER) w.r.t space latch (SYNC_FSP) differs from the order
defined in sync0types.h. It was not discovered earlier as the path to
ibuf_remove_free_page was not covered by the mtr test. Ideal order and
one defined in sync0types.h is as follows.
SYNC_IBUF_HEADER -> SYNC_IBUF_PESS_INSERT_MUTEX -> SYNC_FSP
In ibuf_remove_free_page, we acquire space latch earlier and we have
the order as follows resulting in the assert with innodb_sync_debug=on.
SYNC_FSP -> SYNC_IBUF_HEADER -> SYNC_IBUF_PESS_INSERT_MUTEX
Fix:
---
We do maintain this order in other places and there doesn't seem to be
any real issue here. To reduce impact in GA versions, we avoid doing
extensive changes in mutex ordering to match the current
SYNC_IBUF_PESS_INSERT_MUTEX order. Instead we relax the ordering check
for IBUF pessimistic insert mutex using SYNC_NO_ORDER_CHECK.
flst_read_addr(): Remove assertions. Instead, we will check these
conditions in the callers and avoid a crash in case of corruption.
We will check the conditions more carefully, because the callers
know more exact bounds for the page numbers and the byte offsets
withing pages.
flst_remove(), flst_add_first(), flst_add_last(): Add a parameter
for passing fil_space_t::free_limit. None of the lists may point to
pages that are beyond the current initialized length of the
tablespace.
trx_rseg_mem_restore(): Access the first page of the tablespace,
so that we will correctly recover rseg->space->free_limit
in case some log based recovery is pending.
ibuf_remove_free_page(): Only look up the root page once, and
validate the last page number.
Reviewed by: Debarun Banerjee
The linear read-ahead (enabled by nonzero innodb_read_ahead_threshold)
works best if index leaf pages or undo log pages have been allocated
on adjacent page numbers. The read-ahead is assumed not to be helpful
in other types of page accesses, such as non-leaf index pages.
buf_page_get_low(): Do not invoke buf_page_t::set_accessed(),
buf_page_make_young_if_needed(), or buf_read_ahead_linear().
We will invoke them in those callers of buf_page_get_gen() or
buf_page_get() where it makes sense: the access is not
one-time-on-startup and the page and not going to be freed soon.
btr_copy_blob_prefix(), btr_pcur_move_to_next_page(),
trx_undo_get_prev_rec_from_prev_page(),
trx_undo_get_first_rec(), btr_cur_t::search_leaf(),
btr_cur_t::open_leaf(): Invoke buf_read_ahead_linear().
We will not invoke linear read-ahead in functions that would
essentially allocate or free pages, because pages that are
freshly allocated are expected to be initialized by buf_page_create()
and not read from the data file. Likewise, freeing pages should
not involve accessing any sibling pages, except for freeing
singly-linked lists of BLOB pages.
We will not invoke read-ahead in btr_cur_t::pessimistic_search_leaf()
or in a pessimistic operation of btr_cur_t::open_leaf(), because
it is assumed that pessimistic operations should be preceded by
optimistic operations, which should already have invoked read-ahead.
buf_page_make_young_if_needed(): Invoke also buf_page_t::set_accessed()
and return the result.
btr_cur_nonleaf_make_young(): Like buf_page_make_young_if_needed(),
but do not invoke buf_page_t::set_accessed().
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
When InnoDB attempts to buffer a change operation of a secondary index
leaf page (to insert, delete-mark or remove a record) and the
change buffer is too large, InnoDB used to trigger a change buffer merge
that could affect any tables. This could lead to huge variance in
system throughput and potentially unpredictable crashes, in case the
change buffer was corrupted and a crash occurred while attempting to
merge changes to a table that is not being accessed by the current
SQL statement.
ibuf_insert_low(): Simply return DB_STRONG_FAIL when the maximum size
of the change buffer is exceeded.
ibuf_contract_after_insert(): Remove.
ibuf_get_merge_page_nos_func(): Remove a constant parameter.
The function ibuf_contract() will be our only caller, during
shutdown with innodb_fast_shutdown=0.
A 3-thread deadlock has been frequently observed when using
innodb_change_buffering!=none and innodb_file_per_table=0:
(1) ibuf_merge_or_delete_for_page() holding an exclusive latch on the block
and waiting for an exclusive tablespace latch in fseg_page_is_allocated()
(2) btr_free_but_not_root() in fseg_free_step() waiting for an
exclusive tablespace latch
(3) fsp_alloc_free_page() holding the exclusive tablespace latch and waiting
for a latch on the block, which it is reallocating for something else
While this was reproduced using innodb_file_per_table=0, this hang should
be theoretically possible in .ibd files as well, when the recovery or
cleanup of a failed DROP INDEX or ADD INDEX is executing concurrently
with something that involves page allocation.
ibuf_merge_or_delete_for_page(): Avoid invoking fseg_page_is_allocated()
when block==nullptr. The call was redundant in this case, and it could
cause deadlocks due to latching order violation.
ibuf_read_merge_pages(): Acquire an exclusive tablespace latch
before invoking buf_page_get_gen(), which may cause
fseg_page_is_allocated() to be invoked in ibuf_merge_or_delete_for_page().
Note: This will not fix all latching order violations in this area!
Deadlocks involving ibuf_merge_or_delete_for_page(block!=nullptr) are
still possible if the caller is not acquiring an exclusive tablespace latch
upfront. This would be the case in any read operation that involves a
change buffer merge, such as SELECT, CHECK TABLE, or any DML operation that
cannot be buffered in the change buffer.
While downgrades are not supported and misguided attempts at it could
cause serious corruption especially after
commit b07920b634
it might be useful if InnoDB would start up even after an upgrade to
MariaDB Server 11.0 or later had removed the change buffer.
innodb_change_buffering_update(): Disallow anything else than
innodb_change_buffering=none when the change buffer is corrupted.
ibuf_init_at_db_start(): Mention a possible downgrade in the corruption
error message. If innodb_change_buffering=none, ignore the error but do
not initialize ibuf.index.
ibuf_free_excess_pages(), ibuf_contract(), ibuf_merge_space(),
ibuf_update_max_tablespace_id(), ibuf_delete_for_discarded_space(),
ibuf_print(): Check for !ibuf.index.
ibuf_check_bitmap_on_import(): Remove some unnecessary code.
This function is only accessing change buffer bitmap pages in a
data file that is not attached to the rest of the database.
It is not accessing the change buffer tree itself, hence it does
not need any additional mutex protection.
This has been tested both by starting up MariaDB Server 10.8 on
a 11.0 data directory, and by running ./mtr --big-test while
ibuf_init_at_db_start() was tweaked to always fail.
This also fixes part of MDEV-29835 Partial server freeze
which is caused by violations of the latching order that was
defined in https://dev.mysql.com/worklog/task/?id=6326
(WL#6326: InnoDB: fix index->lock contention). Unless the
current thread is holding an exclusive dict_index_t::lock,
it must acquire page latches in a strict parent-to-child,
left-to-right order. Not all cases of MDEV-29835 are fixed yet.
Failure to follow the correct latching order will cause deadlocks
of threads due to lock order inversion.
As part of these changes, the BTR_MODIFY_TREE mode is modified
so that an Update latch (U a.k.a. SX) will be acquired on the
root page, and eXclusive latches (X) will be acquired on all pages
leading to the leaf page, as well as any left and right siblings
of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326
will be removed, because at the time the DEBUG_SYNC point is hit,
the thread is actually holding several page latches that will be
blocking a concurrent SELECT statement.
We also remove double bookkeeping that was caused due to excessive
information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo
store information of latched pages, and ensure that
mtr_memo_slot_t::object is never a null pointer.
The tree_blocks[] and tree_savepoints[] were redundant.
buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid
a hang, do not try to evict blocks if we are holding a latch on
a modified page. The test innodb.innodb-change-buffer-recovery
will be removed, because change buffering may no longer be forced
by debug injection when the change buffer comprises multiple pages.
Remove a debug assertion that could fail when
innodb_change_buffering_debug=1 fails to evict a page.
For other cases, the assertion is redundant, because we already
checked that right after the got_block: label. The test
innodb.innodb-change-buffering-recovery will be removed, because
due to this change, we will be unable to evict the desired page.
mtr_t::lock_register(): Register a change of a page latch
on an unmodified buffer-fixed block.
mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint():
Replaced by the use of mtr_t::upgrade_buffer_fix(), which now
also handles RW_S_LATCH.
mtr_t::set_modified(): For temporary tables, invoke
buf_page_t::set_modified() here and not in mtr_t::commit().
We will never set the MTR_MEMO_MODIFY flag on other than
persistent data pages, nor set mtr_t::m_modifications when
temporary data pages are modified.
mtr_t::commit(): Only invoke the buf_flush_note_modification() loop
if persistent data pages were modified.
mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo.
This avoids many redundant entries in mtr_t::m_memo, as well as
redundant calls to buf_page_get_gen() for blocks that had already
been looked up in a mini-transaction.
btr_get_latched_root(): Return a pointer to an already latched root page.
This replaces btr_root_block_get() in cases where the mini-transaction
has already latched the root page.
btr_page_get_parent(): Fetch a parent page that was already latched
in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched().
If needed, upgrade the root page U latch to X.
This avoids bloating mtr_t::m_memo as well as performing redundant
buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for
B-tree defragmentation, we will invoke btr_cur_search_to_nth_level().
btr_cur_search_to_nth_level(): This will only be used for non-leaf
(level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE
or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be
removed altogether, or retained for the case of
CHECK TABLE without QUICK.
btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page()
can retrieve the left sibling from the end of mtr_t::m_memo.
btr_cur_t::open_leaf(): Some clean-up.
btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
for searches to level=0 (the leaf level). We will never release
parent page latches before acquiring leaf page latches. If we need to
temporarily release the level=1 page latch in the BTR_SEARCH_PREV or
BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the
child node pointer so that we will land on the correct leaf page.
btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE
latching logic in the case that page splits or merges will be needed.
The parent pages (and their siblings) should already be latched on
the first dive to the leaf and be present in mtr_t::m_memo; there
should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost
suffices; it must be revised in MDEV-29835 and work-arounds removed
for cases where mtr_t::get_already_latched() fails to find a block.
rtr_search_to_nth_level(): A SPATIAL INDEX version of
btr_search_to_nth_level() that can search to any level
(including the leaf level).
rtr_search_leaf(), rtr_insert_leaf(): Wrappers for
rtr_search_to_nth_level().
rtr_search(): Replaces rtr_pcur_open().
rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike
in the B-tree code, there is no error handling in case the sibling
pages are corrupted.
rtr_cur_restore_position(): Remove an unused constant parameter.
btr_pcur_open_on_user_rec(): Remove the constant parameter
mode=PAGE_CUR_GE.
row_ins_clust_index_entry_low(): Use a new
mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page
when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC.
BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove.
BTR_CONT_MODIFY_TREE: Note that this is only used by
rtr_search_to_nth_level().
btr_pcur_optimistic_latch_leaves(): Replaces
btr_cur_optimistic_latch_leaves().
ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order
to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV).
btr_blob_log_check_t(): Acquire a U latch on the root page,
so that btr_page_alloc() in btr_store_big_rec_extern_fields()
will avoid a deadlock.
btr_store_big_rec_extern_fields(): Assert that the root page latch
is being held.
Tested by: Matthias Leich
Reviewed by: Vladislav Lesin
This also fixes part of MDEV-29835 Partial server freeze
which is caused by violations of the latching order that was
defined in https://dev.mysql.com/worklog/task/?id=6326
(WL#6326: InnoDB: fix index->lock contention). Unless the
current thread is holding an exclusive dict_index_t::lock,
it must acquire page latches in a strict parent-to-child,
left-to-right order. Not all cases are fixed yet. Failure to
follow the correct latching order will cause deadlocks of threads
due to lock order inversion.
As part of these changes, the BTR_MODIFY_TREE mode is modified
so that an Update latch (U a.k.a. SX) will be acquired on the
root page, and eXclusive latches (X) will be acquired on all pages
leading to the leaf page, as well as any left and right siblings
of the pages along the path. The test innodb.innodb_wl6326
will be removed, because at the time the DEBUG_SYNC point is hit,
the thread is actually holding several page latches that will be
blocking a concurrent SELECT statement.
We also remove double bookkeeping that was caused due to excessive
information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo
store information of latched pages, and ensure that
mtr_memo_slot_t::object is never a null pointer.
The tree_blocks[] and tree_savepoints[] were redundant.
mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo.
This avoids many redundant entries in mtr_t::m_memo, as well as
redundant calls to buf_page_get_gen() for blocks that had already
been looked up in a mini-transaction.
btr_get_latched_root(): Return a pointer to an already latched root page.
This replaces btr_root_block_get() in cases where the mini-transaction
has already latched the root page.
btr_page_get_parent(): Fetch a parent page that was already latched
in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched().
If needed, upgrade the root page U latch to X.
This avoids bloating mtr_t::m_memo as well as redundant
buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for
B-tree defragmentation, we will invoke btr_cur_search_to_nth_level().
btr_cur_search_to_nth_level(): This will only be used for non-leaf
(level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE
or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be
removed altogether, or retained for the case of
CHECK TABLE without QUICK.
btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
for searches to level=0 (the leaf level).
btr_cur_t::pessimistic_search_leaf(): Implement the new
BTR_MODIFY_TREE latching logic in the case that page splits
or merges will be needed. The parent pages (and their siblings)
should already be latched on the first dive to the leaf and be
present in mtr_t::m_memo; there should be no need for
BTR_CONT_MODIFY_TREE. This pre-latching almost suffices;
MDEV-29835 will have to revise it and remove work-arounds where
mtr_t::get_already_latched() fails to find a block.
rtr_search_to_nth_level(): A SPATIAL INDEX version of
btr_search_to_nth_level() that can search to any level
(including the leaf level).
rtr_search_leaf(), rtr_insert_leaf(): Wrappers for
rtr_search_to_nth_level().
rtr_search(): Replaces rtr_pcur_open().
rtr_cur_restore_position(): Remove an unused constant parameter.
btr_pcur_open_on_user_rec(): Remove the constant parameter
mode=PAGE_CUR_GE.
btr_cur_latch_leaves(): Update a pre-existing mtr_t::m_memo entry
for the current leaf page.
row_ins_clust_index_entry_low(): Use a new
mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page
when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC.
btr_cur_t::open_leaf(): Some clean-up.
mtr_t::lock_register(): Register a page latch on a buffer-fixed block.
BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove.
BTR_CONT_MODIFY_TREE: Note that this is only used by
rtr_search_to_nth_level().
btr_pcur_optimistic_latch_leaves(): Replaces
btr_cur_optimistic_latch_leaves().
ibuf_delete_rec(): Acquire ibuf.index->lock.u_lock() in order
to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV).
Tested by: Matthias Leich
btr_cur_t::open_random_leaf(): Replaces btr_cur_open_at_rnd_pos().
Acquire a shared latch on each page, and finally release all
latches except the one on the leaf page.
This fixes a race condition between the purge of history and
btr_estimate_number_of_different_key_vals(), which turned out
to only hold a buffer-fix on the randomly chosen leaf page.
Typically, an assertion would fail in page_rec_is_supremum().
ibuf_contract(): Start from the beginning of the change buffer,
to simplify the logic. Starting with
commit b42294bc64
it does not matter much where the change buffer merge is being initiated.
The race condition may have been introduced as early as
mysql/mysql-server@ac74632293
from where it was copied to
commit 2e814d4702.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
ibuf_init_at_db_start(): Validate the change buffer root page.
A later version may stop creating a change buffer, and this
validation check will prevent a downgrade from such later versions.
ibuf_max_size_update(): If the change buffer was not loaded, do nothing.
dict_boot(): Merge the local variable "error" to "err". Ignore
failures of ibuf_init_at_db_start() if innodb_force_recovery>=4.
The InnoDB change buffer (ibuf.index, stored in the system tablespace)
and the change buffer bitmaps in persistent tablespaces could get out
of sync with each other: According to the bitmap, no changes exist for
a page, while there actually exist buffered entries in ibuf.index.
InnoDB performs lazy deletion of buffered changes. When a secondary
index leaf page is freed (possibly as part of DROP INDEX), any
buffered changes will not be deleted. Instead, they would be deleted
on a subsequent buf_page_create_low().
One scenario where InnoDB failed to delete buffered changes is
as follows:
1. Some changes were buffered for a secondary index leaf page.
2. The index page had been freed.
3. ibuf_read_merge_pages() invoked ibuf_merge_or_delete_for_page(),
which noticed that the page had been freed, and reset the change buffer
bits, but did not delete the records from ibuf.index.
4. The index page was reallocated for something else.
5. The index page was removed from the buffer pool.
6. Some changes were buffered for the newly created page.
7. Finally, the buffered changes from both 1. and 6. were merged.
8. The index is corrupted.
An alternative outcome is:
4. Shutdown with innodb_fast_shutdown=0 gets into an infinite loop.
An alternative scenario is:
3. ibuf_set_bitmap_for_bulk_load() reset the IBUF_BITMAP_BUFFERED bit
but did not delete the ibuf.index records for that page number.
The shutdown hang was already once fixed in
commit d7a2401750, refactored for
10.5 in commit 77e8a311e1 and
disabled in commit 310dff5d84
due to corruption.
We will fix this as follows:
ibuf_delete_recs(): Delete all ibuf.index entries for the specified page.
ibuf_merge_or_delete_for_page(): When the change buffer bitmap bits
were set and the page had been freed, and the page does not belong
to ibuf.index itself, invoke ibuf_delete_recs(). This prevents the
corruption from occurring when a DML operation is allocating a
previously freed page for which changes had been buffered.
ibuf_set_bitmap_for_bulk_load(): When the change buffer bitmap bits
were set, invoke ibuf_delete_recs(). This prevents the corruption
from occurring when CREATE INDEX is reusing a previously freed page.
ibuf_read_merge_pages(): On slow shutdown, remove the orphan records
by invoking ibuf_delete_recs(). This fixes the hang when the change
buffer had become corrupted. We also remove the dops[] accounting,
because nothing can monitor it during shutdown. We invoke
ibuf_delete_recs() if:
(a) buf_page_get_gen() failed to load the page or merge changes
(b) the page is not a valid index leaf page
(c) the page number is out of tablespace bounds
srv_shutdown(): Invoke ibuf_max_size_update(0) to ensure that
the race condition that motivated us to disable the code in
ibuf_read_merge_pages() in commit 310dff5d84
is no longer possible. That is, during slow shutdown, both the
rollback of transactions and the purge of history will return
early from ibuf_insert_low().
ibuf_merge_space(), ibuf_delete_for_discarded_space(): Cleanup:
Do not allocate a memory heap.
This was implemented by Thirunarayanan Balathandayuthapani
and tested with innodb_change_buffering_debug=1 by Matthias Leich.
btr_cur_t: Zero-initialize all fields in the default constructor.
btr_cur_t::index: Remove; it duplicated page_cur.index.
Many functions: Remove arguments that were duplicating
page_cur_t::index and page_cur_t::block.
page_cur_open_level(), btr_pcur_open_level(): Replaces
btr_cur_open_at_index_side() for dict_stats_analyze_index().
At the end, release all latches except the dict_index_t::lock
and the buf_page_t::lock on the requested page.
dict_stats_analyze_index(): Rely on mtr_t::rollback_to_savepoint()
to release all uninteresting page latches.
btr_search_guess_on_hash(): Simplify the logic, and invoke
mtr_t::rollback_to_savepoint().
We will use plain C++ std::vector<mtr_memo_slot_t> for mtr_t::m_memo.
In this way, we can avoid setting mtr_memo_slot_t::object to nullptr
and instead just remove garbage from m_memo.
mtr_t::rollback_to_savepoint(): Shrink the vector. We will be needing this
in dict_stats_analyze_index(), where we will release page latches and
only retain the index->lock in mtr_t::m_memo.
mtr_t::release_last_page(): Release the last acquired page latch.
Replaces btr_leaf_page_release().
mtr_t::release(const buf_block_t&): Release a single page latch.
Used in btr_pcur_move_backward_from_page().
mtr_t::memo_release(): Replaced with mtr_t::release().
mtr_t::upgrade_buffer_fix(): Acquire a latch for a buffer-fixed page.
This replaces the double bookkeeping in btr_cur_t::open_leaf().
Reviewed by: Vladislav Lesin
btr_cur_t::open_leaf(): Replaces btr_cur_open_at_index_side() for
most calls, except dict_stats_analyze_index(), which is the only
place where we need to open a page at the non-leaf level.
Use btr_block_get() for better error handling.
Also, use the enumeration type btr_latch_mode wherever possible.
Reviewed by: Vladislav Lesin
ibuf.size, ibuf.max_size: Changed the type to Atomic_relaxed<ulint>
in order to fix some (not all) race conditions.
ibuf_contract(): Renamed from ibuf_merge_pages(ulint*).
ibuf_merge(), ibuf_merge_all(): Removed.
srv_shutdown(): Invoke log_free_check() and ibuf_contract(). Even though
ibuf_contract() is not writing anything, it will trigger calls of
ibuf_merge_or_delete_for_page(), which will write something. Because
we cannot invoke log_free_check() at that low level, we must invoke
it at the high level.
srv_shutdown_print(): Replaces srv_shutdown_print_master_pending().
Report progress and remaining work every 15 seconds. For the
change buffer merge, the remaining work is indicated by ibuf.size.
Every operation that is going to write redo log is supposed to
invoke log_free_check() before acquiring any latches. If there
is a risk of log buffer overrun, a log checkpoint would be
triggered by that call.
ibuf_merge_space(), ibuf_merge_in_background(),
ibuf_delete_for_discarded_space(): Invoke log_free_check()
when the current thread is not holding any page latches.
Unfortunately, in lower-level code called from ibuf_insert()
or ibuf_merge_or_delete_for_page(), some page latches may be
held and a call to log_free_check() could hang.
ibuf_set_bitmap_for_bulk_load(): Use the caller's mini-transaction.
The caller should have invoked log_free_check() while not holding
any page latches.
The function rec_get_offsets_func() used to hit ut_error
due to an invalid rec_get_status() value of a
ROW_FORMAT!=REDUNDANT record. This fix is twofold:
We will not only avoid a crash on corruption in this case,
but we will also make more effort to validate each record
every time we are iterating over index page records.
rec_get_offsets_func(): Do not crash on a corrupted record.
page_rec_get_nth(): Return nullptr on error.
page_dir_slot_get_rec_validate(): Like page_dir_slot_get_rec(),
but validate the pointer and return nullptr on error.
page_cur_search_with_match(), page_cur_search_with_match_bytes(),
page_dir_split_slot(), page_cur_move_to_next():
Indicate failure in a return value.
page_cur_search(): Replaced with page_cur_search_with_match().
rec_get_next_ptr_const(), rec_get_next_ptr(): Replaced with
page_rec_get_next_low().
TODO: rtr_page_split_initialize_nodes(), rtr_update_mbr_field(),
and possibly other SPATIAL INDEX functions fail to properly handle
errors.
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Matthias Leich
Performance tested by: Axel Schwenke
A prominent remaining source of crashes on corrupted index pages
is page directory corruption.
A frequent caller of page_dir_find_owner_slot() is page_rec_get_prev().
Some of those calls can be replaced with simpler logic that is less
prone to fail.
page_dir_find_owner_slot(),
page_rec_get_prev(), page_rec_get_prev_const(),
btr_pcur_move_to_prev(), btr_pcur_move_to_prev_on_page(),
btr_cur_upd_rec_sys(),
page_delete_rec_list_end(),
rtr_page_copy_rec_list_end_no_locks(),
rtr_page_copy_rec_list_start_no_locks(): Return an error code on failure.
fil_space_t::io(), buf_page_get_low(): Use DB_CORRUPTION for
out-of-bounds page reads.
PageBulk::getSplitRec(), PageBulk::copyOut(): Simplify the code.
btr_validate_level(): Prevent some more CHECK TABLE crashes on
corrupted pages.
btr_block_get(), btr_pcur_move_to_next_page(): Implement some checks that
were previously only part of IndexPurge::next().
IndexPurge::next(): Use btr_pcur_move_to_next_page().
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
The function btr_pcur_close() is being invoked on local variables
even when no cleanup needs to be done. In particular, for B-tree
indexes (not SPATIAL INDEX), unless btr_pcur_store_position()
was invoked in the past, there is no need to invoke btr_pcur_close().
On purge and rollback, we will retain btr_pcur_close(&pcur)
because otherwise some ./mtr --suite=innodb_gis tests would leak memory.
Since commit 2ca1123464 (MDEV-26217)
the function trx_t::commit(std::vector<pfs_os_file_t>&)
holds exclusive lock_sys.latch while invoking fil_delete_tablespace(),
which in turn may wait for change buffer tree latches in
ibuf_delete_for_discarded_space().
ibuf_insert_low(): If a shared lock_sys.latch cannot be acquired
without waiting, refuse to buffer the insert. In this way, a
deadlock with a DDL operation will be avoided.
ibuf_insert_to_index_page(), ibuf_delete(): Remove redundant calls to
record locking. In ibuf_insert_low() we already ensured that no record
locks existed on the page. No locks can be added before the buffered
changes have been merged.
This follows up the previous fix in
commit c3c53926c4 (MDEV-26554).
ha_innobase::delete_table(): Work around the insufficient
metadata locking (MDL) during DML operations by acquiring exclusive
InnoDB table locks on all child tables. Previously, this was only
done on TRUNCATE and ALTER.
ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke
lock_update_delete() during change buffer operations.
The revised trx_t::commit(std::vector<pfs_os_file_t>&) will
hold exclusive lock_sys.latch while invoking fil_delete_tablespace(),
which in turn may invoke ibuf_delete_rec().
dict_index_t::has_locking(): A new predicate, replacing the dummy
!dict_table_is_locking_disabled(index->table). Used for skipping lock
operations during ibuf_delete_rec().
trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks
and remove the table from the cache while holding exclusive
lock_sys.latch.
trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds.
trx_t::commit(): Reset dict_operation before invoking commit_in_memory()
via commit_persist().
lock_release_on_drop(): Release locks while lock_sys.latch is
exclusively locked.
lock_table(): Add a parameter for a pointer to the table.
We must not dereference the table before a lock_sys.latch has
been acquired. If the pointer to the table does not match the table
at that point, the table is invalid and DB_DEADLOCK will be returned.
row_ins_foreign_check_on_constraint(): Improve the checks.
Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed
before commit c5fd9aa562 (MDEV-25919).
row_upd_check_references_constraints(),
wsrep_row_upd_check_foreign_constraints(): Simplify checks.
The only purpose of ibuf_bitmap_mutex is to prevent a deadlock between
two concurrent invocations of ibuf_update_free_bits_for_two_pages_low()
on the same pair of bitmap pages, but in opposite order.
The mutex is unnecessarily serializing the execution of the function
even when it is being invoked on totally different tablespaces.
To avoid deadlocks, it suffices to ensure that the two page latches
are being acquired in a deterministic (sorted) order.
This failure was caused by MDEV-25975, which removed the parameter
innodb_disallow_writes.
Added a check for wsrep_sst_disable_writes to the function
ibuf_merge_in_background().
Backported from 10.5 20e9e804c1 and
5948d7602e.
sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).
It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.
But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.
The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.
The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.
Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.
If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.
To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.
The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.
One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.
If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().