We will remove the InnoDB background operation of merging buffered
changes to secondary index leaf pages. Changes will only be merged as a
result of an operation that accesses a secondary index leaf page,
such as a SQL statement that performs a lookup via that index,
or is modifying the index. Also ROLLBACK and some background operations,
such as purging the history of committed transactions, or computing
index cardinality statistics, can cause change buffer merge.
Encryption key rotation will not perform change buffer merge.
The motivation of this change is to simplify the I/O logic and to
allow crash recovery to happen in the background (MDEV-14481).
We also hope that this will reduce the number of "mystery" crashes
due to corrupted data. Because change buffer merge will typically
take place as a result of executing SQL statements, there should be
a clearer connection between the crash and the SQL statements that
were executed when the server crashed.
In many cases, a slight performance improvement was observed.
This is joint work with Thirunarayanan Balathandayuthapani
and was tested by Axel Schwenke and Matthias Leich.
The InnoDB monitor counter innodb_ibuf_merge_usec will be removed.
On slow shutdown (innodb_fast_shutdown=0), we will continue to
merge all buffered changes (and purge all undo log history).
Two InnoDB configuration parameters will be changed as follows:
innodb_disable_background_merge: Removed.
This parameter existed only in debug builds.
All change buffer merges will use synchronous reads.
innodb_force_recovery will be changed as follows:
* innodb_force_recovery=4 will be the same as innodb_force_recovery=3
(the change buffer merge cannot be disabled; it can only happen as
a result of an operation that accesses a secondary index leaf page).
The option used to be capable of corrupting secondary index leaf pages.
Now that capability is removed, and innodb_force_recovery=4 becomes 'safe'.
* innodb_force_recovery=5 (which essentially hard-wires
SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED)
becomes safe to use. Bogus data can be returned to SQL, but
persistent InnoDB data files will not be corrupted further.
* innodb_force_recovery=6 (ignore the redo log files)
will be the only option that can potentially cause
persistent corruption of InnoDB data files.
Code changes:
buf_page_t::ibuf_exist: New flag, to indicate whether buffered
changes exist for a buffer pool page. Pages with pending changes
can be returned by buf_page_get_gen(). Previously, the changes
were always merged inside buf_page_get_gen() if needed.
ibuf_page_exists(const buf_page_t&): Check if a buffered changes
exist for an X-latched or read-fixed page.
buf_page_get_gen(): Add the parameter allow_ibuf_merge=false.
All callers that know that they may be accessing a secondary index
leaf page must pass this parameter as allow_ibuf_merge=true,
unless it does not matter for that caller whether all buffered
changes have been applied. Assert that whenever allow_ibuf_merge
holds, the page actually is a leaf page. Attempt change buffer
merge only to secondary B-tree index leaf pages.
btr_block_get(): Add parameter 'bool merge'.
All callers of btr_block_get() should know whether the page could be
a secondary index leaf page. If it is not, we should avoid consulting
the change buffer bitmap to even consider a merge. This is the main
interface to requesting index pages from the buffer pool.
ibuf_merge_or_delete_for_page(), recv_recover_page(): Replace
buf_page_get_known_nowait() with much simpler logic, because
it is now guaranteed that that the block is x-latched or read-fixed.
mlog_init_t::mark_ibuf_exist(): Renamed from mlog_init_t::ibuf_merge().
On crash recovery, we will no longer merge any buffered changes
for the pages that we read into the buffer pool during the last batch
of applying log records.
buf_page_get_gen_known_nowait(), BUF_MAKE_YOUNG, BUF_KEEP_OLD: Remove.
btr_search_guess_on_hash(): Merge buf_page_get_gen_known_nowait()
to its only remaining caller.
buf_page_make_young_if_needed(): Define as an inline function.
Add the parameter buf_pool.
buf_page_peek_if_young(), buf_page_peek_if_too_old(): Add the
parameter buf_pool.
fil_space_validate_for_mtr_commit(): Remove a bogus comment
about background merge of the change buffer.
btr_cur_open_at_rnd_pos_func(), btr_cur_search_to_nth_level_func(),
btr_cur_open_at_index_side_func(): Use narrower data types and scopes.
ibuf_read_merge_pages(): Replaces buf_read_ibuf_merge_pages().
Merge the change buffer by invoking buf_page_get_gen().
btr_page_get_split_rec_to_left(): Assert that in the leftmost leaf page,
if the metadata record exists, index->is_instant() must hold.
The assertion of commit 01f45becd1
could fail during innobase_instant_try().
btr_page_get_split_rec_to_left(): Assert that in the leftmost leaf page,
the metadata record exists if and only if index->is_instant().
page_validate(): Correct the wording of a message.
rec_init_offsets(): Assert that whenever a record is in "instant ALTER"
format, index->is_instant() must hold.
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096 and
commit 99dc40d6ac
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
When MDEV-12026 introduced innodb_checksum_algorithm=full_crc32 in
MariaDB 10.4, it accidentally added a dependency on buf_page_t::encrypted.
Now that the flag has been removed, we must adjust the page-read routine.
buf_page_io_complete(): When the full_crc32 page checksum matches but the
tablespace ID in the page does not match after decrypting, we should
declare it a decryption failure and suppress the page dump output and
any attempts to re-read the page.
Constraint check is done on secondary index update.
F.ex. DELETE does row_upd_sec_index_entry() and checks constraints in
row_upd_check_references_constraints(). UPDATE is optimized for the
case when order is not changed (node->cmpl_info & UPD_NODE_NO_ORD_CHANGE)
and doesn't do row_upd_sec_index_entry(), so it doesn't check constraints.
Since for versioned DELETE we do UPDATE actually, but expect behaviour
of DELETE in terms of constraints, we should deny this optimization to
get constraints checked.
Fix wrong referenced table check when versioned DELETE inserts history in parent
table. Set check_ref to false in this case.
Removed unused dup_chk_only argument for row_ins_sec_index_entry() and
added check_ref argument.
MDEV-18057 fix was superseded by this fix and reverted.
foreign.test:
All key_type combinations: pk, unique, sec(ondary).
btr_cur_pessimistic_delete(): code changed in a way that allows
to put more REC_INFO_MIN_REC_FLAG assertions inside btr_set_min_rec_mark().
Without that change tests innodb.innodb-table-online,
innodb.temp_table_savepoint and innodb_zip.prefix_index_liftedlimit fail.
Removed basically duplicated page_zip_validate() calls
which fails because of temporary(!) invariant violation.
That fixed innodb_zip.wl5522_debug_zip and
innodb_zip.prefix_index_liftedlimit
The bug affects MariaDB Server 10.3 or later, but it makes sense
to improve CHECK TABLE in earlier versions already.
page_validate(): Check REC_INFO_MIN_REC_FLAG in the records.
This allows CHECK TABLE to catch more bugs.
Until now, InnoDB inefficiently compared the aligned fields
FIL_PAGE_PREV, FIL_PAGE_NEXT to the byte-order-agnostic value FIL_NULL.
This is a backport of 32170f8c6d
from MariaDB Server 10.3.
fsp_free_seg_inode(): Remove the constant parameter log=true.
Only fsp_free_page() could also be called with log=false.
The dead code was introduced in
commit 304ae942f7
by me, and spotted by Thirunarayanan Balathandayuthapani.
This applies to large allocations.
This maps to the way Linux does it in MDEV-10814 except FreeBSD uses
different constants.
Adjust error string to match to implementation.
Tested on FreeBSD-12.0
For release builds, do not declare unused variables.
unpack_row(): Omit a debug-only variable from WSREP diagnostic message.
create_wsrep_THD(): Fix -Wmaybe-uninitialized for the PSI_thread_key.
Thanks to Eugene Kosov for noting that the fix is incomplete.
It turns out that on instant DROP/reorder column (MDEV-15562),
we must always write the metadata record, even though the table
was empty. Alternatively, we should guarantee that all undo
log records for the table have been purged. (Attempting to do
that by updating table_id leads to other problems; see
commit 1b31d8852c00b4bab6e6fe179b97db45ccb8d535.)
It would be tempting to remove dict_index_t::clear_instant_alter()
altogether, but it turns that we need that when the instant ALTER TABLE
operation of a first-time DROP COLUMN is being rolled back.
innobase_instant_try(): Clarify a comment. Purge never calls
dict_index_t::clear_instant_alter(), but it may invoke
dict_index_t::clear_instant_add(). On first-time instant DROP/reorder,
always write a metadata record, even if the table is empty.
The test encryption.innodb-redo-badkey was accidentally disabled
until commit 23657a2101 enabled
it recently. Once it was enabled, it started failing randomly.
recv_recover_corrupt_page(): Do not assume that any redo log exists
for the page. A page may be unnecessarily read by read-ahead.
When noting the corruption, reset recv_addr->state to RECV_PROCESSED,
so that even if the same page is re-read again, we will only
decrement recv_sys->n_addrs once.
The test innodb_fts.fulltext_table_evict was only creating 1000 tables
with fulltext indexes, only to check that no tables with fulltext
indexes are being evicted.
The reason why tables containing fulltext indexes cannot be evicted is
that fts_optimize_init() invokes dict_table_prevent_eviction().
For CMAKE_BUILD_TYPE=Debug, the default MYSQL_MAINTAINER_MODE=AUTO
implies -Werror along with other flags in cmake/maintainer.cmake,
which would break the debug builds when CMAKE_CXX_FLAGS include -O2.
This fix includes a backport of 6dd3f24090
from MariaDB 10.3.
The crash scenario is as follows:
(1) A non-empty table exists.
(2) MDEV-15562 instant ADD/DROP/reorder has been invoked.
(3) Some purgeable undo log exists for the table.
(4) The table becomes empty, containing not even any delete-marked records,
only containing the hidden metadata record that was added in (2).
(5) An instant ADD/DROP/reorder column is executed, and the table
is emptied and the (2) metadata removed.
(6) Purge processes an undo log record from (3), which will refer to
a non-existent clustered index field, because the metadata that
was created in (2) was remoeved in (5).
We fix this by adjusting step (5) so that we will never remove the
MDEV-15562-style metadata record. Removing the MDEV-11369 metadata
record (instant ADD COLUMN to the end of the table) is completely
fine at any time when the table becomes empty, because
dict_index_t::n_fields will remain unchanged.
innobase_instant_try(): Never remove the MDEV-15562 metadata record.
page_cur_delete_rec(): Do not reset FIL_PAGE_TYPE when the
MDEV-15562 metadata record is being removed as part of
btr_cur_pessimistic_update() invoked by innobase_instant_try().
lock_print_info::operator(): Do not dereference purge_sys.query in case
it is NULL. We would not initialize purge_sys if innodb_force_recovery
is set to 5 or 6.
The test case will be added by merge from 10.2.
In MariaDB 10.4.0, commit 09af00cbde
removed the crash-upgrade logic for the MariaDB 10.2
innodb_safe_truncate=OFF TRUNCATE TABLE (which was the only option
between MariaDB 10.2.2 and 10.2.18), but failed to adjust some
comments and code.
buf_page_io_complete(): Remove a bogus comment about TRUNCATE.
dict_recreate_index_tree(): Unused function; remove.
fil_space_t::stop_new_ops: Clarify the comment.
fil_space_acquire_low(): Remove a bogus comment about TRUNCATE.
fil_check_pending_ops(), fil_check_pending_io(): Adjust a warning message.
This code is only invoked as part of DISCARD TABLESPACE or DROP TABLE.
DROP TABLE is internally used as part of ALTER TABLE, OPTIMIZE TABLE,
or TRUNCATE TABLE.
RemoteDatafile::create_link_file(): Clarify a comment.
ibuf_delete_for_discarded_space(): Clarify the function comment.
dict_table_x_lock_indexes(), dict_table_x_unlock_indexes():
Merge with the only remaining caller, row_quiesce_set_state().
page_create_zip(): Remove a bogus comment about TRUNCATE.