page_zip_reorganize(): Restore the page on failure.
In callers, omit now-redundant calls to page_zip_decompress().
btr_page_reorganize_low(): Define in static scope only, and
remove the z_level parameter. Assert that ROW_FORMAT is not COMPRESSED.
btr_page_reorganize_block(), btr_page_reorganize(): Invoke
page_zip_reorganize() for ROW_FORMAT=COMPRESSED.
page_zip_compress_write_log_no_data(): Remove.
We no longer write the MLOG_ZIP_PAGE_COMPRESS_NO_DATA record.
Instead, we will write MLOG_ZIP_PAGE_COMPRESS records.
Our benchmarking efforts indicate that the reasons for splitting the
buf_pool in commit c18084f71b
have mostly gone away, possibly as a result of
mysql/mysql-server@ce6109ebfd
or similar work.
Only in one write-heavy benchmark where the working set size is
ten times the buffer pool size, the buf_pool->mutex would be
less contended with 4 buffer pool instances than with 1 instance,
in buf_page_io_complete(). That contention could be alleviated
further by making more use of std::atomic and by splitting
buf_pool_t::mutex further (MDEV-15053).
We will deprecate and ignore the following parameters:
innodb_buffer_pool_instances
innodb_page_cleaners
There will be only one buffer pool and one page cleaner task.
In a number of INFORMATION_SCHEMA views, columns that indicated
the buffer pool instance will be removed:
information_schema.innodb_buffer_page.pool_id
information_schema.innodb_buffer_page_lru.pool_id
information_schema.innodb_buffer_pool_stats.pool_id
information_schema.innodb_cmpmem.buffer_pool_instance
information_schema.innodb_cmpmem_reset.buffer_pool_instance
During native table rebuild or index creation, InnoDB used to skip
redo logging and write MLOG_INDEX_LOAD records to inform crash recovery
and Mariabackup of the gaps in redo log. This is fragile and prohibits
some optimizations, such as skipping the doublewrite buffer for
newly (re)initialized pages (MDEV-19738).
row_merge_write_redo(): Remove. We do not write MLOG_INDEX_LOAD
records any more. Instead, we write full redo log.
FlushObserver: Remove.
fseg_free_page_func(): Remove the parameter log. Redo logging
cannot be disabled.
fil_space_t::redo_skipped_count: Remove.
We cannot remove buf_block_t::skip_flush_check, because PageBulk
will temporarily generate invalid B-tree pages in the buffer pool.
I found that memcpy_aligned was used incorrectly at redo log and decided to put
assertions in aligned functions. And found even more incorrect cases.
Given the amount discovered of bugs, I left assertions to prevent future bugs.
my_assume_aligned(): instead of MY_ASSUME_ALIGNED macro
Pass buf_block_t* to more functions that write redo log.
page_zip_write_node_ptr(), page_zip_write_blob_ptr(),
page_zip_compress_write_log_no_data():
Take buf_block_t* as parameter, and do not tolerate mtr=NULL.
page_zip_compress(): Do not tolerate mtr=NULL.
page_zip_dir_insert(): Take page_cur_t* as parameter.
mlog_write_initial_log_record(): Remove. This function was unused.
RecIterator::remove(): Remove the redundant page_zip parameter.
PageConverter::m_page_zip_ptr: Remove.
offset_t: this is a type which represents one record offset.
It's unsigned short int.
a lot of functions: replace ulint with offset_t
btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
allocate record offsets on the stack instead of waiting for rec_get_offsets()
to allocate it from mem_heap_t. So, reducing memory allocations.
RECORD_OFFSET, INDEX_OFFSET:
now it's less convenient to store pointers in offset_t*
array. One pointer occupies now several offset_t. And those constant are start
indexes into array to places where to store pointer values
REC_OFFS_HEADER_SIZE: adjusted for the new reality
REC_OFFS_NORMAL_SIZE:
increase size from 100 to 300 which means less heap allocations.
And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
is smaller than previous 800 bytes.
REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
rem0rec.h, rem0rec.ic, rem0rec.cc:
various arguments, return values and local variables types were changed to
fix numerous integer conversions issues.
enum field_type_t:
offset types concept was introduces which replaces old offset flags stuff.
Like in earlier version, 2 upper bits are used to store offset type.
And this enum represents those types.
REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
get_type(), set_type(), get_value(), combine():
these are convenience functions to work with offsets and it's types
rec_offs_base()[0]:
still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
rec_offs_base()[i]:
these have type offset_t now. Two upper bits contains type.
In commit af5947f433
the function btr_discard_page() is invoking btr_set_min_rec_mark()
with the wrong buf_block_t* object. node_ptr is on merge_block,
not block.
btr_discard_page(): Remove the variables merge_page, page, and
always refer to block->frame or merge_block->frame instead.
Also, limit the scope of node_ptr and avoid duplicated conditions.
btr_set_min_rec_mark(): Add a template parameter, so that the
caller can specify whether the page is supposed to have a left sibling.
Otherwise, the assertion (which was introduced in the same commit)
would fail in btr_discard_page().
mtr_t::memcpy(): Replaces mlog_write_string(), mlog_log_string().
The buf_block_t is passed a parameter, so that
mlog_write_initial_log_record_low() can be used instead of
mlog_write_initial_log_record_fast().
fil_space_crypt_t::write_page0(): Remove the fil_space_t* parameter.
Passing buf_block_t helps us avoid calling
mlog_write_initial_log_record_fast() and page_get_page_no(),
and allows us to implement more debug checks, such as
that on ROW_FORMAT=COMPRESSED index pages, only the page header
may be modified by MLOG_MEMSET records.
fseg_n_reserved_pages(): Add a buf_block_t parameter.
mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull().
Optimize away writes if the page contents does not change,
except when a dummy write has been explicitly requested.
Because the member function template takes a block descriptor as a
parameter, it is possible to introduce better consistency checks.
Due to this, the code for handling file-based lists, undo logs
and user transactions was refactored to pass around buf_block_t.
btr_set_min_rec_mark(): Write MLOG_1BYTE instead of
MLOG_REC_MIN_MARK or MLOG_COMP_REC_MIN_MARK.
On ROW_FORMAT=COMPRESSED pages, the minimum record flag is not stored
at all. The flag is computed for the uncompressed page by
page_zip_decompress(). Hence, nothing needs to be logged for
ROW_FORMAT=COMPRESSED tables for this operation.
To facilitate crash-upgrade and hot backup from older versions,
we will retain the code to parse and apply the old log record types
MLOG_REC_MIN_MARK and MLOG_COMP_REC_MIN_MARK.
Introduce memcpy_aligned<N>(), memcmp_aligned<N>(), memset_aligned<N>()
and use them for accessing InnoDB page header fields that are known
to be aligned.
MY_ASSUME_ALIGNED(): Wrapper for the GCC/clang __builtin_assume_aligned().
Nothing similar seems to exist in Microsoft Visual Studio, and the
C++20 std::assume_aligned is not available to us yet.
Explicitly specified alignment guarantees allow compilers to generate
faster code on platforms with strict alignment rules, instead of
emitting calls to potentially unaligned memcpy(), memcmp(), or memset().
Apart from page latches (buf_block_t::lock), mini-transactions
are keeping track of at most one dict_index_t::lock and
fil_space_t::latch at a time, and in a rare case, purge_sys.latch.
Let us introduce interfaces for acquiring an index latch
or a tablespace latch.
In a later version, we may want to introduce mtr_t members
for holding a latched dict_index_t* and fil_space_t*,
and replace the remaining use of mtr_t::m_memo
with std::set<buf_block_t*> or with a map<buf_block_t*,byte*>
pointing to log records.
btr_create(), btr_root_raise_and_insert(): Write a MLOG_MEMSET record
to set FIL_PAGE_PREV,FIL_PAGE_NEXT to FIL_NULL, instead of writing
two MLOG_4BYTES records.
For ROW_FORMAT=COMPRESSED pages, we will not use MLOG_MEMSET
because we want the crash-downgrade to earlier 10.4 releases to succeed.
mlog_parse_nbytes(): Relax the too strict assertion. There is no problem
with MLOG_MEMSET records that affect the uncompressed header of
ROW_FORMAT=COMPRESSED index pages.
page_rec_write_field(): Remove.
dict_create_index_tree_step(): If the SYS_INDEXES.PAGE does not change,
do not update it in the data dictionary. Typically, all index page numbers
would be unchanged before and after IMPORT TABLESPACE, except if some
secondary indexes were created after loading some data.
btr_root_fseg_adjust_on_import(): Remove the redundant mtr_t* parameter.
Redo logging is disabled during the page adjustments that IMPORT TABLESPACE
is performing.
Except for fil_name_process(), which invokes os_normalize_path(),
the redo log record parser will not modify the redo log records.
Add const qualifiers accordingly.
The assertion that was added in
commit c0c003beb4
to augment the fix of MDEV-20805 turns out to be invalid when
innodb_immediate_scrub_data_uncompressed is enabled.
In this mode, fsp_init_file_page() will be invoked on data pages
that have been freed, causing writes of almost-all-zero pages.
btr_page_free(): Adjust the comment.
buf_flush_init_for_writing(): Disable the assertion with a note
that it should be re-enabled in MDEV-15528.
We will remove the InnoDB background operation of merging buffered
changes to secondary index leaf pages. Changes will only be merged as a
result of an operation that accesses a secondary index leaf page,
such as a SQL statement that performs a lookup via that index,
or is modifying the index. Also ROLLBACK and some background operations,
such as purging the history of committed transactions, or computing
index cardinality statistics, can cause change buffer merge.
Encryption key rotation will not perform change buffer merge.
The motivation of this change is to simplify the I/O logic and to
allow crash recovery to happen in the background (MDEV-14481).
We also hope that this will reduce the number of "mystery" crashes
due to corrupted data. Because change buffer merge will typically
take place as a result of executing SQL statements, there should be
a clearer connection between the crash and the SQL statements that
were executed when the server crashed.
In many cases, a slight performance improvement was observed.
This is joint work with Thirunarayanan Balathandayuthapani
and was tested by Axel Schwenke and Matthias Leich.
The InnoDB monitor counter innodb_ibuf_merge_usec will be removed.
On slow shutdown (innodb_fast_shutdown=0), we will continue to
merge all buffered changes (and purge all undo log history).
Two InnoDB configuration parameters will be changed as follows:
innodb_disable_background_merge: Removed.
This parameter existed only in debug builds.
All change buffer merges will use synchronous reads.
innodb_force_recovery will be changed as follows:
* innodb_force_recovery=4 will be the same as innodb_force_recovery=3
(the change buffer merge cannot be disabled; it can only happen as
a result of an operation that accesses a secondary index leaf page).
The option used to be capable of corrupting secondary index leaf pages.
Now that capability is removed, and innodb_force_recovery=4 becomes 'safe'.
* innodb_force_recovery=5 (which essentially hard-wires
SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED)
becomes safe to use. Bogus data can be returned to SQL, but
persistent InnoDB data files will not be corrupted further.
* innodb_force_recovery=6 (ignore the redo log files)
will be the only option that can potentially cause
persistent corruption of InnoDB data files.
Code changes:
buf_page_t::ibuf_exist: New flag, to indicate whether buffered
changes exist for a buffer pool page. Pages with pending changes
can be returned by buf_page_get_gen(). Previously, the changes
were always merged inside buf_page_get_gen() if needed.
ibuf_page_exists(const buf_page_t&): Check if a buffered changes
exist for an X-latched or read-fixed page.
buf_page_get_gen(): Add the parameter allow_ibuf_merge=false.
All callers that know that they may be accessing a secondary index
leaf page must pass this parameter as allow_ibuf_merge=true,
unless it does not matter for that caller whether all buffered
changes have been applied. Assert that whenever allow_ibuf_merge
holds, the page actually is a leaf page. Attempt change buffer
merge only to secondary B-tree index leaf pages.
btr_block_get(): Add parameter 'bool merge'.
All callers of btr_block_get() should know whether the page could be
a secondary index leaf page. If it is not, we should avoid consulting
the change buffer bitmap to even consider a merge. This is the main
interface to requesting index pages from the buffer pool.
ibuf_merge_or_delete_for_page(), recv_recover_page(): Replace
buf_page_get_known_nowait() with much simpler logic, because
it is now guaranteed that that the block is x-latched or read-fixed.
mlog_init_t::mark_ibuf_exist(): Renamed from mlog_init_t::ibuf_merge().
On crash recovery, we will no longer merge any buffered changes
for the pages that we read into the buffer pool during the last batch
of applying log records.
buf_page_get_gen_known_nowait(), BUF_MAKE_YOUNG, BUF_KEEP_OLD: Remove.
btr_search_guess_on_hash(): Merge buf_page_get_gen_known_nowait()
to its only remaining caller.
buf_page_make_young_if_needed(): Define as an inline function.
Add the parameter buf_pool.
buf_page_peek_if_young(), buf_page_peek_if_too_old(): Add the
parameter buf_pool.
fil_space_validate_for_mtr_commit(): Remove a bogus comment
about background merge of the change buffer.
btr_cur_open_at_rnd_pos_func(), btr_cur_search_to_nth_level_func(),
btr_cur_open_at_index_side_func(): Use narrower data types and scopes.
ibuf_read_merge_pages(): Replaces buf_read_ibuf_merge_pages().
Merge the change buffer by invoking buf_page_get_gen().
btr_page_get_split_rec_to_left(): Assert that in the leftmost leaf page,
if the metadata record exists, index->is_instant() must hold.
The assertion of commit 01f45becd1
could fail during innobase_instant_try().
btr_page_get_split_rec_to_left(): Assert that in the leftmost leaf page,
the metadata record exists if and only if index->is_instant().
page_validate(): Correct the wording of a message.
rec_init_offsets(): Assert that whenever a record is in "instant ALTER"
format, index->is_instant() must hold.
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096 and
commit 99dc40d6ac
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
btr_cur_pessimistic_delete(): code changed in a way that allows
to put more REC_INFO_MIN_REC_FLAG assertions inside btr_set_min_rec_mark().
Without that change tests innodb.innodb-table-online,
innodb.temp_table_savepoint and innodb_zip.prefix_index_liftedlimit fail.
Removed basically duplicated page_zip_validate() calls
which fails because of temporary(!) invariant violation.
That fixed innodb_zip.wl5522_debug_zip and
innodb_zip.prefix_index_liftedlimit
Until now, InnoDB inefficiently compared the aligned fields
FIL_PAGE_PREV, FIL_PAGE_NEXT to the byte-order-agnostic value FIL_NULL.
This is a backport of 32170f8c6d
from MariaDB Server 10.3.