page_mem_alloc_free(), page_dir_set_n_heap(), page_ptr_set_direction():
Merge with the callers.
page_direction_reset(), page_direction_increment(),
page_zip_dir_insert(), page_zip_write_rec_ext(), page_zip_write_rec():
Add the parameter mtr, and write log.
PageBulk::insert(), PageBulk::finish(): Write log for all changes.
page_cur_rec_insert(), page_cur_insert_rec_write_log(),
page_cur_insert_rec_write_log(): Remove.
page_rec_set_next(), page_header_set_field(), page_header_set_ptr():
Remove. Use lower-level operations with or without logging.
page_zip_dir_add_slot(): Move to the same compilation unit with
its only caller, page_cur_insert_rec_zip().
page_cur_insert_rec_zip(): Mark pieces of code that must be skipped
once this task is completed.
btr_defragment_chunk(): Before starting a mini-transaction that
is writing (a lot), invoke log_free_check(). This should allow
the test innodb.innodb_defrag_concurrent to pass with the
mtr default_mysqld.cnf setting of innodb_log_file_size=10M.
MLOG_BUF_MARGIN: Remove.
No longer write the following redo log records:
MLOG_COMP_LIST_END_DELETE, MLOG_LIST_END_DELETE,
MLOG_COMP_LIST_START_DELETE, MLOG_LIST_START_DELETE,
MLOG_REC_DELETE,MLOG_COMP_REC_DELETE.
Each individual deleted record will be logged separately
using physical log records.
page_dir_slot_set_n_owned(),
page_zip_rec_set_owned(), page_zip_dir_delete(), page_zip_clear_rec():
Add the parameter mtr, and write redo log.
page_dir_slot_set_rec(): Remove. Replaced with lower-level operations
that write redo log when necessary.
page_rec_set_n_owned(): Replaces rec_set_n_owned_old(),
rec_set_n_owned_new().
rec_set_heap_no(): Replaces rec_set_heap_no_old(), rec_set_heap_no_new().
page_mem_free(), page_dir_split_slot(), page_dir_balance_slot():
Add the parameter mtr.
page_dir_set_n_slots(): Merge with the caller page_dir_split_slot().
page_dir_slot_set_rec(): Merge with the callers page_dir_split_slot()
and page_dir_balance_slot().
page_cur_insert_rec_low(), page_cur_insert_rec_zip():
Suppress the logging of lower-level operations.
page_cur_delete_rec_write_log(): Remove.
page_cur_delete_rec(): Do not tolerate mtr=NULL.
rec_convert_dtuple_to_rec_old(), rec_convert_dtuple_to_rec_comp():
Replace rec_set_heap_no_old() and rec_set_heap_no_new() with direct
access that does not involve redo logging.
mtr_t::memcpy(): Do allow non-redo-logged writes to uncompressed pages
of ROW_FORMAT=COMPRESSED pages.
buf_page_io_complete(): Evict the uncompressed page of
a ROW_FORMAT=COMPRESSED page after recovery. Because we no longer
write logical log records for deleting index records, but instead
write physical records that may refer directly to the compressed
page frame of a ROW_FORMAT=COMPRESSED page, and because on recovery
we will only apply the changes to the ROW_FORMAT=COMPRESSED page,
the uncompressed page frame can be stale until page_zip_decompress()
is executed.
recv_parse_or_apply_log_rec_body(): After applying MLOG_ZIP_WRITE_STRING,
ensure that the FIL_PAGE_TYPE of the uncompressed page matches the
compressed page, because buf_flush_init_for_writing() assumes that
field to be valid.
mlog_init_t::mark_ibuf_exist(): Invoke page_zip_decompress(), because
the uncompressed page after buf_page_create() is not necessarily
up to date.
buf_LRU_block_remove_hashed(): Bypass a page_zip_validate() check
during redo log apply.
recv_apply_hashed_log_recs(): Invoke mlog_init.mark_ibuf_exist()
also for the last batch, to ensure that page_zip_decompress() will
be called for freshly initialized pages.
page_create(): Create normal B-tree pages. Callers that create
R-tree pages will set FIL_PAGE_TYPE and reset the split
sequence number afterwards.
The creation of ROW_FORMAT=COMPRESSED pages is unaffected;
they will be logged as compressed page images.
page_create_low(): Take const buf_block_t* as a parameter.
Let the callers invoke buf_block_modify_clock_inc().
btr_cur_upd_rec_sys(): Replaces row_upd_rec_sys_fields() and
implements redo logging.
row_upd_rec_sys_fields_in_recovery(): Remove, and merge to the
only remaining caller btr_cur_parse_update_in_place().
btr_cur_del_mark_set_clust_rec_log(),
btr_cur_del_mark_set_sec_rec_log(),
btr_cur_set_deleted_flag_for_ibuf():
Remove, and replace with btr_rec_set_deleted<bool>().
page_zip_rec_set_deleted(): Add the parameter mtr, and write a
MLOG_ZIP_WRITE_STRING record to the log.
Log the low-level operations for ROW_FORMAT=COMPRESSED index pages
using a new record, MLOG_ZIP_WRITE_STRING. We will still use
MLOG_1BYTE,..., MLOG_8BYTES or MLOG_WRITE_STRING for operations
on other than index pages (such as the page allocation bitmap pages).
We will stop writing the record MLOG_ZIP_PAGE_COMPRESS later, after
replacing all MLOG_REC_ and MLOG_COMP_REC_ that update index pages.
Log page reorganize as a series of insert operations.
This will make the redo log volume proportional to the page payload size.
btr_page_reorganize_low(): Add template <bool recovery=false>
btr_page_reorganize_block(): Remove the parameter 'bool recovery'
Instead of writing the high-level redo log records
MLOG_LIST_END_COPY_CREATED, MLOG_COMP_LIST_END_COPY_CREATED
write log for each individual insert of a record.
page_copy_rec_list_end_to_created_page(): Remove.
This will improve the fill factor of some pages.
Adjust some tests accordingly.
PageBulk::init(), PageBulk::finish(): Avoid setting bogus limits
to PAGE_HEAP_TOP and PAGE_N_DIR_SLOTS. Avoid accessor functions
that would enforce these limits before the correct ones are set
at the end of PageBulk::finish().
page_zip_reorganize(): Restore the page on failure.
In callers, omit now-redundant calls to page_zip_decompress().
btr_page_reorganize_low(): Define in static scope only, and
remove the z_level parameter. Assert that ROW_FORMAT is not COMPRESSED.
btr_page_reorganize_block(), btr_page_reorganize(): Invoke
page_zip_reorganize() for ROW_FORMAT=COMPRESSED.
page_zip_compress_write_log_no_data(): Remove.
We no longer write the MLOG_ZIP_PAGE_COMPRESS_NO_DATA record.
Instead, we will write MLOG_ZIP_PAGE_COMPRESS records.
Our benchmarking efforts indicate that the reasons for splitting the
buf_pool in commit c18084f71b
have mostly gone away, possibly as a result of
mysql/mysql-server@ce6109ebfd
or similar work.
Only in one write-heavy benchmark where the working set size is
ten times the buffer pool size, the buf_pool->mutex would be
less contended with 4 buffer pool instances than with 1 instance,
in buf_page_io_complete(). That contention could be alleviated
further by making more use of std::atomic and by splitting
buf_pool_t::mutex further (MDEV-15053).
We will deprecate and ignore the following parameters:
innodb_buffer_pool_instances
innodb_page_cleaners
There will be only one buffer pool and one page cleaner task.
In a number of INFORMATION_SCHEMA views, columns that indicated
the buffer pool instance will be removed:
information_schema.innodb_buffer_page.pool_id
information_schema.innodb_buffer_page_lru.pool_id
information_schema.innodb_buffer_pool_stats.pool_id
information_schema.innodb_cmpmem.buffer_pool_instance
information_schema.innodb_cmpmem_reset.buffer_pool_instance
During native table rebuild or index creation, InnoDB used to skip
redo logging and write MLOG_INDEX_LOAD records to inform crash recovery
and Mariabackup of the gaps in redo log. This is fragile and prohibits
some optimizations, such as skipping the doublewrite buffer for
newly (re)initialized pages (MDEV-19738).
row_merge_write_redo(): Remove. We do not write MLOG_INDEX_LOAD
records any more. Instead, we write full redo log.
FlushObserver: Remove.
fseg_free_page_func(): Remove the parameter log. Redo logging
cannot be disabled.
fil_space_t::redo_skipped_count: Remove.
We cannot remove buf_block_t::skip_flush_check, because PageBulk
will temporarily generate invalid B-tree pages in the buffer pool.
Let us define page_id_t as a thin wrapper of uint64_t so that
the comparison operators can be simplified. This is a follow-up
to the original commit 14be814380.
The comparison operator for recv_sys.pages.emplace() turned out to be
a busy spot in a recovery benchmark. That data structure was introduced
in MDEV-19586 in commit 177a571e01.
ut_align_down(): Preserve the const qualifier. Use C++ casts.
ha_delete_hash_node(): Correct an assertion expression.
fil_page_get_type(): Perform an assumed-aligned read.
page_align(): Preserve the const qualifier. Assume (some) alignment.
page_get_max_trx_id(): Check the index page type.
page_header_get_field(): Perform an assumed-aligned read.
page_get_autoinc(): Perform an assumed-aligned read.
page_dir_get_nth_slot(): Perform an assumed-aligned read.
Preserve the const qualifier.
- n_ext value may be less than dtuple_get_n_ext(dtuple) when PK is being
updated and new record inherits the externally stored fields from
delete mark old record.
Use bit-fields for some mtr_t members to improve locality of reference.
Because mtr_t is never shared between threads, there are no considerations
regarding concurrent access.
Since commit 5e62b6a5e0 (MDEV-16264),
purge_sys_t::stop() no longer waited for all purge activity to stop.
This caused problems on FLUSH TABLES...FOR EXPORT because of
purge running concurrently with the buffer pool flush.
The assertion at the end of buf_flush_dirty_pages() could fail.
The, implemented by Vladislav Vaintroub, aims to eliminate race
conditions when stopping or resuming purge:
waitable_task::disable(): Wait for the task to complete, then replace
the task callback function with noop.
waitable_task::enable(): Restore the original task callback function
after disable().
purge_sys_t::stop(): Invoke purge_coordinator_task.disable().
purge_sys_t::resume(): Invoke purge_coordinator_task.enable().
purge_sys_t::running(): Add const qualifier, and clarify the comment.
The purge coordinator task will remain active as long as any purge
worker task is active.
purge_worker_callback(): Assert purge_sys.running().
srv_purge_wakeup(): Merge with the only caller purge_sys_t::resume().
purge_coordinator_task: Use static linkage.
Release memory as soon as redo log records are processed.
Because the memory allocation and deallocation of parsed redo log
records must be protected by recv_sys.mutex, it is better to avoid
using a std::atomic field for bookkeeping.
buf_page_t::access_time: Keep track of the recv_sys.pages record
allocations. The most significant 16 bits will count allocated
blocks (which were previously counted by buf_page_t::buf_fix_count
in the debug version), and the least significant 16 bits indicate
the number of allocated bytes in the block (which was previously
managed in buf_block_t::modify_clock), which must be a positive
number, up to innodb_page_size. The byte offset 65536 is represented
as the value 0.
recv_recover_page(): Let the caller erase the log.
recv_validate_tablespace(): Acquire recv_sys_t::mutex.
InnoDB crash recovery used a special type of mem_heap_t that
allocates backing store from the buffer pool. That incurred
a significant overhead, leading to underutilization of memory,
and limiting the maximum contiguous allocated size of a log record.
recv_sys_t::blocks: A linked list of buf_block_t that are allocated
by buf_block_alloc() for redo log records. Replaces recv_sys_t::heap.
We repurpose buf_block_t::unzip_LRU for linking the elements.
recv_sys_t::max_log_blocks: Renamed from recv_n_pool_free_frames.
recv_sys_t::max_blocks(): Accessor for max_log_blocks.
recv_sys_t::alloc(): Allocate memory from the current recv_sys_t::blocks
element, or allocate another block. In debug builds, various free()
member functions must be invoked, because we repurpose
buf_page_t::buf_fix_count for tracking allocations.
recv_sys_t::free_corrupted_page(): Renamed from recv_recover_corrupt_page()
recv_sys_t::is_memory_exhausted(): Renamed from recv_sys_heap_check()
recv_sys_t::pages and its elements are allocated directly by the
system memory allocator.
recv_parse_log_recs(): Remove the parameter available_memory.
We rename some variables 'store_to_hash' to 'store', because
recv_sys.pages is not actually a hash table.
This is joint work with Thirunarayanan Balathandayuthapani.
class log_file_t: more or less sane RAII wrapper around redo log file
descriptor and its path.
This change is motivated by the need of using that log_file_t somewhere else.
os_file_flush_data_func(): fix builds on POSIX OSs where fdatasync()
is not avaiable
log_t::files::flush_data_only(): rename from fdatasync()
log_t::files::fsync(): removed and replaced with flush_data_only().
It will flush everything we need for using redo log files.
I found that memcpy_aligned was used incorrectly at redo log and decided to put
assertions in aligned functions. And found even more incorrect cases.
Given the amount discovered of bugs, I left assertions to prevent future bugs.
my_assume_aligned(): instead of MY_ASSUME_ALIGNED macro
MySQL 5.7.29 includes the following fix:
Bug #30287668 INNODB: A LONG SEMAPHORE WAIT
mysql/mysql-server@5cdbb22b51
There is no test case. It seems that the problem could occur when
a spatial index is large and peculiar enough so that multiple R-tree
leaf pages will have the exactly same maximum bounding rectangle (MBR).
The commit message suggests that the hang can occur when R-tree
non-leaf pages are being merged, which should only be possible
during transaction rollback or the purge of transaction history,
when the R-tree index is at least 2 levels high and very many records
are being deleted. The message says that a comparison result that two
spatial index node pointer records are equal will cause an infinite loop
in rtr_page_copy_rec_list_end_no_locks(). Hence, we must include the
child page number in the comparison to be consistent with
mysql/mysql-server@2e11fe0e15.
We fix this bug in a simpler way, involving fewer code changes.
cmp_rec_rec(): Renamed from cmp_rec_rec_with_match().
Assert that rec2 always resides in an index page.
Treat non-leaf spatial index pages specially.
Now that we will be invoking dtuple_get_n_ext() instead of
letting btr_push_update_extern_fields() update an already
calculated value, it is unnecessary to calculate the n_ext
upfront.
row_rec_to_index_entry(), row_rec_to_index_entry_low():
Remove the output parameter n_ext.
During update, rollback, or MVCC read, we may miscalculate
the number of off-page columns, and thus the size of the
clustered index record. The function btr_push_update_extern_fields()
is mostly redundant, because the off-page columns would also be
moved by row_upd_index_replace_new_col_val(), which is invoked
via row_upd_index_replace_new_col_vals().
btr_push_update_extern_fields(): Remove.
This is based on
mysql/mysql-server@1fa475b85d
which refines a fix for a recovery bug fix
mysql/mysql-server@ce0a1e85e2
in MySQL 5.7.5.
No test case was provided by Oracle.
Some of the changed code is being covered by the existing test
innodb.blob-crash.