This is based on commit 20ae4816bb
with some adjustments for MDEV-12353.
row_ins_sec_index_entry_low(): If a separate mini-transaction is
needed to adjust the minimum bounding rectangle (MBR) in the parent
page, we must disable redo logging if the table is a temporary table.
For temporary tables, no log is supposed to be written, because
the temporary tablespace will be reinitialized on server restart.
rtr_update_mbr_field(), rtr_merge_and_update_mbr(): Changed the return
type to void and removed unreachable code. In older versions, these
used to return a different value for temporary tables.
page_id_t: Add constexpr to most member functions.
mtr_t::log_write(): Catch log writes to invalid tablespaces
so that the test case would crash without the fix to
row_ins_sec_index_entry_low().
In the rewrite of MDEV-8139 (based on MDEV-15528), we introduced a
wrong assumption that any persistent tablespace that is not an .ibd
file is the system tablespace. This assumption is broken when
innodb_undo_tablespaces (files undo001, undo002, ...) are being used.
By default, we have innodb_undo_tablespaces=0 (the persistent undo
log is being stored in the system tablespace).
In MDEV-15528 and MDEV-8139 we rewrote the page scrubbing logic
so that it will follow the tried-and-true write-ahead logging
protocol, first writing FREE_PAGE records and then in the page
flushing, zerofilling or hole-punching freed pages.
Unfortunately, the implementation included a wrong assumption that
that anything that is not in an .ibd file must be the system tablespace.
This wrong assumption would cause overwrites of valid data pages in
the system tablespace.
mtr_t::m_freed_in_system_tablespace: Remove.
mtr_t::m_freed_space: The tablespace associated with m_freed_pages.
buf_page_free(): Take the tablespace and page number as a parameter,
instead of taking a page identifier.
mtr_t::init(): Correct the debug assertion that was added in
commit c89366866b (MDEV-22970).
When the original failure scenario involves the system tablespace,
we could have m_user_space==nullptr. Let us compare m_user_space_id
instead.
when scrubbing is enabled
buf_read_recv_pages(): Ignore the page to read if it is already
present in the freed ranges.
store_freed_or_init_rec(): Store the ranges only if scrubbing
is enabled or page compressed tablespace.
recv_init_crash_recovery_space(): Add the freed range only when
scrubbing or page compressed tablespace.
range_set::contains(): Search the value is present in ranges.
range_set::remove_if_exists(): Remove the value if exist in ranges.
mtr_t::init(): Handles the scenario that mini-transaction may allocate
a page that had just been freed.
recv_sys_t::parse(): Note down the FREE and INIT redo log irrespective
of STORE value.
Removed innodb_tablespaces_scrubbing from test case
mtr_t::m_freed_pages: Renamed from m_freed_ranges and made it as
pointer indirection.
mtr_t::add_freed_offset(): Allocates m_freed_pages.
mtr_t:clear_freed_ranges(): Removed.
mtr_t::init(): Added debug assertion to check whether m_freed_pages
is not yet initialized.
btr_page_alloc_low(): Remove #ifdef UNIV_DEBUG_SCRUBBING.
mtr_t::commit(): Delete m_freed_pages, reset m_trim_pages and
m_freed_in_system_tablespace.
fil_space_t::clear_freed_ranges(): Added a comment to explain how
undo log tablespaces uses it.
fil_space_t::freed_ranges: Store ranges of freed page numbers.
fil_space_t::last_freed_lsn: Store the most recent LSN of
freeing a page.
fil_space_t::freed_mutex: Protects freed_ranges, last_freed_lsn.
fil_space_create(): Initialize the freed_range mutex.
fil_space_free_low(): Frees the freed_range mutex.
range_set: Ranges of page numbers.
buf_page_create(): Removes the page from freed_ranges when page
is being reused.
btr_free_root(): Remove the PAGE_INDEX_ID invalidation. Because
btr_free_root() and dict_drop_index_tree() are executed in
the same atomic mini-transaction, there is no need to
invalidate the root page.
buf_release_freed_page(): Split from buf_flush_freed_page().
Skip any I/O
buf_flush_freed_pages(): Get the freed ranges from tablespace and
Write punch-hole or zeroes of the freed ranges.
buf_flush_try_neighbors(): Handles the flushing of freed ranges.
mtr_t::freed_pages: Variable to store the list of freed pages.
mtr_t::add_freed_pages(): To add freed pages.
mtr_t::clear_freed_pages(): To clear the freed pages.
mtr_t::m_freed_in_system_tablespace: Variable to indicate whether page has
been freed in system tablespace.
mtr_t::m_trim_pages: Variable to indicate whether the space has been trimmed.
mtr_t::commit(): Add the freed page and update the last freed lsn
in the tablespace and clear the tablespace freed range if space is
trimmed.
file_name_t::freed_pages: Store the freed pages during recovery.
file_name_t::add_freed_page(), file_name_t::remove_freed_page(): To
add and remove freed page during recovery.
store_freed_or_init_rec(): Store or remove the freed pages while
encountering FREE_PAGE or INIT_PAGE redo log record.
recv_init_crash_recovery_spaces(): Add the freed page encountered
during recovery to respective tablespace.
User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
and will no longer report the PAGE_STATE value READY_FOR_USE.
We will remove some fields from buf_page_t and move much code to
member functions of buf_pool_t and buf_page_t, so that the access
rules of data members can be enforced consistently.
Evicting or adding pages in buf_pool.LRU will remain covered by
buf_pool.mutex.
Evicting or adding pages in buf_pool.page_hash will remain
covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.
After this fix, buf_pool.page_hash lookups can entirely
avoid acquiring buf_pool.mutex, only relying on
buf_pool.hash_lock_get() S-latch.
Similarly, buf_flush_check_neighbors() can will rely solely on
buf_pool.mutex, no buf_pool.page_hash latch at all.
The buf_pool.mutex is rather contended in I/O heavy benchmarks,
especially when the workload does not fit in the buffer pool.
The first attempt to alleviate the contention was the
buf_pool_t::mutex split in
commit 4ed7082eef
which introduced buf_block_t::mutex, which we are now removing.
Later, multiple instances of buf_pool_t were introduced
in commit c18084f71b
and recently removed by us in
commit 1a6f708ec5 (MDEV-15058).
UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
related debugging in otherwise non-debug builds has not been used
for years. Instead, we have been using UNIV_DEBUG, which is enabled
in CMAKE_BUILD_TYPE=Debug.
buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
std::atomic and the buf_pool.page_hash latches, and in some cases
depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
We must always release buf_block_t::lock before invoking
unfix() or io_unfix(), to prevent a glitch where a block that was
added to the buf_pool.free list would apper X-latched. See
commit c5883debd6 how this glitch
was finally caught in a debug environment.
We move some buf_pool_t::page_hash specific code from the
ha and hash modules to buf_pool, for improved readability.
buf_pool_t::close(): Assert that all blocks are clean, except
on aborted startup or crash-like shutdown.
buf_pool_t::validate(): No longer attempt to validate
n_flush[] against the number of BUF_IO_WRITE fixed blocks,
because buf_page_t::flush_type no longer exists.
buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
Reduce mutex contention by separating the buf_pool.watch[]
allocation and the insert into buf_pool.page_hash.
buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
buf_pool.page_hash latch.
Replaces and extends buf_page_hash_lock_s_confirm()
and buf_page_hash_lock_x_confirm().
buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.
buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
Use Atomic_counter.
buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().
buf_pool_t::LRU_remove(): Remove a block from the LRU list
and return its predecessor. Incorporates buf_LRU_adjust_hp(),
which was removed.
buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
BTR_DELETE_OP (purge), which is never invoked on temporary tables.
buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.
buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.
buf_LRU_free_page(): Clarify the function comment.
buf_flush_check_neighbor(), buf_flush_check_neighbors():
Rewrite the construction of the page hash range. We will hold
the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
consecutive lookups of buf_pool.page_hash.
buf_flush_page_and_try_neighbors(): Remove.
Merge to its only callers, and remove redundant operations in
buf_flush_LRU_list_batch().
buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
Do not acquire buf_pool.mutex, and iterate directly with page_id_t.
ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
and avoids any loops.
fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.
buf_flush_page(): Add a fil_space_t* parameter. Minimize the
buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
atomically with the io_fix, and we will protect most buf_block_t
fields with buf_block_t::lock. The function
buf_flush_write_block_low() is removed and merged here.
buf_page_init_for_read(): Use static linkage. Initialize the newly
allocated block and acquire the exclusive buf_block_t::lock while not
holding any mutex.
IORequest::IORequest(): Remove the body. We only need to invoke
set_punch_hole() in buf_flush_page() and nowhere else.
buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
This field is only used during a fil_io() call.
That function already takes IORequest as a parameter, so we had
better introduce for the rarely changing field.
buf_block_t::init(): Replaces buf_page_init().
buf_page_t::init(): Replaces buf_page_init_low().
buf_block_t::initialise(): Initialise many fields, but
keep the buf_page_t::state(). Both buf_pool_t::validate() and
buf_page_optimistic_get() requires that buf_page_t::in_file()
be protected atomically with buf_page_t::in_page_hash
and buf_page_t::in_LRU_list.
buf_page_optimistic_get(): Now that buf_block_t::mutex
no longer exists, we must check buf_page_t::io_fix()
after acquiring the buf_pool.page_hash lock, to detect
whether buf_page_init_for_read() has been initiated.
We will also check the io_fix() before acquiring hash_lock
in order to avoid unnecessary computation.
The field buf_block_t::modify_clock (protected by buf_block_t::lock)
allows buf_page_optimistic_get() to validate the block.
buf_page_t::real_size: Remove. It was only used while flushing
pages of page_compressed tables.
buf_page_encrypt(): Add an output parameter that allows us ot eliminate
buf_page_t::real_size. Replace a condition with debug assertion.
buf_page_should_punch_hole(): Remove.
buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
Add the parameter size (to replace buf_page_t::real_size).
buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
Add the parameter size (to replace buf_page_t::real_size).
fil_system_t::detach(): Replaces fil_space_detach().
Ensure that fil_validate() will not be violated even if
fil_system.mutex is released and reacquired.
fil_node_t::complete_io(): Renamed from fil_node_complete_io().
fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
Avoid invoking fil_node_t::close() because fil_system.n_open
has already been decremented in fil_space_t::detach().
BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.
BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
and distinguish dirty pages by buf_page_t::oldest_modification().
BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
This state was only being used for buf_page_t that are in
buf_pool.watch.
buf_pool_t::watch[]: Remove pointer indirection.
buf_page_t::in_flush_list: Remove. It was set if and only if
buf_page_t::oldest_modification() is nonzero.
buf_page_decrypt_after_read(), buf_corrupt_page_release(),
buf_page_check_corrupt(): Change the const fil_space_t* parameter
to const fil_node_t& so that we can report the correct file name.
buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.
buf_page_io_complete(): Split to buf_page_read_complete() and
buf_page_write_complete().
buf_dblwr_t::in_use: Remove.
buf_dblwr_t::buf_block_array: Add IORequest::flush_t.
buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
os_aio_wait_until_no_pending_writes().
buf_flush_write_complete(): Declare static, not global.
Add the parameter IORequest::flush_t.
buf_flush_freed_page(): Simplify the code.
recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.
fil_read(), fil_write(): Replaced with direct use of fil_io().
fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.
fil_mutex_enter_and_prepare_for_io(): Return the resolved
fil_space_t* to avoid a duplicated lookup in the caller.
fil_report_invalid_page_access(): Clean up the parameters.
fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
Always invoke fil_space_t::acquire_for_io() and let either the
sync=true caller or fil_aio_callback() invoke
fil_space_t::release_for_io().
fil_aio_callback(): Rewrite to replace buf_page_io_complete().
fil_check_pending_operations(): Remove a parameter, and remove some
redundant lookups.
fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
do an extra lookup of the tablespace between fil_io() and the
completion of the operation, we must give fil_node_t::complete_io() a
chance to decrement the counter.
fil_close_tablespace(): Remove unused parameter trx, and document
that this is only invoked during the error handling of IMPORT TABLESPACE.
row_import_discard_changes(): Merged with the only caller,
row_import_cleanup(). Do not lock up the data dictionary while
invoking fil_close_tablespace().
logs_empty_and_mark_files_at_shutdown(): Do not invoke
fil_close_all_files(), to avoid a !needs_flush assertion failure
on fil_node_t::close().
innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().
fil_close_all_files(): Invoke fil_flush_file_spaces()
to ensure proper durability.
thread_pool::unbind(): Fix a crash that would occur on Windows
after srv_thread_pool->disable_aio() and os_file_close().
This fix was submitted by Vladislav Vaintroub.
Thanks to Matthias Leich and Axel Schwenke for extensive testing,
Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
The template parameter mtr_t::OPT refers to optional, not optimized.
Also the default parameter mtr_t::NORMAL refers to optimized writes.
The name MAYBE_NOP would be more descriptive, conveying the idea
that a write to a durable page might not actually have any effect.
In MDEV-12353, the calls to mtr_t::memo_modify_page()
were accidentally removed along with
mlog_open_and_write_index() and its callers.
Let us resurrect the function to enable better debug checks.
mtr_t::flag_modified(): Renamed from mtr_t::set_modified()
and made private.
mtr_t::set_modified(): Take const buf_block_t& as a parameter.
In several mtr_t member functions, replace const buf_page_t&
parameters with const buf_block_t&, so that we can pass the
parameter to set_modified().
mtr_t::modify(): Add a MTR_MEMO_MODIFY entry for a block that
is guaranteed to be modified in the mini-transaction.
The -Wconversion in GCC seems to be stricter than in clang.
GCC at least since version 4.4.7 issues truncation warnings for
assignments to bitfields, while clang 10 appears to only issue
warnings when the sizes in bytes rounded to the nearest integer
powers of 2 are different.
Before GCC 10.0.0, -Wconversion required more casts and would not
allow some operations, such as x<<=1 or x+=1 on a data type that
is narrower than int.
GCC 5 (but not GCC 4, GCC 6, or any later version) is complaining
about x|=y even when x and y are compatible types that are narrower
than int. Hence, we must rewrite some x|=y as
x=static_cast<byte>(x|y) or similar, or we must disable -Wconversion.
In GCC 6 and later, the warning for assigning wider to bitfields
that are narrower than 8, 16, or 32 bits can be suppressed by
applying a bitwise & with the exact bitmask of the bitfield.
For older GCC, we must disable -Wconversion for GCC 4 or 5 in such
cases.
The bitwise negation operator appears to promote short integers
to a wider type, and hence we must add explicit truncation casts
around them. Microsoft Visual C does not allow a static_cast to
truncate a constant, such as static_cast<byte>(1) truncating int.
Hence, we will use the constructor-style cast byte(~1) for such cases.
This has been tested at least with GCC 4.8.5, 5.4.0, 7.4.0, 9.2.1, 10.0.0,
clang 9.0.1, 10.0.0, and MSVC 14.22.27905 (Microsoft Visual Studio 2019)
on 64-bit and 32-bit targets (IA-32, AMD64, POWER 8, POWER 9, ARMv8).
When a InnoDB data file page is freed, its contents becomes garbage,
and any storage allocated in the data file is wasted. During flushing,
InnoDB initializes the page with zeros if scrubbing is enabled. If the
tablespace is compressed then InnoDB should punch a hole else ignore the
flushing of the freed page.
buf_page_t:
- Replaced the variable file_page_was_freed, init_on_flush in buf_page_t
with status enum variable.
- Changed all debug assert of file_page_was_freed to DBUG_ASSERT
of buf_page_t::status
Removed buf_page_set_file_page_was_freed(),
buf_page_reset_file_page_was_freed().
buf_page_free(): Newly added function which takes X-lock on the page
before marking the status as FREED. So that InnoDB flush handler can
avoid concurrent flush of the freed page. Also while flushing the page,
InnoDB make sure that redo log which does freeing of the page also written
to the disk. Currently, this function only marks the page as FREED if
it is in buffer pool
buf_flush_freed_page(): Newly added function which initializes zeros
asynchorously if innodb_immediate_scrub_data_uncompressed is enabled.
Punch a hole to the file synchorously if page_compressed is enabled.
Reset the io_fix to NORMAL. Release the block from flush list and
associated mutex before writing zeros or punch a hole to the file.
buf_flush_page(): Removed the unnecessary usage of temporary
variable "flush"
fil_io(): Introduce new parameter called punch_hole. It allows fil_io()
to punch the hole to the file for the given offset.
buf_page_create(): Let the callers assign buf_page_t::status.
Every caller should eventually invoke mtr_t::init().
fsp_page_create(): Remove the unused mtr_t parameter.
In all other callers of buf_page_create() except fsp_page_create(),
before invoking mtr_t::init(), invoke
mtr_t::sx_latch_at_savepoint() or mtr_t::x_latch_at_savepoint().
mtr_t::init(): Initialize buf_page_t::status also for the temporary
tablespace (when redo logging is disabled), to avoid assertion failures.
For undo log truncation, commit 055a3334ad
repurposed the MLOG_FILE_CREATE2 record with a nonzero page size
to indicate that an undo tablespace will be shrunk in size.
In commit 7ae21b18a6 the
MLOG_FILE_CREATE2 record was replaced by a FILE_CREATE record.
Now that the redo log encoding was changed, there is no actual need
to write a file name in the log record; it suffices to write the
page identifier of the first page that is not part of the file.
This TRIM_PAGES record could allow us to shrink any data files in the
future. For now, it will be limited to undo tablespaces.
mtr_t::log_file_op(): Remove the parameter first_page_no, because
it would always be 0 for file operations.
mtr_t::trim_pages(): Replaces fil_truncate_log().
mtr_t::log_write(): Avoid same_page encoding if !bpage&&!m_last.
fil_op_replay_rename(): Remove the constant parameter first_page_no=0.
This is a follow-up to commit 84e3f9ce84
that introduced the EXTENDED log record of UNDO_APPEND subtype.
mtr_t::undo_append(): Accurately enforce the mtr_buf_t::MAX_DATA_SIZE
limit. Also, replace mtr_buf_t::push() with simpler code, to append 1 byte
to the log.
log_phys_t::undo_append(): Return whether the page was found to
be in an inconsistent state.
log_phys_t::apply(): If corruption was noticed, stop applying log
unless innodb_force_recovery is set.
mrec_ext_t: Introduce DELETE_ROW_FORMAT_REDUNDANT,
DELETE_ROW_FORMAT_DYNAMIC.
mtr_t::page_delete(): Write DELETE_ROW_FORMAT_REDUNDANT or
DELETE_ROW_FORMAT_DYNAMIC log records. We log the byte offset
of the preceding record, so that on recovery we can easily
find everything to update. For DELETE_ROW_FORMAT_DYNAMIC,
we must also write the header and data size of the record.
We will retain the physical logging for ROW_FORMAT=COMPRESSED pages.
page_zip_dir_balance_slot(): Renamed from page_dir_balance_slot(),
and specialized for ROW_FORMAT=COMPRESSED only.
page_rec_set_n_owned(), page_dir_slot_set_n_owned(),
page_dir_balance_slot(): New variants that do not write any log.
page_mem_free(): Take data_size, extra_size as parameters.
Always zerofill the record payload.
page_cur_delete_rec(): For other than ROW_FORMAT=COMPRESSED,
only write log by mtr_t::page_delete().
We introduce an EXTENDED log record for appending an undo log record
to an undo log page. This is equivalent to the MLOG_UNDO_INSERT record
that was removed in commit f802c989ec,
only using more compact encoding.
mtr_t::log_write(): Fix a bug that affects longer log
record writes in the !same_page && !have_offset case.
Similar code is already implemented for the have_offset code path.
The bug was unobservable before we started to write longer
EXTENDED records. All !have_offset records (FREE_PAGE, INIT_PAGE,
EXTENDED) that were written so far are short, and we never write
RESERVED or OPTION records.
mtr_t::undo_append(): Write an UNDO_APPEND record.
log_phys_t::undo_append(): Apply an UNDO_APPEND record.
trx_undo_page_set_next_prev_and_add(),
trx_undo_page_report_modify(),
trx_undo_page_report_rename():
Invoke mtr_t::undo_append() instead of emitting WRITE records.
We introduce an EXTENDED log record for initializing an undo log page.
The size of the record will be 2 bytes plus the optional page identifier.
The entire undo page will be initialized, except the space that is
already reserved for TRX_UNDO_SEG_HDR in trx_undo_seg_create().
mtr_t::undo_create(): Write the UNDO_INIT record.
trx_undo_page_init(): Initialize the undo page corresponding to the
UNDO_INIT record. Unlike the former MLOG_UNDO_INIT record, we will
initialize almost the entire page, including initializing the
TRX_UNDO_PAGE_NODE to an empty list node, so that the subsequent call
to flst_init() will avoid writing log for the undo page.
We plan use the redo log record main type code 0x20 for
InnoDB specific index page operations.
mrec_type_t: Rename INIT_INDEX_PAGE to EXTENDED.
mrec_ext_t: The EXTENDED subtypes.
This is a non-functional change: the redo log record encoding
that was introduced in commit 7ae21b18a6
is not affected.
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
mtr_t::log_write_low(): Replaces mlog_write_initial_log_record_low().
mtr_t::log_file_op(): Replaces fil_op_write_log().
mtr_t::free(): Write MLOG_INIT_FREE_PAGE.
mtr_t::init(): Write MLOG_INIT_FILE_PAGE2.
mtr_t::page_create(): Write record about the partial initialization
of an index page.
mlog_catenate_ulint(), mlog_catenate_string(),
mlog_open(), mlog_close(): Remove.
Pass buf_block_t* to all functions that write redo log.
Specifically, replace the parameters page,page_zip
with buf_block_t* block in page_zip_ functions.
page_mem_alloc_free(), page_dir_set_n_heap(), page_ptr_set_direction():
Merge with the callers.
page_direction_reset(), page_direction_increment(),
page_zip_dir_insert(), page_zip_write_rec_ext(), page_zip_write_rec():
Add the parameter mtr, and write log.
PageBulk::insert(), PageBulk::finish(): Write log for all changes.
page_cur_rec_insert(), page_cur_insert_rec_write_log(),
page_cur_insert_rec_write_log(): Remove.
page_rec_set_next(), page_header_set_field(), page_header_set_ptr():
Remove. Use lower-level operations with or without logging.
page_zip_dir_add_slot(): Move to the same compilation unit with
its only caller, page_cur_insert_rec_zip().
page_cur_insert_rec_zip(): Mark pieces of code that must be skipped
once this task is completed.
btr_defragment_chunk(): Before starting a mini-transaction that
is writing (a lot), invoke log_free_check(). This should allow
the test innodb.innodb_defrag_concurrent to pass with the
mtr default_mysqld.cnf setting of innodb_log_file_size=10M.
MLOG_BUF_MARGIN: Remove.
mtr_t::memcpy(): Replaces mlog_write_string(), mlog_log_string().
The buf_block_t is passed a parameter, so that
mlog_write_initial_log_record_low() can be used instead of
mlog_write_initial_log_record_fast().
fil_space_crypt_t::write_page0(): Remove the fil_space_t* parameter.
Passing buf_block_t helps us avoid calling
mlog_write_initial_log_record_fast() and page_get_page_no(),
and allows us to implement more debug checks, such as
that on ROW_FORMAT=COMPRESSED index pages, only the page header
may be modified by MLOG_MEMSET records.
fseg_n_reserved_pages(): Add a buf_block_t parameter.
mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull().
Optimize away writes if the page contents does not change,
except when a dummy write has been explicitly requested.
Because the member function template takes a block descriptor as a
parameter, it is possible to introduce better consistency checks.
Due to this, the code for handling file-based lists, undo logs
and user transactions was refactored to pass around buf_block_t.
page_create_write_log(), mlog_write_initial_log_record():
Merge to the only caller, and use
mlog_write_initial_log_record_low() for writing the log record.
Except for fil_name_process(), which invokes os_normalize_path(),
the redo log record parser will not modify the redo log records.
Add const qualifiers accordingly.
Implement a 10.4 redo log format, which extends the 10.3 format
by introducing the MLOG_MEMSET record.
MLOG_MEMSET: A new redo log record type for filling an area with a byte.
mlog_memset(): Write the MLOG_MEMSET record.
mlog_parse_nbytes(): Handle MLOG_MEMSET as well.
trx_rseg_header_create(): Reduce the redo log volume by making use of
mlog_memset() and the zero-initialization that happens inside page
allocation.
fil_addr_null: Remove.
flst_init(): Create a variant that takes a zero-initialized
buf_block_t* as a parameter, and only writes the FIL_NULL using
mlog_memset().
flst_zero_addr(): A variant of flst_write_addr() that writes
a null address using mlog_memset() for the FIL_NULL.
The following fixes are replacing some use of MLOG_WRITE_STRING
with the more compact MLOG_MEMSET record, or eliminating
redundant redo log writes:
btr_store_big_rec_extern_fields(): Invoke mlog_memset() for
zero-initializing the tail of the ROW_FORMAT=COMPRESSED BLOB page.
trx_sysf_create(), trx_rseg_format_upgrade(): Invoke mlog_memset()
for zero-initializing the page trailer.
fsp_header_init(), trx_rseg_header_create():
Remove redundant zero-initializations.
Also, remove empty .ic files that were not removed by my MySQL commit.
Problem:
InnoDB used to support a compilation mode that allowed to choose
whether the function definitions in .ic files are to be inlined or not.
This stopped making sense when InnoDB moved to C++ in MySQL 5.6
(and ha_innodb.cc started to #include .ic files), and more so in
MySQL 5.7 when inline methods and functions were introduced
in .h files.
Solution:
Remove all references to UNIV_NONINL and UNIV_MUST_NOT_INLINE from
all files, assuming that the symbols are never defined.
Remove the files fut0fut.cc and ut0byte.cc which only mattered when
UNIV_NONINL was defined.
The InnoDB source code contains quite a few references to a closed-source
hot backup tool which was originally called InnoDB Hot Backup (ibbackup)
and later incorporated in MySQL Enterprise Backup.
The open source backup tool XtraBackup uses the full database for recovery.
So, the references to UNIV_HOTBACKUP are only cluttering the source code.
Contains also
MDEV-10547: Test multi_update_innodb fails with InnoDB 5.7
The failure happened because 5.7 has changed the signature of
the bool handler::primary_key_is_clustered() const
virtual function ("const" was added). InnoDB was using the old
signature which caused the function not to be used.
MDEV-10550: Parallel replication lock waits/deadlock handling does not work with InnoDB 5.7
Fixed mutexing problem on lock_trx_handle_wait. Note that
rpl_parallel and rpl_optimistic_parallel tests still
fail.
MDEV-10156 : Group commit tests fail on 10.2 InnoDB (branch bb-10.2-jan)
Reason: incorrect merge
MDEV-10550: Parallel replication can't sync with master in InnoDB 5.7 (branch bb-10.2-jan)
Reason: incorrect merge
layout as we always had in trees containing only the builtin
2) win\configure.js WITH_INNOBASE_STORAGE_ENGINE still works.
storage/innobase/CMakeLists.txt:
fix to new directory name (and like 5.1)
storage/innobase/Makefile.am:
fix to new directory name (and like 5.1)
storage/innobase/handler/ha_innodb.cc:
fix to new directory name (and like 5.1)
storage/innobase/plug.in:
fix to new directory name (and like 5.1)
bzr branch mysql-5.1-performance-version mysql-trunk # Summit
cd mysql-trunk
bzr merge mysql-5.1-innodb_plugin # which is 5.1 + Innodb plugin
bzr rm innobase # remove the builtin
Next step: build, test fixes.