When srv_page_size and innodb_page_size were introduced,
the functions page_align() and page_offset() got more expensive.
Let us try to replace such calls with simpler pointer arithmetics
with respect to the buffer page frame.
page_rec_get_next_non_del_marked(): Add a page frame as a parameter,
and template<bool comp>.
page_rec_next_get(): A more efficient variant of page_rec_get_next(),
with template<bool comp> and const page_t* parameters.
lock_get_heap_no(): Replaces page_rec_get_heap_no() outside debug checks.
fseg_free_step(), fseg_free_step_not_header(): Take the header block
as a parameter.
Reviewed by: Vladislav Lesin
Also, default to innodb_purge_batch_size=1000,
replacing the old default value of processing 300 undo log pages
in a batch. Axel Schwenke found this value to help reduce purge lag
without having a significant impact on workload throughput.
In purge, we can simply acquire a shared latch on the undo log page
(to avoid a race condition like the one that was fixed in
commit b102872ad5) and retain a buffer-fix
after releasing the latch. The buffer-fix will prevent the undo log
page from being evicted from the buffer pool. Concurrent modification
is prevented by design. Only the purge_coordinator_task
(or its accomplice purge_truncation_task) may free the undo log pages,
after any purge_worker_task have completed execution. Hence, we do not
have to worry about any overwriting or reuse of the undo log records.
trx_undo_rec_copy(): Remove. The only remaining caller would have been
trx_undo_get_undo_rec_low(), which is where the logic was merged.
purge_sys_t::m_initialized: Replaces heap.
purge_sys_t::pages: A cache of buffer-fixed pages that have been
looked up from buf_pool.page_hash.
purge_sys_t::get_page(): Return a buffer-fixed undo page, using the
pages cache.
trx_purge_t::batch_cleanup(): Renamed from clone_end_view().
Clear the pages cache and clone the end_view at the end of a batch.
purge_sys_t::n_pages_handled(): Return pages.size(). This determines
if innodb_purge_batch_size was exceeded.
purge_sys_t::rseg_get_next_history_log(): Replaces
trx_purge_rseg_get_next_history_log().
purge_sys_t::choose_next_log(): Replaces trx_purge_choose_next_log()
and trx_purge_read_undo_rec().
purge_sys_t::get_next_rec(): Replaces trx_purge_get_next_rec()
and trx_undo_get_next_rec().
purge_sys_t::fetch_next_rec(): Replaces trx_purge_fetch_next_rec()
and some use of trx_undo_get_first_rec().
trx_purge_attach_undo_recs(): Do not allow purge_sys.n_pages_handled()
exceed the innodb_purge_batch_size or ¾ of the buffer pool, whichever
is smaller.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich and Axel Schwenke
UndorecApplier::assign_rec(): Remove. We will pass the undo record to
UndorecApplier::apply_undo_rec(). There is no need to copy the
undo record, because nothing else can write to the undo log pages
that belong to an active or incomplete transaction.
trx_t::apply_log(): Buffer-fix the undo page across mini-transaction
boundary in order to avoid repeated page lookups.
Reviewed by: Vladislav Lesin
purge_node_t, undo_node_t: Change the type of rec_type and cmpl_info
to byte, because this data is being extracted from a single byte.
UndoRecApplier: Change type and cmpl_info to be of type byte, and
move them next to the 16-bit offset field to minimize alignment bloat.
row_purge_parse_undo_rec(): Remove some redundant code. Purge will
be started by innodb_ddl_recovery_done(), at which point all
necessary subsystems will have been initialized.
trx_purge_rec_t::undo_rec: Point to const.
Reviewed by: Vladislav Lesin
trx_commit_cleanup(): Clean up any temporary undo log.
Replaces trx_undo_commit_cleanup() and trx_undo_seg_free().
trx_write_serialisation_history(): Commit the mini-transaction.
Do not touch temporary undo logs. Assume that a persistent rollback
segment has been assigned.
trx_serialise(): Merged into trx_write_serialisation_history().
trx_t::commit_low(): Correct some comments and assertions.
trx_t::commit_persist(): Only invoke commit_low() on a mini-transaction
if the persistent state needs to change.
trx_undo_set_state_at_finish(): Merge to its only caller,
trx_purge_add_undo_to_history().
trx_purge_add_undo_to_history(): Evaluate the condition related to
TRX_UNDO_STATE only once.
Tested by: Matthias Leich
The motivation of this change is to allow undo pages for temporary tables
to be marked free as often as possible, so that we can avoid buf_pool.LRU
eviction (and writes) of undo pages that contain data that is
no longer needed. For temporary tables, no MVCC or purge of history
is needed, and reusing cached undo log pages might not help that much.
It is possible that this may cause some performance regression due to
more frequent allocation and freeing of undo log pages, but I only
measured a performance improvement.
trx_write_serialisation_history(): Never cache temporary undo log pages.
trx_undo_reuse_cached(): Assert that the rollback segment is persistent.
trx_undo_assign_low(): Add template<bool is_temp>. Never invoke
trx_undo_reuse_cached() for temporary tables.
Tested by: Matthias Leich
It is not safe to invoke trx_purge_free_segment() or execute
innodb_undo_log_truncate=ON before all undo log records in
the rollback segment has been processed.
A prominent failure that would occur due to premature freeing of
undo log pages is that trx_undo_get_undo_rec() would crash when
trying to copy an undo log record to fetch the previous version
of a record.
If trx_undo_get_undo_rec() was not invoked in the unlucky time frame,
then the symptom would be that some committed transaction history is
never removed. This would be detected by CHECK TABLE...EXTENDED that
was impleented in commit ab0190101b.
Such a garbage collection leak should be possible even when using
innodb_undo_log_truncate=OFF, just involving trx_purge_free_segment().
trx_rseg_t::needs_purge: Change the type from Boolean to a transaction
identifier, noting the most recent non-purged transaction, or 0 if
everything has been purged. On transaction start, we initialize this
to 1 more than the transaction start ID. On recovery, the field may be
adjusted to the transaction end ID (TRX_UNDO_TRX_NO) if it is larger.
The field TRX_UNDO_NEEDS_PURGE becomes write-only; only some debug
assertions that would validate the value. The field reflects the old
inaccurate Boolean field trx_rseg_t::needs_purge.
trx_undo_mem_create_at_db_start(), trx_undo_lists_init(),
trx_rseg_mem_restore(): Remove the parameter max_trx_id.
Instead, store the maximum in trx_rseg_t::needs_purge,
where trx_rseg_array_init() will find it.
trx_purge_free_segment(): Contiguously hold a lock on
trx_rseg_t to prevent any concurrent allocation of undo log.
trx_purge_truncate_rseg_history(): Only invoke trx_purge_free_segment()
if the rollback segment is empty and there are no pending transactions
associated with it.
trx_purge_truncate_history(): Only proceed with innodb_undo_log_truncate=ON
if trx_rseg_t::needs_purge indicates that all history has been purged.
Tested by: Matthias Leich
trx_undo_rec_copy(): Return nullptr if the undo record is corrupted.
trx_undo_rec_get_undo_no(): Define inline with the declaration.
trx_purge_dummy_rec: Replaced with a -1 pointer.
row_undo_rec_get(), UndorecApplier::apply_undo_rec(): Check
if trx_undo_rec_copy() returned nullptr.
trx_purge_get_next_rec(): Return nullptr upon encountering any
corruption, to signal the end of purge.
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
Problem:
=======
Index page latches must be acquired before undo page latches.
In trx_t::apply_log(), InnoDB acquired undo log page latch
before an index page latch.
Solution:
========
In trx_t::apply_log(), InnoDB should copy the undo log
record and release the undo log block before applying
it on online indexes.
- InnoDB DDL results in `Duplicate entry' if concurrent DML throws
duplicate key error. The following scenario explains the problem
connection con1:
ALTER TABLE t1 FORCE;
connection con2:
INSERT INTO t1(pk, uk) VALUES (2, 2), (3, 2);
In connection con2, InnoDB throws the 'DUPLICATE KEY' error because
of unique index. Alter operation will throw the error when applying
the concurrent DML log.
- Inserting the duplicate key for unique index logs the insert
operation for online ALTER TABLE. When insertion fails,
transaction does rollback and it leads to logging of
delete operation for online ALTER TABLE.
While applying the insert log entries, alter operation
encounters 'DUPLICATE KEY' error.
- To avoid the above fake duplicate scenario, InnoDB should
not write any log for online ALTER TABLE before DML transaction
commit.
- User thread which does DML can apply the online log if
InnoDB ran out of online log and index is marked as completed.
Set online log error if apply phase encountered any error.
It can also clear all other indexes log, marks the newly
added indexes as corrupted.
- Removed the old online code which was a part of DML operations
commit_inplace_alter_table() : Does apply the online log
for the last batch of secondary index log and does frees
the log for the completed index.
trx_t::apply_online_log: Set to true while writing the undo
log if the modified table has active DDL
trx_t::apply_log(): Apply the DML changes to online DDL tables
dict_table_t::is_active_ddl(): Returns true if the table
has an active DDL
dict_index_t::online_log_make_dummy(): Assign dummy value
for clustered index online log to indicate the secondary
indexes are being rebuild.
dict_index_t::online_log_is_dummy(): Check whether the online
log has dummy value
ha_innobase_inplace_ctx::log_failure(): Handle the apply log
failure for online DDL transaction
row_log_mark_other_online_index_abort(): Clear out all other
online index log after encountering the error during
row_log_apply()
row_log_get_error(): Get the error happened during row_log_apply()
row_log_online_op(): Does apply the online log if index is
completed and ran out of memory. Returns false if apply log fails
UndorecApplier: Introduced a class to maintain the undo log
record, latched undo buffer page, parse the undo log record,
maintain the undo record type, info bits and update vector
UndorecApplier::get_old_rec(): Get the correct version of the
clustered index record that was modified by the current undo
log record
UndorecApplier::clear_undo_rec(): Clear the undo log related
information after applying the undo log record
UndorecApplier::log_update(): Handle the update, delete undo
log and apply it on online indexes
UndorecApplier::log_insert(): Handle the insert undo log
and apply it on online indexes
UndorecApplier::is_same(): Check whether the given roll pointer
is generated by the current undo log record information
trx_t::rollback_low(): Set apply_online_log for the transaction
after partially rollbacked transaction has any active DDL
prepare_inplace_alter_table_dict(): After allocating the online
log, InnoDB does create fulltext common tables. Fulltext index
doesn't allow the index to be online. So removed the dead
code of online log removal
Thanks to Marko Mäkelä for providing the initial prototype and
Matthias Leich for testing the issue patiently.
Let us simply refuse an upgrade from earlier versions if the
upgrade procedure was not followed. This simplifies the purge,
commit, and rollback of transactions.
Before upgrading to MariaDB 10.3 or later, a clean shutdown
of the server (with innodb_fast_shutdown=1 or 0) is necessary,
to ensure that any incomplete transactions are rolled back.
The undo log format was changed in MDEV-12288. There is only
one persistent undo log for each transaction.
InnoDB used to support at most one CREATE TABLE or DROP TABLE
per transaction. This caused complications for DDL operations on
partitioned tables (where each partition is treated as a separate
table by InnoDB) and FULLTEXT INDEX (where each index is maintained
in a number of internal InnoDB tables).
dict_drop_index_tree(): Extend the MDEV-24589 logic and treat
the purge or rollback of SYS_INDEXES records of clustered indexes
specially: by dropping the tablespace if it exists. This is the only
form of recovery that we will need.
trx_undo_ddl_type: Document the DDL undo log record types better.
trx_t::dict_operation: Change the type to bool.
trx_t::ddl: Remove.
trx_t::table_id, trx_undo_t::table_id: Remove.
dict_build_table_def_step(): Remove trx_t::table_id logging.
dict_table_close_and_drop(), row_merge_drop_table(): Remove.
row_merge_lock_table(): Merged to the only callers, which can
call lock_table_for_trx() directly.
fts_aux_table_t, fts_aux_id, fts_space_set_t: Remove.
fts_drop_orphaned_tables(): Remove.
row_merge_rename_index_to_drop(): Remove. Thanks to MDEV-24589,
we can simply delete the to-be-dropped indexes from SYS_INDEXES,
while still being able to roll back the operation.
ha_innobase_inplace_ctx: Make a few data members const.
Preallocate trx.
prepare_inplace_alter_table_dict(): Simplify the logic. Let the
normal rollback take care of some cleanup.
row_undo_ins_remove_clust_rec(): Simplify the parsing of SYS_COLUMNS.
trx_rollback_active(): Remove the special DROP TABLE logic.
trx_undo_mem_create_at_db_start(), trx_undo_reuse_cached():
Always write TRX_UNDO_TABLE_ID as 0.
InnoDB frees the block lock during buffer pool shrinking when other
thread is yet to release the block lock. While shrinking the
buffer pool, InnoDB allows the page to be freed unless it is buffer
fixed. In some cases, InnoDB releases the latch after unfixing the
block.
Fix:
====
- InnoDB should unfix the block after releases the latch.
- Add more assertion to check buffer fix while accessing the page.
- Introduced block_hint structure to store buf_block_t pointer
and allow accessing the buf_block_t pointer only by passing a
functor. It returns original buf_block_t* pointer if it is valid
or nullptr if the pointer become stale.
- Replace buf_block_is_uncompressed() with
buf_pool_t::is_block_pointer()
This change is motivated by a change in mysql-5.7.32:
mysql/mysql-server@46e60de444
Bug #31036301 ASSERTION FAILURE: SYNC0RW.IC:429:LOCK->LOCK_WORD
We introduce an EXTENDED log record for initializing an undo log page.
The size of the record will be 2 bytes plus the optional page identifier.
The entire undo page will be initialized, except the space that is
already reserved for TRX_UNDO_SEG_HDR in trx_undo_seg_create().
mtr_t::undo_create(): Write the UNDO_INIT record.
trx_undo_page_init(): Initialize the undo page corresponding to the
UNDO_INIT record. Unlike the former MLOG_UNDO_INIT record, we will
initialize almost the entire page, including initializing the
TRX_UNDO_PAGE_NODE to an empty list node, so that the subsequent call
to flst_init() will avoid writing log for the undo page.
mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull().
Optimize away writes if the page contents does not change,
except when a dummy write has been explicitly requested.
Because the member function template takes a block descriptor as a
parameter, it is possible to introduce better consistency checks.
Due to this, the code for handling file-based lists, undo logs
and user transactions was refactored to pass around buf_block_t.
Some places didn't match the previous rules, making the Floor
address wrong.
Additional sed rules:
sed -i -e 's/Place.*Suite .*, Boston/Street, Fifth Floor, Boston/g'
sed -i -e 's/Suite .*, Boston/Fifth Floor, Boston/g'
InnoDB in MySQL 5.7 introduced two new parameters to the function
dict_hdr_get_new_id(), to allow redo logging to be disabled when
assigning identifiers to temporary tables or during the
backup-unfriendly TRUNCATE TABLE that was replaced in MariaDB
by MDEV-13564.
Now that MariaDB 10.4.0 removed the crash recovery code for the
backup-unfriendly TRUNCATE, we can revert dict_hdr_get_new_id()
to be used only for persistent data structures.
dict_table_assign_new_id(): Remove. This was a simple 2-line function
that was called from few places.
dict_table_open_on_id_low(): Declare in the only file where it
is called.
dict_sys_t::temp_id_hash: A separate lookup table for temporary tables.
Table names will be in the common dict_sys_t::table_hash.
dict_sys_t::get_temporary_table_id(): Assign a temporary table ID.
dict_sys_t::get_table(): Look up a persistent table.
dict_sys_t::get_temporary_table(): Look up a temporary table.
dict_sys_t::temp_table_id: The sequence of temporary table identifiers.
Starts from DICT_HDR_FIRST_ID, so that we can continue to simply compare
dict_table_t::id to a few constants for the persistent hard-coded
data dictionary tables.
undo_node_t::state: Distinguish temporary and persistent tables.
lock_check_dict_lock(), lock_get_table_id(): Assert that there cannot
be locks on temporary tables.
row_rec_to_index_entry_impl(): Assert that there cannot be metadata
records on temporary tables.
row_undo_ins_parse_undo_rec(): Distinguish temporary and persistent tables.
Move some assertions from the only caller. Return whether the table was
found.
row_undo_ins(): Add some assertions.
row_undo_mod_clust(), row_undo_mod(): Do not assign node->state.
Let row_undo() do that.
row_undo_mod_parse_undo_rec(): Distinguish temporary and persistent tables.
Move some assertions from the only caller. Return whether the table was
found.
row_undo_try_truncate(): Renamed and simplified from trx_roll_try_truncate().
row_undo_rec_get(): Replaces trx_roll_pop_top_rec_of_trx() and
trx_roll_pop_top_rec(). Fetch an undo log record, and assign undo->state
accordingly.
trx_undo_truncate_end(): Acquire the rseg->mutex only for the minimum
required duration, and release it between mini-transactions.
dict_index_t::db_trx_id(): Return the position of DB_TRX_ID.
Only valid for the clustered index.
dict_index_t::db_roll_ptr(): Return the position of DB_ROLL_PTR.
Only valid for the clustered index.
dict_index_get_sys_col_pos(): Remove. This was performing unnecessarily
complex computations, which only made sense for DB_ROW_ID, which would
exist either as the first field in the clustered index or as the last
field in a secondary index (only when a DB_ROW_ID column is materialised).
row_sel_store_row_id_to_prebuilt(): Remove, and replace with simpler code.
row_upd_index_entry_sys_field(): Remove.
btr_cur_log_sys(): Replaces row_upd_write_sys_vals_to_log().
btr_cur_write_sys(): Write DB_TRX_ID,DB_ROLL_PTR to a data tuple.
main.derived_cond_pushdown: Move all 10.3 tests to the end,
trim trailing white space, and add an "End of 10.3 tests" marker.
Add --sorted_result to tests where the ordering is not deterministic.
main.win_percentile: Add --sorted_result to tests where the
ordering is no longer deterministic.
It turned out that ha_innobase::truncate() would prematurely
commit the transaction already before the completion of the
ha_innobase::create(). All of this must be atomic.
innodb.truncate_crash: Use the correct DEBUG_SYNC point, and
tolerate non-truncation of the table, because the redo log
for the TRUNCATE transaction commit might be flushed due to
some InnoDB background activity.
dict_build_tablespace_for_table(): Merge to the function
dict_build_table_def_step().
dict_build_table_def_step(): If a table is being created during
an already started data dictionary transaction (such as TRUNCATE),
persistently write the table_id to the undo log header before
creating any file. In this way, the recovery of TRUNCATE will be
able to delete the new file before rolling back the rename of
the original table.
dict_table_rename_in_cache(): Add the parameter replace_new_file,
used as part of rolling back a TRUNCATE operation.
fil_rename_tablespace_check(): Add the parameter replace_new.
If the parameter is set and a file identified by new_path exists,
remove a possible tablespace and also the file.
create_table_info_t::create_table_def(): Remove some debug assertions
that no longer hold. During TRUNCATE, the transaction will already
have been started (and performed a rename operation) before the
table is created. Also, remove a call to dict_build_tablespace_for_table().
create_table_info_t::create_table(): Add the parameter create_fk=true.
During TRUNCATE TABLE, do not add FOREIGN KEY constraints to the
InnoDB data dictionary, because they will also not be removed.
row_table_add_foreign_constraints(): If trx=NULL, do not modify
the InnoDB data dictionary, but only load the FOREIGN KEY constraints
from the data dictionary.
ha_innobase::create(): Lock the InnoDB data dictionary cache only
if no transaction was passed by the caller. Unlock it in any case.
innobase_rename_table(): Add the parameter commit = true.
If !commit, do not lock or unlock the data dictionary cache.
ha_innobase::truncate(): Lock the data dictionary before invoking
rename or create, and let ha_innobase::create() unlock it and
also commit or roll back the transaction.
trx_undo_mark_as_dict(): Renamed from trx_undo_mark_as_dict_operation()
and declared global instead of static.
row_undo_ins_parse_undo_rec(): If table_id is set, this must
be rolling back the rename operation in TRUNCATE TABLE, and
therefore replace_new_file=true.
This is a merge from 10.2, but the 10.2 version of this will not
be pushed into 10.2 yet, because the 10.2 version would include
backports of MDEV-14717 and MDEV-14585, which would introduce
a crash recovery regression: Tables could be lost on
table-rebuilding DDL operations, such as ALTER TABLE,
OPTIMIZE TABLE or this new backup-friendly TRUNCATE TABLE.
The test innodb.truncate_crash occasionally loses the table due to
the following bug:
MDEV-17158 log_write_up_to() sometimes fails
Implement undo tablespace truncation via normal redo logging.
Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name,
CREATE, and DROP.
Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2
is killed before the DROP operation is committed. If MariaDB Server 10.2
is killed during TRUNCATE, it is also possible that the old table
was renamed to #sql-ib*.ibd but the data dictionary will refer to the
table using the original name.
In MariaDB Server 10.3, RENAME inside InnoDB is transactional,
and #sql-* tables will be dropped on startup. So, this new TRUNCATE
will be fully crash-safe in 10.3.
ha_mroonga::wrapper_truncate(): Pass table options to the underlying
storage engine, now that ha_innobase::truncate() will need them.
rpl_slave_state::truncate_state_table(): Before truncating
mysql.gtid_slave_pos, evict any cached table handles from
the table definition cache, so that there will be no stale
references to the old table after truncating.
== TRUNCATE TABLE ==
WL#6501 in MySQL 5.7 introduced separate log files for implementing
atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB
undo and redo log. Some convoluted logic was added to the InnoDB
crash recovery, and some extra synchronization (including a redo log
checkpoint) was introduced to make this work. This synchronization
has caused performance problems and race conditions, and the extra
log files cannot be copied or applied by external backup programs.
In order to support crash-upgrade from MariaDB 10.2, we will keep
the logic for parsing and applying the extra log files, but we will
no longer generate those files in TRUNCATE TABLE.
A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE
(with full redo and undo logging and proper rollback). This will
be implemented in MDEV-14717.
ha_innobase::truncate(): Invoke RENAME, create(), delete_table().
Because RENAME cannot be fully rolled back before MariaDB 10.3
due to missing undo logging, add some explicit rename-back in
case the operation fails.
ha_innobase::delete(): Introduce a variant that takes sqlcom as
a parameter. In TRUNCATE TABLE, we do not want to touch any
FOREIGN KEY constraints.
ha_innobase::create(): Add the parameters file_per_table, trx.
In TRUNCATE, the new table must be created in the same transaction
that renames the old table.
create_table_info_t::create_table_info_t(): Add the parameters
file_per_table, trx.
row_drop_table_for_mysql(): Replace a bool parameter with sqlcom.
row_drop_table_after_create_fail(): New function, wrapping
row_drop_table_for_mysql().
dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(),
fil_prepare_for_truncate(), fil_reinit_space_header_for_table(),
row_truncate_table_for_mysql(), TruncateLogger,
row_truncate_prepare(), row_truncate_rollback(),
row_truncate_complete(), row_truncate_fts(),
row_truncate_update_system_tables(),
row_truncate_foreign_key_checks(), row_truncate_sanity_checks():
Remove.
row_upd_check_references_constraints(): Remove a check for
TRUNCATE, now that the table is no longer truncated in place.
The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some
race-condition like scenarios. The test innodb-innodb.truncate does
not use any synchronization.
We add a redo log subformat to indicate backup-friendly format.
MariaDB 10.4 will remove support for the old TRUNCATE logging,
so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve
limitations.
== Undo tablespace truncation ==
MySQL 5.7 implements undo tablespace truncation. It is only
possible when innodb_undo_tablespaces is set to at least 2.
The logging is implemented similar to the WL#6501 TRUNCATE,
that is, using separate log files and a redo log checkpoint.
We can simply implement undo tablespace truncation within
a single mini-transaction that reinitializes the undo log
tablespace file. Unfortunately, due to the redo log format
of some operations, currently, the total redo log written by
undo tablespace truncation will be more than the combined size
of the truncated undo tablespace. It should be acceptable
to have a little more than 1 megabyte of log in a single
mini-transaction. This will be fixed in MDEV-17138 in
MariaDB Server 10.4.
recv_sys_t: Add truncated_undo_spaces[] to remember for which undo
tablespaces a MLOG_FILE_CREATE2 record was seen.
namespace undo: Remove some unnecessary declarations.
fil_space_t::is_being_truncated: Document that this flag now
only applies to undo tablespaces. Remove some references.
fil_space_t::is_stopping(): Do not refer to is_being_truncated.
This check is for tablespaces of tables. Potentially used
tablespaces are never truncated any more.
buf_dblwr_process(): Suppress the out-of-bounds warning
for undo tablespaces.
fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero
page number (new size of the tablespace in pages) to inform
crash recovery that the undo tablespace size has been reduced.
fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2
can be written for undo tablespaces (without .ibd file suffix)
for a nonzero page number.
os_file_truncate(): Add the parameter allow_shrink=false
so that undo tablespaces can actually be shrunk using this function.
fil_name_parse(): For undo tablespace truncation,
buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[].
recv_read_in_area(): Avoid reading pages for which no redo log
records remain buffered, after recv_addr_trim() removed them.
trx_rseg_header_create(): Add a FIXME comment that we could write
much less redo log.
trx_undo_truncate_tablespace(): Reinitialize the undo tablespace
in a single mini-transaction, which will be flushed to the redo log
before the file size is trimmed.
recv_addr_trim(): Discard any redo logs for pages that were
logged after the new end of a file, before the truncation LSN.
If the rec_list becomes empty, reduce n_addrs. After removing
any affected records, actually truncate the file.
recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before
applying any log records. The undo tablespace files must be open
at this point.
buf_flush_or_remove_pages(), buf_flush_dirty_pages(),
buf_LRU_flush_or_remove_pages(): Add a parameter for specifying
the number of the first page to flush or remove (default 0).
trx_purge_initiate_truncate(): Remove the log checkpoints, the
extra logging, and some unnecessary crash points. Merge the code
from trx_undo_truncate_tablespace(). First, flush all to-be-discarded
pages (beyond the new end of the file), then trim the space->size
to make the page allocation deterministic. At the only remaining
crash injection point, flush the redo log, so that the recovery
can be tested.
While the record type MLOG_UNDO_INIT feels redundant, it does save
redo log volume, and subtle changes to redo log volume appear to
make a big difference for the log_sys.mutex contention.
trx_undo_parse_page_init(): Allow type==0, which is what we write
since MDEV-12288. Parse type in a simpler way; it always was written
as a single byte.
trx_undo_page_init(): Write a MLOG_UNDO_INIT record.
Remove unused InnoDB function parameters and functions.
i_s_sys_virtual_fill_table(): Do not allocate heap memory.
mtr_is_block_fix(): Replace with mtr_memo_contains().
mtr_is_page_fix(): Replace with mtr_memo_contains_page().