crash recovery
Summary
=======
When doing server recovery, the active transactions will be rolled
back by InnoDB background rollback thread automatically. The
prepared transactions will be committed or rolled back accordingly
by binlog recovery. Binlog recovery is done in main thread before
the server can provide service to users. If there is a big
transaction to rollback, the server will not available for a long
time.
This patch provides a way to rollback the prepared transactions
asynchronously. Thus the rollback will not block server startup.
Design
======
- Handler::recover_rollback_by_xid()
This patch provides a new handler interface to rollback transactions
in recover phase. InnoDB just set the transaction's state to active.
Then the transaction will be rolled back by the background rollback
thread.
- Handler::signal_tc_log_recover_done()
This function is called after tc log is opened(typically binlog opened)
has done. When this function is called, all transactions will be rolled
back have been reverted to ACTIVE state. Thus it starts rollback thread
to rollback the transactions.
- Background rollback thread
With this patch, background rollback thread is defered to run until binlog
recovery is finished. It is started by innobase_tc_log_recovery_done().
The recent commit 4ca355d863 (MDEV-33894)
caused a serious regression for online InnoDB ib_logfile0 resizing,
breaking crash-safety unless the memory-mapped log file interface is
being used. However, the log resizing was broken also before this.
To prevent such regressions in the future, we extend the test
innodb.log_file_size_online with a kill and restart of the server
and with some writes running concurrently with the log size change.
When run enough many times, this test revealed all the bugs that
are being fixed by the code changes.
log_t::resize_start(): Do not allow the resized log to start before
the current log sequence number. In this way, there is no need to
copy anything to the first block of resize_buf. The previous logic
regarding that was incorrect in two ways. First, we would have to
copy from the last written buffer (buf or flush_buf). Second, we failed
to ensure that the mini-transaction end marker bytes would be 1
in the buffer. If the source ib_logfile0 had wrapped around an odd number
of times, the end marker would be 0. This was occasionally observed
when running the test innodb.log_file_size_online.
log_t::resize_write_buf(): To adjust for the resize_start() change,
do not write anything that would be before the resize_lsn.
Take the buffer (resize_buf or resize_flush_buf) as a parameter.
Starting with commit 4ca355d863
we no longer swap buffers when rewriting the last log block.
log_t::append(): Define as a static function; only some debug
assertions need to refer to the log_sys object.
innodb_log_file_size_update(): Wake up the buf_flush_page_cleaner()
if needed, and wait for it to complete a batch while waiting for
the log resizing to be completed. If the current LSN is behind the
resize target LSN, we will write redundant FILE_CHECKPOINT records to
ensure that the log resizing completes. If the buf_pool.flush_list is
empty or the buf_flush_page_cleaner() is stuck for some reason, our wait
will time out in 5 seconds, so that we can periodically check if the
execution of SET GLOBAL innodb_log_file_size was aborted. Previously,
we could get into a busy loop here while the buf_flush_page_cleaner()
would remain idle.
recv_recovery_from_checkpoint_start(): Abort startup due to log
corruption if we were unable to parse the entire log between
the latest log checkpoint and the corresponding FILE_CHECKPOINT record.
Also, reduce some code bloat related to log output and log_sys.mutex.
Reviewed by: Debarun Banerjee
In commit fa8a46eb68 (MDEV-33613)
the parameter innodb_lru_flush_size ceased to have any effect.
Let us declare the parameter as deprecated and additionally as
MARIADB_REMOVED_OPTION, so that there will be a warning written
to the error log in case the option is specified in the command line.
Let us also do the same for the parameter
innodb_purge_rseg_truncate_frequency
that was deprecated&ignored earlier in MDEV-32050.
Reviewed by: Debarun Banerjee
purge_sys_t::stop_FTS(): Fix an incorrect debug assertion that
commit d58734d781 added.
The assertion would fail if there had been prior invocations of
purge_sys.stop_SYS() without purge_sys.resume_SYS().
The intention of the assertion is to check that number of pending
stop_FTS() stays below 65536.
Before this patch, the InnoDB purge coordinator task submitted
innodb_purge_threads-1 tasks even if there was not sufficient amount
of work for all of them. For example, if there are undo log records
only for 1 table, only 1 task can be employed, and that task had better
be the purge coordinator.
srv_purge_worker_task_low(): Split from purge_worker_callback().
trx_purge_attach_undo_recs(): Remove the parameter n_purge_threads,
and add the parameter n_work_items, to keep track of the amount of
work.
trx_purge(): Launch purge worker tasks only if necessary. The work of
one thread will be executed by this purge coordinator thread.
que_fork_scheduler_round_robin(): Merged to trx_purge().
Thanks to Vladislav Vaintroub for supplying a prototype of this.
Reviewed by: Debarun Banerjee
In a Sysbench oltp_update_index workload that involves 1 table,
a serious contention between the workload and the purge of history
was observed. This was the worst when the table contained only 1 record.
This turned out to be fixed by setting innodb_purge_batch_size=128,
which corresponds to the number of usable persistent rollback segments.
When we go above that, there would be contention between row_purge_poss_sec()
and the workload, typically on the clustered index page latch, sometimes
also on a secondary index page latch. It might be that with smaller
batches, trx_sys.history_size() will end up pausing all concurrent
transaction start/commit frequently enough so that purge will be able
to make some progress, so that there would be less contention on the
index page latches between purge and SQL execution.
In commit aa719b5010 (part of MDEV-32050)
the interpretation of the parameter innodb_purge_batch_size was slightly
changed. It would correspond to the maximum desired size of the
purge_sys.pages cache. Before that change, the parameter was referring to
a number of undo log pages, but the accounting might have been inaccurate.
To avoid a regression, we will reduce the default value to
innodb_purge_batch_size=127, which will also be compatible with
innodb_undo_tablespaces>1 (which will disable rollback segment 0).
Additionally, some logic in the purge and MVCC checks is simplified.
The purge tasks will make use of purge_sys.pages when accessing undo
log pages to find out if a secondary index record can be removed.
If an undo page needs to be looked up in buf_pool.page_hash, we will
merely buffer-fix it. This is correct, because the undo pages are
append-only in nature. Holding purge_sys.latch or purge_sys.end_latch
or the fact that the current thread is executing as a part of an
in-progress purge batch will prevent the contents of the undo page from
being freed and subsequently reused. The buffer-fix will prevent the
page from being evicted form the buffer pool. Thanks to this logic,
we can refer to the undo log record directly in the buffer pool page
and avoid copying the record.
buf_pool_t::page_fix(): Look up and buffer-fix a page. This is useful
for accessing undo log pages, which are append-only by nature.
There will be no need to deal with change buffer or ROW_FORMAT=COMPRESSED
in that case.
purge_sys_t::view_guard::view_guard(): Allow the type of guard to be
acquired: end_latch, latch, or no latch (in case we are a purge thread).
purge_sys_t::view_guard::get(): Read-only accessor to purge_sys.pages.
purge_sys_t::get_page(): Invoke buf_pool_t::page_fix().
row_vers_old_has_index_entry(): Replaced with row_purge_is_unsafe()
and row_undo_mod_sec_unsafe().
trx_undo_get_undo_rec(): Merged to trx_undo_prev_version_build().
row_purge_poss_sec(): Add the parameter mtr and remove redundant
or unused parameters sec_pcur, sec_mtr, is_tree. We will use the
caller's mtr object but release any acquired page latches before
returning.
btr_cur_get_page(), page_cur_get_page(): Do not invoke page_align().
row_purge_remove_sec_if_poss_leaf(): Return the value of PAGE_MAX_TRX_ID
to be checked against the page in row_purge_remove_sec_if_poss_tree().
If the secondary index page was not changed meanwhile, it will be
unnecessary to invoke row_purge_poss_sec() again.
trx_undo_prev_version_build(): Access any undo log pages using
the caller's mini-transaction object.
row_purge_vc_matches_cluster(): Moved to the only compilation unit that
needs it.
Reviewed by: Debarun Banerjee
There were two separate Atomic_counter<uint32_t>, purge_sys.m_SYS_paused
and purge_sys.m_FTS_paused. In purge_sys.wait_FTS() we have to read both
atomically. We used to use an overkill solution for this, acquiring
purge_sys.latch and waiting 10 milliseconds between samples. To make
matters worse, the 10-millisecond wait was unconditional, which would
unnecessarily suspend the purge_coordinator_task every now and then.
It turns out that we can fold both "reference counts" into a single
Atomic_relaxed<uint32_t> and avoid the purge_sys.latch.
To assess whether std::memory_order_relaxed is acceptable, we should
consider the operations that read these "reference counts", that is,
purge_sys_t::wait_FTS(bool) and purge_sys_t::must_wait_FTS().
Outside debug assertions, purge_sys.must_wait_FTS() is only invoked in
trx_purge_table_acquire(), which is covered by a shared dict_sys.latch.
We would increment the counter as part of a DDL operation, but before
acquiring an exclusive dict_sys.latch. So, a
purge_sys_t::close_and_reopen() loop could be triggered slightly
prematurely, before a problematic DDL operation is actually executed.
Decrementing the counter is less of an issue; purge_sys.resume_FTS()
or purge_sys.resume_SYS() would mostly be invoked while holding an
exclusive dict_sys.latch; ha_innobase::delete_table() does it outside
that critical section. Still, this would only cause some extra wait in
the purge_coordinator_task, just like at the start of a DDL operation.
There are two calls to purge_sys_t::wait_FTS(bool): in the above mentioned
purge_sys_t::close_and_reopen() and in purge_sys_t::clone_oldest_view(),
both invoked by the purge_coordinator_task. There is also a
purge_sys.clone_oldest_view<true>() call at startup when no DDL operation
can be in progress.
purge_sys_t::m_SYS_paused: Merged into m_FTS_paused, using a new
multiplier PAUSED_SYS = 65536.
purge_sys_t::wait_FTS(): Remove an unnecessary sleep as well as the
access to purge_sys.latch. It suffices to poll purge_sys.m_FTS_paused.
purge_sys_t::stop_FTS(): Do not acquire purge_sys.latch.
Reviewed by: Debarun Banerjee
buf_page_ibuf_merge_try(): A new, separate function for invoking
ibuf_merge_or_delete_for_page() when needed. Use the already requested
page latch for determining if the call is necessary. If it is and
if we are currently holding rw_latch==RW_S_LATCH, upgrading to an exclusive
latch may involve waiting that another thread acquires and releases
a U or X latch on the page. If we have to wait, we must recheck if the
call to ibuf_merge_or_delete_for_page() is still needed. If the page
turns out to be corrupted, we will release and fail the operation.
Finally, the exclusive page latch will be downgraded to the originally
requested latch.
ssux_lock_impl::rd_u_upgrade_try(): Attempt to upgrade a shared lock to
an update lock.
sux_lock::s_x_upgrade_try(): Attempt to upgrade a shared lock to
exclusive.
sux_lock::s_x_upgrade(): Upgrade a shared lock to exclusive.
Return whether a wait was elided.
ssux_lock_impl::u_rd_downgrade(), sux_lock::u_s_downgrade():
Downgrade an update lock to shared.
One cause of the slowdown is because the ftruncate call can be much
slower on some systems. ftruncate() is called by Aria for internal
temporary tables, tables created by the optimizer, when the upper level
asks Aria to delete the previous result set. This is needed when some
content from previous tables changes.
I have now changed Aria so that for internal temporary tables we don't
call ftruncate() anymore for maria_delete_all_rows().
I also had to update the Aria repair code to use the logical datafile
size and not the on-disk datafile size, which may contain data from a
previous result set. The repair code is called to create indexes for
the internal temporary table after it is filled.
I also replaced a call to mysql_file_size() with a pwrite() in
_ma_bitmap_create_first().
Reviewer: Sergei Petrunia <sergey@mariadb.com>
Tester: Dave Gosselin <dave.gosselin@mariadb.com>
When SUX_LOCK_GENERIC is defined, the srw_mutex, srw_lock, sux_lock are
implemented based on pthread_mutex_t and pthread_cond_t. This is the
only option for systems that lack a futex-like system call.
In the SUX_LOCK_GENERIC mode, if pthread_mutex_init() is allocating
some resources that need to be freed by pthread_mutex_destroy(),
a memory leak could occur when we are repeatedly invoking
pthread_mutex_init() without a pthread_mutex_destroy() in between.
pthread_mutex_wrapper::initialized: A debug field to track whether
pthread_mutex_init() has been invoked. This also helps find bugs
like the one that was fixed by
commit 1c8af2ae53 (MDEV-34422);
one simply needs to add -DSUX_LOCK_GENERIC to the CMAKE_CXX_FLAGS
to catch that particular bug on the initial server bootstrap.
buf_block_init(), buf_page_init_for_read(): Invoke block_lock::init()
because buf_page_t::init() will no longer do that.
buf_page_t::init(): Instead of invoking lock.init(), assert that it
has already been invoked (the lock is vacant).
add_fts_index(), build_fts_hidden_table(): Explicitly invoke
index_lock::init() in order to avoid a pthread_mutex_destroy()
invocation on an uninitialized object.
srw_lock_debug::destroy(): Invoke readers_lock.destroy().
trx_sys_t::create(): Invoke trx_rseg_t::init() on all rollback segments
in order to guarantee a deterministic state for shutdown, even if
InnoDB fails to start up.
trx_rseg_array_init(), trx_temp_rseg_create(), trx_rseg_create():
Invoke trx_rseg_t::destroy() before trx_rseg_t::init() in order to
balance pthread_mutex_init() and pthread_mutex_destroy() calls.
buf_page_get_gen(): Relax the assertion once more.
The LSN may grow by invoking ibuf_upgrade(), that is,
when upgrading files where innodb_change_buffering!=none was used.
The LSN may also have been recovered from a log that needs
to be upgraded to the current format.
int wsrep_thd_append_key(THD*, const wsrep_key*, int, Wsrep_service_key_type)
CREATE TABLE [SELECT|REPLACE SELECT] is CTAS and idea was that
we force ROW format. However, it was not correctly enforced
and keys were appended before wsrep transaction was started.
At THD::decide_logging_format we should force used stmt binlog
format to ROW in CTAS case and produce a warning if used
binlog format was not ROW.
At ha_innodb::update_row we should not append keys similarly
as in ha_innodb::write_row if sql_command is SQLCOM_CREATE_TABLE.
Improved error logging on ::write_row, ::update_row and ::delete_row
if wsrep key append fails.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Assertion `table->field[0]->ptr >= table->record[0] &&
table->field[0]->ptr <= table->record[0] + table->s->reclength' failed in
handler::assert_icp_limitations.
table->move_fields has some limitations:
1. It cannot be used in cascade
2. It should always have a restoring pair.
Rule 1 is covered by assertions in handler::assert_icp_limitations
and handler::ptr_in_record (commit 30894fe9a9).
Rule 2 should be manually maintained with care. Hopefully, the rule 1 assertions
may sometimes help as well.
In ha_myisam::repair, both rules are broken. table->move_fields is used
asymmetrically there: it is set on every param->fix_record call
(i.e. in compute_vcols) but is restored only once, in the end of repair.
The reason to updating field ptr's for every call is that compute_vcols can
(supposedly) be called in parallel, that is, with the same table, but different
records.
The condition to "unmove" the pointers in ha_myisam::restore_vcos_after_repair
is incorrect, when stored vcols are available, and myisam stores a VIRTUAL field
if it's the only field in the table (the record cannot be of zero length).
This patch solves the problem by "unmoving" the pointers symmetrically, in
compute_vcols. That is, both rules will be preserved maintained.
Issue: During mtr_t:commit, if there is not enough space available in
redo log buffer, we flush the buffer. During flush, the LSN lock is
released allowing other concurrent mtr to commit. After flush we
reacquire the lock but use the old LSN obtained before check. It could
lead to redo log corruption. As the LSN moves backwards with the
possibility of data loss and unrecoverable server if the server aborts
for any reason or if server is shutdown with innodb_fast_shutdown=2.
With normal shutdown, recovery fails to map the checkpoint LSN to
correct offset.
In debug mode it hits log0log.cc:863: lsn_t log_t::write_buf()
Assertion `new_buf_free == ((lsn - first_lsn) & write_size_1)' failed.
In release mode, after normal shutdown, restart fails.
[ERROR] InnoDB: Missing FILE_CHECKPOINT(8416546) at 8416546
[ERROR] InnoDB: Log scan aborted at LSN 8416546
Backup fails reading the corrupt redo log.
[00] 2024-07-31 20:59:10 Retrying read of log at LSN=7334851
[00] FATAL ERROR: 2024-07-31 20:59:11 Was only able to copy log from
7334851 to 7334851, not 8416446; try increasing innodb_log_file_size
Unless a backup is tried or the server is shutdown or killed
immediately, the corrupt redo part is eventually truncated and there
may not be any visible issues seen in release mode.
This issue was introduced by the following commit.
commit a635c40648
MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()
Fix: If we need to release latch and flush redo before writing mtr
logs, make sure to get the latest system LSN after reacquiring the
redo system latch.
Reason:
======
- InnoDB fails to load the instant alter table metadata from
clustered index while loading the table definition.
The reason is that InnoDB metadata blob has the column length
exceeds maximum fixed length column size.
Fix:
===
- InnoDB should treat the long fixed length column as variable
length fields that needs external storage while initializing
the field map for instant alter operation
Problem:
========
- After the commit ada1074bb1 (MDEV-14398)
fil_crypt_set_encrypt_tables() iterates through all tablespaces to
fill the default_encrypt tables list. This was a trigger to
encrypt or decrypt when key rotation age is set to 0. But import
tablespace does call fil_crypt_set_encrypt_tables() unnecessarily.
The motivation for the call is to signal the encryption threads.
Fix:
====
ha_innobase::discard_or_import_tablespace: Remove the
fil_crypt_set_encrypt_tables() and add the import tablespace
to the default encrypt list if necessary