There are many filesystem related errors that can occur with
MariaBackup. These already outputed to stderr with a good description of
the error. Many of these are permission or resource (file descriptor)
limits where the assertion and resulting core crash doesn't offer
developers anything more than the log message. To the user, assertions
and core crashes come across as poor error handling.
As such we return an error and handle this all the way up the stack.
recv_sys_t::parse(): For undo tablespace truncation mini-transactions,
remember the start_lsn instead of the end LSN. This is what we expect
after commit 461402a564 (MDEV-30479).
recv_sys_t::apply(): When applying an undo log truncation operation,
invoke os_file_truncate() on space->recv_size, which must not be
less than the original truncated file size.
Alternatively, as pointed out by Thirunarayanan Balathandayuthapani,
we could assign space->size = t.pages, so that
fil_system_t::extend_to_recv_size() would extend the file back
to space->recv_size.
The solution is to suppress error messages for missing tablespaces if
mariabackup is launched with "--prepare --export" options.
"mariabackup --prepare --export" invokes itself with --mysqld parameter.
If the parameter is set, then it starts server to feed "FLUSH TABLES ...
FOR EXPORT;" queries for exported tablespaces. This is "normal" server
start, that's why new srv_operation value is introduced.
Reviewed by Marko Makela.
This almost completely reverts
commit acd23da4c2 and
retains a safe optimization:
recv_sys_t::parse(): Remove any old redo log records for the
truncated tablespace, to free up memory earlier.
If recovery consists of multiple batches, then recv_sys_t::apply()
will must invoke recv_sys_t::trim() again to avoid wrongly
applying old log records to an already truncated undo tablespace.
- During non-last batch of multi-batch recovery, InnoDB holds
log_sys.mutex and preallocates the block which may intiate
page flush, which may initiate log flush, which requires
log_sys.mutex to acquire again. This leads to assert failure.
So InnoDB recovery should release log_sys.mutex before
preallocating the block.
recv_sys_t::parse(): Discard old page-level redo log when parsing
a TRIM_PAGES record.
recv_sys_t::apply(): trim() was invoked in parse() already.
recv_sys_t::truncated_undo_spaces[]: Only store the size, no LSN.
page_recv_t::trim(): Do remove log records for mini-transactions
that end right at the threshold LSN. This will avoid an inconsistency
where a dirty page had been evicted from the buffer pool during
undo tablespace truncation, and recovery would attempt to apply
log records for which the last available copy in the data file is
too new. These changes would be discarded anyway.
to copy datafile
- Mariabackup fails to copy the undo log tablespace when it undergoes
truncation. So Mariabackup should detect the redo log which does
undo tablespace truncation and also backup should read the minimum
file size of the tablespace and ignore the error while reading.
- Throw error when innodb undo tablespace read failed, but backup
doesn't find the redo log for undo tablespace truncation
recv_log_recover_10_4(): Widen the operand of bitwise and to 64 bits,
so that the upgrade check will work when the redo log record is located
more than 4 gigabytes from the start of the first file.
If a log checkpoint occurs at the end LSN of mtr.commit_shrink(space)
in trx_purge_truncate_history(), then recovery may fail because
it could try to apply too old log records to too old copies of
undo log pages. This was repeated with the following test:
./mtr innodb.undo_log_truncate,4k,strict_full_crc32
recv_sys_t::trim(): Move some code to the caller.
recv_sys_t::apply(): For undo tablespace truncation, discard
all old redo log for the undo tablespace, and then truncate
the file to the desired size.
Tested by: Matthias Leich
The InnoDB write-ahead log ib_logfile0 is of fixed size,
specified by innodb_log_file_size. If the tail of the log
manages to overwrite the head (latest checkpoint) of the log,
crash recovery will be broken.
Let us clarify the messages about this, including adding
a message on the completion of a log checkpoint that notes
that the dangerous situation is over.
To reproduce the dangerous scenario, we will introduce the
debug injection label ib_log_checkpoint_avoid_hard, which will
avoid log checkpoints even harder than the previous
ib_log_checkpoint_avoid.
log_t::overwrite_warned: The first known dangerous log sequence number.
Set in log_close() and cleared in log_write_checkpoint_info(),
which will output a "Crash recovery was broken" message.
srv_shutdown(): Do not call log_free_check(), because it will now
be repeatedly called by ibuf_merge_all(). Do not call
srv_sync_log_buffer_in_background(), because we do not actually care
about durability during shutdown. Log writes will already be triggered
by buf_flush_page_cleaner() for writing back modified pages, possibly by
log_free_check().
logs_empty_and_mark_files_at_shutdown(): Clean up a condition.
This function is the caller of srv_shutdown(), and it will ensure that
the log and the buffer pool will be in clean state before shutdown.
log_phys_t::apply(): When parsing an INSERT_HEAP_DYNAMIC record,
allow ll==rlen to hold for the last part. A secondary index record
may inherit all preceding bytes from the infimum pseudo-record.
For INSERT_HEAP_REDUNDANT, some header bytes will always be present
because the header will never be copied from the page infimum.
We will tolerate ll==rlen also in that case to be consistent with
the parsing of INSERT_HEAP_DYNAMIC.
This bug was found in MariaDB Server 10.6 thanks to the
OPT_PAGE_CHECKSUM record that was implemented
in commit 4179f93d28 for catching
this type of recovery failures.
page_cur_insert_rec_low(): If the previous record is the page infimum,
correctly limit the end of the record. We do not want to copy data from
the header of the page supremum. This omission caused the incorrect
recovery of DB_TRX_ID in an instant ALTER TABLE metadata record, because
part of the DB_TRX_ID was incorrectly copied from the n_owned of the
page supremum, which in recovery would be updated after the copying,
but in normal operation would already have been updated at the time the
common prefix was being determined.
log_phys_t::apply(): If a data page is found to be corrupted, do not
flag the log corrupted but instead return a new status APPLIED_CORRUPTED
so that the caller may discard all log for this page. We do not want
the recovery of unrelated pages to fail in recv_recover_page().
No test case is included, because the known test case would only work
in 10.6, and even after this fix, it would trigger another bug in
instant ALTER TABLE crash recovery.
We will remove the parameter innodb_disallow_writes because it is badly
designed and implemented. The parameter was never allowed at startup.
It was only internally used by Galera snapshot transfer.
If a user executed
SET GLOBAL innodb_disallow_writes=ON;
the server could hang even on subsequent read operations.
During Galera snapshot transfer, we will block writes
to implement an rsync friendly snapshot, as follows:
sst_flush_tables() will acquire a global lock by executing
FLUSH TABLES WITH READ LOCK, which will block any writes
at the high level.
sst_disable_innodb_writes(), invoked via ha_disable_internal_writes(true),
will suspend or disable InnoDB background tasks or threads that could
initiate writes. As part of this, log_make_checkpoint() will be invoked
to ensure that anything in the InnoDB buf_pool.flush_list will be written
to the data files. This has the nice side effect that the Galera joiner
will avoid crash recovery.
The changes to sql/wsrep.cc and to the tests are based on a prototype
that was developed by Jan Lindström.
Reviewed by: Jan Lindström
Only one checkpoint may be in progress at a time.
The counter log_sys.n_pending_checkpoint_writes
was being protected by log_sys.mutex.
Let us replace it with the Boolean log_sys.checkpoint_pending.
If innodb_flush_method=O_DSYNC, log_sys.flushed_to_disk_lsn is changed
without 'flush_lock' protection inside log_write().
This leads to a race condition, if there are 2 threads running in parallel,
doing log_write_up_to() with different values for 'flush_to_disk'
In this case, log_write() and log_write_flush_to_disk_low() can execute at
the same time, and both would change flushed_lsn.
The fix is to remove special treatment of durable writes from log_write().
There is no apparent reason for this special treatment, log_write_flush_to_disk_low()
is already optimized for durable writes.
Nor there is an apparent reason to call log_flush_notify() more often in
for O_DSYNC.
In recv_sys_t::apply(), we were unnecessarily looking up pages
in buf_pool.page_hash and potentially waiting for exclusive page latches.
Before buf_page_get_low() would return an x-latched page,
that page will have to be read and buf_page_read_complete() would
have invoked recv_recover_page() to apply the log to the page.
Therefore, it suffices to invoke recv_read_in_area() to trigger
a transition from RECV_NOT_PROCESSED.
recv_read_in_area(): Take the iterator as a parameter, and remove
page_id lookups. Should the page already be in buf_pool.page_hash,
buf_page_init_for_read() will return nullptr to buf_read_page_low()
and buf_read_page_background().
recv_sys_t::apply(): Replace goto, remove dead code, and add assertions
to guarantee that the iteration will make progress.
Reviewed by: Vladislav Lesin
In MDEV-14425, an early plan was to introduce a separate log file
for file-level records and checkpoint information. The reasoning was
that fil_system.mutex contention would be reduced by not having to
maintain fil_system.named_spaces. The mutex contention was actually
fixed in MDEV-23855 by making some data fields in fil_space_t and
fil_node_t use std::atomic.
Using a single circular log file simplifies recovery and backup.
The problem was introduced by the removal of buf_pool.flush_rbt
in commit 46b1f50098 (MDEV-23399)
recv_sys_t::apply(): don't write to disc and fsync() the last batch.
Insead, sort it by oldest_modification for MariaDB server and some
mariabackup operations.
log_sort_flush_list(): a thread-safe function which sorts buf_pool::flush_list
InnoDB could sometimes hang when triggering a log checkpoint. This is
due to commit 7b1252c03d (MDEV-24278),
which introduced an untimed wait to buf_flush_page_cleaner().
The hang was noticed by occasional failures of IMPORT TABLESPACE tests,
such as innodb.innodb-wl5522, which would (unnecessarily) invoke
log_make_checkpoint() from row_import_cleanup().
The reason of the hang was that buf_flush_page_cleaner() would enter
untimed sleep despite buf_flush_sync_lsn being set. The exact failure
scenario is unclear, because buf_flush_sync_lsn should actually be
protected by buf_pool.flush_list_mutex. We prevent the hang by
invoking buf_pool.page_cleaner_set_idle(false) whenever we are
setting buf_flush_sync_lsn and signaling buf_pool.do_flush_list.
The bulk of these changes was originally developed as a preparation
for MDEV-26827, to invoke buf_flush_list() from fewer threads,
and tested on 10.6 by Matthias Leich.
This fix was tested by running 100 repetitions of 100 concurrent instances
of the test innodb.innodb-wl5522 on a RelWithDebInfo build, using ext4fs
and innodb_flush_method=O_DIRECT on a SATA SSD with 4096-byte block size.
During the test, the call to log_make_checkpoint() in row_import_cleanup()
was present.
buf_flush_list(): Make static.
buf_flush_wait(): Wait for buf_pool.get_oldest_modification()
to reach a target, by work done in the buf_flush_page_cleaner.
If buf_flush_sync_lsn is going to be set, we will invoke
buf_pool.page_cleaner_set_idle(false).
buf_flush_ahead(): If buf_flush_sync_lsn or buf_flush_async_lsn
is going to be set and the page cleaner woken up, we will invoke
buf_pool.page_cleaner_set_idle(false).
buf_flush_wait_flushed(): Invoke buf_flush_wait().
buf_flush_sync(): Invoke recv_sys.apply() at the start in case
crash recovery is active. Invoke buf_flush_wait().
buf_flush_sync_batch(): A lower-level variant of buf_flush_sync()
that is only called by recv_sys_t::apply().
buf_flush_sync_for_checkpoint(): Do not trigger log apply
or checkpoint during recovery.
buf_dblwr_t::create(): Only initiate a buffer pool flush, not
a checkpoint.
row_import_cleanup(): Do not unnecessarily invoke log_make_checkpoint().
Invoking buf_flush_list_space() before starting to generate redo log
for the imported tablespace should suffice.
srv_prepare_to_delete_redo_log_file():
Set recv_sys.recovery_on in order to prevent
buf_flush_sync_for_checkpoint() from initiating a checkpoint
while the log is inaccessible. Remove a wait loop that is already
part of buf_flush_sync().
Do not invoke fil_names_clear() if the log is being upgraded,
because the FILE_MODIFY record is specific to the latest format.
create_log_file(): Clear recv_sys.recovery_on only after calling
log_make_checkpoint(), to prevent buf_flush_page_cleaner from
invoking a checkpoint.
innodb_shutdown(): Simplify the logic in mariadb-backup --prepare.
os_aio_wait_until_no_pending_writes(): Update the function comment.
Apart from row_quiesce_table_start() during FLUSH TABLES...FOR EXPORT,
this is being called by buf_flush_list_space(), which is invoked
by ALTER TABLE...IMPORT TABLESPACE as well as some encryption operations.
During startup, InnoDB must write a FILE_CHECKPOINT record.
However, before MDEV-12353 (in MariaDB Server 10.2, 10.3, 10.4)
the corresponding record MLOG_CHECKPOINT was encoded in a different way.
When we are upgrading from a logically empty 10.2, 10.3, or 10.4 redo log,
we must not write anything to the old log file, because if the server were
killed during the upgrade, we would end up with a corrupted log file, and
both the old and the new server would refuse to start up.
On upgrade, we must simply create a new logically empty log file
and replace the old ib_logfile0 with that.
This is a low hanging fruit. Before this patch std::map::emplace() was
a ~50% of the whole recv_sys_t::parse() operation in by test.
After the fix it's only ~20%.
recv_sys_t::parse() recv_sys_t::pages is a collection of all pages
to recovery. Often, there are multiple changes for a single page.
Often, they go in a row and for such cases let's avoid
lookup in a std::map. cached_pages_it serves as a cache
of size 1.
recv_sys_t::add(): replace page_id argument with a std::map::iterator
In commit 1cb218c37c (MDEV-26450)
we introduced the function log_write_and_flush(), which may
compete with log_checkpoint() invoking log_write_flush_to_disk_low()
from another thread.
The assertion n_pending_flushes==1 is too strict.
There is no possibility of a race condition here, because
fil_flush() is protected by fil_system->mutex and the
rest will be protected by log_sys->mutex.
log_write_flush_to_disk_low(), log_write_and_flush():
Relax the assertions to test for a nonzero count.
At least since commit 055a3334ad
(MDEV-13564) the undo log truncation in InnoDB did not work correctly.
The main issue is that during the execution of
trx_purge_truncate_history() some pages of the newly truncated
undo tablespace could be discarded.
This is improved from commit 1cb218c37c
which was applied to earlier-version branches.
fsp_try_extend_data_file(): Apply the peculiar rounding of
fil_space_t::size_in_header only to the system tablespace,
whose size can be expressed in megabytes in a configuration parameter.
Other files may freely grow by a number of pages.
fseg_alloc_free_page_low(): Do allow the extension of undo tablespaces,
and mention the file name in the error message.
mtr_t::commit_shrink(): Implement crash-safe shrinking of a tablespace:
(1) durably write the log
(2) release the page latches of the rebuilt tablespace
(3) release the mutexes
(4) truncate the file
(5) release the tablespace latch
This is refactored from trx_purge_truncate_history().
log_write_and_flush_prepare(), log_write_and_flush(): New functions
to durably write log during mtr_t::commit_shrink().
At least since commit 055a3334ad
(MDEV-13564) the undo log truncation in InnoDB did not work correctly.
The main issue is that during the execution of
trx_purge_truncate_history() some pages of the newly truncated
undo tablespace could be discarded.
fsp_try_extend_data_file(): Apply the peculiar rounding of
fil_space_t::size_in_header only to the system tablespace,
whose size can be expressed in megabytes in a configuration parameter.
Other files may freely grow by a number of pages.
fseg_alloc_free_page_low(): Do allow the extension of undo tablespaces,
and mention the file name in the error message.
mtr_t::commit_shrink(): Implement crash-safe shrinking of a tablespace
file. First, durably write the log, then shrink the file, and finally
release the page latches of the rebuilt tablespace. Refactored from
trx_purge_truncate_history().
log_write_and_flush_prepare(), log_write_and_flush(): New functions
to durably write log during mtr_t::commit_shrink().