mariadb-backup with --prepare option could result in empty redo log
file. When --prepare is followed by --prepare --export, we exit early
in srv_start function without opening the ibdata1 tablespace. Later
while trying to read rollback segment header page, we hit the debug
assert which claims that the system space should already have been
opened.
There are two assert cases here.
Issue-1: System tablespace object is not there in fil space hash i.e.
srv_sys_space.open_or_create() is not called.
Issue-2: The system tablespace data file ibdata1 is not opened i.e.
fil_system.sys_space->open() is not called.
Fix: For empty redo log and restore operation, open system tablespace
before returning.
The parameter innodb_undo_log_truncate=ON enables a multi-phased logic:
1. Any "producers" (new starting transactions) are prohibited
from using the rollback segments that reside in the undo tablespace.
2. Any transactions that use any of the rollback segments must be
committed or aborted.
3. The purge of committed transaction history must process all the
rollback segments.
4. The undo tablespace is truncated and rebuilt.
5. The rollback segments are re-enabled for new transactions.
There was one flaw in this logic: The first step was not being invoked
as often as it could be, and therefore innodb_undo_log_truncate=ON
would have no chance to work during a heavy write workload.
Independent of innodb_undo_log_truncate, even after
commit 86767bcc0f
we are missing some chances to free processed undo log pages.
If we prohibited the creation of new transactions in one busy
rollback segment at a time, we would be eventually guaranteed
to be able to free such pages.
purge_sys_t::skipped_rseg: The current candidate rollback segment
for shrinking the history independent of innodb_undo_log_truncate.
purge_sys_t::iterator::free_history_rseg(): Renamed from
trx_purge_truncate_rseg_history(). Implement the logic
around purge_sys.m_skipped_rseg.
purge_sys_t::truncate_undo_space: Renamed from truncate.
purge_sys.truncate_undo_space.last: Changed the type to integer
to get rid of some pointer dereferencing and conditional branches.
purge_sys_t::truncating_tablespace(), purge_sys_t::undo_truncate_try():
Refactored from trx_purge_truncate_history().
Set purge_sys.truncate_undo_space.current if applicable,
or return an already set purge_sys.truncate_undo_space.current.
purge_coordinator_state::do_purge(): Invoke
purge_sys_t::truncating_tablespace() as part of the normal work loop,
to implement innodb_undo_log_truncate=ON as often as possible.
trx_purge_truncate_rseg_history(): Remove a redundant parameter.
trx_undo_truncate_start(): Replace dead code with a debug assertion.
Correctness tested by: Matthias Leich
Performance tested by: Axel Schwenke
Reviewed by: Debarun Banerjee
- InnoDB fails to find the space id from the page0 of
the tablespace. In that case, InnoDB can use
doublewrite buffer to recover the page0 and write
into the file.
- buf_dblwr_t::init_or_load_pages(): Loads only the pages
which are valid.(page lsn >= checkpoint). To do that,
InnoDB has to open the redo log before system
tablespace, read the latest checkpoint information.
recv_dblwr_t::find_first_page():
1) Iterate the doublewrite buffer pages and find the 0th page
2) Read the tablespace flags, space id from the 0th page.
3) Read the 1st, 2nd and 3rd page from tablespace file and
compare the space id with the space id which is stored
in doublewrite buffer.
4) If it matches then we can write into the file.
5) Return space which matches the pages from the file.
SysTablespace::read_lsn_and_check_flags(): Remove the
retry logic for validating the first page. After
restoring the first page from doublewrite buffer,
assign tablespace flags by reading the first page.
recv_recovery_read_max_checkpoint(): Reads the maximum
checkpoint information from log file
recv_recovery_from_checkpoint_start(): Avoid reading
the checkpoint header information from log file
Datafile::validate_first_page(): Throw error in case
of first page validation fails.
srv_start(): Move a read only mode startup tweak from
innodb_init_params() to the correct location. Also if
innodb_force_recovery=6 we will disable the doublewrite buffer,
because InnoDB must run in read-only mode to prevent further corruption.
This change only affects debug checks. Whenever srv_read_only_mode holds,
the buf_pool.flush_list will be empty, that is, there will be no writes
of persistent InnoDB data pages.
Reviewed by: Thirunarayanan Balathandayuthapani
dict_find_max_space_id(): Return SELECT MAX(SPACE) FROM SYS_TABLES.
dict_check_tablespaces_and_store_max_id(): In the normal case
(no encryption plugin has been loaded and the change buffer is empty),
invoke dict_find_max_space_id() and do not open any .ibd files.
If a std::set<uint32_t> has been specified, open the files whose
tablespace ID is mentioned. Else, open all data files that are identified
by SYS_TABLES records.
fil_ibd_open(): Remove a call to os_file_get_last_error() that can
report a misleading error, such as EINVAL inside my_realpath() that is
not an actual error. This could be invoked when a data file is found
but the FSP_SPACE_FLAGS are incorrect, such as is the case for
table test.td in
./mtr --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 innodb.table_flags
buf_load(): If any tablespaces could not be found, invoke
dict_check_tablespaces_and_store_max_id() on the missing tablespaces.
dict_load_tablespace(): Try to load the tablespace unless it was found
to be futile. This fixes failures related to FTS_*.ibd files for
FULLTEXT INDEX.
btr_cur_t::search_leaf(): Prevent a crash when the tablespace
does not exist. This was caught by the test innodb_fts.fts_concurrent_insert
when the change to dict_load_tablespaces() was not present.
We modify a few tests to ensure that tables will not be loaded at startup.
For some fault injection tests this means that the corrupted tables
will not be loaded, because dict_load_tablespace() would perform stricter
checks than dict_check_tablespaces_and_store_max_id().
Tested by: Matthias Leich
Reviewed by: Thirunarayanan Balathandayuthapani
innodb_preshutdown(): Only wait for active transactions to be terminated
if InnoDB was started and innodb_force_recovery=3 or larger does not
prevent a rollback.
This fixes the following:
./mtr --parallel=auto --mysqld=--innodb-fast-shutdown=0 \
innodb.log_file_size innodb.innodb_force_recovery \
innodb.read_only_recovery innodb.read_only_recover_committed \
mariabackup.apply-log-only-incr
A slow shutdown using the previous default innodb_purge_batch_size=300
could be extremely slow, employing at most a few CPU cores on the average.
Let us use the maximum batch size in order to increase throughput.
Reviewed by: Vladislav Lesin
The InnoDB table lookup in purge worker threads is a bottleneck that can
degrade a slow shutdown to utilize less than 2 threads. Let us fix that
bottleneck by constructing a local lookup table that does not require any
synchronization while the undo log records of the current batch
are being processed.
TRX_PURGE_TABLE_BUCKETS: The initial number of std::unordered_map
hash buckets used during a purge batch. This could avoid some
resizing and rehashing in trx_purge_attach_undo_recs().
purge_node_t::tables: A lookup table from table ID to an already
looked up and locked table. Replaces many fields.
trx_purge_attach_undo_recs(): Look up each table in the purge batch
only once.
trx_purge(): Close all tables and release MDL at the end of the batch.
trx_purge_table_open(), trx_purge_table_acquire(): Open a table in purge
and acquire a metadata lock on it. This replaces
dict_table_open_on_id<true>() and dict_acquire_mdl_shared().
purge_sys_t::close_and_reopen(): In case of an MDL conflict, close and
reopen all tables that are covered by the current purge batch.
It may be that some of the tables have been dropped meanwhile and can
be ignored. This replaces wait_SYS() and wait_FTS().
row_purge_parse_undo_rec(): Make purge_coordinator_task issue a
MDL warrant to any purge_worker_task which might need it
when innodb_purge_threads>1.
purge_node_t::end(): Clear the MDL warrant.
Reviewed by: Vladislav Lesin and Vladislav Vaintroub
purge_coordinator_state::do_purge(): Simply use all innodb_purge_threads,
no matter what the LSN age is. During shutdown with innodb_fast_shutdown=0
this code could degrade to using only 1 thread.
Also, restore periodical "InnoDB: to purge" messages that were
accidentally disabled in commit 80585c9d6f.
Reviewed by: Vladislav Lesin and Vladislav Vaintroub
purge_sys_t::wake_if_not_active(): Replaces
srv_wake_purge_thread_if_not_active().
innodb_ddl_recovery_done(): Move the wakeup call to
srv_init_purge_tasks().
purge_coordinator_timer: Remove. The srv_master_callback() already
invokes purge_sys.wake_if_not_active() once per second.
Reviewed by: Vladislav Lesin and Vladislav Vaintroub
The motivation of introducing the parameter
innodb_purge_rseg_truncate_frequency in
mysql/mysql-server@28bbd66ea5 and
mysql/mysql-server@8fc2120fed
seems to have been to avoid stalls due to freeing undo log pages
or truncating undo log tablespaces. In MariaDB Server,
innodb_undo_log_truncate=ON should be a much lighter operation
than in MySQL, because it will not involve any log checkpoint.
Another source of performance stalls should be
trx_purge_truncate_rseg_history(), which is shrinking the history list
by freeing the undo log pages whose undo records have been purged.
To alleviate that, we will introduce a purge_truncation_task that will
offload this from the purge_coordinator_task. In that way, the next
innodb_purge_batch_size pages may be parsed and purged while the pages
from the previous batch are being freed and the history list being shrunk.
The processing of innodb_undo_log_truncate=ON will still remain the
responsibility of the purge_coordinator_task.
purge_coordinator_state::count: Remove. We will ignore
innodb_purge_rseg_truncate_frequency, and act as if it had been
set to 1 (the maximum shrinking frequency).
purge_coordinator_state::do_purge(): Invoke an asynchronous task
purge_truncation_callback() to free the undo log pages.
purge_sys_t::iterator::free_history(): Free those undo log pages
that have been processed. This used to be a part of
trx_purge_truncate_history().
purge_sys_t::clone_end_view(): Take a new value of purge_sys.head
as a parameter, so that it will be updated while holding exclusive
purge_sys.latch. This is needed for race-free access to the field
in purge_truncation_callback().
Reviewed by: Vladislav Lesin
srv_all_undo_tablespaces_open(): While opening the extra unused
undo tablespaces, InnoDB should use ULINT_UNDEFINED instead of
SRV_SPACE_ID_UPPER_BOUND.
In MemorySanitizer builds of 10.10 and 10.11, we would rather often
have the assertion fail in innodb_init() during mariadb-backup --prepare.
The assertion could also fail during InnoDB startup, but less often.
Before commit 685d958e38 in 10.8 the
log file cleanup after a successfully applied backup is different,
and the os_aio_pending_writes() assertion is in srv0start.cc.
IORequest::write_complete(): Invoke node->complete_write() before
releasing the page latch, so that a log checkpoint that is about to
execute concurrently will not miss a fdatasync() or fsync() on the
file, in case this was the first write since the last such call.
create_log_file(), srv_start(): Replace the debug assertion with
a debug check. For all intents and purposes, all writes could have
been completed but some write_io_callback() may not have invoked
io_slots::release() yet.
Problem:
========
- InnoDB fails to open undo tablespace when page0 is corrupted
and fails to throw error.
Solution:
=========
- InnoDB throws DB_CORRUPTION error when InnoDB encounters
page0 corruption of undo tablespace.
- InnoDB restores the page0 of undo tablespace from
doublewrite buffer if it encounters page corruption
- Moved Datafile::restore_from_doublewrite() to
recv_dblwr_t::restore_first_page(). So that undo
tablespace and system tablespace can use this function
instead of duplicating the code
srv_undo_tablespace_open(): Returns 0 if file doesn't exist
or ULINT_UNDEFINED if page0 is corrupted.
MONITOR_OVLD_ROW_LOCK_CURRENT_WAIT monitor should has
MONITOR_DISPLAY_CURRENT flag set in its definition, as it shows the
current state and does not accumulate anything.
Reviewed by: Marko Mäkelä
Add threadpool functionality to restrict concurrency during "batch"
periods (where tasks are added in rapid succession).
This will throttle thread creation more agressively than usual, while
keeping performance at least on-par.
One of these cases is bufferpool load, where async read IOs are executed
without any throttling. There can be as much as 650K read IOs for
loading 10GB buffer pool.
Another one is recovery, where "fake read" IOs are executed.
Why there are more threads than we expect?
Worker threads are not be recognized as idle, until they return to the
standby list, and to return to that list, they need to acquire
mutex currently held in the submit_task(). In those cases, submit_task()
has no worker to wake, and would create threads until default concurrency
level (2*ncpus) is satisfied. Only after that throttling would happen.
innodb_max_purge_lag_wait_update(): Return immediately if we are
in high_level_read_only mode.
srv_wake_purge_thread_if_not_active(): Relax a debug assertion.
If srv_read_only_mode holds, purge_sys.enabled() will not hold
and this function will do nothing.
trx_t::commit_in_memory(): Remove a redundant condition before
invoking srv_wake_purge_thread_if_not_active().
In commit 03ca6495df and
commit ff5d306e29
we forgot to remove some Google copyright notices related to
a contribution of using atomic memory access in the old InnoDB
mutex_t and rw_lock_t implementation.
The copyright notices had been mostly added in
commit c6232c06fa
due to commit a1bb700fd2.
The following Google contributions remain:
* some logic related to the parameter innodb_io_capacity
* innodb_encrypt_tables, added in MariaDB Server 10.1
Before MDEV-24671, the wait time was derived from my_interval_timer() /
1000 (nanoseconds converted to microseconds, and not microseconds to
milliseconds like I must have assumed). The lock_sys.wait_time and
lock_sys.wait_time_max are already in milliseconds; we should not divide
them by 1000.
In MDEV-24738 the millisecond counts lock_sys.wait_time and
lock_sys.wait_time_max were changed to a 32-bit type. That would
overflow in 49.7 days. Keep using a 64-bit type for those millisecond
counters.
Reviewed by: Marko Mäkelä
innodb_undo_log_truncate_update(): A callback function. If
SET GLOBAL innodb_undo_log_truncate=ON, invoke
srv_wake_purge_thread_if_not_active().
srv_wake_purge_thread_if_not_active(): If innodb_undo_log_truncate=ON,
always wake up the purge subsystem.
srv_do_purge(): If the history is empty, invoke
trx_purge_truncate_history() in order to free undo log pages.
trx_purge_truncate_history(): If head.trx_no==0, consider the
cached undo logs to be free.
trx_purge(): Remove the parameter "bool truncate" and let the
caller invoke trx_purge_truncate_history() directly.
Reviewed by: Vladislav Lesin
srv_export_innodb_status(): Update
export_vars.innodb_buffer_pool_read_requests as it was done
before commit a55b951e60 (MDEV-26827).
If innodb_status_variables[] pointed to a sharded variable, it would
only access the first shard.
trx_purge_truncate_history(): Only call trx_purge_truncate_rseg_history()
if the rollback segment is safe to process. This will avoid leaking undo
log pages that are not yet ready to be processed. This fixes a regression
that was introduced in
commit 0de3be8cfd (MDEV-30671).
trx_sys_t::any_active_transactions(): Separately count XA PREPARE
transactions.
srv_purge_should_exit(): Terminate slow shutdown if the history size
does not change and XA PREPARE transactions exist in the system.
This will avoid a hang of the test innodb.recovery_shutdown.
Tested by: Matthias Leich
trx_purge_truncate_history(): Only call trx_purge_truncate_rseg_history()
if the rollback segment is safe to process. This will avoid leaking undo
log pages that are not yet ready to be processed. This fixes a regression
that was introduced in
commit 0de3be8cfd (MDEV-30671).
trx_sys_t::any_active_transactions(): Separately count XA PREPARE
transactions.
srv_purge_should_exit(): Terminate slow shutdown if the history size
does not change and XA PREPARE transactions exist in the system.
This will avoid a hang of the test innodb.recovery_shutdown.
Tested by: Matthias Leich
The InnoDB buffer pool and locking were heavily refactored in
MariaDB Server 10.6. Among other things, dict_sys.mutex was removed,
and the contended lock_sys.mutex was replaced with a combination of
lock_sys.latch and distributed latches in hash tables. Also, a
default value was changed to innodb_flush_method=O_DIRECT to improve
performance in write-heavy workloads.
One thing where an adjustment was missing is around the parameters
innodb_max_purge_lag (number of committed transactions waiting to
be purged), and innodb_max_purge_lag_delay
(maximum number of microseconds to delay a DML operation).
purge_coordinator_state::do_purge(): Pass the history_size to trx_purge()
and reset srv_dml_needed_delay if the history is empty.
Keep executing the loop non-stop as long as srv_dml_needed_delay is set.
trx_purge_dml_delay(): Made part of trx_purge().
Set srv_dml_needed_delay=0 when nothing can be purged (!n_pages_handled).
row_mysql_delay_if_needed(): Mimic the logic of
innodb_max_purge_lag_wait_update().
Reviewed by: Thirunarayanan Balathandayuthapani
When commit a5a2ef079c
implemented asynchronous doublewrite, the writes via
the doublewrite buffer started to be counted incorrectly,
without multiplying them by innodb_page_size.
srv_export_innodb_status(): Correctly count the
Innodb_data_written.
buf_dblwr_t: Remove submitted(), because it is close to written()
and only Innodb_data_written was interested in it. According to
its name, it should count completed and not submitted writes.
Tested by: Axel Schwenke
os_aio_wait_until_no_pending_reads(), os_aio_wait_until_pending_writes():
Add a Boolean parameter to indicate whether the wait should be declared
in the thread pool.
buf_flush_wait(): The callers have already declared a wait, so let us
avoid doing that again, just call os_aio_wait_until_pending_writes(false).
buf_flush_wait_flushed(): Do not declare a wait in the rare case that
the buf_flush_page_cleaner thread has been shut down already.
buf_flush_page_cleaner(), buf_flush_buffer_pool(): In the code that runs
during shutdown, do not declare waits.
buf_flush_buffer_pool(): Remove a debug assertion that might fail.
What really matters here is buf_pool.flush_list.count==0.
buf_read_recv_pages(), srv_prepare_to_delete_redo_log_file():
Do not declare waits during InnoDB startup.
Let us remove explicit updates of MONITOR_NUM_UNDO_SLOT_USED
and MONITOR_NUM_UNDO_SLOT_CACHED, and let us compute the rough values
from trx_sys.rseg_array[] on demand.
buf_LRU_get_free_block(): Always wake up the page cleaner if needed
before exiting the inner loop.
srv_prepare_to_delete_redo_log_file():
Replace a debug assertion with a wait in debug builds.
Starting with commit 7e31a8e7fa
the debug assertion ut_ad(!os_aio_pending_writes())
could occasionally fail, while it would hold in core dumps of crashes.
The failure can be reproduced more easily by adding a sleep to the
write completion callback function, right before releasing to
write_slots.
srv_start(): Remove a bogus debug assertion
ut_ad(!os_aio_pending_writes()) that could fail in
mariadb-backup --prepare. In an rr replay trace, we had
buf_pool.flush_list.count==0 but write_slots->m_cache.m_pos==1
and buf_page_t::write_complete() was executing u_unlock().
fil_space_t::create(), fil_space_t::add(): Expect the caller to
acquire and release fil_system.mutex. In this way, creating a tablespace
and adding the first (usually only) data file will be atomic.
recv_sys_t::recover_deferred(): Correctly protect some changes by
holding fil_system.mutex.
Tested by: Matthias Leich
The solution is to suppress error messages for missing tablespaces if
mariabackup is launched with "--prepare --export" options.
"mariabackup --prepare --export" invokes itself with --mysqld parameter.
If the parameter is set, then it starts server to feed "FLUSH TABLES ...
FOR EXPORT;" queries for exported tablespaces. This is "normal" server
start, that's why new srv_operation value is introduced.
Reviewed by Marko Makela.