The directio(3C) function on Solaris is supported on NFS and UFS
while the majority of users should be on ZFS, which is a copy-on-write
file system that implements transparent compression and therefore
cannot support unbuffered I/O.
Let us remove the call to directio() and simply treat
innodb_flush_method=O_DIRECT in the same way as the previous
default value innodb_flush_method=fsync on Solaris. Also, let us
remove some dead code around calls to os_file_set_nocache() on
platforms where fcntl(2) is not usable with O_DIRECT.
On IBM AIX, O_DIRECT is not documented for fcntl(2), only for open(2).
os_file_set_size(): Let us invoke the Linux system call fallocate(2)
directly, because the GNU libc posix_fallocate() implements a fallback
that writes to the file 1 byte every 4096 or fewer bytes. In one
environment, invoking fallocate() directly would lead to 4 times the
file growth rate during ALTER TABLE. Presumably, what happened was
that the NFS server used a smaller allocation block size than 4096 bytes
and therefore created a heavily fragmented sparse file when
posix_fallocate() was used. For example, extending a file by 4 MiB
would create 1,024 file fragments. When the file is actually being
written to with data, it would be "unsparsed".
The built-in EOPNOTSUPP fallback in os_file_set_size() writes a buffer
of 1 MiB of NUL bytes. This was always used on musl libc and other
Linux implementations of posix_fallocate().
During mariadb-backup --backup, a table could be renamed, created and
dropped. We could have both oldname.ibd and oldname.new, and one of
the files would be deleted before the InnoDB recovery starts. The desired
end result would be that we will recover both oldname.ibd and newname.ibd.
During normal crash recovery, at most one file operation (create, rename,
delete) may require to be replayed from the write-ahead log before the
DDL recovery starts.
deferred_spaces.create(): In mariadb-backup --prepare, try to create the
file in case it does not exist.
fil_name_process(): Display a message about not found files not only
if innodb_force_recovery is set, but also in mariadb-backup --prepare.
If we are processing a FILE_RENAME for a tablespace whose recovery is
deferred, suppress the message and adjust the file name in case
fil_ibd_load() returns FIL_LOAD_NOT_FOUND or FIL_LOAD_DEFER.
fil_ibd_load(): Remove a redundant file name comparison.
The caller already compared that the file names are different.
We used to wrongly return FIL_LOAD_OK instead of FIL_LOAD_ID_CHANGED
if only the schema name differed, such as a/t1.ibd and b/t1.ibd.
Tested by: Matthias Leich
Reviewed by: Thirunarayanan Balathandayuthapani
is_file_on_ssd() is more expensive than it should be.
It caches the results by volume name, but still calls GetVolumePathName()
every time, which, as procmon shows, opens multiple directories in
filesystem hierarchy (db directory, datadir, and all ancestors)
The fix is to cache SSD status by volume serial ID, which is cheap to
retrieve with GetFileInformationByHandleEx()
buf_page_t::write_complete(), buf_page_write_complete(),
IORequest::write_complete(): Add a parameter for passing
an error code. If an error occurred, we will release the
io-fix, buffer-fix and page latch but not reset the
oldest_modification field. The block would remain in
buf_pool.LRU and possibly buf_pool.flush_list, to be written
again later, by buf_flush_page_cleaner(). If all page writes
start consistently failing, all write threads should eventually
hang in log_free_check() because the log checkpoint cannot
be advanced to make room in the circular write-ahead-log ib_logfile0.
IORequest::read_complete(): Add a parameter for passing
an error code. If a read operation fails, we report the error
and discard the page, just like we would do if the page checksum
was not validated or the page could not be decrypted.
This only affects asynchronous reads, due to linear or random read-ahead
or crash recovery. When buf_page_get_low() invokes buf_read_page(),
that will be a synchronous read, not involving this code.
This was tested by randomly injecting errors in
write_io_callback() and read_io_callback(), like this:
if (!ut_rnd_interval(100))
cb->m_err= 42;
This is a 10.6 port of commit 2f9e264781
from MariaDB Server 10.9 that is missing some optimization due to a
more complex redo log format and recovery logic
(which was simplified in commit 685d958e38).
The progress reporting of InnoDB crash recovery was rather intermittent.
Nothing was reported during the single-threaded log record parsing, which
could consume minutes when parsing a large log. During log application,
there only was progress reporting in background threads that would be
invoked on data page read completion.
The progress reporting here will be detailed like this:
InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
InnoDB: Read redo log up to LSN=1963895808
InnoDB: Multi-batch recovery needed at LSN 2534560930
InnoDB: Read redo log up to LSN=3312233472
InnoDB: Read redo log up to LSN=1599646720
InnoDB: Read redo log up to LSN=2160831488
InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages
InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages
InnoDB: Read redo log up to LSN=3195776000
InnoDB: Read redo log up to LSN=3687099392
InnoDB: Read redo log up to LSN=4165315584
InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages
InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages
InnoDB: Read redo log up to LSN=4508724224
InnoDB: Read redo log up to LSN=5094550528
InnoDB: To recover: 205230 pages
The previous messages "Starting a batch to recover" or
"Starting a final batch to recover" will be replaced by
"To recover: ... pages" messages.
If a batch lasts longer than 15 seconds, then there will be
progress reports every 15 seconds, showing the number of remaining pages.
For the non-final batch, the "To recover:" message includes two end LSN:
that of the batch, and of the recovered log. This is the primary measure
of progress. The batch will end once the number of pages to recover
reaches 0.
If recovery is possible in a single batch, the output will look like this,
with a shorter "To recover:" message that counts only the remaining pages:
InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
InnoDB: Read redo log up to LSN=1984539648
InnoDB: Read redo log up to LSN=2710875136
InnoDB: Read redo log up to LSN=3358895104
InnoDB: Read redo log up to LSN=3965299712
InnoDB: Read redo log up to LSN=4557417472
InnoDB: Read redo log up to LSN=5219527680
InnoDB: To recover: 450915 pages
We will also speed up recovery by improving the memory management and
implementing multi-threaded recovery of data pages that will not need
to be read into the buffer pool ("fake read"). Log application in the
"fake read" threads will be protected by an atomic being_recovered field
and exclusive buf_page_t::lock.
Recovery will reserve for data pages two thirds of the buffer pool,
or 256 pages, whichever is smaller. Previously, we could only use at most
one third of the buffer pool for buffered log records. This would typically
mean that with large buffer pools, recovery unnecessary consisted of
multiple batches.
If recovery runs out of memory, it will "roll back" or "rewind" the current
mini-transaction. The recv_sys.recovered_lsn and recv_sys.pages
will correspond to the "out of memory LSN", at the end of the previous
complete mini-transaction.
If recovery runs out of memory while executing the final recovery batch,
we can simply invoke recv_sys.apply(false) to make room, and resume
parsing.
If recovery runs out of memory before the final batch, we will
scan the redo log to the end and check for any missing or inconsistent
files. In this version of the patch, we will throw away any previously
buffered recv_sys.pages and rescan the log from the checkpoint onwards.
recv_sys_t::pages_it: A cached iterator to recv_sys.pages.
recv_sys_t::is_memory_exhausted(): Remove. We will have out-of-memory
handling deep inside recv_sys_t::parse().
recv_sys_t::rewind(), page_recv_t::recs_t::rewind():
Remove all log starting with a specific LSN.
IORequest::write_complete(), IORequest::read_complete():
Replaces fil_aio_callback().
read_io_callback(), write_io_callback(): Replaces io_callback().
IORequest::fake_read_complete(), fake_io_callback(), os_fake_read():
Process a "fake read" request for concurrent recovery.
recv_sys_t::apply_batch(): Choose a number of successive pages
for a recovery batch.
recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a
page whose recovery is not in progress. Log application threads
will not invoke this; they will only set being_recovered=-1 to indicate
that the entry is no longer needed.
recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries.
recv_sys_t::wait_for_pool(): Wait for some space to become available
in the buffer pool.
mlog_init_t::mark_ibuf_exist(): Avoid calls to
recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low().
Such calls would lead to double locking of recv_sys.mutex, which
depending on implementation could cause a deadlock. We will use
lower-level calls to look up index pages.
buf_LRU_block_remove_hashed(): Disable consistency checks for freed
ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage.
This fixes an occasional failure of the test
innodb.innodb_bulk_create_index_debug.
Tested by: Matthias Leich
os_aio_wait_until_no_pending_reads(), os_aio_wait_until_pending_writes():
Add a Boolean parameter to indicate whether the wait should be declared
in the thread pool.
buf_flush_wait(): The callers have already declared a wait, so let us
avoid doing that again, just call os_aio_wait_until_pending_writes(false).
buf_flush_wait_flushed(): Do not declare a wait in the rare case that
the buf_flush_page_cleaner thread has been shut down already.
buf_flush_page_cleaner(), buf_flush_buffer_pool(): In the code that runs
during shutdown, do not declare waits.
buf_flush_buffer_pool(): Remove a debug assertion that might fail.
What really matters here is buf_pool.flush_list.count==0.
buf_read_recv_pages(), srv_prepare_to_delete_redo_log_file():
Do not declare waits during InnoDB startup.
io_callback(): Process the request before releasing the write slot.
Before commit a091d6ac4e
when we had a duplicated counter for writes, either ordering was fine.
Now, correctness depends on os_aio_wait_until_no_pending_writes().
tpool::cache::m_mtx: Add PERFORMANCE_SCHEMA instrumentation
(wait/synch/mutex/innodb/tpool_cache_mutex). This covers the
InnoDB read_slots and write_slots for asynchronous data page I/O.
handle_slave_io(), handle_slave_sql(), os_thread_exit():
Remove a redundant pthread_exit(nullptr) call, because it
would cause SIGSEGV.
mysql_print_status(): Add MEM_MAKE_DEFINED() to work around
some missing instrumentation around mallinfo2().
que_graph_free_stat_list(): Invoke que_node_get_next(node) before
que_graph_free_recursive(node). That is the logical and
MSAN_OPTIONS=poison_in_dtor=1 compatible way of freeing memory.
ins_node_t::~ins_node_t(): Invoke mem_heap_free(entry_sys_heap).
que_graph_free_recursive(): Rely on ins_node_t::~ins_node_t().
fts_t::~fts_t(): Invoke mem_heap_free(fts_heap).
fts_free(): Replace with direct calls to fts_t::~fts_t().
The failures in free_root() due to MSAN_OPTIONS=poison_in_dtor=1
will be covered in MDEV-30942.
The solution is to suppress error messages for missing tablespaces if
mariabackup is launched with "--prepare --export" options.
"mariabackup --prepare --export" invokes itself with --mysqld parameter.
If the parameter is set, then it starts server to feed "FLUSH TABLES ...
FOR EXPORT;" queries for exported tablespaces. This is "normal" server
start, that's why new srv_operation value is introduced.
Reviewed by Marko Makela.
recv_sys_t::recover_deferred(): If the file has been determined
to be large enough, skip the call to os_file_set_size(), which
would use the current value of FSP_SIZE, which during a multi-batch
recovery can be smaller than the actual file size.
os_file_io(): Also display the file offset in the warning message about
partial I/O.
os_file_read(): Merged with os_file_read_no_error_handling().
Crashing on a partial page read is as unhelpful as crashing on a
corrupted page read (commit 0b47c126e3).
Report the file name if it is available via IORequest.
Let us use the normal platform-specific preprocessor symbols
__linux__, __sun__, _AIX instead of some homebrew ones.
The preprocessor symbol UNIV_HPUX must have lost its meaning
by f6deb00a56 (note: the symbol
UNIV_HPUX10 is being checked for, but only UNIV_HPUX is defined).
Removed use std::vector's ba push_back(), pop_back() to make it more
obvious that memory in the vectors won't be reallocated.
Also, "borrowed" elements can be debugged a little better now,
they are put into the start of the m_cache vector.
Starting with commit da094188f6 (MDEV-24393),
MariaDB will no longer acquire advisory file locks on InnoDB data
files by default, because it would create a large number of
entries in Linux /proc/locks.
The motivation for acquiring the file locks is to prevent accidental
concurrent startup of multiple server processes on the same data files.
Such mistake still turns out to be relatively common, based on
corruption bug reports from the community.
To prevent corruption due to concurrent startup attempts, the
Aria storage engine would unconditionally acquire an advisory lock
on one of its log files.
Solution: InnoDB will always lock its system tablespace files.
(Ever since commit 685d958e38
the InnoDB log file will not necessarily be open while the
server is running, because it can be accessed via memory-mapped I/O.)
If more protection is desired, then the option --external-locking
can be used.
The mandatory advisory lock also fixes intermittent failures of
some crash recovery tests. It turns out that when the mtr test harness
kills and restarts the server, it will not actually ensure that the
old process has terminated before starting the new one.
Step 1 in handling InnoDB AIO assertions better is to get
more detail of the cases of error.
This doesn't resolve MDEV-27593, but increases the level
of information in the assertion.
Revert "TSAN: data race on vptr (ctor/dtor vs virtual call)"
This reverts commit 78084fa747.
This commit was done to please TSAN, which falsely reported an error
where there was none.
Yet as consequence, it could cause a real error, a crash in os_aio_free on
shutdown
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
We will remove the parameter innodb_disallow_writes because it is badly
designed and implemented. The parameter was never allowed at startup.
It was only internally used by Galera snapshot transfer.
If a user executed
SET GLOBAL innodb_disallow_writes=ON;
the server could hang even on subsequent read operations.
During Galera snapshot transfer, we will block writes
to implement an rsync friendly snapshot, as follows:
sst_flush_tables() will acquire a global lock by executing
FLUSH TABLES WITH READ LOCK, which will block any writes
at the high level.
sst_disable_innodb_writes(), invoked via ha_disable_internal_writes(true),
will suspend or disable InnoDB background tasks or threads that could
initiate writes. As part of this, log_make_checkpoint() will be invoked
to ensure that anything in the InnoDB buf_pool.flush_list will be written
to the data files. This has the nice side effect that the Galera joiner
will avoid crash recovery.
The changes to sql/wsrep.cc and to the tests are based on a prototype
that was developed by Jan Lindström.
Reviewed by: Jan Lindström
The MemorySanitizer implementation in clang includes some built-in
instrumentation (interceptors) for GNU libc. In GNU libc 2.33, the
interface to the stat() family of functions was changed. Until the
MemorySanitizer interceptors are adjusted, any MSAN code builds
will act as if that the stat() family of functions failed to initialize
the struct stat.
A fix was applied in
https://reviews.llvm.org/rG4e1a6c07052b466a2a1cd0c3ff150e4e89a6d87a
but it fails to cover the 64-bit variants of the calls.
For now, let us work around the MemorySanitizer bug by defining
and using the macro MSAN_STAT_WORKAROUND().
As btrfs showed, a partial read of data in AIO /O_DIRECT circumstances can
really confuse MariaDB.
Filipe Manana (SuSE)[1] showed how database programmers can assume
O_DIRECT is all or nothing.
While a fix was done in the kernel side, we can do better in our code by
requesting that the rest of the block be read/written synchronously if
we do only get a partial read/write.
Per the APIs, a partial read/write can occur before an error, so
reattempting the request will leave the caller with a concrete error to
handle.
[1] https://lore.kernel.org/linux-btrfs/CABVffENfbsC6HjGbskRZGR2NvxbnQi17gAuW65eOM+QRzsr8Bg@mail.gmail.com/T/#mb2738e675e48e0e0778a2e8d1537dec5ec0d3d3a
Also spell synchronously correctly in other files.