In commit 685d958e38 (MDEV-14425)
we ended up not enabling O_DIRECT writes on the redo log
by default, because back then, it was slightly slower on
some systems.
With commit a635c40648 (MDEV-27774)
the situation changed. A new test on a NVMe device shows 9%
improvement in throughput and over 15% reduction of latency
when O_DIRECT writes are enabled.
With this change, all the following settings will use O_DIRECT
on InnoDB data and log files:
innodb_flush_method=O_DIRECT
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_flush_method=O_DSYNC
Before MDEV-14425, log writes were always buffered on Linux.
Between MDEV-14425 and this change, unbuffered log writes
were only enabled for innodb_flush_method=O_DSYNC.
The data member log_sys.block_size is only defined in environments
where we can determine the physical block size of a file
(currently, Linux and Microsoft Windows). The accessor function
get_block_size() is available everywhere.
Thanks to Dmitry Shulga for reporting the build failure.
buf_flush_freed_pages(): Assert that neither buf_pool.mutex
nor buf_pool.flush_list_mutex are held. Simplify the loops.
Return the tablespace and the number of pages written or punched.
buf_flush_LRU_list_batch(), buf_do_flush_list_batch():
Release buf_pool.mutex before invoking buf_flush_space().
buf_flush_list_space(): Acquire the mutexes only after invoking
buf_flush_freed_pages().
Reviewed by: Thirunarayanan Balathandayuthapani
In commit a635c40648 (MDEV-27774)
a race condition was introduced between mtr_t::commit() and
a log checkpoint.
Between the time of assigning the log sequence number and adding
the changed pages to buf_pool.flush_list, the log_sys.latch must
be continuously held by the current thread, or otherwise a
log checkpoint could get the wrong result from
buf_pool.get_oldest_modification().
buf_pool_t::insert_into_flush_list(): Add a debug assertion for
increasing the probability of cathing this type of problem.
mtr_t::m_latch_ex: A flag that indicates whether the mini-transaction
is holding log_sys.latch in exclusive mode.
mtr_t::do_write(), mtr_t::finish_write(): Remove the parameter
"bool ex" and refer to m_latch_ex instead.
mtr_t::commit(): Release log_sys.latch according to m_latch_ex.
mtr_t::commit_shrink(), mtr_t::commit_files(): Set m_latch_ex.
mtr_t::do_write(): Do not release an exclusive log_sys.latch,
but instead set m_latch_ex if needed.
fil_space_t::try_to_close(): Tolerate a tablespace that has no
data files attached. The function fil_ibd_create() initially
creates and attaches a tablespace with no files, and invokes
fil_space_t::add() later.
fil_node_open_file(): After releasing and reacquiring fil_system.mutex,
check if the file was already opened by another thread. This avoids
an assertion failure !node->is_open() in fil_node_open_file_low().
These failures were reproduced with the test
innodb.table_definition_cache_debug and the fix of MDEV-27985.
Commit 6c39eaeb1 made the crash recovery dependent on server_id.
The crash recovery could fail when restoring a new instance from
original crashed data directory USING A NEW SERVER ID.
The issue doesn't exist in previous major versions before 10.6.
Root cause is when generating the input XID to be searched in the hash,
server id is populated with the current server id.
So if the server id changed when recovering, the XID couldn't be found
in the hash due to server id doesn't match.
This fix is to use original server id when creating the input XID
object in function `xarecover_do_commit_or_rollback`.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
- InnoDB fails to skip newly created column while checking for
change column when table is in redundant row format. This issue
is caused the MDEV-18035 (ccb1acbd3c)
For some reason, the tests of the MemorySanitizer build on 10.5 failed
with both clang 13 and clang 14 with SIGSEGV. On 10.6 where it worked
better, some more places to work around were identified.
The MemorySanitizer implementation in clang includes some built-in
instrumentation (interceptors) for GNU libc. In GNU libc 2.33, the
interface to the stat() family of functions was changed. Until the
MemorySanitizer interceptors are adjusted, any MSAN code builds
will act as if that the stat() family of functions failed to initialize
the struct stat.
A fix was applied in
https://reviews.llvm.org/rG4e1a6c07052b466a2a1cd0c3ff150e4e89a6d87a
but it fails to cover the 64-bit variants of the calls.
For now, let us work around the MemorySanitizer bug by defining
and using the macro MSAN_STAT_WORKAROUND().
In commit 83212632e4
the trx_rseg_latch was instrumented for performance_schema,
but some acqusitions of rd_lock() were not adjusted.
Thus, the build would fail on platforms where a futex-based
rw-lock is not available (SUX_LOCK_GENERIC) unless the code
was built with cmake -DPLUGIN_PERFSCHEMA=NO.
Some Spider table options introduces an unnecessary complication to
Spider settings. For example, the default value of the plugin variable
spider_auto_increment_mode is -1 (use table value) and the default
table option value is 0 (normal mode). Thus, the virtual default value
of the variable is 0. This kind of indirection is confusing.
In order to delete such confusing table options in a future release,
we first change the default values of some Spider plugin variables
from -1 (use table value) to the corresponding default table values.
The default table values are defined in spider_set_connect_info_default().
At the same time, we also deprecate the option value -1 (use table value).
Deprecate the plugin variable spider_use_handler and the corresponding
table parameters "uhd" and "use_handler".
Passing a Handler statement to data nodes, without converting it to
SQL sometimes, might improve the performance, while this introduces
some complication to the implementation.
In the first place, only a few people use Handler statements and the
performance gain seems not to be very significant. Further, setting
spider_use_handler > 0 disables the GROUP BY handler. So, we decided
to deprecate the variable.
not every index-using plan sets bits in table->quick_keys.
QUICK_ROR_INTERSECT_SELECT, for example, doesn't.
Use the fact that select->quick is set instead.
Also allow EXPLAIN to work.
As btrfs showed, a partial read of data in AIO /O_DIRECT circumstances can
really confuse MariaDB.
Filipe Manana (SuSE)[1] showed how database programmers can assume
O_DIRECT is all or nothing.
While a fix was done in the kernel side, we can do better in our code by
requesting that the rest of the block be read/written synchronously if
we do only get a partial read/write.
Per the APIs, a partial read/write can occur before an error, so
reattempting the request will leave the caller with a concrete error to
handle.
[1] https://lore.kernel.org/linux-btrfs/CABVffENfbsC6HjGbskRZGR2NvxbnQi17gAuW65eOM+QRzsr8Bg@mail.gmail.com/T/#mb2738e675e48e0e0778a2e8d1537dec5ec0d3d3a
Also spell synchronously correctly in other files.
Let us explicitly wait for purge before invoking a slow shutdown,
so that instrumented builds (such as ASAN or UBSAN) will not
exceed the 60-second timeout during shutdown.