The resources like uring in MariaDB aren't intended for spawned
processes so we restrict access using the io_uring_ring_dontfork
liburing library call.
Removed use std::vector's ba push_back(), pop_back() to make it more
obvious that memory in the vectors won't be reallocated.
Also, "borrowed" elements can be debugged a little better now,
they are put into the start of the m_cache vector.
Fix concurrency error - avoid accessing deleted memory, when io_slots is
resized. the deleted memory in this case was vftable pointer in
aiocb::m_internal_task
The fix avoids calling dummy release function, via a flag in task_group.
Fixed tpool timer implementation on POSIX.
Prior to this patch, under some specific rare circumstances (concurrency
related), timer callback execution might be skipped.
Table_cache_instance: Define the structure aligned at
the CPU cache line, and remove a pad[] data member.
Krunal Bauskar reported this to improve performance on ARMv8.
aligned_malloc(): Wrapper for the Microsoft _aligned_malloc()
and the ISO/IEC 9899:2011 <stdlib.h> aligned_alloc().
Note: The parameters are in the Microsoft order (size, alignment),
opposite of aligned_alloc(alignment, size).
Note: The standard defines that size must be an integer multiple
of alignment. It is enforced by AddressSanitizer but not by GNU libc
on Linux.
aligned_free(): Wrapper for the Microsoft _aligned_free() and
the standard free().
HAVE_ALIGNED_ALLOC: A new test. Unfortunately, support for
aligned_alloc() may still be missing on some platforms.
We will fall back to posix_memalign() for those cases.
HAVE_MEMALIGN: Remove, along with any use of the nonstandard memalign().
PFS_ALIGNEMENT (sic): Removed; we will use CPU_LEVEL1_DCACHE_LINESIZE.
PFS_ALIGNED: Defined using the C++11 keyword alignas.
buf_pool_t::page_hash_table::create(),
lock_sys_t::hash_table::create():
lock_sys_t::hash_table::resize(): Pad the allocation size to an
integer multiple of the alignment.
Reviewed by: Vladislav Vaintroub
aio_uring::thread_routine(): Handle -EINTR from io_uring_wait_cqe()
in the same way as aio_linux::getevent_thread_routine() does it:
simply ignore it and invoke the system call again.
Reviewed by: Vladislav Vaintroub
As btrfs showed, a partial read of data in AIO /O_DIRECT circumstances can
really confuse MariaDB.
Filipe Manana (SuSE)[1] showed how database programmers can assume
O_DIRECT is all or nothing.
While a fix was done in the kernel side, we can do better in our code by
requesting that the rest of the block be read/written synchronously if
we do only get a partial read/write.
Per the APIs, a partial read/write can occur before an error, so
reattempting the request will leave the caller with a concrete error to
handle.
[1] https://lore.kernel.org/linux-btrfs/CABVffENfbsC6HjGbskRZGR2NvxbnQi17gAuW65eOM+QRzsr8Bg@mail.gmail.com/T/#mb2738e675e48e0e0778a2e8d1537dec5ec0d3d3a
Also spell synchronously correctly in other files.
Fixed tpool::pread() and tpool::pwrite() to return SSIZE_T on Windows,
so that huge numbers are not converted to negatives.
Also, make sure to never attempt reading/writing more bytes than
DWORD can accomodate (4G)
MDEV-23855 and MDEV-23399 already moved some transient data fields
from buffer pool page descriptors to IORequest, but the write buffer
of PAGE_COMPRESSED or ENCRYPTED tables was missed. Since is only needed
during asynchronous page write requests, it belongs to IORequest.
Do not execute user callback just after pwrite. Instead, submit user
function as task into thread pool. This way, the IO thread would not hog
aiocb, which is a limited (in Innodb) resource
In commit 49e2c8f0a6 (MDEV-25743)
we made dict_sys_t::find() incompatible with the rest of the
table name hash table operations in case the table name contains
non-ASCII octets (using a compatibility mode that facilitates the
upgrade into the MySQL 5.0 filename-safe encoding) and the target
platform implements signed char.
ut_fold_string(): Remove; replace with my_crc32c(). This also makes
table name hash value calculations independent on whether char
is unsigned or signed.
The server still may abort if there is no enough free space in the
ring buffer to resubmit the IO job, but the behavior is equal to
the failure of os_aio() -> submit_io().
This gives the user the size required and how to set
memlock limits for the process.
Thanks Jens Axboe for providing this requested interface
ref: https://github.com/axboe/liburing/issues/246
Also don't put \n on my_printf_error, its implicit.
- use FIND_PACKAGE(LIBAIO) to find libaio
- Use standard CMake conventions in Find{PMEM,URING}.cmake
- Drop the LIB from LIB{PMEM,URING}_{INCLUDE_DIR,LIBRARIES}
It is cleaner, and consistent with how other packages are handled in CMake.
e.g successful FIND_PACKAGE(PMEM) now sets PMEM_FOUND, PMEM_LIBRARIES,
PMEM_INCLUDE_DIR, not LIBPMEM_{FOUND,LIBRARIES,INCLUDE_DIR}.
- Decrease the output. use FIND_PACKAGE with QUIET argument.
- for Linux packages, either liburing, or libaio is required
If liburing is installed, libaio does not need to be present .
Use FIND_PACKAGE([LIBAIO|URING] REQUIRED) if either library is required.
The new default values WITH_URING:BOOL=OFF, WITH_PMEM:BOOL=OFF imply
that the dependencies are optional.
An explicit request WITH_URING=ON or WITH_PMEM=ON will cause the
build to fail if the requested dependencies are not available.
Last, to prevent a feature to be built in even though the built-time
dependencies are available, the following can be used:
cmake -DCMAKE_DISABLE_FIND_PACKAGE_URING=1
cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1
This cleanup was suggested by Vladislav Vaintroub.
liburing is a new optional dependency (WITH_URING=auto|yes|no)
that replaces libaio when it is available.
aio_uring: class which wraps io_uring stuff
aio_uring::bind()/unbind(): optional optimization
aio_uring::submit_io(): mutex prevents data race. liburing calls are
thread-unsafe. But if you look into it's implementation you'll see
atomic operations. They're used for synchronization between kernel and
user-space only. That's why our own synchronization is still needed.
For systemd, we add LimitMEMLOCK=524288 (ulimit -l 524288)
because the io_uring_setup system call that is invoked
by io_uring_queue_init() requests locked memory. The value
was found empirically; with 262144, we would occasionally
fail to enable io_uring when using the maximum values of
innodb_read_io_threads=64 and innodb_write_io_threads=64.
aio_uring::thread_routine(): Tolerate -EINTR return from
io_uring_wait_cqe(), because it may occur on shutdown
on Ubuntu 20.10 (Groovy Gorilla).
This was mostly implemented by Eugene Kosov. Systemd integration
and improved startup/shutdown error handling by Marko Mäkelä.
In commit 5e62b6a5e0 (MDEV-16264)
the logic of os_aio_init() was changed so that it will never fail,
but instead automatically disable innodb_use_native_aio (which is
enabled by default) if the io_setup() system call would fail due
to resource limits being exceeded. This is questionable, especially
because falling back to simulated AIO may lead to significantly
reduced performance.
srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads:
Change the data type from ulong to uint.
os_aio_init(): Remove the parameters, and actually return an error code.
thread_pool::configure_aio(): Do not silently fall back to simulated AIO.
Reviewed by: Vladislav Vaintroub
- the intention for my_getevents syscall is now better explained,
why are we using it (to be able to interrupt io_getevents syscall via
io_destroy()).
- Fix comment for MAX_EVENTS in getevent_thread_routine.
MAX_EVENTS is more of less arbitrary constant, chosen such that events array
is big enough to get multiple simultaneous io completions, but small
enough so it does not blow the thread's stack.
If maintenance timer does not do much for prolonged time, it will
wake up less frequently, once every 4 seconds instead of once every 0.4
second.
It will wakeup more often if thread creation is throttled, to avoid stalls.