Commit graph

700 commits

Author SHA1 Message Date
Daniel Black
807e4f320f Change my_umask{,_dir} to mode_t and remove os_innodb_umask
os_innodb_umask was of the incorrect type resulting in warnings
in clang-19. The correct type is mode_t.

As os_innodb_umask was set during innnodb_init from my_umask,
corrected the type there along with its companion my_umask_dir.
Because of this, the defaults mask values in innodb never
had an effect.

The resulting change allow found signed differences in
my_create{,_nosymlink}, open_nosymlinks:

mysys/my_create.c:47:20: error: operand of ?: changes signedness from ‘int’ to ‘mode_t’ {aka ‘unsigned int’} due to unsignedness of other operand [-Werror=sign-compare]
   47 |      CreateFlags ? CreateFlags : my_umask);

Ref: clang-19 warnings:

[55/123] Building CXX object storage/innobase/CMakeFiles/innobase.dir/os/os0file.cc.o
storage/innobase/os/os0file.cc:1075:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
 1075 |                 file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
      |                        ~~~~                                ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1249:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
 1249 |                 file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
      |                        ~~~~                                ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1381:45: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
 1381 |         file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
      |                ~~~~                                ^~~~~~~~~~~~~~~
2024-12-11 17:21:01 +11:00
Marko Mäkelä
decdd4bf49 MDEV-29015/MDEV-29260/MDEV-34938: os_file_get_size() WSL work-around
When MariaDB Server is run in a container under
Windows Subsystem for Linux, the fstat(2) system calls that InnoDB
invokes in os_file_set_size() or os_file_get_size() are causing a
failure in case the file had been renamed in the past while the file
handle was open. This affects at least ALTER TABLE and OPTIMIZE TABLE.

os_file_get_size(): Invoke lseek(2) instead of fstat(2). We do not mind
if the file pointer is moving to the end of the file, because InnoDB
exclusively invokes positioned reads and writes, or in some rare cases,
appends to an existing file.

os_file_set_size(): Invoke os_file_get_size() instead of fstat(2).
Define the POSIX and Windows versions separately. Formerly, the
Windows version was called os_file_change_size_win32().

fil_node_t::read_page0(): Use os_file_get_size() to determine the
size, and do not crash on error.

fil_node_t::read_metadata(): Remove the non-Windows stat* parameter
and always invoke fstat(2) outside Windows, but do tolerate errors.
Because fstat(2) is more likely to fail than lseek(2), and this is
not time critical code, we can afford the extra lseek(2) system call.

Reviewed by: Vladislav Vaintroub
2024-10-24 16:08:56 +03:00
Vladislav Vaintroub
c1fc59277a MDEV-34929 page-compressed tables do not work on Windows
Remove workaround for MDEV-13941, it served for 5 years,and all affected
pre-release 10.2 installation should have been already fixed in between.

Apparently Innodb is using is_sparse parameter in os_file_set_size()
inconsistently, and it passes is_sparse=false now during first file
extension. With MDEV-13941 workaround in place, it would unsparse
the file, which is makes compression not to work at all anymore.
2024-10-16 16:02:13 +02:00
Thirunarayanan Balathandayuthapani
3662f8ca89 MDEV-34543 Shutdown hangs while freeing the asynchronous i/o slots
Problem:
========
- During shutdown, InnoDB tries to free the asynchronous
I/O slots and hangs. The reason is that InnoDB disables
asynchronous I/O before waiting for pending
asynchronous I/O to finish.

buf_load(): InnoDB aborts the buffer pool load due to
user requested shutdown and doesn't wait for the asynchronous
read to get completed. This could lead to debug assertion
in buf_flush_buffer_pool() during shutdown

Fix:
===
os_aio_free(): Should wait all read_slots and write_slots
to finish before disabling the aio.

buf_load(): Should wait for pending read request to complete
even though it was aborted.
2024-07-12 17:42:14 +05:30
Brandon Nesterenko
d3a7e46bb4 MDEV-34365: UBSAN runtime error: call to function io_callback(tpool::aiocb*)
On an UBSAN clang-15 build, if running with UBSAN option
halt_on_error=1 (the issue doesn't show up without it),
MTR fails during mysqld --bootstrap with UBSAN error:

call to function io_callback(tpool::aiocb*) through pointer to incorrect function type 'void (*)(void *)'

This patch corrects the parameter type of io_callback
to match its expected type defined by callback_func,
i.e. (void*).

Reviewed By:
============
<TODO>
2024-06-12 08:39:41 +03:00
Thirunarayanan Balathandayuthapani
9b2bf09b95 MDEV-33980 mariadb-backup --backup is missing retry logic for undo tablespaces
- This is a merge of commit f378e76434
from 10.4 to 10.5.
2024-05-06 19:50:20 +05:30
Julius Goryavsky
b88c20ce1b Merge branch 10.4 into 10.5 2024-05-06 13:55:42 +02:00
Thirunarayanan Balathandayuthapani
f378e76434 MDEV-33980 mariadb-backup --backup is missing retry logic for undo tablespaces
Problem:
========
- Currently mariabackup have to reread the pages in case they are
modified by server concurrently. But while reading the undo
tablespace, mariabackup failed to do reread the page in case of
error.

Fix:
===
Mariabackup --backup functionality should have retry logic
while reading the undo tablespaces.
2024-04-30 16:15:26 +05:30
Sergei Golubchik
01f6abd1d4 Merge branch '10.4' into 10.5 2024-01-31 17:32:53 +01:00
Marko Mäkelä
ee1407f74d MDEV-32268: GNU libc posix_fallocate() may be extremely slow
os_file_set_size(): Let us invoke the Linux system call fallocate(2)
directly, because the GNU libc posix_fallocate() implements a fallback
that writes to the file 1 byte every 4096 or fewer bytes. In one
environment, invoking fallocate() directly would lead to 4 times the
file growth rate during ALTER TABLE. Presumably, what happened was
that the NFS server used a smaller allocation block size than 4096 bytes
and therefore created a heavily fragmented sparse file when
posix_fallocate() was used. For example, extending a file by 4 MiB
would create 1,024 file fragments. When the file is actually being
written to with data, it would be "unsparsed".

The built-in EOPNOTSUPP fallback in os_file_set_size() writes a buffer
of 1 MiB of NUL bytes. This was always used on musl libc and other
Linux implementations of posix_fallocate().
2024-01-18 11:00:27 +02:00
Yuchen Pei
6b343de8ef
Merge branch '10.4' into 10.5 2023-09-25 13:06:57 +10:00
Vladislav Vaintroub
1ee0d09a2b MDEV-32228 speedup opening tablespaces on Windows
is_file_on_ssd() is more expensive than it should be.
It caches the results by volume name, but still calls GetVolumePathName()
every time, which, as procmon shows, opens multiple directories in
filesystem hierarchy (db directory, datadir, and all ancestors)

The fix is to cache SSD status by volume serial ID, which is cheap to
retrieve with GetFileInformationByHandleEx()
2023-09-22 21:07:50 +02:00
Oleksandr Byelkin
ac5a534a4c Merge remote-tracking branch '10.4' into 10.5 2023-03-31 21:32:41 +02:00
Marko Mäkelä
dfa90257f6 MDEV-30936 clang 15.0.7 -fsanitize=memory fails massively
handle_slave_io(), handle_slave_sql(), os_thread_exit():
Remove a redundant pthread_exit(nullptr) call, because it
would cause SIGSEGV.

mysql_print_status(): Add MEM_MAKE_DEFINED() to work around
some missing instrumentation around mallinfo2().

que_graph_free_stat_list(): Invoke que_node_get_next(node) before
que_graph_free_recursive(node). That is the logical and
MSAN_OPTIONS=poison_in_dtor=1 compatible way of freeing memory.

ins_node_t::~ins_node_t(): Invoke mem_heap_free(entry_sys_heap).

que_graph_free_recursive(): Rely on ins_node_t::~ins_node_t().

fts_t::~fts_t(): Invoke mem_heap_free(fts_heap).

fts_free(): Replace with direct calls to fts_t::~fts_t().

The failures in free_root() due to MSAN_OPTIONS=poison_in_dtor=1
will be covered in MDEV-30942.
2023-03-28 11:44:24 +03:00
Vlad Lesin
4c226c1850 MDEV-29050 mariabackup issues error messages during InnoDB tablespaces export on partial backup preparing
The solution is to suppress error messages for missing tablespaces if
mariabackup is launched with "--prepare --export" options.

"mariabackup --prepare --export" invokes itself with --mysqld parameter.
If the parameter is set, then it starts server to feed "FLUSH TABLES ...
FOR EXPORT;" queries for exported tablespaces. This is "normal" server
start, that's why new srv_operation value is introduced.

Reviewed by Marko Makela.
2023-03-27 20:15:10 +03:00
Marko Mäkelä
0792aff161 Merge 10.4 into 10.5 2022-09-20 13:17:02 +03:00
Marko Mäkelä
0c0a569028 Merge 10.3 into 10.4 2022-09-20 12:38:25 +03:00
Marko Mäkelä
c22dff21a5 InnoDB cleanup: Replace UNIV_LINUX, UNIV_SOLARIS, UNIV_AIX
Let us use the normal platform-specific preprocessor symbols
__linux__, __sun__, _AIX instead of some homebrew ones.

The preprocessor symbol UNIV_HPUX must have lost its meaning
by f6deb00a56 (note: the symbol
UNIV_HPUX10 is being checked for, but only UNIV_HPUX is defined).
2022-09-19 12:20:53 +03:00
Marko Mäkelä
bbf81b51f2 Correct typos in a function comment
Thanks to Thirunarayanan Balathandayuthapani for spotting this.
2022-09-19 10:23:57 +03:00
Vladislav Vaintroub
a3fd9e6b06 MDEV-29367 Refactor tpool::cache
Removed use std::vector's ba push_back(), pop_back() to  make it more
obvious that memory in the vectors won't be reallocated.

Also, "borrowed" elements can be debugged a little better now,
they are put into the start of the m_cache vector.
2022-08-24 13:36:49 +02:00
Marko Mäkelä
098c0f2634 Merge 10.4 into 10.5 2022-07-27 17:17:24 +03:00
Marko Mäkelä
e5c4f4e590 Merge 10.3 into 10.4 2022-07-27 14:25:36 +03:00
Marko Mäkelä
0ee1082bd2 MDEV-28495 InnoDB corruption due to lack of file locking
Starting with commit da094188f6 (MDEV-24393),
MariaDB will no longer acquire advisory file locks on InnoDB data
files by default, because it would create a large number of
entries in Linux /proc/locks.

The motivation for acquiring the file locks is to prevent accidental
concurrent startup of multiple server processes on the same data files.
Such mistake still turns out to be relatively common, based on
corruption bug reports from the community.

To prevent corruption due to concurrent startup attempts, the
Aria storage engine would unconditionally acquire an advisory lock
on one of its log files.

Solution: InnoDB will always lock its system tablespace files.
(Ever since commit 685d958e38
the InnoDB log file will not necessarily be open while the
server is running, because it can be accessed via memory-mapped I/O.)

If more protection is desired, then the option --external-locking
can be used.

The mandatory advisory lock also fixes intermittent failures of
some crash recovery tests. It turns out that when the mtr test harness
kills and restarts the server, it will not actually ensure that the
old process has terminated before starting the new one.
2022-07-27 14:15:14 +03:00
Daniel Black
fc456bc97e MDEV-27593 InnoDB handle AIO errors - more detailed assertion
Step 1 in handling InnoDB AIO assertions better is to get
more detail of the cases of error.

This doesn't resolve MDEV-27593, but increases the level
of information in the assertion.
2022-07-09 12:34:02 +10:00
Marko Mäkelä
4b3c3e526e Merge 10.4 into 10.5 2022-06-02 16:51:13 +03:00
Marko Mäkelä
96f4b4a55b Merge 10.3 into 10.4 2022-06-02 16:34:17 +03:00
Marko Mäkelä
fde99e006d MDEV-28716: Portability: unlink() can return EPERM instead of EISDIR 2022-06-01 11:13:15 +03:00
Marko Mäkelä
5d8dcfd86c MDEV-25975: Merge 10.4 into 10.5 2022-04-06 10:30:49 +03:00
Marko Mäkelä
d172df9913 MDEV-25975: Merge 10.3 into 10.4 2022-04-06 09:18:38 +03:00
Marko Mäkelä
e9735a8185 MDEV-25975 innodb_disallow_writes causes shutdown to hang
We will remove the parameter innodb_disallow_writes because it is badly
designed and implemented. The parameter was never allowed at startup.
It was only internally used by Galera snapshot transfer.
If a user executed
SET GLOBAL innodb_disallow_writes=ON;
the server could hang even on subsequent read operations.

During Galera snapshot transfer, we will block writes
to implement an rsync friendly snapshot, as follows:

sst_flush_tables() will acquire a global lock by executing
FLUSH TABLES WITH READ LOCK, which will block any writes
at the high level.

sst_disable_innodb_writes(), invoked via ha_disable_internal_writes(true),
will suspend or disable InnoDB background tasks or threads that could
initiate writes. As part of this, log_make_checkpoint() will be invoked
to ensure that anything in the InnoDB buf_pool.flush_list will be written
to the data files. This has the nice side effect that the Galera joiner
will avoid crash recovery.

The changes to sql/wsrep.cc and to the tests are based on a prototype
that was developed by Jan Lindström.

Reviewed by: Jan Lindström
2022-04-06 08:06:49 +03:00
Marko Mäkelä
59359fb44a MDEV-24841 Build error with MSAN use-of-uninitialized-value in comp_err
The MemorySanitizer implementation in clang includes some built-in
instrumentation (interceptors) for GNU libc. In GNU libc 2.33, the
interface to the stat() family of functions was changed. Until the
MemorySanitizer interceptors are adjusted, any MSAN code builds
will act as if that the stat() family of functions failed to initialize
the struct stat.

A fix was applied in
https://reviews.llvm.org/rG4e1a6c07052b466a2a1cd0c3ff150e4e89a6d87a
but it fails to cover the 64-bit variants of the calls.

For now, let us work around the MemorySanitizer bug by defining
and using the macro MSAN_STAT_WORKAROUND().
2022-03-14 09:28:55 +02:00
Daniel Black
d78173828e MDEV-27900: aio handle partial reads/writes
As btrfs showed, a partial read of data in AIO /O_DIRECT circumstances can
really confuse MariaDB.

Filipe Manana (SuSE)[1] showed how database programmers can assume
O_DIRECT is all or nothing.

While a fix was done in the kernel side, we can do better in our code by
requesting that the rest of the block be read/written synchronously if
we do only get a partial read/write.

Per the APIs, a partial read/write can occur before an error, so
reattempting the request will leave the caller with a concrete error to
handle.

[1] https://lore.kernel.org/linux-btrfs/CABVffENfbsC6HjGbskRZGR2NvxbnQi17gAuW65eOM+QRzsr8Bg@mail.gmail.com/T/#mb2738e675e48e0e0778a2e8d1537dec5ec0d3d3a

Also spell synchronously correctly in other files.
2022-03-12 09:47:53 +11:00
Marko Mäkelä
b791b942e1 Merge 10.4 into 10.5 2022-02-25 13:27:41 +02:00
Marko Mäkelä
f5ff7d09c7 Merge 10.3 into 10.4 2022-02-25 13:00:48 +02:00
Marko Mäkelä
00b70bbb51 Merge 10.2 into 10.3 2022-02-25 10:43:38 +02:00
Vlad Lesin
a112a80b47 Merge 10.4 into 10.5 2022-02-22 10:35:16 +03:00
Vladislav Vaintroub
24ec144c63 MDEV-27901 Windows : expensive system calls used to calculate file system block size
The result is not used anywhere but in the output of Innodb information
schema, but this can take as much as 7%CPU (only) on a benchmark.

Fix to move fs blocksize calculate to where it is used.
2022-02-20 22:00:42 +01:00
Vladislav Vaintroub
fa557986ac MDEV-24175 Windows - fix detection of whether file is on SSD
Fix detection. SSD is when storage does *not* incur a seek penalty.
2022-02-17 22:55:08 +01:00
Marko Mäkelä
4c3ad24413 MDEV-27416 InnoDB hang in buf_flush_wait_flushed(), on log checkpoint
InnoDB could sometimes hang when triggering a log checkpoint. This is
due to commit 7b1252c03d (MDEV-24278),
which introduced an untimed wait to buf_flush_page_cleaner().

The hang was noticed by occasional failures of IMPORT TABLESPACE tests,
such as innodb.innodb-wl5522, which would (unnecessarily) invoke
log_make_checkpoint() from row_import_cleanup().

The reason of the hang was that buf_flush_page_cleaner() would enter
untimed sleep despite buf_flush_sync_lsn being set. The exact failure
scenario is unclear, because buf_flush_sync_lsn should actually be
protected by buf_pool.flush_list_mutex. We prevent the hang by
invoking buf_pool.page_cleaner_set_idle(false) whenever we are
setting buf_flush_sync_lsn and signaling buf_pool.do_flush_list.

The bulk of these changes was originally developed as a preparation
for MDEV-26827, to invoke buf_flush_list() from fewer threads,
and tested on 10.6 by Matthias Leich.

This fix was tested by running 100 repetitions of 100 concurrent instances
of the test innodb.innodb-wl5522 on a RelWithDebInfo build, using ext4fs
and innodb_flush_method=O_DIRECT on a SATA SSD with 4096-byte block size.
During the test, the call to log_make_checkpoint() in row_import_cleanup()
was present.

buf_flush_list(): Make static.

buf_flush_wait(): Wait for buf_pool.get_oldest_modification()
to reach a target, by work done in the buf_flush_page_cleaner.
If buf_flush_sync_lsn is going to be set, we will invoke
buf_pool.page_cleaner_set_idle(false).

buf_flush_ahead(): If buf_flush_sync_lsn or buf_flush_async_lsn
is going to be set and the page cleaner woken up, we will invoke
buf_pool.page_cleaner_set_idle(false).

buf_flush_wait_flushed(): Invoke buf_flush_wait().

buf_flush_sync(): Invoke recv_sys.apply() at the start in case
crash recovery is active. Invoke buf_flush_wait().

buf_flush_sync_batch(): A lower-level variant of buf_flush_sync()
that is only called by recv_sys_t::apply().

buf_flush_sync_for_checkpoint(): Do not trigger log apply
or checkpoint during recovery.

buf_dblwr_t::create(): Only initiate a buffer pool flush, not
a checkpoint.

row_import_cleanup(): Do not unnecessarily invoke log_make_checkpoint().
Invoking buf_flush_list_space() before starting to generate redo log
for the imported tablespace should suffice.

srv_prepare_to_delete_redo_log_file():
Set recv_sys.recovery_on in order to prevent
buf_flush_sync_for_checkpoint() from initiating a checkpoint
while the log is inaccessible. Remove a wait loop that is already
part of buf_flush_sync().
Do not invoke fil_names_clear() if the log is being upgraded,
because the FILE_MODIFY record is specific to the latest format.

create_log_file(): Clear recv_sys.recovery_on only after calling
log_make_checkpoint(), to prevent buf_flush_page_cleaner from
invoking a checkpoint.

innodb_shutdown(): Simplify the logic in mariadb-backup --prepare.

os_aio_wait_until_no_pending_writes(): Update the function comment.
Apart from row_quiesce_table_start() during FLUSH TABLES...FOR EXPORT,
this is being called by buf_flush_list_space(), which is invoked
by ALTER TABLE...IMPORT TABLESPACE as well as some encryption operations.
2022-01-04 07:40:31 +02:00
Marko Mäkelä
2d0847818d Merge 10.4 into 10.5 2021-09-11 11:49:12 +03:00
Marko Mäkelä
101d10b883 Merge 10.3 into 10.4 2021-09-11 11:21:39 +03:00
Marko Mäkelä
bcd25e1066 Merge 10.2 into 10.3 2021-09-11 11:14:18 +03:00
Marko Mäkelä
d09426f9e6 MDEV-26537 InnoDB corrupts files due to incorrect st_blksize calculation
The st_blksize returned by fstat(2) is not documented to be
a power of 2, like we assumed in
commit 58252fff15 (MDEV-26040).
While on Linux, the st_blksize appears to report the file system
block size (which hopefully is not smaller than the sector size
of the underlying block device), on FreeBSD we observed
st_blksize values that might have been something similar to st_size.

Also IBM AIX was affected by this. A simple test case would
lead to a crash when using the minimum innodb_buffer_pool_size=5m
on both FreeBSD and AIX:

seq -f 'create table t%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&
seq -f 'create table u%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&

We will fix this by not trusting st_blksize at all, and assuming that
the smallest allowed write size (for O_DIRECT) is 4096 bytes. We hope
that no storage systems with larger block size exist. Anything larger
than 4096 bytes should be unlikely, given that it is the minimum
virtual memory page size of many contemporary processors.

MariaDB Server on Microsoft Windows was not affected by this.

While the 512-byte sector size of the venerable Seagate ST-225 is still
in widespread use, the minimum innodb_page_size is 4096 bytes, and
innodb_log_file_size can be set in integer multiples of 65536 bytes.

The only occasion where InnoDB uses smaller data file block sizes than
4096 bytes is with ROW_FORMAT=COMPRESSED tables with KEY_BLOCK_SIZE=1
or KEY_BLOCK_SIZE=2 (or innodb_page_size=4096). For such tables,
we will from now on preallocate space in integer multiples of 4096 bytes
and let regular writes extend the file by 1024, 2048, or 3072 bytes.

The view INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES.FS_BLOCK_SIZE
should report the raw st_blksize.

For page_compressed tables, the function fil_space_get_block_size()
will map to 512 any st_blksize value that is larger than 4096.

os_file_set_size(): Assume that the file system block size is 4096 bytes,
and only support extending files to integer multiples of 4096 bytes.

fil_space_extend_must_retry(): Round down the preallocation size to
an integer multiple of 4096 bytes.
2021-09-10 19:15:41 +03:00
Marko Mäkelä
eb2f2c1e5f MDEV-26547 fixup: Wait for read completion
buf_load(): Wait for the submitted reads to finish before updating
innodb_buffer_pool_load_status.
2021-09-07 08:55:08 +03:00
Oleksandr Byelkin
ae6bdc6769 Merge branch '10.4' into 10.5 2021-07-31 23:19:51 +02:00
Oleksandr Byelkin
7841a7eb09 Merge branch '10.3' into 10.4 2021-07-31 22:59:58 +02:00
Marko Mäkelä
f50eb0d398 Merge 10.2 into 10.3 2021-07-27 10:47:17 +03:00
Marko Mäkelä
da094188f6 MDEV-24393 InnoDB disregards --skip-external-locking
On POSIX systems, InnoDB would unconditionally acquire advisory locks
on the files that it opens. On Linux, this would be observable by
a large number of entries in /proc/locks.

Other storage engines would only acquire advisory locks on files
based on the Boolean configuration parameter external_locking.

Let InnoDB do the same.

NOTE: The --skip-external-locking is activated by default. To have
InnoDB acquire advisory locks, --external-locking must be specified.

Reviewed by: Sergei Golubchik
2021-07-27 08:52:59 +03:00
Marko Mäkelä
15dcb8bd3e Merge 10.4 into 10.5 2021-07-02 13:02:26 +03:00
Sergei Petrunia
eebe2090c8 Merge 10.3 -> 10.4 2021-06-30 18:41:46 +03:00