Making changes to wsrep_mysqld.h causes large parts of server code to
be recompiled. The reason is that wsrep_mysqld.h is included by
sql_class.h, even tough very little of wsrep_mysqld.h is needed in
sql_class.h. This commit introduces a new header file, wsrep_on.h,
which is meant to be included from sql_class.h, and contains only
macros and variable declarations used to determine whether wsrep is
enabled.
Also, header wsrep.h should only contain definitions that are also
used outside of sql/. Therefore, move WSREP_TO_ISOLATION* and
WSREP_SYNC_WAIT macros to wsrep_mysqld.h.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Summary of changes
- MD_CTX_SIZE is increased
- EVP_CIPHER_CTX_buf_noconst(ctx) does not work anymore, points
to nobody knows where. The assumption made previously was that
(since the function does not seem to be documented)
was that it points to the last partial source block.
Add own partial block buffer for NOPAD encryption instead
- SECLEVEL in CipherString in openssl.cnf
had been downgraded to 0, from 1, to make TLSv1.0 and TLSv1.1 possible
(according to https://github.com/openssl/openssl/blob/openssl-3.0.0/NEWS.md
even though the manual for SSL_CTX_get_security_level claims that it
should not be necessary)
- Workaround Ssl_cipher_list issue, it now returns TLSv1.3 ciphers,
in addition to what was set in --ssl-cipher
- ctx_buf buffer now must be aligned to 16 bytes with openssl(
previously with WolfSSL only), ot crashes will happen
- updated aes-t , to be better debuggable
using function, rather than a huge multiline macro
added test that does "nopad" encryption piece-wise, to test
replacement of EVP_CIPHER_CTX_buf_noconst
part of MDEV-29000
The purpose of the compress() wrapper my_compress_buffer() was twofold:
silence Valgrind warnings about uninitialized memory access before
zlib 1.2.4, and have PERFORMANCE_SCHEMA instrumentation of some zlib
related memory allocation. Because of PERFORMANCE_SCHEMA, we cannot
trivially replace my_compress_buffer() with compress().
az_open(): Remove a crc32() call. Any CRC of the empty string is 0.
tv_usec is a (suseconds_t) so we cast to it. Prevents the AIX(gcc-10) warning:
include/my_time.h: In function 'void my_timeval_trunc(timeval*, uint)':
include/my_time.h:249:65: warning: conversion from 'long int' to 'suseconds_t' {aka 'int'} may change value [-Wconversion]
249 | tv->tv_usec-= my_time_fraction_remainder(tv->tv_usec, decimals);
|
macOS is: conversion from 'long int' to '__darwin_suseconds_t' {aka 'int'} may change value
On Windows suseconds_t isn't defined so we use the existing
long return type of my_time_fraction_remainder.
Reviewed by Marko Mäkelä
Closes: #2079
The MemorySanitizer implementation in clang includes some built-in
instrumentation (interceptors) for GNU libc. In GNU libc 2.33, the
interface to the stat() family of functions was changed. Until the
MemorySanitizer interceptors are adjusted, any MSAN code builds
will act as if that the stat() family of functions failed to initialize
the struct stat.
A fix was applied in
https://reviews.llvm.org/rG4e1a6c07052b466a2a1cd0c3ff150e4e89a6d87a
but it fails to cover the 64-bit variants of the calls.
For now, let us work around the MemorySanitizer bug by defining
and using the macro MSAN_STAT_WORKAROUND().
The sys_var class has the deprecation_substitute member to mark the
deprecated variables. As it's set, the server produces warnings when
these variables are used. However, the plugin has no means to utilize
that functionality.
So, the PLUGIN_VAR_DEPRECATED flag is introduced to set the
deprecation_substitute with the empty string. A non-empty string can
make the warning more informative, but there's no nice way seen to
specify it, and not that needed at the moment.
- compile wolfcrypt with kdf.c, to avoid undefined symbols in tls13.c
- define WOLFSSL_HAVE_ERROR_QUEUE to avoid endless loop SSL_get_error
- Do not use SSL_CTX_set_tmp_dh/get_dh2048, this would require additional
compilation options in WolfSSL. Disable it for WolfSSL build, it works
without it anyway.
- fix "macro already defined" Windows warning.
The InnoDB DATA DIRECTORY attribute is not implemented via
symbolic links but something similar, *.isl files that contain
the names of data files.
InnoDB failed to ignore the DATA DIRECTORY attribute even though
the server was started with --skip-symbolic-links.
Native ALTER TABLE in InnoDB will retain the DATA DIRECTORY attribute
of the table, no matter if the table will be rebuilt or not.
Generic ALTER TABLE (with ALGORITHM=COPY) as well as TRUNCATE TABLE
will discard the DATA DIRECTORY attribute.
All tests have been run with and without the ./mtr option
--mysqld=--skip-symbolic-links
and some tests that use the InnoDB DATA DIRECTORY attribute
have been adjusted for this.
SET PASSWORD = PASSWORD('foo') would fail for pam plugin with
ERROR HY000: SET PASSWORD is ignored for users authenticating via pam plugin
but SET PASSWORD = 'foo' would not.
Now it will.
LIMIT history switching requires the number of history partitions to
be marked for read: from first to last non-empty plus one empty. The
least we can do is to fail with error message if the needed partition
was not marked for read. As this is handler interface we require new
handler error code to display user-friendly error message.
Switching by INTERVAL works out-of-the-box with
ER_ROW_DOES_NOT_MATCH_GIVEN_PARTITION_SET error.
This commit contains a fix, where the replication write set for a CREATE TABLE
will contain, as certification keys, table names for all FK references.
With this, all DML for the FK parent tables will conflict with the CREATE TABLE
statement.
There is also new test galera.MDEV-27276 to verify the fix.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
This commit has a mtr test where two two transactions delete a row from
two separate tables, which will cascade a FK delete for the same row in
a third table. Second replica node is configured with 2 applier threads,
and the test will fail if these two transactions are applied in parallel.
The actual fix, in this commit, is to mark a transaction as unsafe for
parallel applying when it traverses into cascade delete operation.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
ALTER TABLE IMPORT doesn't properly handle instant alter metadata.
This patch makes IMPORT read, parse and apply instant alter metadata at the
very beginning of operation. So, cases when source table has some metadata
and destination table doesn't have it now works fine.
DISCARD already removes instant metadata so importing normal table into
instant table worked fine before this patch.
decrypt_decompress(): decrypts and decompresses page if needed
handle_instant_metadata(): this should be the first thing to read source
table. Basically, it applies instant metadata to a destination
dict_table_t object. This is the first thing to read FSP flags so
all possible checks of it were moved to this function.
PageConverter::update_index_page(): it doesn't now read instant metadata.
This logic were moved into handle_instant_metadata()
row_import::match_flags(): this is a first part row_import::match_schema().
As a separate function it's used by handle_instant_metadata().
fil_space_t::is_full_crc32_compressed(): added convenient function
ha_innobase::discard_or_import_tablespace(): do not reload table definition
to read instant metadata because handle_instant_metadata() does it better.
The reverted code was originally added in
4e7ee166a9
ANONYMOUS_VAR: this is a handy thing to use along with make_scope_exit()
full_crc32_import.test shows different results, because no
dict_table_close() and dict_table_open_on_id() happens.
Thus, SHOW CREATE TABLE shows a little bit older table definition.
The crash happened because my_isalnum() does not support character
sets with mbminlen>1.
The value of "ft_boolean_syntax" is converted to utf8 in do_string_check().
So calling my_isalnum() is combination with "default_charset_info" was wrong.
Adding new parameters (size_t length, CHARSET_INFO *cs) to
ft_boolean_check_syntax_string() and passing self->charset(thd)
as the character set.
Create minidump when server fails to shutdown. If process is being
debugged, cause a debug break.
Moves some code which is part of safe_kill into mysys, as both safe_kill,
and mysqltest produce minidumps on different timeouts.
Small cleanup in wait_until_dead() - replace inefficient loop with a single
wait.
Problem was that there was extra condition !thd->lex->no_write_to_binlog
before call to begin TOI. It seems that this variable is not initialized.
TRUNCATE does not support [NO_WRITE_TO_BINLOG | LOCAL] keywords, thus
we should not check this condition. All this was hidden in a macro,
so I decided to remove those macros that were used only a few places
with actual function calls.
This bug happens when one runs aria_chk on multiple tables. It does not
affect REPAIR TABLE.
aria_chk tries to optimize the sort buffer size to minimize memory usage
when used with small tables. The bug was that the adjusted value was
used as a base for the next table, which could cause problems.
init_mutex_v1_t: Stop lying that the mutex parameter is const.
GCC 11.2.0 assumes that it is and could complain about any mysql_mutex_t
being uninitialized even after mysql_mutex_init() as long as
PLUGIN_PERFSCHEMA is enabled.
init_rwlock_v1_t, init_cond_v1_t: Remove untruthful const qualifiers.
Note: init_socket_v1_t is expecting that the socket fd has already
been created before PSI_SOCKET_CALL(init_socket), and therefore that
parameter really is being treated as a pointer to const.
On POSIX systems, InnoDB would unconditionally acquire advisory locks
on the files that it opens. On Linux, this would be observable by
a large number of entries in /proc/locks.
Other storage engines would only acquire advisory locks on files
based on the Boolean configuration parameter external_locking.
Let InnoDB do the same.
NOTE: The --skip-external-locking is activated by default. To have
InnoDB acquire advisory locks, --external-locking must be specified.
Reviewed by: Sergei Golubchik
In commit 5f22511e35 we depend on
Total Store Ordering. For correct operation on ISAs that implement
weaker memory ordering, we must explicitly use release/acquire stores
and loads on buf_page_t::oldest_modification_ to prevent a race condition
when buf_page_t::list does not happen to be on the same cache line.
buf_page_t::clear_oldest_modification(): Assert that the block is
not in buf_pool.flush_list, and use std::memory_order_release.
buf_page_t::oldest_modification_acquire(): Read oldest_modification_
with std::memory_order_acquire. In this way, if the return value is 0,
the caller may safely assume that it will not observe the buf_page_t
as being in buf_pool.flush_list, even if it is not holding
buf_pool.flush_list_mutex.
buf_flush_relocate_on_flush_list(), buf_LRU_free_page():
Invoke buf_page_t::oldest_modification_acquire().
Debian has support for 3 different arm machine and probably one which is
failing is a pretty old version which doesn't support the said instruction.
So we can make it specific to _aarch64_ (as per the arm official reference
manual ISB should be defined by all ARMv8 processors). [as per the ARMv7 specs
even 32 bits processor should support it but not sure which exact version
Debian has under armel].
In either case, the said performance issue will have an impact mainly with a
high-end processors with extreme parallelism so it safe to limit it to _aarch64_.
Corrects: MDEV-24630
The easiest way to compile and test the server with UBSAN is to run:
./BUILD/compile-pentium64-ubsan
and then run mysql-test-run.
After this commit, one should be able to run this without any UBSAN
warnings. There is still a few compiler warnings that should be fixed
at some point, but these do not expose any real bugs.
The 'special' cases where we disable, suppress or circumvent UBSAN are:
- ref10 source (as here we intentionally do some shifts that UBSAN
complains about.
- x86 version of optimized int#korr() methods. UBSAN do not like unaligned
memory access of integers. Fixed by using byte_order_generic.h when
compiling with UBSAN
- We use smaller thread stack with ASAN and UBSAN, which forced me to
disable a few tests that prints the thread stack size.
- Verifying class types does not work for shared libraries. I added
suppression in mysql-test-run.pl for this case.
- Added '#ifdef WITH_UBSAN' when using integer arithmetic where it is
safe to have overflows (two cases, in item_func.cc).
Things fixed:
- Don't left shift signed values
(byte_order_generic.h, mysqltest.c, item_sum.cc and many more)
- Don't assign not non existing values to enum variables.
- Ensure that bool and enum values are properly initialized in
constructors. This was needed as UBSAN checks that these types has
correct values when one copies an object.
(gcalc_tools.h, ha_partition.cc, item_sum.cc, partition_element.h ...)
- Ensure we do not called handler functions on unallocated objects or
deleted objects.
(events.cc, sql_acl.cc).
- Fixed bugs in Item_sp::Item_sp() where we did not call constructor
on Query_arena object.
- Fixed several cast of objects to an incompatible class!
(Item.cc, Item_buff.cc, item_timefunc.cc, opt_subselect.cc, sql_acl.cc,
sql_select.cc ...)
- Ensure we do not do integer arithmetic that causes over or underflows.
This includes also ++ and -- of integers.
(Item_func.cc, Item_strfunc.cc, item_timefunc.cc, sql_base.cc ...)
- Added JSON_VALUE_UNITIALIZED to json_value_types and ensure that
value_type is initialized to this instead of to -1, which is not a valid
enum value for json_value_types.
- Ensure we do not call memcpy() when second argument could be null.
- Fixed that Item_func_str::make_empty_result() creates an empty string
instead of a null string (safer as it ensures we do not do arithmetic
on null strings).
Other things:
- Changed struct st_position to an OBJECT and added an initialization
function to it to ensure that we do not copy or use uninitialized
members. The change to a class was also motived that we used "struct
st_position" and POSITION randomly trough the code which was
confusing.
- Notably big rewrite in sql_acl.cc to avoid using deleted objects.
- Changed in sql_partition to use '^' instead of '-'. This is safe as
the operator is either 0 or 0x8000000000000000ULL.
- Added check for select_nr < INT_MAX in JOIN::build_explain() to
avoid bug when get_select() could return NULL.
- Reordered elements in POSITION for better alignment.
- Changed sql_test.cc::print_plan() to use pointers instead of objects.
- Fixed bug in find_set() where could could execute '1 << -1'.
- Added variable have_sanitizer, used by mtr. (This variable was before
only in 10.5 and up). It can now have one of two values:
ASAN or UBSAN.
- Moved ~Archive_share() from ha_archive.cc to ha_archive.h and marked
it virtual. This was an effort to get UBSAN to work with loaded storage
engines. I kept the change as the new place is better.
- Added in CONNECT engine COLBLK::SetName(), to get around a wrong cast
in tabutil.cpp.
- Added HAVE_REPLICATION around usage of rgi_slave, to get embedded
server to compile with UBSAN. (Patch from Marko).
- Added #ifdef for powerpc64 to avoid a bug in old gcc versions related
to integer arithmetic.
Changes that should not be needed but had to be done to suppress warnings
from UBSAN:
- Added static_cast<<uint16_t>> around shift to get rid of a LOT of
compiler warnings when using UBSAN.
- Had to change some '/' of 2 base integers to shift to get rid of
some compile time warnings.
Reviewed by:
- Json changes: Alexey Botchkov
- Charset changes in ctype-uca.c: Alexander Barkov
- InnoDB changes & Embedded server: Marko Mäkelä
- sql_acl.cc changes: Vicențiu Ciorbaru
- build_explain() changes: Sergey Petrunia
SI_USER is, however in FreeBSD there are a couple of non-kernel
user signal infomations above SI_KERNEL.
Put a fallback just in case there is nothing available.
This was because of a wrong test in encryption code that wrote random
numbers over the LSN for pages for transactional Aria tables during repair.
The effect was that after an ALTER TABLE ENABLE KEYS of a encrypted
recovery of the tables would not work.
Fixed by changing testing of !share->now_transactional to
!share->base.born_transactional.
Other things:
- Extended Aria check_table() to check for wrong (= too big) LSN numbers.
- If check_table() failed just because of wrong LSN or TRN numbers,
a following repair table will just do a zerofill which is much faster.
- Limit number of LSN errors in one check table to MAX_LSN_ERROR (10).
- Removed old obsolete test of 'if (error_count & 2)'. Changed error_count
and warning_count from bits to numbers of errors/warnings as this is
more useful.
memory barrier on ARM
As suggested in the said JIRA ticket based on the contribution done by
the community (in an attempt to optimize the spin-loop) the said approach
was evaluated against MariaDB Server 10.5 and found to help improve
throughput in the range of 2-5%.
Note: 10.6 timing graph and model are different as home-brew
mutexes are replaced with pthread mutexes. Said patch has mixed
impact on 10.6 so not recommended for 10.6.
As pointed out by Andrei Elkin, the previous fix did not fix one
race condition that may have caused the observed hang.
innodb_log_flush_request(): If we are enqueueing the very first
request at the same time the log write is being completed,
we must ensure that a near-concurrent call to log_flush_notify()
will not result in a missed notification. We guarantee this by
release-acquire operations on log_requests.start and
log_sys.flushed_to_disk_lsn.
log_flush_notify_and_unlock(): Cleanup: Always release the mutex.
log_sys_t::get_flushed_lsn(): Use acquire memory order.
log_sys_t::set_flushed_lsn(): Use release memory order.
log_sys_t::set_lsn(): Use release memory order.
log_sys_t::get_lsn(): Use relaxed memory order by default, and
allow the caller to specify acquire memory order explicitly.
Whenever the log_sys.mutex is being held or when log writes are
prohibited during startup, we can use a relaxed load. Likewise,
in some assertions where reading a stale value of log_sys.lsn
should not matter, we can use a relaxed load.
This will cause some additional instructions to be emitted on
architectures that do not implement Total Store Ordering (TSO),
such as POWER, ARM, and RISC-V Weak Memory Ordering (RVWMO).
HAVE_valgrind_or_MSAN to HAVE_valgrind was incorrect in
af784385b4.
In my_valgrind.h when clang exists (hence no __has_feature(memory_sanitizer),
and -DWITH_VALGRIND=1, but without memcheck.h, we end up with a MEM_CHECK_DEFINED
being empty.
If we are also doing a CMAKE_BUILD_TYPE=Debug this results a number of
[-Werror,-Wunused-variable] errors because MEM_CHECK_DEFINED is empty.
With MEM_CHECK_DEFINED empty, there becomes no uses of this of the
fixed field and innodb variables in this patch.
So we stop using HAVE_valgrind as catchall and use the name
HAVE_CHECK_MEM to indicate that a CHECK_MEM_DEFINED function exists.
Reviewer: Monty
Corrects: af784385b4
Incorrect processing of an auto-incrementing field in the
WSREP-related code during applying transactions results in
a duplicate key being created. This is due to the fact that
at the beginning of the write_row() and update_row() functions,
the values of the auto-increment parameters are used, which
are read from the parameters of the current thread, but further
along the code other values are used, which are read from global
variables (when applying a transaction). This can happen when
the cluster configuration has changed while applying a transaction
(for example in the high_priority_service mode for Galera 4).
Further during IST processing duplicating key is detected, and
processing of the DB_DUPLICATE_KEY return code (inside innodb,
in the write_row() handler) results in a call to the
wsrep_thd_self_abort() function.
Like the 10.2 version 1635686b50,
except C++ on internal functions for my_assume_aligned.
volatile != atomic.
volatile has no memory barrier schemantics, its for mmaped IO
so lets allow some optimizer gains and stop pretending it helps
with memory atomicity.
The MDEV lists a SEGV an assumption is made that an address was
partially read. As C packs structs strictly in order and on arm64 the
cache line size is 128 bits. A pointer (link - 64 bits), followed
by a hashnr (uint32 - 32 bits), leaves the following key (uchar *
64 bits), neither naturally aligned to any pointer and worse, split
across a cache line which is the processors view of an atomic
reservation of memory.
lf_dynarray_lvalue is assumed to return a 64 bit aligned address.
As a solution move the 32bit hashnr to the end so we don't get the
*key pointer split across two cache lines.
Tested by: Krunal Bauskar
Reviewer: Marko Mäkelä
One should not change the program arguments!
This change also reduces warnings from the icc compiler.
Almost all changes are just syntax changes (adding const to
'get_one_option function' declarations).
Other changes:
- Added a few cast of 'argument' from 'const char*' to 'char *'. This
was mainly in calls to 'external' functions we don't have control of.
- Ensure that all reset of 'password command line argument' are similar.
(In almost all cases it was just adding a comment and a cast)
- In mysqlbinlog.cc and mysqld.cc there was a few cases that changed
the command line argument. These places where changed to instead allocate
the option in a MEM_ROOT to avoid changing the argument. Some of this
code was changed to ensure that different programs did parsing the
same way. Added a test case for the changes in mysqlbinlog.cc
- Changed a few variables that took their value from command line options
from 'char *' to 'const char *'.
Since 2017 (c2118a08b1) THD::awake() no longer requires LOCK_thd_data.
It uses LOCK_thd_kill, and this latter mutex is used to prevent
a thread of dying, not LOCK_thd_data as before.
Atomic_relaxed<T>: add fetch_or() and fetch_and()
innodb_init(): rely on a zero-initialization of a global variable
monitor_set_tbl: make Atomic_relaxed<ulint> array and use proper operations
for setting bit, unsetting bit and reading bit
Reviewed by: Marko Mäkelä
The clang++ -stdlib=libc++ header file <fstream> depends on
<filesystem> that defines a member function path::root_name(),
which conflicts with the rather unused #define root_name()
that had been introduced in
commit 7c58e97bf6.
Because an instrumented -stdlib=libc++ (rather than the default
-stdlib=libstdc++) is easier to build for a working -fsanitize=memory
(cmake -DWITH_MSAN=ON), let us remove the conflicting #define for now.
This reverts commit 6cf8f05fd9.
Original patch assumed that MAP_HUGETLB as consistent across
achitectures which isn't the case. Defining it unconditionally
broke large pages on every achitecutre where the value differed
from x86_64.
With the EOL for Centos/RHEL6 announced in 10.5.7, <3.8 linux
kernels are no longer supported.
This PR fixes same issue as MDEV-21577 for TRUNCATE TABLE.
MDEV-21577 fixed TOI replication for OPTIMIZE, REPAIR and ALTER TABLE
operating on FK child table. It was later found out that also TRUNCATE
has similar problem and needs a fix.
The actual fix is to do FK parent table lookup before TRUNCATE TOI
isolation and append found FK parent table names in certification key
list for the write set.
PR contains also new test scenario in galera_ddl_fk_conflict test where
FK child has two FK parent tables and there are two DML transactions operating
on both parent tables.
For development convenience, new TO isolation macro was added:
WSREP_TO_ISOLATION_BEGIN_IF and WSREP_TO_ISOLATION_BEGIN_ALTER macro was changed
to skip the goto statement.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
MDEV-23855 broke the handling of innodb_flush_sync=OFF.
That parameter is supposed to limit the page write rate
in case the log capacity is being exceeded and log checkpoints
are needed.
With this fix, the following should pass:
./mtr --mysqld=--loose-innodb-flush-sync=0
One of our best regression tests for page flushing is
encryption.innochecksum. With innodb_page_size=16k and
innodb_flush_sync=OFF it would likely hang without this fix.
log_sys.last_checkpoint_lsn: Declare as Atomic_relaxed<lsn_t>
so that we are allowed to read the value while not holding
log_sys.mutex.
buf_flush_wait_flushed(): Let the page cleaner perform the flushing
also if innodb_flush_sync=OFF. After the page cleaner has
completed, perform a checkpoint if it is needed, because
buf_flush_sync_for_checkpoint() will not be run if
innodb_flush_sync=OFF.
buf_flush_ahead(): Simplify the condition. We do not really care
whether buf_flush_page_cleaner() is running.
buf_flush_page_cleaner(): Evaluate innodb_flush_sync at the low
level. If innodb_flush_sync=OFF, rate-limit the batches to
innodb_io_capacity_max pages per second.
Reviewed by: Vladislav Vaintroub
Some DDL statements appear to acquire MDL locks for a table referenced by
foreign key constraint from the actual affected table of the DDL statement.
OPTIMIZE, REPAIR and ALTER TABLE belong to this class of DDL statements.
Earlier MariaDB version did not take this in consideration, and appended
only affected table in the certification key list in write set.
Because of missing certification information, it could happen that e.g.
OPTIMIZE table for FK child table could be allowed to apply in parallel
with DML operating on the foreign key parent table, and this could lead to
unhandled MDL lock conflicts between two high priority appliers (BF).
The fix in this patch, changes the TOI replication for OPTIMIZE, REPAIR and
ALTER TABLE statements so that before the execution of respective DDL
statement, there is foreign key parent search round. This FK parent search
contains following steps:
* open and lock the affected table (with permissive shared locks)
* iterate over foreign key contstraints and collect and array of Fk parent
table names
* close all tables open for the THD and release MDL locks
* do the actual TOI replication with the affected table and FK parent
table names as key values
The patch contains also new mtr test for verifying that the above mentioned
DDL statements replicate without problems when operating on FK child table.
The mtr test scenario #1, which can be used to check if some other DDL
(on top of OPTIMIZE, REPAIR and ALTER) could cause similar excessive FK
parent table locking.
Reviewed-by: Aleksey Midenkov <aleksey.midenkov@mariadb.com>
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
This follows up commit
commit 94a520ddbe and
commit 7c5519c12d.
After these changes, the default test suites on a
cmake -DWITH_UBSAN=ON build no longer fail due to passing
null pointers as parameters that are declared to never be null,
but plenty of other runtime errors remain.
There are 2 issues here:
Issue #1: memory allocation.
An IO_CACHE that uses encryption uses a larger buffer (it needs space for the encrypted data,
decrypted data, IO_CACHE_CRYPT struct to describe encryption parameters etc).
Issue #2: IO_CACHE::seek_not_done
When IO_CACHE objects are cloned, they still share the file descriptor.
This means, operation on one IO_CACHE may change the file read position
which will confuse other IO_CACHEs using it.
The fix of these issues would be:
Allocate the buffer to also include the extra size needed for encryption.
Perform seek again after one IO_CACHE reads the file.
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.
The configuration parameters will be changed as follows:
innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)
Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
* When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
* As a buf_pool.free limit in buf_LRU_list_batch() for terminating
the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
The status variables will be changed as follows:
innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call
When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.
Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.
The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.
The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.
Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.
We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.
To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.
For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.
mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.
buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.
buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.
buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.
buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().
flush_counters_t::unzip_LRU_evicted: Remove.
IORequest: Make more members const. FIXME: m_fil_node should be removed.
buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).
page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().
recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.
srv_started_redo: Replaces srv_start_state.
SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().
buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().
buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.
buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.
buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.
buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.
buf_flush_check_neighbor(): Take id.fold() as a parameter.
buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.
buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_do_LRU_batch(): Return the number of pages flushed.
buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.
buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.
buf_LRU_check_size_of_non_data_objects(): Simplify the code.
buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.
buf_page_create(): Take a pre-allocated page as a parameter.
buf_pool_t::free_block(): Free a pre-allocated block.
recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.
BtrBulk::logFreeCheck(): Skip a redundant condition.
row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.
sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.
Reviewed by: Vladislav Vaintroub
Add CRC32C code to mysys. The x86-64 implementation uses PCMULQDQ in addition to CRC32 instruction
after Intel whitepaper, and is ported from rocksdb code.
Optimized ARM and POWER CRC32 were already present in mysys.
Remove incorrect BF (brute force) handling from lock_rec_has_to_wait_in_queue
and move condition to correct callers. Add a function to report
BF lock waits and assert if incorrect BF-BF lock wait happens.
wsrep_report_bf_lock_wait
Add a new function to report BF lock wait.
wsrep_assert_no_bf_bf_wait
Add a new function to check do we have a
BF-BF wait and if we have report this case
and assert as it is a bug.
lock_rec_has_to_wait
Use new wsrep_assert_bf_wait to check BF-BF wait.
lock_rec_create_low
lock_table_create
Use new function to report BF lock waits.
lock_rec_insert_by_trx_age
lock_grant_and_move_on_page
lock_grant_and_move_on_rec
Assert that trx is not Galera as VATS is not compatible
with Galera.
lock_rec_add_to_queue
If there is conflicting lock in a queue make sure that
transaction is BF.
lock_rec_has_to_wait_in_queue
Remove incorrect BF handling. If there is conflicting
locks in a queue all transactions must wait.
lock_rec_dequeue_from_page
lock_rec_unlock
If there is conflicting lock make sure it is not
BF-BF case.
lock_rec_queue_validate
Add Galera record locking rules comment and use
new function to report BF lock waits.
All attempts to reproduce the original assertion have been
failed. Therefore, there is no test case on this commit.
This follows up MDEV-14374, which was filed against MariaDB Server 10.3.
Back then, on a 48-core Qualcomm Centriq 2400, the performance of
delay loops for spinloops was tested both with and without the dummy
compare-and-swap operation, and it was decided to keep the dummy
operation.
On target architectures where nothing special is available (other than
x86 (IA-32, AMD64) or POWER), we perform a dummy compare-and-swap operation.
This is contrary to the idea of the x86 PAUSE instruction and the
__ppc_get_timebase(), which aim to keep the memory bus idle for a while,
to allow other cores to better execute code while a spinloop is waiting
for something to be changed.
On MariaDB Server 10.4 and another implementation of the ARMv8 ISA,
omitting the dummy compare-and-swap improved performance by up to 12%.
So, let us avoid the dummy compare-and-swap on ARM.
For now, we are retaining the dummy compare-and-swap on other ISAs
(such as SPARC, MIPS, S390x, RISC-V) because we do not have any
performance data for them.
Raspberry Pi 4 supports crc32 but doesn't support pmull (MDEV-23030).
The PR #1645 offers a solution to fix this issue. But it does not consider
the condition that the target platform does support crc32 but not support PMULL.
In this condition, it should leverage the Arm64 crc32 instruction (__crc32c) and
just only skip parallel computation (pmull/vmull) rather than skip all hardware
crc32 instruction of computation.
The PR also removes unnecessary CRC32_ZERO branch in 'crc32c_aarch64' for MariaDB,
formats the indent and coding style.
Change-Id: I76371a6bd767b4985600e8cca10983d71b7e9459
Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
In 10.3, DBUG_ASSERT() may expand to something that includes
__builtin_expect(), which expects integer arguments, not pointers.
To avoid any compiler warnings, let us use an explicit rather than
implicit comparison to the null pointer.
MariaDB adopted a hardware optimized crc32c approach on ARM64 starting 10.5.
Said implementation of crc32c needs support from target hardware for crc32
and pmull instructions. Existing logic is checking only for crc32 support
from target hardware through a runtime check and so if target hardware
doesn't support pmull it would cause things to fail/crash.
Expanded runtime check to ensure pmull support is also checked on the target
hardware along with existing crc32.
Thanks to Marko and Daniel for review.
Due to restricted size of the threadpool, execution of client queries can
be delayed (queued) for a while. This delay was interpreted as client
inactivity, and connection is closed, if client idle time + queue time
exceeds wait_timeout.
But users did not expect queue time to be included into wait_timeout.
This patch changes the behavior. We don't close connection anymore,
if there is some unread data present on connection,
even if wait_timeout is exceeded. Unread data means that client
was not idle, it sent a query, which we did not have time to process yet.