When calculate_cond_selectivity_for_table() takes into account multi-
column selectivities from range access, it tries to take-into account
that selectivity for some columns may have been already taken into account.
For example, for range access on IDX1 using {kp1, kp2}, the selectivity
of restrictions on "kp2" might have already been taken into account
to some extent.
So, the code tries to "discount" that using rec_per_key[] estimates.
This seems to be wrong and unreliable: the "discounting" may produce a
rselectivity_multiplier number that hints that the overall selectivity
of range access on IDX1 was greater than 1.
Do a conservative fix: if we arrive at conclusion that selectivity of
range access on condition in IDX1 >1.0, clip it down to 1.
storage/connect/tabfmt.cpp:419:24: error: '%.3d' directive writing between 3 and 10 bytes into a region of size 5 [-Werror=format-overflow=]
419 | sprintf(buf, "COL%.3d", i+1);
row_purge_reset_trx_id(): Reserve large enough offsets for accomodating
the maximum width PRIMARY KEY followed by DB_TRX_ID,DB_ROLL_PTR.
Reviewed by: Thirunarayanan Balathandayuthapani
purge_sys_t::get_page(): Avoid accessing a freed reference to pages[id]
after pages.erase(id). This heap-use-after-free would sometimes be
caught by AddressSanitizer.
purge_sys_t::iterator::free_history_rseg(): Do not crash if undo=nullptr
(the database is corrupted).
Reviewed by: Debarun Banerjee
Analysis:
The value gets appended as string instead of unescaped json value
Fix:
Append the value of json in a temporary string and then store it in the
field instead of directly storing as string.
Another chance for cutting back overhead due to C++ exceptions being
enabled; the `dict_sys_t` class is a good candidate because its
locking methods are called frequently.
Binary size reduction this time:
text data bss dec hex filename
24448622 2436488 9473537 36358647 22ac9f7 build/release/sql/mariadbd
24448474 2436488 9473601 36358563 22ac9a3 build/release/sql/mariadbd
MariaDB is compiled with C++ exceptions enabled, and that disallows
some optimizations (e.g. the stack must always be unwinding-safe). By
adding `noexcept` to functions that are guaranteed to never throw,
some of these optimizations can be regained. Low-level locking
functions that are called often are a good candidate for this.
This shrinks the executable a bit (tested with GCC 14 on aarch64):
text data bss dec hex filename
24448910 2436488 9473185 36358583 22ac9b7 build/release/sql/mariadbd
24448622 2436488 9473537 36358647 22ac9f7 build/release/sql/mariadbd
Don't allow the referencing key column from NULL TO NOT NULL
when
1) Foreign key constraint type is ON UPDATE SET NULL
2) Foreign key constraint type is ON DELETE SET NULL
3) Foreign key constraint type is UPDATE CASCADE and referenced
column declared as NULL
Don't allow the referenced key column from NOT NULL to NULL
when foreign key constraint type is UPDATE CASCADE
and referencing key columns doesn't allow NULL values
get_foreign_key_info(): InnoDB sends the information about
nullability of the foreign key fields and referenced key fields.
fk_check_column_changes(): Enforce the above rules for COPY
algorithm
innobase_check_foreign_drop_col(): Checks whether the dropped
column exists in existing foreign key relation
innobase_check_foreign_low() : Enforce the above rules for
INPLACE algorithm
dict_foreign_t::check_fk_constraint_valid(): This is used
by CREATE TABLE statement to check nullability for foreign
key relation.
The commit cd5808eb introduced a union as a storage for the format
argument passed to the internal API fmt::detail::make_arg. This was done
to solve the issue that the internal API no longer accepted temporary
variables.
However, it's generally better to avoid using internal APIs, as they are
more likely to have breaking changes in the future. Instead, we can use
the public API fmt::dynamic_format_arg_store to dynamically build the
argument list. This API accepts temporary variables, and its behavior is
more stable than the internal API. `libfmt.cmake` is updated to reflect
the change as well.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
The method was declared to return an unsigned integer, but it is
really a boolean (and used as such by all callers).
A secondary change is the addition of "const" and "noexcept" to this
method.
In ha_mroonga.cpp, I also added "inline" to the two helper methods of
referenced_by_foreign_key(). This allows the compiler to flatten the
method.
We have found that my_errno can be "passed" to the next commad in some cases.
It is practically impossible to check/fix all cases of my_errno in the server,
plugins and engines so we will reset it as we reset other errors.
The test case will be fixed by CSV engine fix so will be added with it
(see part2).
log_file_t::read(), log_file_t::write(): Invoke pread() or pwrite()
directly, so that we can give more accurate diagnostics in case of
a failure, and so that we will avoid the overhead of setting up 5(!)
stack frames and related objects.
tpool::pwrite(): Add a missing const qualifier.
Added new test scenario in galera.galera_bf_kill
test to make the issue surface. The tetst scenario has
a multi statement transaction containing a KILL command.
When the KILL is submitted, another transaction is
replicated, which causes BF abort for the KILL command
processing. Handling BF abort rollback while executing
KILL command causes node hanging, in this scenario.
sql_kill() and sql_kill_user() functions have now fix,
to perform implicit commit before starting the KILL command
execution. BEcause of the implicit commit, the KILL execution
will not happen inside transaction context anymore.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
RESET MASTER waits for storage engines to reply to a binlog checkpoint
requests. If this response is delayed for a long time for some reason, then
RESET MASTER can hang.
Fix this by forcing a log sync in all engines just before waiting for the
checkpoint reply.
(Waiting for old checkpoint responses is needed to preserve durability of
any commits that were synced to disk in the to-be-deleted binlog but not yet
synced in the engine.)
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
(Polished initial patch by Alexey Botchkov)
Make the code handle DEFAULT values of any datatype
- Make Json_table_column::On_response::m_default be Item*, not LEX_STRING.
- Change the parser to use string literal non-terminals for producing
the DEFAULT value
-- Also, stop updating json_table->m_text_literal_cs for the DEFAULT
value literals as it is not used.
A server that was running with innodb_log_file_size=96M and
innodb_buffer_pool_size=6M had inserted some data into a table
that was subsequently dropped. When the server was killed and
restarted, an assertion failed in recv_sys_t::parse() while
a FSP_SIZE change was unnecessarily being processed during
the skip_the_rest: loop in recv_scan_log().
The ib_logfile0 contents was as follows:
1. The checkpoint start LSN points to the start of some mini-transaction.
2. There may be log records for modifying files for which a FILE_MODIFY
had been written before the checkpoint. These records were "purged"
by advancing the checkpoint.
3. At some point during the initial parsing with store=true the space
reserved for recv_sys.pages will run out and recv_scan_log() would switch
to the skip_the_rest: mode.
4. We encounter a log record for extending a tablespace that will be
deleted a bit later. This would trip the bogus debug assertion.
5. Later on, there would be a FILE_DELETE record for this tablespace.
6. The checkpoint end LSN points to a possibly empty sequence of
FILE_MODIFY records and a FILE_CHECKPOINT record. Recovery had parsed these
records first, before rewinding to the checkpoint start LSN.
7. There could be further records following the FILE_CHECKPOINT record.
Recovery will process all records until an inconsistency is found and
it is assumed that the end of the circular ib_logfile0 was reached.
recv_sys_t::parse(): For the template instantiation with store=false,
remove a debug assertion that could fail in a multi-batch recovery,
while recv_scan_log(false) would be in the skip_the_rest: loop.
It is very well possible that we have not encountered all FILE_ records
yet, and therefore we should not complain about unknown tablespaces.
Reviewed by: Debarun Banerjee
When using the default innodb_log_buffer_size=2m, mariadb-backup --backup
would spend a lot of time re-reading and re-parsing the log. For reads,
it would be beneficial to memory-map the entire ib_logfile0 to the
address space (typically 48 bits or 256 TiB) and read it from there,
both during --backup and --prepare.
We will introduce the Boolean read-only parameter innodb_log_file_mmap
that will be OFF by default on most platforms, to avoid aggressive
read-ahead of the entire ib_logfile0 in when only a tiny portion would be
accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON,
because those platforms define a specific mmap(2) option for enabling
such read-ahead and therefore it can be assumed that the default would
be on-demand paging. This parameter will only have impact on the initial
InnoDB startup and recovery. Any writes to the log will use regular I/O,
except when the ib_logfile0 is stored in a specially configured file system
that is backed by persistent memory (Linux "mount -o dax").
We also experimented with allowing writes of the ib_logfile0 via a
memory mapping and decided against it. A fundamental problem would be
unnecessary read-before-write in case of a major page fault, that is,
when a new, not yet cached, virtual memory page in the circular
ib_logfile0 is being written to. There appears to be no way to tell
the operating system that we do not care about the previous contents of
the page, or that the page fault handler should just zero it out.
Many references to HAVE_PMEM have been replaced with references to
HAVE_INNODB_MMAP.
The predicate log_sys.is_pmem() has been replaced with
log_sys.is_mmap() && !log_sys.is_opened().
Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the
way that an open file handle to ib_logfile0 will be retained. In both
code paths, log_sys.is_mmap() will hold. Holding a file handle open will
allow log_t::clear_mmap() to disable the interface with fewer operations.
It should be noted that ever since
commit 685d958e38 (MDEV-14425)
most 64-bit Linux platforms on our CI platforms
(s390x a.k.a. IBM System Z being a notable exception) read and write
/dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is
persistent memory (mount -o dax). So, the memory mapping based log
parsing that this change is enabling by default on Linux and FreeBSD
has already been extensively tested on Linux.
::log_mmap(): If a log cannot be opened as PMEM and the desired access
is read-only, try to open a read-only memory mapping.
xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile():
Copy the InnoDB log in mariadb-backup --backup from a memory
mapped file.
SSL_CTX_set_ciphersuites() sets the TLSv1.3 cipher suites.
SSL_CTX_set_cipher_list() sets the ciphers for TLSv1.2 and below.
The current TLS configuration logic will not perform SSL_CTX_set_cipher_list()
to configure TLSv1.2 ciphers if the call to SSL_CTX_set_ciphersuites() was
successful. The call to SSL_CTX_set_ciphersuites() is successful if any TLSv1.3
cipher suite is passed into `--ssl-cipher`.
This is a potential security vulnerability because users trying to restrict
specific secure ciphers for TLSv1.3 and TLSv1.2, would unknowingly still have
the database support insecure TLSv1.2 ciphers.
For example:
If setting `--ssl_cipher=TLS_AES_128_GCM_SHA256:ECDHE-RSA-AES128-GCM-SHA256`,
the database would still support all possible TLSv1.2 ciphers rather than only
ECDHE-RSA-AES128-GCM-SHA256.
The solution is to execute both SSL_CTX_set_ciphersuites() and
SSL_CTX_set_cipher_list() even if the first call succeeds.
This allows the configuration of exactly which TLSv1.3 and TLSv1.2 ciphers to
support.
Note that there is 1 behavior change with this. When specifying only TLSv1.3
ciphers to `--ssl-cipher`, the database will not support any TLSv1.2 cipher.
However, this does not impose a security risk and considering TLSv1.3 is the
modern protocol, this behavior should be fine.
All TLSv1.3 ciphers are still supported if only TLSv1.2 ciphers are specified
through `--ssl-cipher`.
All new code of the whole pull request, including one or several files that are
either new files or modified ones, are contributed under the BSD-new license. I
am contributing on behalf of my employer Amazon Web Services, Inc.
It's read for every command execution, and during slave replication
for every applied event.
It's also planned to be used during write set applying, so it means
mostly every server thread is going to compete for the mutex covering
this variable, especially considering how rarely it changes.
Converting wsrep_ready to atomic relaxes the things.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Add wait_until_ready waits after wsrep_on is set on again to
make sure that node is ready for next step before continuing.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Stabilize test by reseting DEBUG_SYNC and add wait_condition
for expected table contents.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
The crash report terminates prematurely when Galera library was
not loaded.
As a fix, check whether the provider is loaded before shutting down
Galera connections.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Move memory allocations performed during Sys_var_gtid_binlog_state::do_check
to Sys_var_gtid_binlog_state::global_update where they will be freed before
the latter method returns.
A call to
dbug_print_join_prefix(join_positions, idx, s)
returns a const char* ponter to string with current join prefix,
including the table being added to it.
All the options that where in buildbot, should
be in the server making it accessible to all
without any special invocation.
If WITH_MSAN=ON, we want to make sure that the
compiler options are supported and it will result
in an error if not supported.
We make the -WITH_MSAN=ON append -stdlib=libc++
to the CXX_FLAGS if supported.
With SECURITY_HARDENING options the bootstrap
currently crashes, so for now, we disable SECRUITY_HARDENING
if there is MSAN enable.
Option WITH_DBUG_TRACE has no effect in MSAN builds.
Each time a listener socket becomes ready, MariaDB calls accept() ten
times (MAX_ACCEPT_RETRY), even if all but the first one return EAGAIN
because there are no more connections. This causes unnecessary CPU
usage - on our server, the CPU load of that thread, which does nothing
but accept(), saturates one CPU core by ~45%. The loop should stop
after the first EAGAIN.
Perf report:
11.01% mariadbd libc.so.6 [.] accept4
6.42% mariadbd [kernel.kallsyms] [k] finish_task_switch.isra.0
5.50% mariadbd [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
5.50% mariadbd [kernel.kallsyms] [k] syscall_enter_from_user_mode
4.59% mariadbd [kernel.kallsyms] [k] __fget_light
3.67% mariadbd [kernel.kallsyms] [k] kmem_cache_alloc
2.75% mariadbd [kernel.kallsyms] [k] fput
2.75% mariadbd [kernel.kallsyms] [k] mod_objcg_state
1.83% mariadbd [kernel.kallsyms] [k] __inode_wait_for_writeback
1.83% mariadbd [kernel.kallsyms] [k] __sys_accept4
1.83% mariadbd [kernel.kallsyms] [k] _raw_spin_unlock_irq
1.83% mariadbd [kernel.kallsyms] [k] alloc_inode
1.83% mariadbd [kernel.kallsyms] [k] call_rcu
Applied SR transaction on the child table was not BF aborted by TOI running
on the parent table for several reasons:
Although SR correctly collected FK-referenced keys to parent, TOI in Galera
disregards common certification index and simply sets itself to depend on
the latest certified write set seqno.
Since this write set was the fragment of SR transaction, TOI was allowed to
run in parallel with SR presuming it would BF abort the latter.
At the same time, DML transactions in the server don't grab MDL locks on
FK-referenced tables, thus parent table wasn't protected by an MDL lock from
SR and it couldn't provoke MDL lock conflict for TOI to BF abort SR transaction.
In InnoDB, DDL transactions grab shared MDL locks on child tables, which is not
enough to trigger MDL conflict in Galera.
InnoDB-level Wsrep patch didn't contain correct conflict resolution logic due to
the fact that it was believed MDL locking should always produce conflicts correctly.
The fix brings conflict resolution rules similar to MDL-level checks to InnoDB,
thus accounting for the problematic case.
Apart from that, wsrep_thd_is_SR() is patched to return true only for executing
SR transactions. It should be safe as any other SR state is either the same as
for any single write set (thus making the two logically equivalent), or it reflects
an SR transaction as being aborting or prepared, which is handled separately in
BF-aborting logic, and for regular execution path it should not matter at all.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>