There were two separate Atomic_counter<uint32_t>, purge_sys.m_SYS_paused
and purge_sys.m_FTS_paused. In purge_sys.wait_FTS() we have to read both
atomically. We used to use an overkill solution for this, acquiring
purge_sys.latch and waiting 10 milliseconds between samples. To make
matters worse, the 10-millisecond wait was unconditional, which would
unnecessarily suspend the purge_coordinator_task every now and then.
It turns out that we can fold both "reference counts" into a single
Atomic_relaxed<uint32_t> and avoid the purge_sys.latch.
To assess whether std::memory_order_relaxed is acceptable, we should
consider the operations that read these "reference counts", that is,
purge_sys_t::wait_FTS(bool) and purge_sys_t::must_wait_FTS().
Outside debug assertions, purge_sys.must_wait_FTS() is only invoked in
trx_purge_table_acquire(), which is covered by a shared dict_sys.latch.
We would increment the counter as part of a DDL operation, but before
acquiring an exclusive dict_sys.latch. So, a
purge_sys_t::close_and_reopen() loop could be triggered slightly
prematurely, before a problematic DDL operation is actually executed.
Decrementing the counter is less of an issue; purge_sys.resume_FTS()
or purge_sys.resume_SYS() would mostly be invoked while holding an
exclusive dict_sys.latch; ha_innobase::delete_table() does it outside
that critical section. Still, this would only cause some extra wait in
the purge_coordinator_task, just like at the start of a DDL operation.
There are two calls to purge_sys_t::wait_FTS(bool): in the above mentioned
purge_sys_t::close_and_reopen() and in purge_sys_t::clone_oldest_view(),
both invoked by the purge_coordinator_task. There is also a
purge_sys.clone_oldest_view<true>() call at startup when no DDL operation
can be in progress.
purge_sys_t::m_SYS_paused: Merged into m_FTS_paused, using a new
multiplier PAUSED_SYS = 65536.
purge_sys_t::wait_FTS(): Remove an unnecessary sleep as well as the
access to purge_sys.latch. It suffices to poll purge_sys.m_FTS_paused.
purge_sys_t::stop_FTS(): Do not acquire purge_sys.latch.
Reviewed by: Debarun Banerjee
Discovered this while working on MDEV-34720: test_if_cheaper_ordering()
uses rec_per_key, while the original estimate for the access method
is produced in best_access_path() by using actual_rec_per_key().
Make test_if_cheaper_ordering() also use actual_rec_per_key().
Also make several getter function "const" to make this compile.
Also adjusted the testcase to handle this (the change backported from
11.0)
buf_page_ibuf_merge_try(): A new, separate function for invoking
ibuf_merge_or_delete_for_page() when needed. Use the already requested
page latch for determining if the call is necessary. If it is and
if we are currently holding rw_latch==RW_S_LATCH, upgrading to an exclusive
latch may involve waiting that another thread acquires and releases
a U or X latch on the page. If we have to wait, we must recheck if the
call to ibuf_merge_or_delete_for_page() is still needed. If the page
turns out to be corrupted, we will release and fail the operation.
Finally, the exclusive page latch will be downgraded to the originally
requested latch.
ssux_lock_impl::rd_u_upgrade_try(): Attempt to upgrade a shared lock to
an update lock.
sux_lock::s_x_upgrade_try(): Attempt to upgrade a shared lock to
exclusive.
sux_lock::s_x_upgrade(): Upgrade a shared lock to exclusive.
Return whether a wait was elided.
ssux_lock_impl::u_rd_downgrade(), sux_lock::u_s_downgrade():
Downgrade an update lock to shared.
MDEV-33582 (3541bd63f0) changed the "Could not write packet"
error message in net_serv.cc to use the function
sql_print_warning(), instead of my_printf_error(). The flags
argument was not removed in this change though, so the old
flags were printed in place of the file descriptor, and all
other args are presenting for the wrong field (and length is
never showed).
This patch removes flags as a parameter to sql_print_warning().
The original code is correct.
valgrind and asan binaries should be built with a specialiced version of
mem_root that makes it easier to find memory overwrites.
This is what the BUILD scripts is doing.
The specialiced mem_root code allocates a new block for every allocation
which is visiable for any test that depenmds on the default original malloc
size and usage.
Running an UPDATE statement in PS mode and having positional
parameter(s) bound with an array of actual values (that is
prepared to be run in bulk mode) results in incorrect behaviour
in presence of on update trigger that also executes an UPDATE
statement. The same is true for handling a DELETE statement in
presence of on delete trigger. Typically, the visible effect of
such incorrect behaviour is expressed in a wrong number of
updated/deleted rows of a target table. Additionally, in case UPDATE
statement, a number of modified rows and a state message returned
by a statement contains wrong information about a number of modified rows.
The reason for incorrect number of updated/deleted rows is that
a data structure used for binding positional argument with its
actual values is stored in THD (this is thd->bulk_param) and reused
on processing every INSERT/UPDATE/DELETE statement. It leads to
consuming actual values bound with top-level UPDATE/DELETE statement
by other DML statements used by triggers' body.
To fix the issue, reset the thd->bulk_param temporary to the value
nullptr before invoking triggers and restore its value on finishing
its execution.
The second part of the problem relating with wrong value of affected
rows reported by Connector/C API is caused by the fact that diagnostics
area is reused by an original DML statement and a statement invoked
by a trigger. This fact should be take into account on finalizing a
state of diagnostics area on completion running of a statement.
Important remark: in case the macros DBUG_OFF is on, call of the method
Diagnostics_area::reset_diagnostics_area()
results in reset of the data members
m_affected_rows, m_statement_warn_count.
Values of these data members of the class Diagnostics_area are used on
sending OK and EOF messages. In case DML statement is executed in PS bulk
mode such resetting results in sending wrong result values to a client
for affected rows in case the DML statement fires a triggers. So, reset
these data members only in case the current statement being processed
is not run in bulk mode.
When SUX_LOCK_GENERIC is defined, the srw_mutex, srw_lock, sux_lock are
implemented based on pthread_mutex_t and pthread_cond_t. This is the
only option for systems that lack a futex-like system call.
In the SUX_LOCK_GENERIC mode, if pthread_mutex_init() is allocating
some resources that need to be freed by pthread_mutex_destroy(),
a memory leak could occur when we are repeatedly invoking
pthread_mutex_init() without a pthread_mutex_destroy() in between.
pthread_mutex_wrapper::initialized: A debug field to track whether
pthread_mutex_init() has been invoked. This also helps find bugs
like the one that was fixed by
commit 1c8af2ae53 (MDEV-34422);
one simply needs to add -DSUX_LOCK_GENERIC to the CMAKE_CXX_FLAGS
to catch that particular bug on the initial server bootstrap.
buf_block_init(), buf_page_init_for_read(): Invoke block_lock::init()
because buf_page_t::init() will no longer do that.
buf_page_t::init(): Instead of invoking lock.init(), assert that it
has already been invoked (the lock is vacant).
add_fts_index(), build_fts_hidden_table(): Explicitly invoke
index_lock::init() in order to avoid a pthread_mutex_destroy()
invocation on an uninitialized object.
srw_lock_debug::destroy(): Invoke readers_lock.destroy().
trx_sys_t::create(): Invoke trx_rseg_t::init() on all rollback segments
in order to guarantee a deterministic state for shutdown, even if
InnoDB fails to start up.
trx_rseg_array_init(), trx_temp_rseg_create(), trx_rseg_create():
Invoke trx_rseg_t::destroy() before trx_rseg_t::init() in order to
balance pthread_mutex_init() and pthread_mutex_destroy() calls.
- Added plugin_debug.test, multiple_index.test to innodb_fts suite
from mysql-5.7.
- commit c5b28e55f6 removes the warning
for InnoDB rebuilding table to add FTS_DOC_ID
- multiple_index test case has MATCH(a) values are smaller
than in MySQL because ROLLBACK updates the stat_n_rows.
- st_mysql_ftparser_boolean_info structure conveys boolean
metadata to mysql search engine for every word in the query.
This structure misses the position value to store the correct
offset of every word. So phrase search queries in plugin_debug
test case with boolean mode for simple parser throws
wrong result.
With wsrep_sst_rsync, node goes into endless loop when trying
to establish connection to donor for IST/SST if the database
is bind on specific IP address, not the "*".
This commit fixes this problem. Separate tests are not
required - the problem can occur in normal configurations
on a number of systems when selecting a bing address other
than "*", especially on FreeBSD and with the IPv6 addresses.
int wsrep_thd_append_key(THD*, const wsrep_key*, int, Wsrep_service_key_type)
CREATE TABLE [SELECT|REPLACE SELECT] is CTAS and idea was that
we force ROW format. However, it was not correctly enforced
and keys were appended before wsrep transaction was started.
At THD::decide_logging_format we should force used stmt binlog
format to ROW in CTAS case and produce a warning if used
binlog format was not ROW.
At ha_innodb::update_row we should not append keys similarly
as in ha_innodb::write_row if sql_command is SQLCOM_CREATE_TABLE.
Improved error logging on ::write_row, ::update_row and ::delete_row
if wsrep key append fails.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
A mixture of a multi-byte *TEXT column and a short binary column
produced a too large column.
For example, COALESCE(tinytext_utf8mb4, short_varbinary)
produced a BLOB column instead of an expected TINYBLOB.
- Adding a virtual method Type_all_attributes::character_octet_length(),
returning max_length by default.
- Overriding Item_field::character_octet_length() to extract
the octet length from the underlying Field.
- Overriding Item_ref::character_octet_length() to extract
the octet length from the references Item (e.g. as VIEW fields).
- Fixing Type_numeric_attributes::find_max_octet_length() to
take the octet length using the new method character_octet_length()
instead of accessing max_length directly.
fprintf() on Windows, when used on unbuffered FILE*, writes bytewise.
This can make crash handler messages harder to read, if they are mixed up
with other error log output.
Fixed , on Windows, by using a small buffer for formatting, and fwrite
instead of fprintf, if buffer is large enough for message.
The reason is that on Windows, OpenSSL can be built with different C runtime
than the server (e.g Debug runtime in debug vcpkg build).
Overwriting only malloc(), with CRT that server is using can cause
mixup of incompatible malloc() and free() inside openssl.
To fix, overwrite all memory functions.
Assertion `table->field[0]->ptr >= table->record[0] &&
table->field[0]->ptr <= table->record[0] + table->s->reclength' failed in
handler::assert_icp_limitations.
table->move_fields has some limitations:
1. It cannot be used in cascade
2. It should always have a restoring pair.
Rule 1 is covered by assertions in handler::assert_icp_limitations
and handler::ptr_in_record (commit 30894fe9a9).
Rule 2 should be manually maintained with care. Hopefully, the rule 1 assertions
may sometimes help as well.
In ha_myisam::repair, both rules are broken. table->move_fields is used
asymmetrically there: it is set on every param->fix_record call
(i.e. in compute_vcols) but is restored only once, in the end of repair.
The reason to updating field ptr's for every call is that compute_vcols can
(supposedly) be called in parallel, that is, with the same table, but different
records.
The condition to "unmove" the pointers in ha_myisam::restore_vcos_after_repair
is incorrect, when stored vcols are available, and myisam stores a VIRTUAL field
if it's the only field in the table (the record cannot be of zero length).
This patch solves the problem by "unmoving" the pointers symmetrically, in
compute_vcols. That is, both rules will be preserved maintained.
Modified node config with longer timeouts for suspect,
inactive, install and wait_prim timeout. Increased
node_1 weight to keep it primary component when
other nodes are voted out.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Replication of MyISAM and Aria DML is experimental and best
effort only. Earlier change make INSERT SELECT on both
MyISAM and Aria to replicate using TOI and STATEMENT
replication. Replication should happen only if user
has set needed wsrep_mode setting.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Fixed used configuration and added suppression for warning
message. Test case changes only.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Modified test configuration file to use wsrep_sync_wait
to make sure committed transactions are replicated before
next operation.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
New runtime type diagnostic (MDEV-34490) has detected that classes
Item_func_eq, Item_default_value and Item_date_literal_for_invalid_dates
incorrectly return an instance of its ancestor classes when being cloned.
This commit fixes that.
Additionally, it fixes a bug at Item_func_case_simple::do_build_clone()
which led to an endless loop of cloning functions calls.
Reviewer: Oleksandr Byelkin <sanja@mariadb.com>
The final touch to fixing memory leaks for PS/SP. This patch turns on
by default the option WITH_PROTECT_STATEMENT_MEMROOT in order to
build server with memory leaks detection mechanism switched on.
Reason:
======
- InnoDB fails to load the instant alter table metadata from
clustered index while loading the table definition.
The reason is that InnoDB metadata blob has the column length
exceeds maximum fixed length column size.
Fix:
===
- InnoDB should treat the long fixed length column as variable
length fields that needs external storage while initializing
the field map for instant alter operation
Before this patch the crash occured when a single row dataset is used and
Item::remove_eq_conds() is called for HAVING. This function is not supposed
to be called after the elimination of multiple equalities.
To fix this problem instead of Item::remove_eq_conds() Item::val_int() is
used. In this case the optimizer tries to evaluate the condition for the
single row dataset and discovers impossible HAVING immediately. So, the
execution phase is skipped.
Approved by Igor Babaev <igor@maridb.com>
Problem:
========
- After the commit ada1074bb1 (MDEV-14398)
fil_crypt_set_encrypt_tables() iterates through all tablespaces to
fill the default_encrypt tables list. This was a trigger to
encrypt or decrypt when key rotation age is set to 0. But import
tablespace does call fil_crypt_set_encrypt_tables() unnecessarily.
The motivation for the call is to signal the encryption threads.
Fix:
====
ha_innobase::discard_or_import_tablespace: Remove the
fil_crypt_set_encrypt_tables() and add the import tablespace
to the default encrypt list if necessary
Commit a8a75ba2d causes the MariaDB server to crash, usually with signal
11, at random code locations due to invalid pointer values during any
table operation. This issue occurs when the server is built with -O3 and
other customized compiler flags.
For example, the command `use db1;` causes server to crash in the
`check_table_access` function at line sql_parse.cc:7080 because
`tables->correspondent_table` is an invalid pointer value of 0x1.
The crashes are due to undefined behavior from using uninitialized
variables. The problematic commit a8a75ba2d introduces code that
allocates memory and sets it to 0 using thd->calloc before initializing
it with a placement new operation.
This process depends on setting memory to 0 to initialize member
variables not explicitly set in the constructor. However, the compiler
can optimize out the memset/bfill, leading to uninitialized values and
unpredictable issues.
Once a constructor function initializes an object, any uninitialized
variables within that object are subject to undefined behavior. The
state of memory before the constructor runs, whether it involves
memset or was used for other purposes, is irrelevant after the
placement new operation.
This behavior can be demonstrated with this
[test](https://gcc.godbolt.org/z/5n87z1raG) I wrote to examine the
assembly code. The code in MariaDB can be abstracted to the following,
though it has many layers wrapped around it and more complex logic,
causing slight differences in optimization in the MariaDB build.
To summarize, on x86, the memset in the following code is optimized out
with both -O2 and -O3 in GCC 13, and is only preserved in the much older
GCC 4.9.
struct S {
int i; // uninitialized in consturctor
S() {};
};
int bar() {
void *buf = malloc(sizeof(S));
memset(buf, 0, sizeof(S)); // optimized out
S* s = new(buf) S;
return s->i;
}
With GCC13 -O3:
bar():
sub rsp, 8
mov edi, 4
call malloc
mov eax, DWORD PTR [rax]
add rsp, 8
ret
With GCC4.9 -O3
bar():
sub rsp, 8
mov edi, 4
call malloc
mov DWORD PTR [rax], 0
xor eax, eax
add rsp, 8
ret
Now we ensure the constructor initializes variables correctly by running
the reset() function in the constructor to perform the memset/bfill(0)
operation. After applying the fix, the crash is gone.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services.
Remove an assert added by fix for MDEV-34417. BNL-H join can be used with
prefix keys. This happens when there are real prefix indexes on the
equi-join columns (although it probably doesn't make a lot of sense).
Anyway, remove the assert. The code receives properly truncated key values
for hashing/comparison so it can handle them just fine.
During InnoDB root page split, InnoDB does the following
1) First move the root records to the new page(p1)
2) Empty the root, insert the node pointer to the root page
3) Split the new page and make it as child nodes.
4) Finds the split record, allocate another new page(p2)
to the index
5) InnoDB stores the record(ret) predecessor to the supremum
record of the page (p2).
6) In page_copy_rec_list_start(), move the records from p1 to p2
upto the split record
6) Given table is a compressed row format page, InnoDB attempts to
compress the page p2 and failed (due to innodb_compression_level = 0)
7) Since the compression fails, InnoDB gets the number of preceding
records(ret_pos) of a record (ret) on the page (p2)
8) Page (p2) is a new page, ret points to infimum record.
ret_pos can be 0. InnoDB have wrong condition that ret_pos shouldn't
be 0 and returns corruption. InnoDB has similar wrong check in
page_copy_rec_list_end()