Commit graph

2664 commits

Author SHA1 Message Date
Brandon Nesterenko
744580d5a7 MDEV-32892: IO Thread Reports False Error When Stopped During Connecting to Primary
The IO thread can report error code 2013 into the error log when it
is stopped during the initial connection process to the primary, as
well as when trying to read an event. However, because the IO thread
is being stopped, its connection to the primary is force-killed by
the signaling thread (see THD::awake_no_mutex()), and thereby these
connection errors should be ignored.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
2024-07-08 10:39:17 -06:00
Julius Goryavsky
52c45332a8 MDEV-34071: Failure during the galera_3nodes_sr.GCF-336 test
This commit fixes sporadic failures in galera_3nodes_sr.GCF-336
test. The following changes have been made here:

1) A small addition to the test itself which should make
   it more deterministic by waiting for non-primary state
   before COMMIT;
2) More careful handling of the wsrep_ready variable in
   the server code (it should always be protected with mutex).

No additional tests are required.
2024-05-06 03:16:59 +02:00
Sergei Golubchik
cea083af9f cleanup: use THD_STAGE_INFO, not thd_proc_info
and put master-slave.inc *last* in the series of includes
2024-05-05 21:37:07 +02:00
Kristian Nielsen
57f6a1ca98 MDEV-19415: use-after-free on charsets_dir from slave connect
The slave IO thread sets MYSQL_SET_CHARSET_DIR. The code for this option
however is not thread-safe in sql-common/client.c. The value set is
temporarily written to mysys global variable `charsets-dir` and can be seen
by other threads running in parallel, which can result in use-after-free
error.

Problem was visible as random failures of test cases in suite multi_source
with Valgrind or MSAN.

Work-around by not setting this option for slave connect, it is redundant
anyway as it is just setting the default value.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-04-20 13:41:08 +02:00
Alexey Botchkov
85517f609a MDEV-33393 audit plugin do not report user did the action..
The '<replication_slave>' user is assigned to the slave replication
thread so this name appears in the auditing logs.
2024-02-14 00:02:29 +04:00
Sergei Golubchik
01f6abd1d4 Merge branch '10.4' into 10.5 2024-01-31 17:32:53 +01:00
Brandon Nesterenko
c75905cacb MDEV-33327: rpl_seconds_behind_master_spike Sensitive to IO Thread Stop Position
rpl.rpl_seconds_behind_master_spike uses the DEBUG_SYNC mechanism to
count how many format descriptor events (FDEs) have been executed,
to attempt to pause on a specific relay log FDE after executing
transactions. However, depending on when the IO thread is stopped,
it can send an extra FDE before sending the transactions, forcing
the test to pause before executing any transactions, resulting in a
table not existing, that is attempted to be read for COUNT.

This patch fixes this by no longer counting FDEs, but rather by
programmatically waiting until the SQL thread has executed the
transaction and then automatically activating the DEBUG_SYNC point
to trigger at the next relay log FDE.
2024-01-30 06:58:44 +01:00
Marko Mäkelä
12995559f9 Merge 10.4 into 10.5 2023-12-19 18:30:58 +02:00
Kristian Nielsen
eaa4968fc5 MDEV-10653: Fix segfault in SHOW MASTER STATUS with NULL inuse_relaylog
The previous patch for MDEV-10653 changes the rpl_parallel::workers_idle()
function to use Relay_log_info::last_inuse_relaylog to check for idle
workers. But the code was missing a NULL check. Also, there was one place
during SQL slave thread start which was missing mutex synchronisation when
updating inuse_relaylog.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-12-19 12:08:54 +01:00
Marko Mäkelä
4ae105a37d Merge 10.4 into 10.5 2023-12-18 08:59:07 +02:00
Brandon Nesterenko
8dad51481b MDEV-10653: SHOW SLAVE STATUS Can Deadlock an Errored Slave
AKA rpl.rpl_parallel, binlog_encryption.rpl_parallel fails in
buildbot with timeout in include

A replication parallel worker thread can deadlock with another
connection running SHOW SLAVE STATUS. That is, if the replication
worker thread is in do_gco_wait() and is killed, it will already
hold the LOCK_parallel_entry, and during error reporting, try to
grab the err_lock. SHOW SLAVE STATUS, however, grabs these locks in
reverse order. It will initially grab the err_lock, and then try to
grab LOCK_parallel_entry. This leads to a deadlock when both threads
have grabbed their first lock without the second.

This patch implements the MDEV-31894 proposed fix to optimize the
workers_idle() check to compare the last in-use relay log’s
queued_count==dequeued_count for idleness. This removes the need for
workers_idle() to grab LOCK_parallel_entry, as these values are
atomically updated.

Huge thanks to Kristian Nielsen for diagnosing the problem!

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2023-12-11 07:45:23 -07:00
Sergei Golubchik
98a39b0c91 Merge branch '10.4' into 10.5 2023-12-02 01:02:50 +01:00
Monty
1ffa8c5072 Fixed build failure on aarch64-macos
debug_sync.h was wrongly combined with replication
2023-11-28 15:31:44 +02:00
Anel Husakovic
a7d186a17d MDEV-32168: slave_error_param condition is never checked from the wait_for_slave_param.inc
- Reviewer: <knielsen@knielsen-hq.org>
            <brandon.nesterenko@mariadb.com>
            <andrei.elkin@mariadb.com>
2023-11-16 10:41:11 +01:00
Oleksandr Byelkin
6cfd2ba397 Merge branch '10.4' into 10.5 2023-11-08 12:59:00 +01:00
Brandon Nesterenko
c5f776e9fa MDEV-32265: seconds_behind_master is inaccurate for Delayed replication
If a replica is actively delaying a transaction when restarted (STOP
SLAVE/START SLAVE), when the sql thread is back up,
Seconds_Behind_Master will present as 0 until the configured
MASTER_DELAY has passed. That is, before the restart,
last_master_timestamp is updated to the timestamp of the delayed
event. Then after the restart, the negation of sql_thread_caught_up
is skipped because the timestamp of the event has already been used
for the last_master_timestamp, and their update is grouped together
in the same conditional block.

This patch fixes this by separating the negation of
sql_thread_caught_up out of the timestamp-dependent block, so it is
called any time an idle parallel slave queues an event to a worker.

Note that sql_thread_caught_up is still left in the check for internal
events, as SBM should remain idle in such case to not "magically" begin
incrementing.

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2023-10-23 14:25:03 -06:00
Yuchen Pei
cb1965bd9d
Merge branch '10.4' into 10.5 2023-09-14 16:30:11 +10:00
Brandon Nesterenko
1407f99963 MDEV-31177: SHOW SLAVE STATUS Last_SQL_Errno Race Condition on Errored Slave Restart
The SQL thread and a user connection executing SHOW SLAVE STATUS
have a race condition on Last_SQL_Errno, such that a slave which
previously errored and stopped, on its next start, SHOW SLAVE STATUS
can show that the SQL Thread is running while the previous error is
also showing.

The fix is to move when the last error is cleared when the SQL
thread starts to occur before setting the status of
Slave_SQL_Running.

Thanks to Kristian Nielson for his work diagnosing the problem!

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
Kristian Nielson <knielsen@knielsen-hq.org>
2023-09-13 12:01:47 -06:00
sjaakola
a3cbc44b24 MDEV-31833 replication breaks when using optimistic replication and replica is a galera node
MariaDB async replication SQL thread was stopped for any failure
in applying of replication events and error message logged for the failure
was: "Node has dropped from cluster". The assumption was that event applying
failure is always due to node dropping out.
With optimistic parallel replication, event applying can fail for natural
reasons and applying should be retried to handle the failure. This retry
logic was never exercised because the slave SQL thread was stopped with first
applying failure.

To support optimistic parallel replication retrying logic this commit will
now skip replication slave abort, if node remains in cluster (wsrep_ready==ON)
and replication is configured for optimistic or aggressive retry logic.

During the development of this fix, galera.galera_as_slave_nonprim test showed
some problems. The test was analyzed, and it appears to need some attention.
One excessive sleep command was removed in this commit, but it will need more
fixes still to be fully deterministic. After this commit galera_as_slave_nonprim
is successful, though.

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2023-09-12 02:37:30 +02:00
Kristian Nielsen
7c9837ce74 Merge 10.4 into 10.5
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-08-15 18:02:18 +02:00
Kristian Nielsen
900c4d6920 MDEV-31655: Parallel replication deadlock victim preference code errorneously removed
Restore code to make InnoDB choose the second transaction as a deadlock
victim if two transactions deadlock that need to commit in-order for
parallel replication. This code was erroneously removed when VATS was
implemented in InnoDB.

Also add a test case for InnoDB choosing the right deadlock victim.
Also fixes this bug, with testcase that reliably reproduces:

MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master

Note: This should be null-merged to 10.6, as a different fix is needed
there due to InnoDB locking code changes.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-08-15 16:35:30 +02:00
Marko Mäkelä
599c4d9a40 Merge 10.4 into 10.5 2023-08-15 11:10:27 +03:00
Jan Lindström
277968aa4c MDEV-31413 : Node has been dropped from the cluster on Startup / Shutdown with async replica
There was two related problems:

(1) Galera node that is defined as a slave to async MariaDB
master at restart might do SST (state stransfer) and
part of that it will copy mysql.gtid_slave_pos table.
Problem is that updates on that table are not replicated
on a cluster. Therefore, table from donor that is not
slave is copied and joiner looses gtid position it was
and start executing events from wrong position of the binlog.
This incorrect position could break replication and
causes node to be dropped and requiring user action.

(2) Slave sql thread might start executing events before
galera is ready (wsrep_ready=ON) and that could also
cause node to be dropped from the cluster.

In this fix we enable replication of mysql.gtid_slave_pos
table on a cluster. In this way all nodes in a cluster
will know gtid slave position and even after SST joiner
knows correct gtid position to start.

Furthermore, we wait galera to be ready before slave
sql thread executes any events to prevent too early
execution.

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2023-08-08 03:25:56 +02:00
Oleksandr Byelkin
7564be1352 Merge branch '10.4' into 10.5 2023-07-26 16:02:57 +02:00
Brandon Nesterenko
063f4ac25e MDEV-30619: Parallel Slave SQL Thread Can Update Seconds_Behind_Master with Active Workers
MDEV-31749 sporadic assert in MDEV-30619 new test

If the workers of a parallel replica are busy (potentially with long
queues), but the SQL thread has no events left to distribute (so it
goes idle), then the next event that comes from the primary will
update mi->last_master_timestamp with its timestamp, even if the
workers have not yet finished.

This patch changes the parallel replica logic which updates
last_master_timestamp after idling from using solely sql_thread_caught_up
(added in MDEV-29639) to using the latter with rli queued/dequeued
event counters.
That is, if  the queued count is equal to the dequeued count, it
means all events have been processed and the replica is considered
idle when the driver thread has also distributed all events.

Low level details of the commit include
- to make a more generalized test for Seconds_Behind_Master on
  the parallel replica, rpl_delayed_parallel_slave_sbm.test
  is renamed to rpl_parallel_sbm.test for this purpose.
- pause_sql_thread_on_next_event usage was removed
  with the MDEV-30619 fixes. Rather than remove it, we adapt it
  to the needs of this test case
- added test case to cover SBM spike of relay log read and LMT
  update that was fixed by MDEV-29639
- rpl_seconds_behind_master_spike.test is made to use
  the negate_clock_diff_with_master debug eval.

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2023-07-25 16:36:14 +03:00
Oleksandr Byelkin
edf8ce5b97 Merge branch 'bb-10.4-release' into bb-10.5-release 2023-05-02 13:54:54 +02:00
Oleksandr Byelkin
edd0b03e60 Merge branch '10.3' into 10.4 2023-05-02 10:09:27 +02:00
Andrei
55a53949be MDEV-29621: Replica stopped by locks on sequence
When using binlog_row_image=FULL with sequence table inserts, a
replica can deadlock because it treats full inserts in a sequence as DDL
statements by getting an exclusive lock on the sequence table. It
has been observed that with parallel replication, this exclusive
lock on the sequence table can lead to a deadlock where one
transaction has the exclusive lock and is waiting on a prior
transaction to commit, whereas this prior transaction is waiting on
the MDL lock.

This fix for this is on the master side, to raise FL_DDL
flag on the GTID of a full binlog_row_image write of a sequence table.
This forces the slave to execute the statement serially so a deadlock
cannot happen.

A test verifies the deadlock also to prove it happen on the OLD (pre-fixes)
slave.

OLD (buggy master) -replication-> NEW (fixed slave) is provided.
As the pre-fixes master's full row-image may represent both
SELECT NEXT VALUE and INSERT, the parallel slave pessimistically
waits for the prior transaction to have committed before to take on the
critical part of the second (like INSERT in the test) event execution.
The waiting exploits a parallel slave's retry mechanism which is
controlled by `@@global.slave_transaction_retries`.

Note that in order to avoid any persistent 'Deadlock found' 2013 error
in OLD -> NEW, `slave_transaction_retries` may need to be set to a
higher than the default value.
START-SLAVE is an effective work-around if this still happens.
2023-04-27 21:55:45 +03:00
Marko Mäkelä
dfa90257f6 MDEV-30936 clang 15.0.7 -fsanitize=memory fails massively
handle_slave_io(), handle_slave_sql(), os_thread_exit():
Remove a redundant pthread_exit(nullptr) call, because it
would cause SIGSEGV.

mysql_print_status(): Add MEM_MAKE_DEFINED() to work around
some missing instrumentation around mallinfo2().

que_graph_free_stat_list(): Invoke que_node_get_next(node) before
que_graph_free_recursive(node). That is the logical and
MSAN_OPTIONS=poison_in_dtor=1 compatible way of freeing memory.

ins_node_t::~ins_node_t(): Invoke mem_heap_free(entry_sys_heap).

que_graph_free_recursive(): Rely on ins_node_t::~ins_node_t().

fts_t::~fts_t(): Invoke mem_heap_free(fts_heap).

fts_free(): Replace with direct calls to fts_t::~fts_t().

The failures in free_root() due to MSAN_OPTIONS=poison_in_dtor=1
will be covered in MDEV-30942.
2023-03-28 11:44:24 +03:00
Marko Mäkelä
c41c79650a Merge 10.4 into 10.5 2023-02-10 12:02:11 +02:00
Brandon Nesterenko
eecd4f1459 MDEV-30608: rpl.rpl_delayed_parallel_slave_sbm sometimes fails with Seconds_Behind_Master should not have used second transaction timestamp
One of the constraints added in the MDEV-29639 patch, is that only
the first event after idling should update last_master_timestamp;
and as long as the replica has more events to execute, the variable
should not be updated. The corresponding test,
rpl_delayed_parallel_slave_sbm.test, aims to verify this; however,
if the IO thread takes too long to queue events, the SQL thread can
appear to catch up too fast.

This fix ensures that the relay log has been fully written before
executing the events.

Note that the underlying cause of this test failure needs to be
addressed as a bug-fix, this is a temporary fix to stop test
failures. To track work on the bug-fix for the underlying issue,
please see MDEV-30619.
2023-02-09 13:02:14 -07:00
Oleksandr Byelkin
a977054ee0 Merge branch '10.3' into 10.4 2023-01-28 18:22:55 +01:00
Oleksandr Byelkin
7fa02f5c0b Merge branch '10.4' into 10.5 2023-01-27 13:54:14 +01:00
Oleksandr Byelkin
dd24fa3063 Merge branch '10.3' into 10.4 2023-01-26 10:34:26 +01:00
Brandon Nesterenko
d69e835787 MDEV-29639: Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas
Problem
========
On a parallel, delayed replica, Seconds_Behind_Master will not be
calculated until after MASTER_DELAY seconds have passed and the
event has finished executing, resulting in potentially very large
values of Seconds_Behind_Master (which could be much larger than the
MASTER_DELAY parameter) for the entire duration the event is
delayed. This contradicts the documented MASTER_DELAY behavior,
which specifies how many seconds to withhold replicated events from
execution.

Solution
========
After a parallel replica idles, the first event after idling should
immediately update last_master_timestamp with the time that it began
execution on the primary.

Reviewed By
===========
Andrei Elkin <andrei.elkin@mariadb.com>
2023-01-24 08:11:35 -07:00
Jan Lindström
4eb8e51c26 Merge 10.4 into 10.5 2022-11-30 13:10:52 +02:00
Julius Goryavsky
1ebf0b7372 MDEV-29817: Issues with handling options for SSL CRLs (and some others)
This patch adds the correct setting of the "--tls-version" and
"--ssl-verify-server-cert" options in the client-side utilities
such as mysqltest, mysqlcheck and mysqlslap, as well as the correct
setting of the "--ssl-crl" option when executing queries on the
slave side, and also the correct option codes in the "sslopts-logopts.h"
file (in the latter case, incorrect values are not a problem right
now, but may cause subtle test failures in the future, if the option
handling code changes).
2022-11-22 15:16:12 +01:00
Julius Goryavsky
f0820400ee MDEV-29817: Issues with handling options for SSL CRLs (and some others)
This patch adds the correct setting of the "--ssl-verify-server-cert"
option in the client-side utilities such as mysqlcheck and mysqlslap,
as well as the correct setting of the "--ssl-crl" option when executing
queries on the slave side, and also add the correct option codes in
the "sslopts-logopts.h" file (in the latter case, incorrect values
are not a problem right now, but may cause subtle test failures in
the future, if the option handling code changes).
2022-11-22 14:07:39 +01:00
Marko Mäkelä
6286a05d80 Merge 10.4 into 10.5 2022-09-26 13:34:38 +03:00
Marko Mäkelä
3c92050d1c Fix build without either ENABLED_DEBUG_SYNC or DBUG_OFF
There are separate flags DBUG_OFF for disabling the DBUG facility
and ENABLED_DEBUG_SYNC for enabling the DEBUG_SYNC facility.
Let us allow debug builds without DEBUG_SYNC.

Note: For CMAKE_BUILD_TYPE=Debug, CMakeLists.txt will continue to
define ENABLED_DEBUG_SYNC.
2022-09-23 17:37:52 +03:00
Marko Mäkelä
a69cf6f07e MDEV-29613 Improve WITH_DBUG_TRACE=OFF
In commit 28325b0863
a compile-time option was introduced to disable the macros
DBUG_ENTER and DBUG_RETURN or DBUG_VOID_RETURN.

The parameter name WITH_DBUG_TRACE would hint that it also
covers DBUG_PRINT statements. Let us do that: WITH_DBUG_TRACE=OFF
shall disable DBUG_PRINT() as well.

A few InnoDB recovery tests used to check that some output from
DBUG_PRINT("ib_log", ...) is present. We can live without those checks.

Reviewed by: Vladislav Vaintroub
2022-09-23 13:40:42 +03:00
Marko Mäkelä
a9d0bb12e6 Merge 10.4 into 10.5 2022-06-09 12:22:55 +03:00
Marko Mäkelä
ea1fbd0326 Merge 10.3 into 10.4 2022-06-07 15:55:32 +03:00
Marko Mäkelä
099b9202a5 MDEV-27697 fixup: Exclude debug code from non-debug builds 2022-06-03 10:47:34 +03:00
Sergei Golubchik
7970ac7fe8 Merge branch '10.4' into 10.5 2022-05-18 09:50:26 +02:00
Sergei Golubchik
23ddc3518f Merge branch '10.3' into 10.4 2022-05-18 01:25:30 +02:00
Sergei Golubchik
a0d4f0f306 Merge branch '10.2' into 10.3
commit 84984b79f2 is null-merged
2022-05-18 01:23:47 +02:00
Andrei
726bd8c968 MDEV-28550 improper handling of replication event group that contains
GTID_LIST_EVENT or INCIDENT_EVENT.

It's legal to have either of the two inside a group. E.g
  Gtid_event, Gtid_log_list_event, Query_1, ... Xid_log_event
is permitted.
However, the slave IO thread treated both
as the terminal even when the group represents a DDL query.
That causes a premature Gtid state update so the slave IO would think
the whole group has been collected while in fact Query_1 etc are yet to process.

Fixed with correcting a condition to compute the terminal event
of the group.
Tested with rpl_mysqlbinlog_slave_consistency (of 10.9) and
rpl_gtid_errorlog.test.
2022-05-13 09:45:32 +02:00
Sergei Golubchik
ef781162ff Merge branch '10.4' into 10.5 2022-05-09 22:04:06 +02:00
Sergei Golubchik
a70a1cf3f4 Merge branch '10.3' into 10.4 2022-05-08 23:03:08 +02:00