Commit graph

4074 commits

Author SHA1 Message Date
Sergei Golubchik
9508a44c37 enforce no trailing \n in Diagnostic_area messages
that is in my_error(), push_warning(), etc
2025-01-07 16:31:39 +01:00
Andrei Elkin
bc6121819c MDEV-35098 rpl.rpl_mysqldump_gtid_slave_pos fails in buildbot
The test turns out to be senstive to @@global.gtid_cleanup_batch_size.
With a rather small default value of the latter
SELECTing from mysql.gtid_slave_pos may not be deterministic: tests
that run before may increase a pending for automitic deletion batch.

The test is refined to set its own value for the batch size which
is virtually unreachable.

Thanks to Kristian Nielsen for the analysis.
2024-12-16 19:43:41 +02:00
Kristian Nielsen
b4fde50b1f MDEV-5798: Wrong errorcode for missing partition after TRUNCATE PARTITION
The partitioning error handling code was looking at
thd->lex->alter_info.partition_flags in non-alter-table cases, in which cases
the value is stale and contains whatever was set by any earlier ALTER TABLE.
This could cause the wrong error code to be generated, which then in some cases
can cause replication to break with "different errorcode" error.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-12-05 08:17:35 +01:00
Sergei Golubchik
5ebda30ccc Revert "MDEV-35019 Provide a way to enable "rollback XA on disconnect" behavior we had before 10.5.2"
This reverts commit 8ae462a220.
2024-10-16 13:23:47 +02:00
Kristian Nielsen
8ae462a220 MDEV-35019 Provide a way to enable "rollback XA on disconnect" behavior we had before 10.5.2
Implement variable legacy_xa_rollback_at_disconnect to support
backwards compatibility for applications that rely on the pre-10.5
behavior for connection disconnect, which is to rollback the
transaction (in violation of the XA specification).

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-10-16 10:18:36 +02:00
Lena Startseva
0a5e4a0191 MDEV-31005: Make working cursor-protocol
Updated tests: cases with bugs or which cannot be run
with the cursor-protocol were excluded with
"--disable_cursor_protocol"/"--enable_cursor_protocol"

Fix for v.10.5
2024-09-18 18:39:26 +07:00
Brandon Nesterenko
68938d2b42 MDEV-33500 (part 2): rpl.rpl_parallel_sbm can still fail
The failing test case validates Seconds_Behind_Master for a delayed
slave, while STOP SLAVE is executed during a delay. The test fixes
initially added to the test (commit b04c857596) added a table lock
to ensure a transaction could not finish before validating the
Seconds_Behind_Master field after SLAVE START, but did not address a
possibility that the transaction could finish before running the
STOP SLAVE command, which invalidates the validations for the rest
of the test case. Specifically, this would result in 1) a timeout in
“Waiting for table metadata lock” on the replica, which expects the
transaction to retry after slave restart and hit a lock conflict on
the locked tables (added in b04c857596), and 2) that
Seconds_Behind_Master should have increased, but did not.

The failure can be reproduced by synchronizing the slave to the master
before the MDEV-32265 echo statement (i.e. before the SLAVE STOP).

This patch fixes the test by adding a mechanism to use DEBUG_SYNC to
synchronize a MASTER_DELAY, rather than continually increase the
duration of the delay each time the test fails on buildbot. This is
to ensure that on slow machines, a delay does not pass before the
test gets a chance to validate results. Additionally, it decreases
overall test time because the test can continue immediately after
validation, thereby bypassing the remainder of a full delay for each
transaction.
2024-09-17 06:29:20 -06:00
Kristian Nielsen
8642453ce6 Fix sporadic failure of test case rpl.rpl_start_stop_slave
The test was expecting the I/O thread to be in a specific state, but thread
scheduling may cause it to not yet have reached that state. So just have a
loop that waits for the expected state to occur.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Kristian Nielsen
214e6c5b3d Fix sporadic failure of test case rpl.rpl_old_master
Remove the test for MDEV-14528. This is supposed to test that parallel
replication from pre-10.0 master will update Seconds_Behind_Master. But
after MDEV-12179 the SQL thread is blocked from even beginning to fetch
events from the relay log due to FLUSH TABLES WITH READ LOCK, so the test
case is no longer testing what is was intended to. And pre-10.0 versions are
long since out of support, so does not seem worthwhile to try to rewrite the
test to work another way.

The root cause of the test failure is MDEV-34778. Briefly, depending on
exact timing during slave stop, the rli->sql_thread_caught_up flag may end
up with different value. If it ends up as "true", this causes
Seconds_Behind_Master to be 0 during next slave start; and this caused test
case timeout as the test was waiting for Seconds_Behind_Master to become
non-zero.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Kristian Nielsen
7dc4ea5649 Fix sporadic test failure in rpl.rpl_create_drop_event
Depending on timing, an extra event run could start just when the event
scheduler is shut down and delay running until after the table has been
dropped; this would cause the test to fail with a "table does not exist"
error in the log.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Kristian Nielsen
33854d7324 Restore skiping rpl.rpl_mdev6020 under Valgrind
(Revert a change done by mistake when XtraDB was removed.)

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-08-26 14:39:24 +02:00
Brandon Nesterenko
001608de7e MDEV-15393: Fix rpl_mysqldump_gtid_slave_pos
The slave would try to sync_with_master_gtid.inc,
but the master never actually saved its gtid position
so the test would move on too quickly.
2024-07-31 14:17:46 -06:00
Andrei
b8f92ade57 MDEV-15393 gtid_slave_pos duplicate key errors after mysqldump restore
When mysqldump is run to dump the `mysql` system database, it generates
INSERT statements into the table `mysql.gtid_slave_pos`.
After running the backup script
those inserts did not produce the expected gtid state on slave. In
particular the maximum of mysql.gtid_slave_pos.sub_id did not make
into
   rpl_global_gtid_slave_state.last_sub_id

an in-memory object that is supposed to match the current state of the
table. And that was regardless of whether --gtid option was specified
or not. Later when the backup recipient server starts as slave
in *non-gtid* mode this desychronization may lead to a duplicate key
error.

This effect is corrected for --gtid mode mysqldump/mariadb-dump only
as the following.  The fixes ensure the insert block of the dump
script is followed with a "summing-up" SET @global.gtid_slave_pos
assignment.

For the implemenation part, note a deferred print-out of
SET-gtid_slave_pos and associated comments is prefered over relocating
of the entire blocks if (opt_master,slave_data &&
do_show_master,slave_status) ...  because of compatiblity
concern. Namely an error inside do_show_*() is handled in the new code
the same way, as early as, as before.

A regression test can be run in how-to-reproduce mode as well.
One affected mtr test observed.
rpl_mysqldump_slave.result "mismatch" shows now the new deferring print
of SET-gtid_slave_pos policy in action.
2024-07-19 21:44:12 +03:00
Brandon Nesterenko
a061ae1079 MDEV-33921: Fix rpl_xa_empty_transaction.test
The test was missing a save_master_gtid.inc on the master,
leading to the slave thinking it was in sync after executing
sync_with_master_gtid.inc, despite not having executed the
latest transaction. This skipped transaction, XA COMMIT,
was supposed to error-to-be-ignored because its XID could not
be found, but be thrown out because the replication filters
would filter out the target database. However, if the slave
was able to stop before executing the transaction, then
the replication filer is reset (to empty), and when the
slave is later restarted, that transactions error would
no longer be ignored.

Additionally, as the test cases added in MDEV-33921 rely
on GTID synchronization, the test cases now force
master_use_gtid=slave_pos for consistency
2024-07-17 16:38:26 -06:00
Sergei Golubchik
d60f5c11ea MDEV-34318 mariadb-dump SQL syntax error with MAX_STATEMENT_TIME against Percona MySQL server
protect MariaDB conditional comments from a bug
in Percona MySQL comment parser
2024-07-17 21:25:40 +02:00
Daniel Black
e8bcc4e455 MDEV-34568 rpl.rpl_mdev12179 - correct for Windows
Simplify in an attempt to avoid:

mysqltest: At line 275: File already exist: on the write_file
lines.

Using write_line as that's what a lot of other tests
do for writing small bits to a expect file.

Review thanks Valdislav Vaintroub
2024-07-12 12:55:28 +02:00
Brandon Nesterenko
ea9869504d MDEV-33921: Replication breaks when filtering two-phase XA transactions
There are two problems.

First, replication fails when XA transactions are used where the
slave has replicate_do_db set and the client has touched a different
database when running DML such as inserts. This is because XA
commands are not treated as keywords, and are thereby not exempt
from the replication filter. The effect of this is that during an XA
transaction, if its logged “use db” from the master is filtered out
by the replication filter, then XA END will be ignored, yet its
corresponding XA PREPARE will be executed in an invalid state,
thereby breaking replication.

Second, if the slave replicates an XA transaction which results in
an empty transaction, the XA START through XA PREPARE first phase of
the transaction won’t be binlogged, yet the XA COMMIT will be
binlogged. This will break replication in chain configurations.

The first problem is fixed by treating XA commands in
Query_log_event as keywords, thus allowing them to bypass the
replication filter. Note that Query_log_event::is_trans_keyword() is
changed to accept a new parameter to define its mode, to either
check for XA commands or regular transaction commands, but not both.
In addition, mysqlbinlog is adapted to use this mode so its
--database filter does not remove XA commands from its output.

The second problem fixed by overwriting the XA state in the XID
cache to be XA_ROLLBACK_ONLY, so at commit time, the server knows to
rollback the transaction and skip its binlogging. If the xid cache
is cleared before an XA transaction receives its completion command
(e.g. on server shutdown), then before reporting ER_XAER_NOTA when
the completion command is executed, the filter is first checked if
the database is ignored, and if so, the error is ignored.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2024-07-10 14:37:39 -06:00
Brandon Nesterenko
744580d5a7 MDEV-32892: IO Thread Reports False Error When Stopped During Connecting to Primary
The IO thread can report error code 2013 into the error log when it
is stopped during the initial connection process to the primary, as
well as when trying to read an event. However, because the IO thread
is being stopped, its connection to the primary is force-killed by
the signaling thread (see THD::awake_no_mutex()), and thereby these
connection errors should be ignored.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
2024-07-08 10:39:17 -06:00
Brandon Nesterenko
cbc1898e82 MDEV-25607: Auto-generated DELETE from HEAP table can break replication
The special logic used by the memory storage engine
to keep slaves in sync with the master on a restart can
break replication. In particular, after a restart, the
master writes DELETE statements in the binlog for
each MEMORY-based table so the slave can empty its
data. If the DELETE is not executable, e.g. due to
invalid triggers, the slave will error and fail, whereas
the master will never see the problem.

Instead of DELETE statements, use TRUNCATE to
keep slaves in-sync with the master, thereby bypassing
triggers.

Reviewed By:
===========
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2024-07-05 12:00:09 -06:00
Brandon Nesterenko
6cab2f75fe MDEV-23857: replication master password length
After MDEV-4013, the maximum length of replication passwords was extended to
96 ASCII characters. After a restart, however, slaves only read the first 41
characters of MASTER_PASSWORD from the master.info file. This lead to slaves
unable to reconnect to the master after a restart.

After a slave restart, if a master.info file is detected, use the full
allowable length of the password rather than 41 characters.

Reviewed By:
============
Sergei Golubchik <serg@mariadb.com>
2024-06-18 07:21:18 -06:00
Sergei Golubchik
7a789e2027 sporadic failures of rpl.rpl_parallel_sbm
the test waits for the event to get stuck on MASTER_DELAY,
but on a slow/overloaded slave the event might pass MASTER_DELAY
before the test starts waiting.

Wait for the event to get stuck on the LOCK TABLES (after MASTER_DELAY),
the event cannot avoid that,
2024-05-05 21:37:07 +02:00
Sergei Golubchik
cea083af9f cleanup: use THD_STAGE_INFO, not thd_proc_info
and put master-slave.inc *last* in the series of includes
2024-05-05 21:37:07 +02:00
Kristian Nielsen
553a4d6271 MDEV-33602: Sporadic test failure in rpl.rpl_gtid_stop_start
The test could fail with a duplicate key error because switching to non-GTID
mode could start at the wrong old-style position. The position could be
wrong when the previous GTID connect was stopped before receiving the fake
GTID list event which gives the old-style position corresponding to the GTID
connected position.

Work-around by injecting an extra event and syncing the slave before
switching to non-GTID mode.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-04-25 11:00:45 +02:00
Kristian Nielsen
0c249ad718 MDEV-30232: rpl.rpl_gtid_crash fails sporadically in BB
The root cause of the failure is a bug in the Linux network stack:

  https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u

If the slave does a connect(2) at the exact same time that kill -9 of the
master process closes the listening socket, the FIN or RST packet is lost in
the kernel, and the slave ends up timing out waiting for the initial
communication from the server. This timeout defaults to
--slave-net-timeout=120, which causes include/master_gtid_wait.inc to time
out first and fail the test.

Work-around this problem by reducing the --slave-net-timeout for this test
case. If this problem turns up in other tests, we can consider reducing the
default value for all tests.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-04-20 13:41:08 +02:00
Brandon Nesterenko
0ad52e4d6a MDEV-27512: Assertion !thd->transaction_rollback_request failed in rows_event_stmt_cleanup
If replicating an event in ROW format, and InnoDB detects a deadlock
while searching for a row, the row event will error and rollback in
InnoDB and indicate that the binlog cache also needs to be cleared,
i.e. by marking thd->transaction_rollback_request. In the normal
case, this will trigger an error in Rows_log_event::do_apply_event()
and cause a rollback. During the Rows_log_event::do_apply_event()
cleanup of a successful event application, there is a DBUG_ASSERT in
log_event_server.cc::rows_event_stmt_cleanup(), which sets the
expectation that thd->transaction_rollback_request cannot be set
because the general rollback (i.e. not the InnoDB rollback) should
have happened already. However, if the replica is configured to skip
deadlock errors, the rows event logic will clear the error and
continue on, as if no error happened. This results in
thd->transaction_rollback_request being set while in
rows_event_stmt_cleanup(), thereby triggering the assertion.

This patch fixes this in the following ways:
 1) The assertion is invalid, and thereby removed.
 2) The rollback case is forced in rows_event_stmt_cleanup() if
transaction_rollback_request is set.

Note the differing behavior between transactions which are skipped
due to deadlock errors and other errors. When a transaction is
skipped due to an ignored deadlock error, the entire transaction is
rolled back and skipped (though note MDEV-33930 which allows
statements in the same transaction after the deadlock-inducing one
to commit). When a transaction is skipped due to ignoring a
different error, only the erroring statements are rolled-back and
skipped - the rest of the transaction will execute as normal. The
effect of this can be seen in the test results. The added test case
to rpl_skip_error.test shows that only statements which are ignored
due to non-deadlock errors are ignored in larger transactions. A
diff between rpl_temporary_error2_skip_all.result and
rpl_temporary_error2.result shows that all statements in the errored
transaction are rolled back (diff pasted below):

: diff rpl_temporary_error2.result rpl_temporary_error2_skip_all.result
49c49
< 2	1
---
> 2	NULL
51c51
< 4	1
---
> 4	NULL
53c53
< * There will be two rows in t2 due to the retry.
---
> * There will be one row in t2 because the ignored deadlock does not retry.
57d56
< 1
59c58
< 1
---
> 0

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2024-04-17 11:14:21 -06:00
Vladislav Vaintroub
061adae9a2 MDEV-16944 Fix file sharing issues on Windows in mysqltest
On Windows systems, occurrences of ERROR_SHARING_VIOLATION due to
conflicting share modes between processes accessing the same file can
result in CreateFile failures.

mysys' my_open() already incorporates a workaround by implementing
wait/retry logic on Windows.

But this does not help if files are opened using shell redirection like
mysqltest traditionally did it, i.e via

--echo exec "some text" > output_file

In such cases, it is cmd.exe, that opens the output_file, and it
won't do any sharing-violation retries.

This commit addresses the issue by introducing a new built-in command,
'write_line', in mysqltest. This new command serves as a brief alternative
to 'write_file', with a single line output, that also resolves variables
like "exec" would.

Internally, this command will use my_open(), and therefore retry-on-error
logic.

Hopefully this will eliminate the very sporadic "can't open file because
it is used by another process" error on CI.
2024-04-17 16:52:37 +02:00
Daniel Black
ea810b04cb MDEV-30676 rpl.parallel_backup* tests sometimes fail
Raise innodb_lock_wait_timeout from 1 to 5
2024-04-15 15:45:03 +10:00
Brandon Nesterenko
a6aecbb036 MDEV-10684: rpl.rpl_domain_id_filter_restart fails in buildbot
The test failure in rpl.rpl_domain_id_filter_restart is caused by
MDEV-33887. That is, the test uses master_pos_wait() (called
indirectly by sync_slave_with_master) to try and wait for the
replica to catch up to the master. However, the waited on
transaction is ignored by the configured
  CHANGE MASTER TO IGNORE_DOMAIN_IDS=()
As MDEV-33887 reports, due to the IO thread updating the binlog
coordinates and the SQL thread updating the GTID state, if the
replica is stopped in-between these updates, the replica state will
be inconsistent. That is, the test expects that the GTID state will
be updated, so upon restart, the replica will be up-to-date.
However, if the replica is stopped before the SQL thread updates its
GTID state, then upon restart, the replica will fetch the previously
ignored event, which is no longer ignored upon restart, and execute
it. This leads to the sporadic extra row in t2.

This patch changes master_pos_wait() to use master_gtid_wait() to
ensure the replica state is consistent with the master state.
2024-04-11 09:49:20 -06:00
Sergei Golubchik
2d2172a5cf sporadic failures of rpl.rpl_semi_sync_master_shutdown
increase the MASTER_CONNECT_RETRY time under valgrind,
otherwise the slave gives up retrying before the master is ready

also, cosmetic cleanup of rpl_semi_sync_master_shutdown.test
2024-04-10 19:38:39 +02:00
Andrei
0da1653f1b MDEV-31779 Server crash in Rows_log_event::update_sequence upon replaying binary log
The crash at running mysqlbinlog on a SEQUENCE containing binlog file
was caused MDEV-29621 fixes that did not check which of the slave
or binlog applier executes a block introduced there.

The block is meaningful only for the parallel slave applier, so
it's safe to fix this bug with identified the actual applier and
skipping the block when it's the mysqlbinlog one.
2024-04-10 19:31:39 +03:00
Brandon Nesterenko
952ab9a596 MDEV-30260: Slave crashed:reload_acl_and_cache during shutdown
The signal handler thread can use various different runtime
resources when processing a SIGHUP (e.g. master-info information)
due to calling into reload_acl_and_cache(). Currently, the shutdown
process waits for the termination of the signal thread after
performing cleanup. However, this could cause resources actively
used by the signal handler to be freed while reload_acl_and_cache()
is processing.

The specific resource that caused MDEV-30260 is a race condition for
the hostname_cache, such that mysqld would delete it in
clean_up()::hostname_cache_free(), before the signal handler would
use it in reload_acl_and_cache()::hostname_cache_refresh().

Another similar resource is the active_mi/master_info_index. There
was a race between its deletion by the main thread in end_slave(),
and their usage by the Signal Handler as a part of
Master_info_index::flush_all_relay_logs.read(active_mi) in
reload_acl_and_cache().

This patch fixes these race conditions by relocating where server
shutdown waits for the signal handler to die until after
server-level threads have been killed (i.e., as a last step of
close_connections()). With respect to the hostname_cache, active_mi
and master_info_cache, this ensures that they cannot be destroyed
while the signal handler is still active, and potentially using
them.

Additionally:

 1) This requires that Events memory is still in place for SIGHUP
handling's mysql_print_status(). So event deinitialization is moved
into clean_up(), but the event scheduler still needs to be stopped
in close_connections() at the same spot.

 2) The function kill_server_thread is no longer used, so it is
deleted

 3) The timeout to wait for the death of the signal thread was not
consistent with the comment. The comment mentioned up to 10 seconds,
whereas it was actually 0.01s. The code has been fixed to wait up to
10 seconds.

 4) A warning has been added if the signal handler thread fails to
exit in time.

 5) Added pthread_join() to end of wait_for_signal_thread_to_end()
if it hadn't ended in 10s with a warning. Note this also removes
the pthread_detached attribute from the signal_thread to allow
for the pthread_join().

Reviewed By:
===========
Vladislav Vaintroub <wlad@mariadb.com>
Andrei Elkin <andrei.elkin@mariadb.com>
2024-04-09 14:25:13 -06:00
Kristian Nielsen
fb774eb1eb Fix occasional test failure of rpl.rpl_parallel_stop_slave
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-03-15 15:04:09 +01:00
Marko Mäkelä
f703e72bd8 Merge 10.4 into 10.5 2024-03-11 10:08:20 +02:00
Brandon Nesterenko
b04c857596 MDEV-33500: rpl.rpl_parallel_sbm can fail on slow machines, e.g. MSAN/Valgrind builders
In an addition to test rpl.rpl_parallel_sbm added by MDEV-32265, the
test uses sleep statements alone to test Seconds_Behind_Master with
delayed replication. On slow running machines, the test can pass the
intended MASTER_DELAY duration and Seconds_Behind_Master can become
0, when the test expects the transaction to still be actively in a
delaying state.

This can be consistently reproduced by adding a sleep statement
before the call to

--let = query_get_value(SHOW SLAVE STATUS, Seconds_Behind_Master, 1)

to sleep past the delay end point.

This patch fixes this by locking the table which the delayed
transaction targets so Second_Behind_Master cannot be updated before
the test reads it for validation.
2024-02-20 08:19:18 -07:00
Kristian Nielsen
c73c6aea63 MDEV-33426: Aria temptables wrong thread-specific memory accounting in slave thread
Aria temporary tables account allocated memory as specific to the current
THD. But this fails for slave threads, where the temporary tables need to be
detached from any specific THD.

Introduce a new flag to mark temporary tables in replication as "global",
and use that inside Aria to not account memory allocations as thread
specific for such tables.

Based on original suggestion by Monty.

Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-02-16 12:48:30 +01:00
Alexey Botchkov
85517f609a MDEV-33393 audit plugin do not report user did the action..
The '<replication_slave>' user is assigned to the slave replication
thread so this name appears in the auditing logs.
2024-02-14 00:02:29 +04:00
Brandon Nesterenko
03d1346e7f MDEV-29369: rpl.rpl_semi_sync_shutdown_await_ack fails regularly with Result content mismatch
This test was prone to failures for a few reasons, summarized below:

 1) MDEV-32168 introduced “only_running_threads=1” to
slave_stop.inc, which allowed the stop logic to bypass an
attempting-to-reconnect IO thread. That is, the IO thread could
realize the master shutdown in `read_event()`, and thereby call into
`try_to_reconnect()`. This would leave the IO thread up when the
test expected it to be stopped. Fixed by explicitly stopping the
IO thread and allowing an error state, as the above case would
lead to errno 2003.

 2) On slow systems (or those running profiling tools, e.g. MSAN),
the waiting-for-ack transaction can complete before the system
processes the `SHUTDOWN WAIT FOR ALL SLAVES`. There was shutdown
preparation logic in-between the transaction and shutdown itself,
which contributes to this problem. This patch also moves this
preparation logic before the transaction, so there is less to do
in-between the calls.

 3) Changed work-around for MDEV-28141 to use debug_sync instead
of sleep delay, as it was still possible to hit the bug on very
slow systems.

 4) Masked MTR variable reset with disable/enable query log

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
2024-02-12 05:48:18 -07:00
Brandon Nesterenko
ee89558312 MDEV-14357: rpl.rpl_domain_id_filter_io_crash failed in buildbot with wrong result
A race condition with the SQL thread, where depending on if it was
killed before or after it had executed the fake/generated IGN_GTIDS
Gtid_list_log_event, may or may not update gtid_slave_pos with the
position of the ignored events. Then, the slave would be restarted
while resetting IGNORE_DOMAIN_IDS to be empty, which would result in
the slave requesting different starting locations, depending on
whether or not gtid_slave_pos was updated. And, because previously
ignored events could now be requested and executed (no longer
ignored), their presence would fail the test.

This patch fixes this in two ways. First, to use GTID positions for
synchronization rather than binlog file positions. Then second, to
synchronize the SQL thread’s gtid_slave_pos with the ignored events
before killing the SQL thread.

To consistently reproduce the test failure, the following patch can
be applied:

diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
index f51f5b7deec..de62233acff 100644
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -3686,6 +3686,12 @@ Gtid_list_log_event::do_apply_event(rpl_group_info *rgi)
     void *hton= NULL;
     uint32 i;

+    sleep(1);
+    if (rli->sql_driver_thd->killed || rli->abort_slave)
+    {
+      return 0;
+    }
+

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
2024-02-12 05:48:18 -07:00
Marko Mäkelä
8ec12e0d6d Merge 10.4 into 10.5 2024-02-12 11:38:13 +02:00
Sergei Golubchik
01f6abd1d4 Merge branch '10.4' into 10.5 2024-01-31 17:32:53 +01:00
Monty
57ffcd686f MDEV-21472: ALTER TABLE ... ANALYZE PARTITION ... with EITS reads and locks all rows
This was fixed in 10.2 in 2020 but merging the code to 10.3 caused the
bug to come back.
2024-01-30 09:19:01 +02:00
Brandon Nesterenko
c75905cacb MDEV-33327: rpl_seconds_behind_master_spike Sensitive to IO Thread Stop Position
rpl.rpl_seconds_behind_master_spike uses the DEBUG_SYNC mechanism to
count how many format descriptor events (FDEs) have been executed,
to attempt to pause on a specific relay log FDE after executing
transactions. However, depending on when the IO thread is stopped,
it can send an extra FDE before sending the transactions, forcing
the test to pause before executing any transactions, resulting in a
table not existing, that is attempted to be read for COUNT.

This patch fixes this by no longer counting FDEs, but rather by
programmatically waiting until the SQL thread has executed the
transaction and then automatically activating the DEBUG_SYNC point
to trigger at the next relay log FDE.
2024-01-30 06:58:44 +01:00
Brandon Nesterenko
e4f221a5f2 MDEV-33327: rpl_seconds_behind_master_spike Sensitive to IO Thread Stop Position
rpl.rpl_seconds_behind_master_spike uses the DEBUG_SYNC mechanism to
count how many format descriptor events (FDEs) have been executed,
to attempt to pause on a specific relay log FDE after executing
transactions. However, depending on when the IO thread is stopped,
it can send an extra FDE before sending the transactions, forcing
the test to pause before executing any transactions, resulting in a
table not existing, that is attempted to be read for COUNT.

This patch fixes this by no longer counting FDEs, but rather by
programmatically waiting until the SQL thread has executed the
transaction and then automatically activating the DEBUG_SYNC point
to trigger at the next relay log FDE.
2024-01-29 15:17:57 -07:00
Brandon Nesterenko
112eb14f7e MDEV-27850: rpl_seconds_behind_master_spike debug_sync fix
A debug_sync signal could remain for the SQL thread that should have begun
a wait_for upon seeing a GTID event, but would instead see the old signal
and continue on without waiting. This broke an "idle" condition in
SHOW SLAVE STATUS
which should have automatically negated Seconds_Behind_Master. Instead,
because the SQL thread had already processed the GTID event, it set
sql_thread_caught_up to false, and thereby calculated the value of
Seconds_behind_master, when the test expected 0.

This patch fixes this by resetting the debug_sync state before creating a
new transaction which sends a GTID event to the replica
2024-01-26 11:43:34 -07:00
Brandon Nesterenko
01ca57ec16 MDEV-32168: Postpush fix for rpl_domain_id_filter_master_crash
While a replica may be reading events from the
primary, the primary is killed. Left to its own
devices, the IO thread may or may not stop in
error, depending on what it is doing when its
connection to the primary is killed (e.g. a
failed read results in an error, whereas if the
IO thread is idly waiting for events when the
connection dies, it will enter into a reconnect
loop and reconnect). MDEV-32168 changed the test
to always wait for the reconnect, thus breaking
the error case, as the IO thread would be stopped
at a time of expecting it to be running.

The fix is to manually stop/start the IO thread
to ensure it is in a consistent state.

Note that rpl_domain_id_filter_master_crash.test
will need additional changes after fixing MDEV-33268

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
2024-01-22 07:30:52 -07:00
Alexander Barkov
fa3171df08 MDEV-27666 User variable not parsed as geometry variable in geometry function
Adding GEOMETRY type user variables.
2024-01-16 18:53:23 +04:00
Marko Mäkelä
12995559f9 Merge 10.4 into 10.5 2023-12-19 18:30:58 +02:00
Kristian Nielsen
a204ce2788 MDEV-33045: Server crashes in Item_func_binlog_gtid_pos::val_str / Binary_string::c_ptr_safe
Item::val_str() sets the Item::null_value flag, so call it before checking
the flag, not after.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-12-19 12:08:53 +01:00
Marko Mäkelä
4ae105a37d Merge 10.4 into 10.5 2023-12-18 08:59:07 +02:00
Brandon Nesterenko
8dad51481b MDEV-10653: SHOW SLAVE STATUS Can Deadlock an Errored Slave
AKA rpl.rpl_parallel, binlog_encryption.rpl_parallel fails in
buildbot with timeout in include

A replication parallel worker thread can deadlock with another
connection running SHOW SLAVE STATUS. That is, if the replication
worker thread is in do_gco_wait() and is killed, it will already
hold the LOCK_parallel_entry, and during error reporting, try to
grab the err_lock. SHOW SLAVE STATUS, however, grabs these locks in
reverse order. It will initially grab the err_lock, and then try to
grab LOCK_parallel_entry. This leads to a deadlock when both threads
have grabbed their first lock without the second.

This patch implements the MDEV-31894 proposed fix to optimize the
workers_idle() check to compare the last in-use relay log’s
queued_count==dequeued_count for idleness. This removes the need for
workers_idle() to grab LOCK_parallel_entry, as these values are
atomically updated.

Huge thanks to Kristian Nielsen for diagnosing the problem!

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2023-12-11 07:45:23 -07:00