Commit graph

1295 commits

Author SHA1 Message Date
Sergei Golubchik
f9807aadef Merge branch '10.11' into 11.0 2024-05-12 12:18:28 +02:00
Sergei Golubchik
018d537ec1 Merge branch '10.6' into 10.11 2024-04-22 15:23:10 +02:00
Sergei Golubchik
e83d92ee5e sporadic failures of rpl.rpl_semi_sync_fail_over
in the $case=2 - it's wrong to kill after the first binlog EOF,
because that might happen between INSERT(4) and INSERT(5).

So, wait for the slave to acknowledge INSERT(5) before killing
the master, that is, both connection threads must pass
repl_semisync_master.wait_after_sync()
2024-04-21 22:54:52 +02:00
Marko Mäkelä
829cb1a49c Merge 10.5 into 10.6 2024-04-17 14:14:58 +03:00
Kristian Nielsen
16aa4b5f59 Merge from 10.4 to 10.5
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-04-15 17:46:49 +02:00
Marko Mäkelä
fec2fd6add Merge 10.11 into 11.0 2024-03-28 10:51:36 +02:00
Marko Mäkelä
59e7289b6c Fix g++-14 -Wmaybe-uninitialized 2024-03-19 08:10:42 +02:00
Kristian Nielsen
86a0b57689 MDEV-32976: Un-deprecate MASTER_USE_GTID=Current_Pos
Remove incorrect deprecation.

Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2024-03-15 18:18:42 +01:00
Sergei Golubchik
87e13722a9 Merge branch '10.6' into 10.11 2024-02-01 18:36:14 +01:00
Oleksandr Byelkin
fe490f85bb Merge branch '10.11' into 11.0 2024-01-30 08:54:10 +01:00
Oleksandr Byelkin
14d930db5d Merge branch '10.6' into 10.11 2024-01-30 08:17:58 +01:00
Michael Widenius
7af50e4df4 MDEV-32551: "Read semi-sync reply magic number error" warnings on master
rpl_semi_sync_slave_enabled_consistent.test and the first part of
the commit message comes from Brandon Nesterenko.

A test to show how to induce the "Read semi-sync reply magic number
error" message on a primary. In short, if semi-sync is turned on
during the hand-shake process between a primary and replica, but
later a user negates the rpl_semi_sync_slave_enabled variable while
the replica's IO thread is running; if the io thread exits, the
replica can skip a necessary call to kill_connection() in
repl_semisync_slave.slave_stop() due to its reliance on a global
variable. Then, the replica will send a COM_QUIT packet to the
primary on an active semi-sync connection, causing the magic number
error.

The test in this patch exits the IO thread by forcing an error;
though note a call to STOP SLAVE could also do this, but it ends up
needing more synchronization. That is, the STOP SLAVE command also
tries to kill the VIO of the replica, which makes a race with the IO
thread to try and send the COM_QUIT before this happens (which would
need more debug_sync to get around). See THD::awake_no_mutex for
details as to the killing of the replica’s vio.

Notes:
- The MariaDB documentation does not make it clear that when one
  enables semi-sync replication it does not matter if one enables
  it first in the master or slave. Any order works.

Changes done:
- The rpl_semi_sync_slave_enabled variable is now a default value for
  when semisync is started. The variable does not anymore affect
  semisync if it is already running. This fixes the original reported
  bug.  Internally we now use repl_semisync_slave.get_slave_enabled()
  instead of rpl_semi_sync_slave_enabled. To check if semisync is
  active on should check the @@rpl_semi_sync_slave_status variable (as
  before).
- The semisync protocol conflicts in the way that the original
  MySQL/MariaDB client-server protocol was designed (client-server
  send and reply packets are strictly ordered and includes a packet
  number to allow one to check if a packet is lost). When using
  semi-sync the master and slave can send packets at 'any time', so
  packet numbering does not work. The 'solution' has been that each
  communication starts with packet number 1, but in some cases there
  is still a chance that the packet number check can fail.  Fixed by
  adding a flag (pkt_nr_can_be_reset) in the NET struct that one can
  use to signal that packet number checking should not be done. This
  is flag is set when semi-sync is used.
- Added Master_info::semi_sync_reply_enabled to allow one to configure
  some slaves with semisync and other other slaves without semisync.
  Removed global variable semi_sync_need_reply that would not work
  with multi-master.
- Repl_semi_sync_master::report_reply_packet() can now recognize
  the COM_QUIT packet from semisync slave and not give a
  "Read semi-sync reply magic number error" error for this case.
  The slave will be removed from the Ack listener.
- On Windows, don't stop semisync Ack listener just because one
  slave connection is using socket_id > FD_SETSIZE.
- Removed busy loop in Ack_receiver::run() by using
 "Self-pipe trick" to signal new slave and stop Ack_receiver.
- Changed some Repl_semi_sync_slave functions that always returns 0
  from int to void.
- Added Repl_semi_sync_slave::slave_reconnect().
- Removed dummy_function Repl_semi_sync_slave::reset_slave().
- Removed some duplicate semisync notes from the error log.
- Add test of "if (get_slave_enabled() && semi_sync_need_reply)"
  before calling Repl_semi_sync_slave::slave_reply().
  (Speeds up the code as we can skip all initializations).
- If epl_semisync_slave.slave_reply() fails, we disable semisync
  for that connection.
- We do not call semisync.switch_off() if there are no active slaves.
  Instead we check in Repl_semi_sync_master::commit_trx() if there are
  no active threads. This simplices the code.
- Changed assert() to DBUG_ASSERT() to ensure that the DBUG log is
  flushed in case of asserts.
- Removed the internal rpl_semi_sync_slave_status as it is not needed
  anymore. The @@rpl_semi_sync_slave_status status variable is now
  mapped to rpl_semi_sync_enabled.
- Removed rpl_semi_sync_slave_enabled  as it is not needed anymore.
  Repl_semi_sync_slave::get_slave_enabled() contains the active status.
- Added checking that we do not add a slave twice with
  Ack_receiver::add_slave(). This could happen with old code.
- Removed Repl_semi_sync_master::check_and_switch() as it is not
  needed anymore.
- Ensure that when we call Ack_receiver::remove_slave() that the slave
  is removed from the listener before function returns.
- Call listener.listen_on_sockets() outside of mutex for better
  performance and less contested mutex.
- Ensure that listening is ignoring newly added slaves when checking for
  responses.
- Fixed the master ack_receiver listener is not killed if there are no
  connected slaves (and thus stop semisync handling of future
  connections). This could happen if all slaves sockets where would be
  marked as unreliable.
- Added unlink() to base_ilist_iterator and remove() to
  I_List_iterator. This enables us to remove 'dead' slaves in
  Ack_recever::run().
- kill_zombie_dump_threads() now does killing of dump threads properly.
  - It can now kill several threads (should be impossible but could
    happen if IO slaves reconnects very fast).
  - We now wait until the dump thread is done before starting the
    dump.
- Added an error if kill_zombie_dump_threads() fails.
- Set thd->variables.server_id before calling
  kill_zombie_dump_threads(). This simplies the code.
- Added a lot of comments both in code and tests.
- Removed DBUG_EVALUATE_IF "failed_slave_start" as it is not used.

Test changes:
- rpl.rpl_session_var2 added which runs rpl.rpl_session_var test with
  semisync enabled.
- Some timings changed slight with startup of slave which caused
  rpl_binlog_dump_slave_gtid_state_info.text to fail as it checked the
  error log file before the slave had started properly. Fixed by
  adding wait_for_pattern_in_file.inc that allows waiting for the
  pattern to appear in the log file.
- Tests have been updated so that we first set
  rpl_semi_sync_master_enabled on the master and then set
  rpl_semi_sync_slave_enabled on the slaves (this is according to how
  the MariaDB documentation document how to setup semi-sync).
- Error text "Master server does not have semi-sync enabled" has been
  replaced with "Master server does not support semi-sync" for the
  case when the master supports semi-sync but semi-sync is not
  enabled.

Other things:
- Some trivial cleanups in Repl_semi_sync_master::update_sync_header().
- We should in 11.3 changed the default value for
  rpl-semi-sync-master-wait-no-slave from TRUE to FALSE as the TRUE
  does not make much sense as default. The main difference with using
  FALSE is that we do not wait for semisync Ack if there are no slave
  threads.  In the case of TRUE we wait once, which did not bring any
  notable benefits except slower startup of master configured for
  using semisync.

Co-author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>

This solves the problem reported in MDEV-32960 where a new
slave may not be registered in time and the master disables
semi sync because of that.
2024-01-23 13:03:11 +02:00
Marko Mäkelä
9374772ecd Merge 10.11 into 11.0 2024-01-19 09:07:48 +02:00
Marko Mäkelä
9d20853c74 Merge 10.6 into 10.11 2024-01-18 19:22:23 +02:00
Marko Mäkelä
3a96eba25f Merge 10.5 into 10.6 2024-01-17 13:35:05 +02:00
Alexander Barkov
fa3171df08 MDEV-27666 User variable not parsed as geometry variable in geometry function
Adding GEOMETRY type user variables.
2024-01-16 18:53:23 +04:00
Sergei Golubchik
8c8bce05d2 Merge branch '10.11' into 11.0 2023-12-19 15:53:18 +01:00
Sergei Golubchik
fd0b47f9d6 Merge branch '10.6' into 10.11 2023-12-18 11:19:04 +01:00
Sergei Golubchik
e95bba9c58 Merge branch '10.5' into 10.6 2023-12-17 11:20:43 +01:00
Sergei Golubchik
98a39b0c91 Merge branch '10.4' into 10.5 2023-12-02 01:02:50 +01:00
Brandon Nesterenko
c42aadc388 MDEV-32628: Cryptic ERROR message & inconsistent behavior on incorrect SHOW BINLOG EVENTS FROM ...
Calling SHOW BINLOG EVENTS FROM <offset> with an invalid offset
writes error messages into the server log about invalid reads. The
read errors that occur from this command should only be relayed back
to the user though, and not written into the server log. This is
because they are read-only and have no impact on server operation,
and the client only need be informed to correct the parameter.

This patch fixes this by omitting binary log read errors from the
server when the invocation happens from SHOW BINLOG EVENTS.
Additionally, redundant error messages are omitted when calling the
string based read_log_event from the IO_Cache based read_log_event,
as the later already will report the error of the former.

Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
2023-11-17 09:43:56 -08:00
Kristian Nielsen
6fa69ad747 MDEV-27436: binlog corruption (/tmp no space left on device at the same moment)
This commit fixes several bugs in error handling around disk full when
writing the statement/transaction binlog caches:

1. If the error occurs during a non-transactional statement, the code
attempts to binlog the partially executed statement (as it cannot roll
back). The stmt_cache->error was still set from the disk full error. This
caused MYSQL_BIN_LOG::write_cache() to get an error while trying to read the
cache to copy it to the binlog. This was then wrongly interpreted as a disk
full error writing to the binlog file. As a result, a partial event group
containing just a GTID event (no query or commit) was binlogged. Fixed by
checking if an error is set in the statement cache, and if so binlog an
INCIDENT event instead of a corrupt event group, as for other errors.

2. For LOAD DATA LOCAL INFILE, if a disk full error occured while writing to
the statement cache, the code would attempt to abort and read-and-discard
any remaining data sent by the client. The discard code would however
continue trying to write data to the statement cache, and wrongly interpret
another disk full error as end-of-file from the client. This left the client
connection with extra data which corrupts the communication for the next
command, as well as again causing an corrupt/incomplete event to be
binlogged. Fixed by restoring the default read function before reading any
remaining data from the client connection.

Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-10-31 11:48:00 +01:00
Sergei Petrunia
c7fe8e51de Merge 10.11 into 11.0 2023-04-17 16:50:01 +03:00
Marko Mäkelä
a009280e60 Merge 10.9 into 10.10 2023-04-14 12:24:14 +03:00
Marko Mäkelä
5bada1246d Merge 10.5 into 10.6 2023-04-11 16:15:19 +03:00
Oleksandr Byelkin
ac5a534a4c Merge remote-tracking branch '10.4' into 10.5 2023-03-31 21:32:41 +02:00
Anel Husakovic
c596ad734d MDEV-30269: Remove rpl_semi_sync_[slave,master] usage in code
- Description:
  - Before 10.3.8 semisync was a plugin that is built into the server with
    MDEV-13073,starting with commit cbc71485e2.
    There are still some usage of `rpl_semi_sync_master` in mtr.
Note:
  - To recognize the replica in the `dump_thread`, replica is creating
    local variable `rpl_semi_sync_slave` (the keyword of plugin) in
    function `request_transmit`, that is catched by primary in
    `is_semi_sync_slave()`. This is the user variable and as such not
    related to the obsolete plugin.

 - Found in `sys_vars.all_vars` and `rpl_semi_sync_wait_point` tests,
   usage of plugins `rpl_semi_sync_master`, `rpl_semi_sync_slave`.
   The former test is disabled by default (`sys_vars/disabled.def`)
   and marked as `obsolete`, however this patch will remove the queries.

- Add cosmetic fixes to semisync codebase

Reviewer: <brandon.nesterenko@mariadb.com>
Closes PR #2528, PR #2380
2023-03-23 13:39:46 +01:00
Sergei Golubchik
760d149067 MDEV-30128 remove support for 5.1- replication events
including patches from Andrei Elkin
2023-02-05 22:02:30 +01:00
Oleksandr Byelkin
bb76dcbec7 Merge branch '10.9' into 10.10 2022-10-04 13:32:38 +02:00
Sergei Golubchik
900d7bf360 Merge branch '10.5' into 10.6 2022-10-02 22:14:21 +02:00
Sergei Golubchik
3a2116241b Merge branch '10.4' into 10.5 2022-10-02 14:38:13 +02:00
Sergei Golubchik
d4f6d2f08f Merge branch '10.3' into 10.4 2022-10-01 23:07:26 +02:00
Mikhail Chalov
9de9f105b5
Use memory safe snprintf() in Connect Engine and elsewhere (#2210)
Continue with similar changes as done in 19af1890 to replace sprintf(buf, ...)
with snprintf(buf, sizeof(buf), ...), specifically in the "easy" cases where buf
is allocated with a size known at compile time.

All new code of the whole pull request, including one or several files that are
either new files or modified ones, are contributed under the BSD-new license.  I
am contributing on behalf of my employer Amazon Web Services, Inc.
2022-09-28 15:45:25 +01:00
Marko Mäkelä
829e8111c7 Merge 10.5 into 10.6 2022-09-26 14:34:43 +03:00
Marko Mäkelä
6286a05d80 Merge 10.4 into 10.5 2022-09-26 13:34:38 +03:00
Marko Mäkelä
3c92050d1c Fix build without either ENABLED_DEBUG_SYNC or DBUG_OFF
There are separate flags DBUG_OFF for disabling the DBUG facility
and ENABLED_DEBUG_SYNC for enabling the DEBUG_SYNC facility.
Let us allow debug builds without DEBUG_SYNC.

Note: For CMAKE_BUILD_TYPE=Debug, CMakeLists.txt will continue to
define ENABLED_DEBUG_SYNC.
2022-09-23 17:37:52 +03:00
Brandon Nesterenko
90c3b2835d MDEV-20122: Deprecate MASTER_USE_GTID=Current_Pos to favor new MASTER_DEMOTE_TO_SLAVE option
New Feature:
========
This feature adds a safe replacement to the
MASTER_USE_GTID=Current_Pos option for CHANGE MASTER TO as
MASTER_DEMOTE_TO_SLAVE=<bool>. The use case of Current_Pos is to
transition a master to become a slave; however, can break
replication state if the slave executes local transactions due to
actively updating gtid_current_pos with gtid_binlog_pos and
gtid_slave_pos.

MASTER_DEMOTE_TO_SLAVE changes this use case by forcing users to set
Using_Gtid=Slave_Pos and merging gtid_binlog_pos into gtid_slave_pos
once at CHANGE MASTER TO time. Note that if gtid_slave_pos is more
recent than gtid_binlog_pos (as in the case of chain replication),
the replication state should be preserved.

Additionally, deprecate the `Current_Pos` option of MASTER_USE_GTID
to suggest the safe alternative option MASTER_DEMOTE_TO_SLAVE=TRUE.

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2022-07-26 16:35:24 -06:00
Brandon Nesterenko
5ab5ff08b0 MDEV-19801: Change defaults for CHANGE MASTER TO so that GTID-based replication is used by default if master supports it
This commit makes replicas crash-safe by default by changing the
Using_Gtid value to be Slave_Pos on a fresh slave start and after
RESET SLAVE is issued. If the primary server does not support GTIDs
(i.e., version < 10), the replica will fall back to Using_Gtid=No on
slave start and after RESET SLAVE.

The following additional informational messages/warnings are added:

 1. When Using_Gtid is automatically changed. That is, if RESET
SLAVE reverts Using_Gtid back to Slave_Pos, or Using_Gtid is
inferred to No from a CHANGE MASTER TO given with log coordinates
without MASTER_USE_GTID.
 2. If options are ignored in CHANGE MASTER TO. If CHANGE MASTER TO
is given with log coordinates, yet also specifies
MASTER_USE_GTID=Slave_Pos, a warning message is given that the log
coordinate options are ignored.

Additionally, an MTR macro has been added for RESET SLAVE,
reset_slave.inc, which provides modes/options for resetting a slave
in log coordinate or gtid modes. When in log coordinates mode, the
macro will execute CHANGE MASTER TO MASTER_USE_GTID=No after the
RESET SLAVE command. When in GTID mode, an extra parameter,
reset_slave_keep_gtid_state, can be set to reset or preserve the
value of gtid_slave_pos.

Reviewed By:
===========
Andrei Elkin <andrei.elkin@mariadb.com>
2022-07-26 13:31:27 -06:00
sjaakola
ef2dbb8dbc MDEV-23328 Server hang due to Galera lock conflict resolution
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is

          wsrep_thd_LOCK()
          wsrep_kill_victim()
          lock_rec_other_has_conflicting()
          lock_clust_rec_read_check_and_lock()
          row_search_mvcc()
          ha_innobase::index_read()
          ha_innobase::rnd_pos()
          handler::ha_rnd_pos()
          handler::rnd_pos_by_record()
          handler::ha_rnd_pos_by_record()
          Rows_log_event::find_row()
          Update_rows_log_event::do_exec_row()
          Rows_log_event::do_apply_event()
          Log_event::apply_event()
          wsrep_apply_events()

and mutexes are taken in the order

          lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data

When a normal KILL statement is executed, the stack is

          innobase_kill_query()
          kill_handlerton()
          plugin_foreach_with_mask()
          ha_kill_query()
          THD::awake()
          kill_one_thread()

        and mutexes are

          victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex

This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.

In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.

TOI replication is used, in this approach,  purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.

This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2021-10-29 20:40:35 +02:00
Jan Lindström
d5bc05798f MDEV-25114: Crash: WSREP: invalid state ROLLED_BACK (FATAL)
Revert "MDEV-23328 Server hang due to Galera lock conflict resolution"

This reverts commit eac8341df4.
2021-10-29 20:38:11 +02:00
sjaakola
5c230b21bf MDEV-23328 Server hang due to Galera lock conflict resolution
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is

          wsrep_thd_LOCK()
          wsrep_kill_victim()
          lock_rec_other_has_conflicting()
          lock_clust_rec_read_check_and_lock()
          row_search_mvcc()
          ha_innobase::index_read()
          ha_innobase::rnd_pos()
          handler::ha_rnd_pos()
          handler::rnd_pos_by_record()
          handler::ha_rnd_pos_by_record()
          Rows_log_event::find_row()
          Update_rows_log_event::do_exec_row()
          Rows_log_event::do_apply_event()
          Log_event::apply_event()
          wsrep_apply_events()

and mutexes are taken in the order

          lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data

When a normal KILL statement is executed, the stack is

          innobase_kill_query()
          kill_handlerton()
          plugin_foreach_with_mask()
          ha_kill_query()
          THD::awake()
          kill_one_thread()

        and mutexes are

          victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex

This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.

In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.

TOI replication is used, in this approach,  purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.

This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2021-10-29 09:52:52 +03:00
Jan Lindström
aa7ca987db MDEV-25114: Crash: WSREP: invalid state ROLLED_BACK (FATAL)
Revert "MDEV-23328 Server hang due to Galera lock conflict resolution"

This reverts commit eac8341df4.
2021-10-29 09:52:40 +03:00
Marko Mäkelä
73f5cbd0b6 Merge 10.5 into 10.6 2021-10-21 16:06:34 +03:00
Marko Mäkelä
5f8561a6bc Merge 10.4 into 10.5 2021-10-21 15:26:25 +03:00
Marko Mäkelä
489ef007be Merge 10.3 into 10.4 2021-10-21 14:57:00 +03:00
Marko Mäkelä
e4a7c15dd6 Merge 10.2 into 10.3 2021-10-21 13:41:04 +03:00
Brandon Nesterenko
2291f8ef73 MDEV-25284: Assertion `info->type == READ_CACHE || info->type == WRITE_CACHE' failed
Problem:
========
This patch addresses two issues.

First, if a CHANGE MASTER command is issued and an error happens
while locating the replica’s relay logs, the logs can be put into an
invalid state where future updates fail and future CHANGE MASTER
calls crash the server. More specifically, right before a replica
purges the relay logs (part of the `CHANGE MASTER TO` logic), the
relay log is temporarily closed with state LOG_TO_BE_OPENED. If the
server errors in-between the temporary log closure and purge, i.e.
during the function find_log_pos, the log should be closed.
MDEV-25284 reveals the log is not properly closed.

Second, upon issuing a RESET SLAVE ALL command, a slave’s GTID
filters are not cleared (DO_DOMAIN_IDS, IGNORE_DOMIAN_IDS,
IGNORE_SERVER_IDS). MySQL had a similar bug report, Bug #18816897,
which fixed this issue to clear IGNORE_SERVER_IDS after issuing
RESET SLAVE ALL in version 5.7.

Solution:
=========

To fix the first problem, the CHANGE MASTER error handling logic was
extended to transition the relay log state to LOG_CLOSED from
LOG_TO_BE_OPENED.

To fix the second problem, the RESET SLAVE ALL logic is extended to
clear the domain_id filter and ignore_server_ids.

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2021-10-18 10:43:51 -06:00
Marko Mäkelä
f3fcf5f45c Merge 10.5 to 10.6 2021-08-19 12:25:00 +03:00
Sujatha
475f69b985 MDEV-25958: rpl_semi_sync_fail_over.test fails in buildbot
Analysis:
========
In case multi binlog truncation scenario debug sync points are in the
following order.

Two inserts are done on master as shown below.

INSERT INTO t1 VALUES (4, REPEAT("x", 4100)
commit_after_release_LOCK_after_binlog_sync

INSERT INTO t1 VALUES (5, REPEAT("x", 4100)
commit_after_release_LOCK_log

First insert debug sync ensures that transaction is synced to binlog and
not committed but it reached slave through semi sync.

Second insert debug sync ensures that transaction is synced to binlog and
not committed. It doesn't ensure that 'INSERT 5' reached slave.

Most of the times INSERT 5 reaches slave, hence when it is promoted as
master it sends 4,5 to slave. But occasionally 5 may not reach slave in
those cases post recovery master will have only 4. When row 6 is inserted
Master has 4-6 and Slave has 4,5,6.

This results in test failure.

Fix:
===
For the first insert use 'commit_before_get_LOCK_commit_ordered' debug sync
point, it will ensure that binlog was sent to slave and slave has
acknowledged the receipt. Now enable debug code such that when the next
transaction is written to binary log, dump thread will read and send it
across the network and notify the server to be get killed. Insert row 5
and wait for notification from dump thread. Kill the server. This ensures
that both 4 and 5 have reached the semi-sync slave.

Added a new test case:
Insert two rows on master such that first is present in master's binlog and
reached semi sync slave. Second insert should be flushed to binlog but not
sent to slave. Now crash and fail over to slave. The promoted master will send
the extra transaction to slave.
2021-08-19 11:59:39 +05:30
Marko Mäkelä
4a25957274 Merge 10.4 into 10.5 2021-08-18 18:22:35 +03:00