mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-17 04:22:27 +01:00

Author	SHA1	Message	Date
Brandon Nesterenko	1407f99963	MDEV-31177: SHOW SLAVE STATUS Last_SQL_Errno Race Condition on Errored Slave Restart The SQL thread and a user connection executing SHOW SLAVE STATUS have a race condition on Last_SQL_Errno, such that a slave which previously errored and stopped, on its next start, SHOW SLAVE STATUS can show that the SQL Thread is running while the previous error is also showing. The fix is to move when the last error is cleared when the SQL thread starts to occur before setting the status of Slave_SQL_Running. Thanks to Kristian Nielson for his work diagnosing the problem! Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com> Kristian Nielson <knielsen@knielsen-hq.org>	2023-09-13 12:01:47 -06:00
Kristian Nielsen	900c4d6920	MDEV-31655: Parallel replication deadlock victim preference code errorneously removed Restore code to make InnoDB choose the second transaction as a deadlock victim if two transactions deadlock that need to commit in-order for parallel replication. This code was erroneously removed when VATS was implemented in InnoDB. Also add a test case for InnoDB choosing the right deadlock victim. Also fixes this bug, with testcase that reliably reproduces: MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master Note: This should be null-merged to 10.6, as a different fix is needed there due to InnoDB locking code changes. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:35:30 +02:00
Jan Lindström	277968aa4c	MDEV-31413 : Node has been dropped from the cluster on Startup / Shutdown with async replica There was two related problems: (1) Galera node that is defined as a slave to async MariaDB master at restart might do SST (state stransfer) and part of that it will copy mysql.gtid_slave_pos table. Problem is that updates on that table are not replicated on a cluster. Therefore, table from donor that is not slave is copied and joiner looses gtid position it was and start executing events from wrong position of the binlog. This incorrect position could break replication and causes node to be dropped and requiring user action. (2) Slave sql thread might start executing events before galera is ready (wsrep_ready=ON) and that could also cause node to be dropped from the cluster. In this fix we enable replication of mysql.gtid_slave_pos table on a cluster. In this way all nodes in a cluster will know gtid slave position and even after SST joiner knows correct gtid position to start. Furthermore, we wait galera to be ready before slave sql thread executes any events to prevent too early execution. Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-08-08 03:25:56 +02:00
Brandon Nesterenko	063f4ac25e	MDEV-30619: Parallel Slave SQL Thread Can Update Seconds_Behind_Master with Active Workers MDEV-31749 sporadic assert in MDEV-30619 new test If the workers of a parallel replica are busy (potentially with long queues), but the SQL thread has no events left to distribute (so it goes idle), then the next event that comes from the primary will update mi->last_master_timestamp with its timestamp, even if the workers have not yet finished. This patch changes the parallel replica logic which updates last_master_timestamp after idling from using solely sql_thread_caught_up (added in MDEV-29639) to using the latter with rli queued/dequeued event counters. That is, if the queued count is equal to the dequeued count, it means all events have been processed and the replica is considered idle when the driver thread has also distributed all events. Low level details of the commit include - to make a more generalized test for Seconds_Behind_Master on the parallel replica, rpl_delayed_parallel_slave_sbm.test is renamed to rpl_parallel_sbm.test for this purpose. - pause_sql_thread_on_next_event usage was removed with the MDEV-30619 fixes. Rather than remove it, we adapt it to the needs of this test case - added test case to cover SBM spike of relay log read and LMT update that was fixed by MDEV-29639 - rpl_seconds_behind_master_spike.test is made to use the negate_clock_diff_with_master debug eval. Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2023-07-25 16:36:14 +03:00
Oleksandr Byelkin	edd0b03e60	Merge branch '10.3' into 10.4	2023-05-02 10:09:27 +02:00
Andrei	55a53949be	MDEV-29621: Replica stopped by locks on sequence When using binlog_row_image=FULL with sequence table inserts, a replica can deadlock because it treats full inserts in a sequence as DDL statements by getting an exclusive lock on the sequence table. It has been observed that with parallel replication, this exclusive lock on the sequence table can lead to a deadlock where one transaction has the exclusive lock and is waiting on a prior transaction to commit, whereas this prior transaction is waiting on the MDL lock. This fix for this is on the master side, to raise FL_DDL flag on the GTID of a full binlog_row_image write of a sequence table. This forces the slave to execute the statement serially so a deadlock cannot happen. A test verifies the deadlock also to prove it happen on the OLD (pre-fixes) slave. OLD (buggy master) -replication-> NEW (fixed slave) is provided. As the pre-fixes master's full row-image may represent both SELECT NEXT VALUE and INSERT, the parallel slave pessimistically waits for the prior transaction to have committed before to take on the critical part of the second (like INSERT in the test) event execution. The waiting exploits a parallel slave's retry mechanism which is controlled by `@@global.slave_transaction_retries`. Note that in order to avoid any persistent 'Deadlock found' 2013 error in OLD -> NEW, `slave_transaction_retries` may need to be set to a higher than the default value. START-SLAVE is an effective work-around if this still happens.	2023-04-27 21:55:45 +03:00
Brandon Nesterenko	eecd4f1459	MDEV-30608: rpl.rpl_delayed_parallel_slave_sbm sometimes fails with Seconds_Behind_Master should not have used second transaction timestamp One of the constraints added in the MDEV-29639 patch, is that only the first event after idling should update last_master_timestamp; and as long as the replica has more events to execute, the variable should not be updated. The corresponding test, rpl_delayed_parallel_slave_sbm.test, aims to verify this; however, if the IO thread takes too long to queue events, the SQL thread can appear to catch up too fast. This fix ensures that the relay log has been fully written before executing the events. Note that the underlying cause of this test failure needs to be addressed as a bug-fix, this is a temporary fix to stop test failures. To track work on the bug-fix for the underlying issue, please see MDEV-30619.	2023-02-09 13:02:14 -07:00
Oleksandr Byelkin	a977054ee0	Merge branch '10.3' into 10.4	2023-01-28 18:22:55 +01:00
Brandon Nesterenko	d69e835787	MDEV-29639: Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas Problem ======== On a parallel, delayed replica, Seconds_Behind_Master will not be calculated until after MASTER_DELAY seconds have passed and the event has finished executing, resulting in potentially very large values of Seconds_Behind_Master (which could be much larger than the MASTER_DELAY parameter) for the entire duration the event is delayed. This contradicts the documented MASTER_DELAY behavior, which specifies how many seconds to withhold replicated events from execution. Solution ======== After a parallel replica idles, the first event after idling should immediately update last_master_timestamp with the time that it began execution on the primary. Reviewed By =========== Andrei Elkin <andrei.elkin@mariadb.com>	2023-01-24 08:11:35 -07:00
Julius Goryavsky	1ebf0b7372	MDEV-29817: Issues with handling options for SSL CRLs (and some others) This patch adds the correct setting of the "--tls-version" and "--ssl-verify-server-cert" options in the client-side utilities such as mysqltest, mysqlcheck and mysqlslap, as well as the correct setting of the "--ssl-crl" option when executing queries on the slave side, and also the correct option codes in the "sslopts-logopts.h" file (in the latter case, incorrect values are not a problem right now, but may cause subtle test failures in the future, if the option handling code changes).	2022-11-22 15:16:12 +01:00
Julius Goryavsky	f0820400ee	MDEV-29817: Issues with handling options for SSL CRLs (and some others) This patch adds the correct setting of the "--ssl-verify-server-cert" option in the client-side utilities such as mysqlcheck and mysqlslap, as well as the correct setting of the "--ssl-crl" option when executing queries on the slave side, and also add the correct option codes in the "sslopts-logopts.h" file (in the latter case, incorrect values are not a problem right now, but may cause subtle test failures in the future, if the option handling code changes).	2022-11-22 14:07:39 +01:00
Marko Mäkelä	3c92050d1c	Fix build without either ENABLED_DEBUG_SYNC or DBUG_OFF There are separate flags DBUG_OFF for disabling the DBUG facility and ENABLED_DEBUG_SYNC for enabling the DEBUG_SYNC facility. Let us allow debug builds without DEBUG_SYNC. Note: For CMAKE_BUILD_TYPE=Debug, CMakeLists.txt will continue to define ENABLED_DEBUG_SYNC.	2022-09-23 17:37:52 +03:00
Marko Mäkelä	a69cf6f07e	MDEV-29613 Improve WITH_DBUG_TRACE=OFF In commit `28325b0863` a compile-time option was introduced to disable the macros DBUG_ENTER and DBUG_RETURN or DBUG_VOID_RETURN. The parameter name WITH_DBUG_TRACE would hint that it also covers DBUG_PRINT statements. Let us do that: WITH_DBUG_TRACE=OFF shall disable DBUG_PRINT() as well. A few InnoDB recovery tests used to check that some output from DBUG_PRINT("ib_log", ...) is present. We can live without those checks. Reviewed by: Vladislav Vaintroub	2022-09-23 13:40:42 +03:00
Marko Mäkelä	ea1fbd0326	Merge 10.3 into 10.4	2022-06-07 15:55:32 +03:00
Marko Mäkelä	099b9202a5	MDEV-27697 fixup: Exclude debug code from non-debug builds	2022-06-03 10:47:34 +03:00
Sergei Golubchik	23ddc3518f	Merge branch '10.3' into 10.4	2022-05-18 01:25:30 +02:00
Sergei Golubchik	a0d4f0f306	Merge branch '10.2' into 10.3 commit `84984b79f2` is null-merged	2022-05-18 01:23:47 +02:00
Andrei	726bd8c968	MDEV-28550 improper handling of replication event group that contains GTID_LIST_EVENT or INCIDENT_EVENT. It's legal to have either of the two inside a group. E.g Gtid_event, Gtid_log_list_event, Query_1, ... Xid_log_event is permitted. However, the slave IO thread treated both as the terminal even when the group represents a DDL query. That causes a premature Gtid state update so the slave IO would think the whole group has been collected while in fact Query_1 etc are yet to process. Fixed with correcting a condition to compute the terminal event of the group. Tested with rpl_mysqlbinlog_slave_consistency (of 10.9) and rpl_gtid_errorlog.test.	2022-05-13 09:45:32 +02:00
Sergei Golubchik	a70a1cf3f4	Merge branch '10.3' into 10.4	2022-05-08 23:03:08 +02:00
Oleksandr Byelkin	9614fde1aa	Merge branch '10.2' into 10.3	2022-05-03 10:59:54 +02:00
Sergei Golubchik	1430cf7873	MDEV-28428 Master_SSL_Crl shows Master_SSL_CA value in SHOW SLAVE STATUS output it was showing ca and capath instead of crl and crl_path	2022-04-28 13:21:04 +02:00
Andrei	388032e990	MDEV-27697. Removed a false assert.	2022-04-26 19:47:59 +03:00
Andrei	945245aea4	MDEV-27697. Two affected tests fixed. A result file is updated in one case and former error simulation got refined.	2022-04-26 17:05:40 +03:00
Andrei	1bcdc3e9eb	MDEV-27697 slave must recognize incomplete replication event group In cases of a faulty master or an incorrect binlog event producer, that slave is working with, sends an incomplete group of events slave must react with an error to not to log into the relay-log any new events that do not belong to the incomplete group. Fixed with extending received event properties check when slave connects to master in gtid mode. Specifically for the event that can be a part of a group its relay-logging is permitted only when its position within the group is validated. Otherwise slave IO thread stops with ER_SLAVE_RELAY_LOG_WRITE_FAILURE.	2022-04-25 16:00:35 +03:00
Brandon Nesterenko	a83c7ab1ea	MDEV-11853: semisync thread can be killed after sync binlog but before ACK in the sync state Problem: ======== If a primary is shutdown during an active semi-sync connection during the period when the primary is awaiting an ACK, the primary hard kills the active communication thread and does not ensure the transaction was received by a replica. This can lead to an inconsistent replication state. Solution: ======== During shutdown, the primary should wait for an ACK or timeout before hard killing a thread which is awaiting a communication. We extend the `SHUTDOWN WAIT FOR SLAVES` logic to identify and ignore any threads waiting for a semi-sync ACK in phase 1. Then, before stopping the ack receiver thread, the shutdown is delayed until all waiting semi-sync connections receive an ACK or time out. The connections are then killed in phase 2. Notes: 1) There remains an unresolved corner case that affects this patch. MDEV-28141: Slave crashes with Packets out of order when connecting to a shutting down master. Specifically, If a slave is connecting to a master which is actively shutting down, the slave can crash with a "Packets out of order" assertion error. To get around this issue in the MTR tests, the primary will wait a small amount of time before phase 1 killing threads to let the replicas safely stop (if applicable). 2) This patch also fixes MDEV-28114: Semi-sync Master ACK Receiver Thread Can Error on COM_QUIT Reviewed By ============ Andrei Elkin <andrei.elkin@mariadb.com>	2022-04-22 12:59:54 -06:00
Oleksandr Byelkin	a576a1cea5	Merge branch '10.3' into 10.4	2022-01-30 09:46:52 +01:00
Oleksandr Byelkin	41a163ac5c	Merge branch '10.2' into 10.3	2022-01-29 15:41:05 +01:00
Brandon Nesterenko	96de6bfd5e	MDEV-16091: Seconds_Behind_Master spikes to millions of seconds Problem: ======== A slave’s relay log format description event is used when calculating Seconds_Behind_Master (SBM). This forces the SBM value to spike when processing these events, as their creation date is set to the timestamp that the IO thread begins. Solution: ======== When the slave generates a format description event, mark the event as a relay log event so it does not update the rli->last_master_timestamp variable. Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2022-01-04 11:21:33 -07:00
sjaakola	5c230b21bf	MDEV-23328 Server hang due to Galera lock conflict resolution Mutex order violation when wsrep bf thread kills a conflicting trx, the stack is wsrep_thd_LOCK() wsrep_kill_victim() lock_rec_other_has_conflicting() lock_clust_rec_read_check_and_lock() row_search_mvcc() ha_innobase::index_read() ha_innobase::rnd_pos() handler::ha_rnd_pos() handler::rnd_pos_by_record() handler::ha_rnd_pos_by_record() Rows_log_event::find_row() Update_rows_log_event::do_exec_row() Rows_log_event::do_apply_event() Log_event::apply_event() wsrep_apply_events() and mutexes are taken in the order lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data When a normal KILL statement is executed, the stack is innobase_kill_query() kill_handlerton() plugin_foreach_with_mask() ha_kill_query() THD::awake() kill_one_thread() and mutexes are victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex This patch is the plan D variant for fixing potetial mutex locking order exercised by BF aborting and KILL command execution. In this approach, KILL command is replicated as TOI operation. This guarantees total isolation for the KILL command execution in the first node: there is no concurrent replication applying and no concurrent DDL executing. Therefore there is no risk of BF aborting to happen in parallel with KILL command execution either. Potential mutex deadlocks between the different mutex access paths with KILL command execution and BF aborting cannot therefore happen. TOI replication is used, in this approach, purely as means to provide isolated KILL command execution in the first node. KILL command should not (and must not) be applied in secondary nodes. In this patch, we make this sure by skipping KILL execution in secondary nodes, in applying phase, where we bail out if applier thread is trying to execute KILL command. This is effective, but skipping the applying of KILL command could happen much earlier as well. This also fixed unprotected calls to wsrep_thd_abort that will use wsrep_abort_transaction. This is fixed by holding THD::LOCK_thd_data while we abort transaction. Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>	2021-10-29 09:52:52 +03:00
Jan Lindström	aa7ca987db	MDEV-25114: Crash: WSREP: invalid state ROLLED_BACK (FATAL) Revert "MDEV-23328 Server hang due to Galera lock conflict resolution" This reverts commit `eac8341df4`.	2021-10-29 09:52:40 +03:00
Julius Goryavsky	7948a1dc53	MDEV-26914: Unreleased mutex in the exec_relay_log_event() function In the replication-related code, in the exec_relay_log_event() (slave.cc) function, where the "data_lock" mutex is captured, this mutex is then not released on one of the early return branches within a specific insert for WSREP, namely under the branch: "if (wsrep_before_statement(thd))". As a result, the mutex remains captured, resulting in errors or hangs. This commit fixes this issue, which is now showing up as intermittent failures in mtr tests for galera and galera_sr suites.	2021-10-28 03:17:12 +02:00
Marko Mäkelä	ca3f497564	Merge 10.2 into 10.3, except MDEV-25682	2021-05-18 08:40:19 +03:00
Sachin Kumar	355dc74b76	MDEV-22370 safe_mutex: Trying to lock uninitialized mutex at /data/src/10.4-bug/sql/rpl_parallel.cc, line 470 upon shutdown during FTWRL Problem:- When we issue FTWRL with shutdown in parallel, there is race between FTWRL and shutdown. Shutdown might destroy the mutex (pool->LOCK_rpl_thread_pool) before FTWRL can lock it. So we can get crash on FTWRL thread Solution:- mysql_mutex_destroy(pool->LOCK_rpl_thread_pool) should wait for FTWRL thread to complete its work , and then destroy. So slave_prepare_for_shutdown will just deactivate the pool, and mutex is destroyed later in end_slave()	2021-05-14 11:49:46 +01:00
Andrei Elkin	3616640a31	MDEV-20821 parallel slave server shutdown hang Parallel slave server shutdown found to be hanging in close_connections() triggered by shutdown due to a slave worker thread would not be notified to exit in case the worker was sitting idle. Fixed with destroying the worker pool earlier that is in slave_prepare_for_shutdown() when all their driver threads have already left. A test file is added to simulate the bug condition as well as check multi-sourced and not-idle worker cases.	2021-05-14 11:49:26 +01:00
Sergei Golubchik	eac8341df4	MDEV-23328 Server hang due to Galera lock conflict resolution adaptation of `29bbcac0ee` for 10.4	2021-02-12 18:17:06 +01:00
Sergei Golubchik	9703cffa8c	don't take mutexes conditionally	2021-02-12 18:14:20 +01:00
Sergei Golubchik	00a313ecf3	Merge branch 'bb-10.3-release' into bb-10.4-release Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution" was null-merged. 10.4 version of the fix is coming up separately	2021-02-12 17:44:22 +01:00
Sergei Golubchik	60ea09eae6	Merge branch '10.2' into 10.3	2021-02-01 13:49:33 +01:00
Sergei Golubchik	6a1cb449fe	cleanup: remove slave background thread, use handle_manager thread instead	2021-01-24 11:35:55 +01:00
Daniel Black	29d9897fe2	MDEV-10272: add master host/port info to slave thread exit messages Sample log error message generated: 2021-01-21 2:33:24 139912137520896 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 369 33:24 139912137520896 [Note] master was 127.0.0.1:16400 2021-01-21 2:33:24 139912137828096 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 369 2021-01-21 2:33:24 139912137828096 [Note] master was 127.0.0.1:16400 Based on work by Hartmut Holzgraefe. Reviewer: knielsen@knielsen-hq.org, Andrei, Sachin	2021-01-22 10:06:33 +11:00
Sujatha	b2029c0300	Merge branch '10.3' into 10.4	2020-11-12 15:39:02 +05:30
Sujatha	bafb011a82	Merge branch '10.2' into 10.3	2020-11-12 14:10:05 +05:30
Sujatha	984a06db2c	MDEV-4633: multi_source.simple test fails sporadically Analysis: ======== Writes to 'rli->log_space_total' needs to be synchronized, otherwise both SQL_THREAD and IO_THREAD can try to modify the variable simultaneously resulting in incorrect rli->log_space_total. In the current test scenario SQL_THREAD is trying to decrement 'rli->log_space_total' in 'purge_first_log' and IO_THREAD is trying to increment the 'rli->log_space_total' in 'queue_event' simultaneously. Hence test occasionally fails with result mismatch. Fix: === Convert 'rli->log_space_total' variable to atomic type.	2020-11-12 13:04:39 +05:30
Marko Mäkelä	9216114ce7	Merge 10.3 into 10.4	2020-07-31 18:09:08 +03:00
Marko Mäkelä	66ec3a770f	Merge 10.2 into 10.3	2020-07-31 13:51:28 +03:00
Sujatha	b3dd95e035	MDEV-14203: rpl.rpl_extra_col_master_myisam, rpl.rpl_slave_load_tmpdir_not_exist failed in buildbot with a warning Problem: ======= rpl.rpl_slave_load_tmpdir_not_exist 'stmt' w3 [ fail ] Found warnings/errors in server log file! Test ended at 2017-09-27 20:34:55 [Warning] Master is configured to log replication events with checksum, but will not send such events to slaves that cannot process them ^ Found warnings in /mnt/buildbot/build/mariadb-10.2.10/mysql-test/var/3/log/mysqld.1.err ok Analysis: ======== When slave tries to connect to master 'get_master_version_and_clock' function is invoked to perform elaborated slave-master handshake. During this process slave server queries master server, to know if it is checksum aware and at the same time master is notified about its CRC-awareness. The master's side instant value of @@global.binlog_checksum is stored in the dump thread's uservar area as well as cached locally to become known in consensus by master and slave. Post hand-shake slave requests master for binlog dump. It sends 'COM_BINLOG_DUMP'. This command is sent to master by 'cli_advanced_command' call. If there is some temporary network failure during this request_dump call, 'end_server' is invoked to close the current connection between master and slave. Upon connection close the dump thread on the master gets terminated and it clears the 'uservar' data it got through master-slave handshake. The 'COM_BINLOG_DUMP' command is sent once again without master-slave handshake. Since the checksum data is not available with new dump thread a warning gets reported. Fix: === Upon network write error donot attempt reconnect, proceed to master-slave handshake. This ensures that master is aware of slave's capability to use checksums.	2020-07-23 12:54:40 +05:30
Sachin	592a10d079	MDEV-22370 safe_mutex: Trying to lock uninitialized mutex at /data/src/10.4-bug/sql/rpl_parallel.cc, line 470 upon shutdown during FTWRL Problem:- When we issue FTWRL with shutdown in parallel, there is race between FTWRL and shutdown. Shutdown might destroy the mutex (pool->LOCK_rpl_thread_pool) before FTWRL can lock it. So we can get crash on FTWRL thread Solution:- mysql_mutex_destroy(pool->LOCK_rpl_thread_pool) should wait for FTWRL thread to complete its work , and then destroy. So slave_prepare_for_shutdown will just deactivate the pool, and mutex is destroyed later in end_slave()	2020-06-17 02:22:46 +05:30
Marko Mäkelä	6da14d7b4a	Merge 10.3 into 10.4	2020-05-30 11:04:27 +03:00
Marko Mäkelä	dad7a8ee7d	Merge 10.2 into 10.3	2020-05-27 17:10:39 +03:00
Andrei Elkin	0c1f97b3ab	MDEV-15152 Optimistic parallel slave doesnt cope well with START SLAVE UNTIL The immediate bug was caused by a failure to recognize a correct position to stop the slave applier run in optimistic parallel mode. There were the following set of issues that the analysis unveil. 1 incorrect estimate for the event binlog position passed to is_until_satisfied 2 wait for workers to complete by the driver thread did not account non-group events that could be left unprocessed and thus to mix up the last executed binlog group's file and position: the file remained old and the position related to the new rotated file 3 incorrect 'slave reached file:pos' by the parallel slave report in the error log 4 relay log UNTIL missed out the parallel slave branch in is_until_satisfied. The patch addresses all of them to simplify logics of log change notification in either the master and relay-log until case. P.1 is addressed with passing the event into is_until_satisfied() for proper analisis by the function. P.2 is fixed by changes in handle_queued_pos_update(). P.4 required removing relay-log change notification by workers. Instead the driver thread updates the notion of the current relay-log fully itself with aid of introduced bool Relay_log_info::until_relay_log_names_defer. An extra print out of the requested until file:pos is arranged with --log-warning=3.	2020-05-26 18:49:43 +03:00

1 2 3 4 5 ...

2603 commits