mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-16 12:02:42 +01:00

Author	SHA1	Message	Date
Brandon Nesterenko	744580d5a7	MDEV-32892: IO Thread Reports False Error When Stopped During Connecting to Primary The IO thread can report error code 2013 into the error log when it is stopped during the initial connection process to the primary, as well as when trying to read an event. However, because the IO thread is being stopped, its connection to the primary is force-killed by the signaling thread (see THD::awake_no_mutex()), and thereby these connection errors should be ignored. Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org>	2024-07-08 10:39:17 -06:00
Julius Goryavsky	52c45332a8	MDEV-34071: Failure during the galera_3nodes_sr.GCF-336 test This commit fixes sporadic failures in galera_3nodes_sr.GCF-336 test. The following changes have been made here: 1) A small addition to the test itself which should make it more deterministic by waiting for non-primary state before COMMIT; 2) More careful handling of the wsrep_ready variable in the server code (it should always be protected with mutex). No additional tests are required.	2024-05-06 03:16:59 +02:00
Sergei Golubchik	cea083af9f	cleanup: use THD_STAGE_INFO, not thd_proc_info and put master-slave.inc last in the series of includes	2024-05-05 21:37:07 +02:00
Kristian Nielsen	57f6a1ca98	MDEV-19415: use-after-free on charsets_dir from slave connect The slave IO thread sets MYSQL_SET_CHARSET_DIR. The code for this option however is not thread-safe in sql-common/client.c. The value set is temporarily written to mysys global variable `charsets-dir` and can be seen by other threads running in parallel, which can result in use-after-free error. Problem was visible as random failures of test cases in suite multi_source with Valgrind or MSAN. Work-around by not setting this option for slave connect, it is redundant anyway as it is just setting the default value. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-20 13:41:08 +02:00
Alexey Botchkov	85517f609a	MDEV-33393 audit plugin do not report user did the action.. The '<replication_slave>' user is assigned to the slave replication thread so this name appears in the auditing logs.	2024-02-14 00:02:29 +04:00
Sergei Golubchik	01f6abd1d4	Merge branch '10.4' into 10.5	2024-01-31 17:32:53 +01:00
Brandon Nesterenko	c75905cacb	MDEV-33327: rpl_seconds_behind_master_spike Sensitive to IO Thread Stop Position rpl.rpl_seconds_behind_master_spike uses the DEBUG_SYNC mechanism to count how many format descriptor events (FDEs) have been executed, to attempt to pause on a specific relay log FDE after executing transactions. However, depending on when the IO thread is stopped, it can send an extra FDE before sending the transactions, forcing the test to pause before executing any transactions, resulting in a table not existing, that is attempted to be read for COUNT. This patch fixes this by no longer counting FDEs, but rather by programmatically waiting until the SQL thread has executed the transaction and then automatically activating the DEBUG_SYNC point to trigger at the next relay log FDE.	2024-01-30 06:58:44 +01:00
Marko Mäkelä	12995559f9	Merge 10.4 into 10.5	2023-12-19 18:30:58 +02:00
Kristian Nielsen	eaa4968fc5	MDEV-10653: Fix segfault in SHOW MASTER STATUS with NULL inuse_relaylog The previous patch for MDEV-10653 changes the rpl_parallel::workers_idle() function to use Relay_log_info::last_inuse_relaylog to check for idle workers. But the code was missing a NULL check. Also, there was one place during SQL slave thread start which was missing mutex synchronisation when updating inuse_relaylog. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-12-19 12:08:54 +01:00
Marko Mäkelä	4ae105a37d	Merge 10.4 into 10.5	2023-12-18 08:59:07 +02:00
Brandon Nesterenko	8dad51481b	MDEV-10653: SHOW SLAVE STATUS Can Deadlock an Errored Slave AKA rpl.rpl_parallel, binlog_encryption.rpl_parallel fails in buildbot with timeout in include A replication parallel worker thread can deadlock with another connection running SHOW SLAVE STATUS. That is, if the replication worker thread is in do_gco_wait() and is killed, it will already hold the LOCK_parallel_entry, and during error reporting, try to grab the err_lock. SHOW SLAVE STATUS, however, grabs these locks in reverse order. It will initially grab the err_lock, and then try to grab LOCK_parallel_entry. This leads to a deadlock when both threads have grabbed their first lock without the second. This patch implements the MDEV-31894 proposed fix to optimize the workers_idle() check to compare the last in-use relay log’s queued_count==dequeued_count for idleness. This removes the need for workers_idle() to grab LOCK_parallel_entry, as these values are atomically updated. Huge thanks to Kristian Nielsen for diagnosing the problem! Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org> Andrei Elkin <andrei.elkin@mariadb.com>	2023-12-11 07:45:23 -07:00
Sergei Golubchik	98a39b0c91	Merge branch '10.4' into 10.5	2023-12-02 01:02:50 +01:00
Monty	1ffa8c5072	Fixed build failure on aarch64-macos debug_sync.h was wrongly combined with replication	2023-11-28 15:31:44 +02:00
Anel Husakovic	a7d186a17d	MDEV-32168: slave_error_param condition is never checked from the wait_for_slave_param.inc - Reviewer: <knielsen@knielsen-hq.org> <brandon.nesterenko@mariadb.com> <andrei.elkin@mariadb.com>	2023-11-16 10:41:11 +01:00
Oleksandr Byelkin	6cfd2ba397	Merge branch '10.4' into 10.5	2023-11-08 12:59:00 +01:00
Brandon Nesterenko	c5f776e9fa	MDEV-32265: seconds_behind_master is inaccurate for Delayed replication If a replica is actively delaying a transaction when restarted (STOP SLAVE/START SLAVE), when the sql thread is back up, Seconds_Behind_Master will present as 0 until the configured MASTER_DELAY has passed. That is, before the restart, last_master_timestamp is updated to the timestamp of the delayed event. Then after the restart, the negation of sql_thread_caught_up is skipped because the timestamp of the event has already been used for the last_master_timestamp, and their update is grouped together in the same conditional block. This patch fixes this by separating the negation of sql_thread_caught_up out of the timestamp-dependent block, so it is called any time an idle parallel slave queues an event to a worker. Note that sql_thread_caught_up is still left in the check for internal events, as SBM should remain idle in such case to not "magically" begin incrementing. Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2023-10-23 14:25:03 -06:00
Yuchen Pei	cb1965bd9d	Merge branch '10.4' into 10.5	2023-09-14 16:30:11 +10:00
Brandon Nesterenko	1407f99963	MDEV-31177: SHOW SLAVE STATUS Last_SQL_Errno Race Condition on Errored Slave Restart The SQL thread and a user connection executing SHOW SLAVE STATUS have a race condition on Last_SQL_Errno, such that a slave which previously errored and stopped, on its next start, SHOW SLAVE STATUS can show that the SQL Thread is running while the previous error is also showing. The fix is to move when the last error is cleared when the SQL thread starts to occur before setting the status of Slave_SQL_Running. Thanks to Kristian Nielson for his work diagnosing the problem! Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com> Kristian Nielson <knielsen@knielsen-hq.org>	2023-09-13 12:01:47 -06:00
sjaakola	a3cbc44b24	MDEV-31833 replication breaks when using optimistic replication and replica is a galera node MariaDB async replication SQL thread was stopped for any failure in applying of replication events and error message logged for the failure was: "Node has dropped from cluster". The assumption was that event applying failure is always due to node dropping out. With optimistic parallel replication, event applying can fail for natural reasons and applying should be retried to handle the failure. This retry logic was never exercised because the slave SQL thread was stopped with first applying failure. To support optimistic parallel replication retrying logic this commit will now skip replication slave abort, if node remains in cluster (wsrep_ready==ON) and replication is configured for optimistic or aggressive retry logic. During the development of this fix, galera.galera_as_slave_nonprim test showed some problems. The test was analyzed, and it appears to need some attention. One excessive sleep command was removed in this commit, but it will need more fixes still to be fully deterministic. After this commit galera_as_slave_nonprim is successful, though. Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-09-12 02:37:30 +02:00
Kristian Nielsen	7c9837ce74	Merge 10.4 into 10.5 Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 18:02:18 +02:00
Kristian Nielsen	900c4d6920	MDEV-31655: Parallel replication deadlock victim preference code errorneously removed Restore code to make InnoDB choose the second transaction as a deadlock victim if two transactions deadlock that need to commit in-order for parallel replication. This code was erroneously removed when VATS was implemented in InnoDB. Also add a test case for InnoDB choosing the right deadlock victim. Also fixes this bug, with testcase that reliably reproduces: MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master Note: This should be null-merged to 10.6, as a different fix is needed there due to InnoDB locking code changes. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:35:30 +02:00
Marko Mäkelä	599c4d9a40	Merge 10.4 into 10.5	2023-08-15 11:10:27 +03:00
Jan Lindström	277968aa4c	MDEV-31413 : Node has been dropped from the cluster on Startup / Shutdown with async replica There was two related problems: (1) Galera node that is defined as a slave to async MariaDB master at restart might do SST (state stransfer) and part of that it will copy mysql.gtid_slave_pos table. Problem is that updates on that table are not replicated on a cluster. Therefore, table from donor that is not slave is copied and joiner looses gtid position it was and start executing events from wrong position of the binlog. This incorrect position could break replication and causes node to be dropped and requiring user action. (2) Slave sql thread might start executing events before galera is ready (wsrep_ready=ON) and that could also cause node to be dropped from the cluster. In this fix we enable replication of mysql.gtid_slave_pos table on a cluster. In this way all nodes in a cluster will know gtid slave position and even after SST joiner knows correct gtid position to start. Furthermore, we wait galera to be ready before slave sql thread executes any events to prevent too early execution. Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-08-08 03:25:56 +02:00
Oleksandr Byelkin	7564be1352	Merge branch '10.4' into 10.5	2023-07-26 16:02:57 +02:00
Brandon Nesterenko	063f4ac25e	MDEV-30619: Parallel Slave SQL Thread Can Update Seconds_Behind_Master with Active Workers MDEV-31749 sporadic assert in MDEV-30619 new test If the workers of a parallel replica are busy (potentially with long queues), but the SQL thread has no events left to distribute (so it goes idle), then the next event that comes from the primary will update mi->last_master_timestamp with its timestamp, even if the workers have not yet finished. This patch changes the parallel replica logic which updates last_master_timestamp after idling from using solely sql_thread_caught_up (added in MDEV-29639) to using the latter with rli queued/dequeued event counters. That is, if the queued count is equal to the dequeued count, it means all events have been processed and the replica is considered idle when the driver thread has also distributed all events. Low level details of the commit include - to make a more generalized test for Seconds_Behind_Master on the parallel replica, rpl_delayed_parallel_slave_sbm.test is renamed to rpl_parallel_sbm.test for this purpose. - pause_sql_thread_on_next_event usage was removed with the MDEV-30619 fixes. Rather than remove it, we adapt it to the needs of this test case - added test case to cover SBM spike of relay log read and LMT update that was fixed by MDEV-29639 - rpl_seconds_behind_master_spike.test is made to use the negate_clock_diff_with_master debug eval. Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2023-07-25 16:36:14 +03:00
Oleksandr Byelkin	edf8ce5b97	Merge branch 'bb-10.4-release' into bb-10.5-release	2023-05-02 13:54:54 +02:00
Oleksandr Byelkin	edd0b03e60	Merge branch '10.3' into 10.4	2023-05-02 10:09:27 +02:00
Andrei	55a53949be	MDEV-29621: Replica stopped by locks on sequence When using binlog_row_image=FULL with sequence table inserts, a replica can deadlock because it treats full inserts in a sequence as DDL statements by getting an exclusive lock on the sequence table. It has been observed that with parallel replication, this exclusive lock on the sequence table can lead to a deadlock where one transaction has the exclusive lock and is waiting on a prior transaction to commit, whereas this prior transaction is waiting on the MDL lock. This fix for this is on the master side, to raise FL_DDL flag on the GTID of a full binlog_row_image write of a sequence table. This forces the slave to execute the statement serially so a deadlock cannot happen. A test verifies the deadlock also to prove it happen on the OLD (pre-fixes) slave. OLD (buggy master) -replication-> NEW (fixed slave) is provided. As the pre-fixes master's full row-image may represent both SELECT NEXT VALUE and INSERT, the parallel slave pessimistically waits for the prior transaction to have committed before to take on the critical part of the second (like INSERT in the test) event execution. The waiting exploits a parallel slave's retry mechanism which is controlled by `@@global.slave_transaction_retries`. Note that in order to avoid any persistent 'Deadlock found' 2013 error in OLD -> NEW, `slave_transaction_retries` may need to be set to a higher than the default value. START-SLAVE is an effective work-around if this still happens.	2023-04-27 21:55:45 +03:00
Marko Mäkelä	dfa90257f6	MDEV-30936 clang 15.0.7 -fsanitize=memory fails massively handle_slave_io(), handle_slave_sql(), os_thread_exit(): Remove a redundant pthread_exit(nullptr) call, because it would cause SIGSEGV. mysql_print_status(): Add MEM_MAKE_DEFINED() to work around some missing instrumentation around mallinfo2(). que_graph_free_stat_list(): Invoke que_node_get_next(node) before que_graph_free_recursive(node). That is the logical and MSAN_OPTIONS=poison_in_dtor=1 compatible way of freeing memory. ins_node_t::~ins_node_t(): Invoke mem_heap_free(entry_sys_heap). que_graph_free_recursive(): Rely on ins_node_t::~ins_node_t(). fts_t::~fts_t(): Invoke mem_heap_free(fts_heap). fts_free(): Replace with direct calls to fts_t::~fts_t(). The failures in free_root() due to MSAN_OPTIONS=poison_in_dtor=1 will be covered in MDEV-30942.	2023-03-28 11:44:24 +03:00
Marko Mäkelä	c41c79650a	Merge 10.4 into 10.5	2023-02-10 12:02:11 +02:00
Brandon Nesterenko	eecd4f1459	MDEV-30608: rpl.rpl_delayed_parallel_slave_sbm sometimes fails with Seconds_Behind_Master should not have used second transaction timestamp One of the constraints added in the MDEV-29639 patch, is that only the first event after idling should update last_master_timestamp; and as long as the replica has more events to execute, the variable should not be updated. The corresponding test, rpl_delayed_parallel_slave_sbm.test, aims to verify this; however, if the IO thread takes too long to queue events, the SQL thread can appear to catch up too fast. This fix ensures that the relay log has been fully written before executing the events. Note that the underlying cause of this test failure needs to be addressed as a bug-fix, this is a temporary fix to stop test failures. To track work on the bug-fix for the underlying issue, please see MDEV-30619.	2023-02-09 13:02:14 -07:00
Oleksandr Byelkin	a977054ee0	Merge branch '10.3' into 10.4	2023-01-28 18:22:55 +01:00
Oleksandr Byelkin	7fa02f5c0b	Merge branch '10.4' into 10.5	2023-01-27 13:54:14 +01:00
Oleksandr Byelkin	dd24fa3063	Merge branch '10.3' into 10.4	2023-01-26 10:34:26 +01:00
Brandon Nesterenko	d69e835787	MDEV-29639: Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas Problem ======== On a parallel, delayed replica, Seconds_Behind_Master will not be calculated until after MASTER_DELAY seconds have passed and the event has finished executing, resulting in potentially very large values of Seconds_Behind_Master (which could be much larger than the MASTER_DELAY parameter) for the entire duration the event is delayed. This contradicts the documented MASTER_DELAY behavior, which specifies how many seconds to withhold replicated events from execution. Solution ======== After a parallel replica idles, the first event after idling should immediately update last_master_timestamp with the time that it began execution on the primary. Reviewed By =========== Andrei Elkin <andrei.elkin@mariadb.com>	2023-01-24 08:11:35 -07:00
Jan Lindström	4eb8e51c26	Merge 10.4 into 10.5	2022-11-30 13:10:52 +02:00
Julius Goryavsky	1ebf0b7372	MDEV-29817: Issues with handling options for SSL CRLs (and some others) This patch adds the correct setting of the "--tls-version" and "--ssl-verify-server-cert" options in the client-side utilities such as mysqltest, mysqlcheck and mysqlslap, as well as the correct setting of the "--ssl-crl" option when executing queries on the slave side, and also the correct option codes in the "sslopts-logopts.h" file (in the latter case, incorrect values are not a problem right now, but may cause subtle test failures in the future, if the option handling code changes).	2022-11-22 15:16:12 +01:00
Julius Goryavsky	f0820400ee	MDEV-29817: Issues with handling options for SSL CRLs (and some others) This patch adds the correct setting of the "--ssl-verify-server-cert" option in the client-side utilities such as mysqlcheck and mysqlslap, as well as the correct setting of the "--ssl-crl" option when executing queries on the slave side, and also add the correct option codes in the "sslopts-logopts.h" file (in the latter case, incorrect values are not a problem right now, but may cause subtle test failures in the future, if the option handling code changes).	2022-11-22 14:07:39 +01:00
Marko Mäkelä	6286a05d80	Merge 10.4 into 10.5	2022-09-26 13:34:38 +03:00
Marko Mäkelä	3c92050d1c	Fix build without either ENABLED_DEBUG_SYNC or DBUG_OFF There are separate flags DBUG_OFF for disabling the DBUG facility and ENABLED_DEBUG_SYNC for enabling the DEBUG_SYNC facility. Let us allow debug builds without DEBUG_SYNC. Note: For CMAKE_BUILD_TYPE=Debug, CMakeLists.txt will continue to define ENABLED_DEBUG_SYNC.	2022-09-23 17:37:52 +03:00
Marko Mäkelä	a69cf6f07e	MDEV-29613 Improve WITH_DBUG_TRACE=OFF In commit `28325b0863` a compile-time option was introduced to disable the macros DBUG_ENTER and DBUG_RETURN or DBUG_VOID_RETURN. The parameter name WITH_DBUG_TRACE would hint that it also covers DBUG_PRINT statements. Let us do that: WITH_DBUG_TRACE=OFF shall disable DBUG_PRINT() as well. A few InnoDB recovery tests used to check that some output from DBUG_PRINT("ib_log", ...) is present. We can live without those checks. Reviewed by: Vladislav Vaintroub	2022-09-23 13:40:42 +03:00
Marko Mäkelä	a9d0bb12e6	Merge 10.4 into 10.5	2022-06-09 12:22:55 +03:00
Marko Mäkelä	ea1fbd0326	Merge 10.3 into 10.4	2022-06-07 15:55:32 +03:00
Marko Mäkelä	099b9202a5	MDEV-27697 fixup: Exclude debug code from non-debug builds	2022-06-03 10:47:34 +03:00
Sergei Golubchik	7970ac7fe8	Merge branch '10.4' into 10.5	2022-05-18 09:50:26 +02:00
Sergei Golubchik	23ddc3518f	Merge branch '10.3' into 10.4	2022-05-18 01:25:30 +02:00
Sergei Golubchik	a0d4f0f306	Merge branch '10.2' into 10.3 commit `84984b79f2` is null-merged	2022-05-18 01:23:47 +02:00
Andrei	726bd8c968	MDEV-28550 improper handling of replication event group that contains GTID_LIST_EVENT or INCIDENT_EVENT. It's legal to have either of the two inside a group. E.g Gtid_event, Gtid_log_list_event, Query_1, ... Xid_log_event is permitted. However, the slave IO thread treated both as the terminal even when the group represents a DDL query. That causes a premature Gtid state update so the slave IO would think the whole group has been collected while in fact Query_1 etc are yet to process. Fixed with correcting a condition to compute the terminal event of the group. Tested with rpl_mysqlbinlog_slave_consistency (of 10.9) and rpl_gtid_errorlog.test.	2022-05-13 09:45:32 +02:00
Sergei Golubchik	ef781162ff	Merge branch '10.4' into 10.5	2022-05-09 22:04:06 +02:00
Sergei Golubchik	a70a1cf3f4	Merge branch '10.3' into 10.4	2022-05-08 23:03:08 +02:00

1 2 3 4 5 ...

2664 commits