mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-31 02:51:44 +01:00

Author	SHA1	Message	Date
Sergei Golubchik	6242783f24	rpl.rpl_semi_sync_fail_over improve debugability	2024-04-21 14:03:26 +02:00
Sergei Golubchik	a4b6409ff6	sporadic failures of binlog_encryption.rpl_parallel_slave_bgc_kill do CHANGE MASTER before sync_with_master to have the slave in a predictable fully synced state before the next test	2024-04-21 01:17:31 +02:00
Sergei Golubchik	d8368ae289	Merge '10.5' into 10.6	2024-04-20 14:47:26 +02:00
Kristian Nielsen	0c249ad718	MDEV-30232: rpl.rpl_gtid_crash fails sporadically in BB The root cause of the failure is a bug in the Linux network stack: https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u If the slave does a connect(2) at the exact same time that kill -9 of the master process closes the listening socket, the FIN or RST packet is lost in the kernel, and the slave ends up timing out waiting for the initial communication from the server. This timeout defaults to --slave-net-timeout=120, which causes include/master_gtid_wait.inc to time out first and fail the test. Work-around this problem by reducing the --slave-net-timeout for this test case. If this problem turns up in other tests, we can consider reducing the default value for all tests. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-20 13:41:08 +02:00
Marko Mäkelä	bb2e125d07	Merge 10.5 into 10.6 This excludes commit `040069f4ba` because it is specific to innodb_sync_debug, which had been removed in commit `ff5d306e29`.	2024-04-18 07:14:56 +03:00
Brandon Nesterenko	0ad52e4d6a	MDEV-27512: Assertion !thd->transaction_rollback_request failed in rows_event_stmt_cleanup If replicating an event in ROW format, and InnoDB detects a deadlock while searching for a row, the row event will error and rollback in InnoDB and indicate that the binlog cache also needs to be cleared, i.e. by marking thd->transaction_rollback_request. In the normal case, this will trigger an error in Rows_log_event::do_apply_event() and cause a rollback. During the Rows_log_event::do_apply_event() cleanup of a successful event application, there is a DBUG_ASSERT in log_event_server.cc::rows_event_stmt_cleanup(), which sets the expectation that thd->transaction_rollback_request cannot be set because the general rollback (i.e. not the InnoDB rollback) should have happened already. However, if the replica is configured to skip deadlock errors, the rows event logic will clear the error and continue on, as if no error happened. This results in thd->transaction_rollback_request being set while in rows_event_stmt_cleanup(), thereby triggering the assertion. This patch fixes this in the following ways: 1) The assertion is invalid, and thereby removed. 2) The rollback case is forced in rows_event_stmt_cleanup() if transaction_rollback_request is set. Note the differing behavior between transactions which are skipped due to deadlock errors and other errors. When a transaction is skipped due to an ignored deadlock error, the entire transaction is rolled back and skipped (though note MDEV-33930 which allows statements in the same transaction after the deadlock-inducing one to commit). When a transaction is skipped due to ignoring a different error, only the erroring statements are rolled-back and skipped - the rest of the transaction will execute as normal. The effect of this can be seen in the test results. The added test case to rpl_skip_error.test shows that only statements which are ignored due to non-deadlock errors are ignored in larger transactions. A diff between rpl_temporary_error2_skip_all.result and rpl_temporary_error2.result shows that all statements in the errored transaction are rolled back (diff pasted below): : diff rpl_temporary_error2.result rpl_temporary_error2_skip_all.result 49c49 < 2 1 --- > 2 NULL 51c51 < 4 1 --- > 4 NULL 53c53 < * There will be two rows in t2 due to the retry. --- > * There will be one row in t2 because the ignored deadlock does not retry. 57d56 < 1 59c58 < 1 --- > 0 Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2024-04-17 11:14:21 -06:00
Vladislav Vaintroub	061adae9a2	MDEV-16944 Fix file sharing issues on Windows in mysqltest On Windows systems, occurrences of ERROR_SHARING_VIOLATION due to conflicting share modes between processes accessing the same file can result in CreateFile failures. mysys' my_open() already incorporates a workaround by implementing wait/retry logic on Windows. But this does not help if files are opened using shell redirection like mysqltest traditionally did it, i.e via --echo exec "some text" > output_file In such cases, it is cmd.exe, that opens the output_file, and it won't do any sharing-violation retries. This commit addresses the issue by introducing a new built-in command, 'write_line', in mysqltest. This new command serves as a brief alternative to 'write_file', with a single line output, that also resolves variables like "exec" would. Internally, this command will use my_open(), and therefore retry-on-error logic. Hopefully this will eliminate the very sporadic "can't open file because it is used by another process" error on CI.	2024-04-17 16:52:37 +02:00
Marko Mäkelä	829cb1a49c	Merge 10.5 into 10.6	2024-04-17 14:14:58 +03:00
Daniel Black	ea810b04cb	MDEV-30676 rpl.parallel_backup* tests sometimes fail Raise innodb_lock_wait_timeout from 1 to 5	2024-04-15 15:45:03 +10:00
Brandon Nesterenko	a6aecbb036	MDEV-10684: rpl.rpl_domain_id_filter_restart fails in buildbot The test failure in rpl.rpl_domain_id_filter_restart is caused by MDEV-33887. That is, the test uses master_pos_wait() (called indirectly by sync_slave_with_master) to try and wait for the replica to catch up to the master. However, the waited on transaction is ignored by the configured CHANGE MASTER TO IGNORE_DOMAIN_IDS=() As MDEV-33887 reports, due to the IO thread updating the binlog coordinates and the SQL thread updating the GTID state, if the replica is stopped in-between these updates, the replica state will be inconsistent. That is, the test expects that the GTID state will be updated, so upon restart, the replica will be up-to-date. However, if the replica is stopped before the SQL thread updates its GTID state, then upon restart, the replica will fetch the previously ignored event, which is no longer ignored upon restart, and execute it. This leads to the sporadic extra row in t2. This patch changes master_pos_wait() to use master_gtid_wait() to ensure the replica state is consistent with the master state.	2024-04-11 09:49:20 -06:00
Sergei Golubchik	340d93a8cc	cleanup: rpl.rpl_semi_sync_shutdown_await_ack avoid using multiple files with the same functionality.	2024-04-11 15:28:55 +02:00
Sergei Golubchik	41296a07c8	Merge branch '10.5' into 10.6	2024-04-11 13:58:22 +02:00
Sergei Golubchik	2d2172a5cf	sporadic failures of rpl.rpl_semi_sync_master_shutdown increase the MASTER_CONNECT_RETRY time under valgrind, otherwise the slave gives up retrying before the master is ready also, cosmetic cleanup of rpl_semi_sync_master_shutdown.test	2024-04-10 19:38:39 +02:00
Andrei	0da1653f1b	MDEV-31779 Server crash in Rows_log_event::update_sequence upon replaying binary log The crash at running mysqlbinlog on a SEQUENCE containing binlog file was caused MDEV-29621 fixes that did not check which of the slave or binlog applier executes a block introduced there. The block is meaningful only for the parallel slave applier, so it's safe to fix this bug with identified the actual applier and skipping the block when it's the mysqlbinlog one.	2024-04-10 19:31:39 +03:00
Brandon Nesterenko	952ab9a596	MDEV-30260: Slave crashed:reload_acl_and_cache during shutdown The signal handler thread can use various different runtime resources when processing a SIGHUP (e.g. master-info information) due to calling into reload_acl_and_cache(). Currently, the shutdown process waits for the termination of the signal thread after performing cleanup. However, this could cause resources actively used by the signal handler to be freed while reload_acl_and_cache() is processing. The specific resource that caused MDEV-30260 is a race condition for the hostname_cache, such that mysqld would delete it in clean_up()::hostname_cache_free(), before the signal handler would use it in reload_acl_and_cache()::hostname_cache_refresh(). Another similar resource is the active_mi/master_info_index. There was a race between its deletion by the main thread in end_slave(), and their usage by the Signal Handler as a part of Master_info_index::flush_all_relay_logs.read(active_mi) in reload_acl_and_cache(). This patch fixes these race conditions by relocating where server shutdown waits for the signal handler to die until after server-level threads have been killed (i.e., as a last step of close_connections()). With respect to the hostname_cache, active_mi and master_info_cache, this ensures that they cannot be destroyed while the signal handler is still active, and potentially using them. Additionally: 1) This requires that Events memory is still in place for SIGHUP handling's mysql_print_status(). So event deinitialization is moved into clean_up(), but the event scheduler still needs to be stopped in close_connections() at the same spot. 2) The function kill_server_thread is no longer used, so it is deleted 3) The timeout to wait for the death of the signal thread was not consistent with the comment. The comment mentioned up to 10 seconds, whereas it was actually 0.01s. The code has been fixed to wait up to 10 seconds. 4) A warning has been added if the signal handler thread fails to exit in time. 5) Added pthread_join() to end of wait_for_signal_thread_to_end() if it hadn't ended in 10s with a warning. Note this also removes the pthread_detached attribute from the signal_thread to allow for the pthread_join(). Reviewed By: =========== Vladislav Vaintroub <wlad@mariadb.com> Andrei Elkin <andrei.elkin@mariadb.com>	2024-04-09 14:25:13 -06:00
Kristian Nielsen	d90a2b44ad	MDEV-33668: More precise dependency tracking of XA XID in parallel replication Keep track of each recently active XID, recording which worker it was queued on. If an XID might still be active, choose the same worker to queue event groups that refer to the same XID to avoid conflicts. Otherwise, schedule the XID freely in the next round-robin slot. This way, XA PREPARE can normally be scheduled without restrictions (unless duplicate XID transactions come close together). This improves scheduling and parallelism over the old method, where the worker thread to schedule XA PREPARE on was fixed based on a hash value of the XID. XA COMMIT will normally be scheduled on the same worker as XA PREPARE, but can be a different one if the XA PREPARE is far back in the event history. Testcase and code for trimming dynamic array due to Andrei. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-09 11:42:34 +03:00
Brandon Nesterenko	89c907bd4f	MDEV-33672: Gtid_log_event Construction from File Should Ensure Event Length When Using Extra Flags A GTID event can have variable length, with contributing factors such as the variable length from the flags2 and optional extra flags fields. These fields are bitmaps, where each set bit indicates an additional value that should be appended to the event, e.g. multi-engine transactions append a number to indicate the number of additional engines a transaction uses. However, if a flags bit is set, and no additional fields are appended to the event, MDEV-33672 reports that the server can still try to read from memory as if it did exist. Note, however, in debug builds, this condition is asserted for FL_EXTRA_MULTI_ENGINE. This patch fixes this to check that the length of the event is aligned with the expectation set by the flags for FL_PREPARED_XA, FL_COMPLETED_XA, and FL_EXTRA_MULTI_ENGINE. Reviewed By ============ Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-08 07:57:14 -06:00
Brandon Nesterenko	9a4991a089	MDEV-33799: mysql_manager_submit Segfault at Startup Still Possible During Recovery MDEV-26473 fixed a segmentation fault at startup between the handle manager thread and the binlog background thread, such that the binlog background thread could be started and submit a job to the handle manager, before it had initialized. Where MDEV-26473 made it so the handle manager would initialize before the main thread started the normal binary logs, it did not account for the recovery case. That is, there is still a possibility of a segmentation fault when a server is recovering using the binary logs such that it can open the binary logs, start the binlog background thread, and submit a job to the handle manager before it is initialized. This patch fixes this by moving the initialization of the mysql handler manager to happen prior to recovery. Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2024-04-03 11:55:18 -06:00
Brandon Nesterenko	75c7c6dc39	MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency When using semi-sync replication with rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the primary can significantly reduce compared to AFTER_SYNC's performance for workloads with many concurrent users executing transactions. This is because all connections on the primary share the same cond_wait variable/mutex pair, so any time an ACK is received from a replica, all waiting connections are awoken to check if the ACK was for itself, which is done in mutual exclusion. This patch changes this such that the waiting THD will use its own local condition variable, and the ACK receiver thread only signals connections which have been ACKed for wakeup. That is, the THD::LOCK_wakeup_ready condition variable is re-used for this purpose, and the Active_tranx queue nodes are extended to hold the waiting thread, so it can be signalled once ACKed. Additionally: 1) Removed part of MDEV-11853 additions, which allowed suspended connection threads awaiting their semi-sync ACKs to live until their ACKs had been received. This part, however, wasn't needed. That is, all that was needed was for the Ack_thread to survive. So now the connection threads are killed during phase 1. Thereby THD::is_awaiting_semisync_ack, and all its related code was removed. 2) COND_binlog_send is repurposed to signal on the condition when Active_tranx is emptied during clear_active_tranx_nodes. 3) At master shutdown (when waiting for slaves), instead of the main loop individually waiting for each ACK, await_slave_reply() (renamed await_all_slave_replies()) just waits once for the repurposed COND_binlog_send to signal it is empty. 4) Test rpl_semi_sync_shutdown_await_ack is updates as following: 4.1) Added test case (adapted from Kristian Nielsen) to ensure that if a thread awaiting its ACK is killed while SHUTDOWN WAIT FOR ALL SLAVES is issued, the primary will still wait for the ACK from the killed thread. 4.2) As connections which by-passed phase 1 of thread killing no longer are delayed for kill until phase 2, we can no longer query yes/no tx after receiving an ACK/timeout. The check for these variables is removed. 4.3) Comment descriptions are updated which mention that the connection is alive; and adjusted to be the Ack_thread. Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org>	2024-03-21 08:42:18 -06:00
Brandon Nesterenko	ca07f62992	MDEV-33716: rpl.rpl_semi_sync_slave_enabled_consistent Fails with Error Condition Reached Though the test itself doesn't create any transactions directly, the added test suppressions are replicated, and when the SQL thread is stopped mid-execution, it is set into an error state because these are non-transactional events being aborted. This patch fixes the test by ensuring that the test suppressions are fully replicated before continuing	2024-03-19 14:16:07 -06:00
Marko Mäkelä	50715bd2ed	Merge 10.5 into 10.6	2024-03-18 17:07:32 +02:00
Kristian Nielsen	1fb00f37ae	MDEV-33303: slave_parallel_mode=optimistic should not report the mode's specific temporary errors An earlier patch for MDEV-13577 fixed the most common instances of this, but missed one case for tables without primary key when the scan reaches the end of the table. This patch adds similar code to handle this case, converting the error to HA_ERR_RECORD_CHANGED when doing optimistic parallel apply. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-03-15 15:55:06 +01:00
Kristian Nielsen	fb774eb1eb	Fix occasional test failure of rpl.rpl_parallel_stop_slave Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-03-15 15:04:09 +01:00
Sergei Golubchik	f71d7f2f0f	Merge branch '10.5' into 10.6	2024-03-13 21:02:34 +01:00
Marko Mäkelä	c3a00dfa53	Merge 10.5 into 10.6	2024-03-12 09:19:57 +02:00
Marko Mäkelä	f703e72bd8	Merge 10.4 into 10.5	2024-03-11 10:08:20 +02:00
Monty	b3d507ff13	Suppressed new warning for rpl_get_lock on amd-freebsd and aarch64-macos	2024-03-09 12:20:58 +02:00
Monty	567c097359	MDEV-33582 Add more warnings to be able to better diagnose network issues Warnings are added to net_server.cc when global_system_variables.log_warnings >= 4. When the above condition holds then: - All communication errors from net_serv.cc is also written to the error log. - In case of a of not being able to read or write a packet, a more detailed error is given. Other things: - Added detection of slaves that has hangup to Ack_receiver::run() - vio_close() is now first marking the socket closed before closing it. The reason for this is to ensure that the connection that gets a read error can check if the reason was that the socket was closed. - Add a new state to vio to be able to detect if vio is acive, shutdown or closed. This is used to detect if socket is closed by another thread. - Testing of the new warnings is done in rpl_get_lock.test - Suppress some of the new warnings in mtr to allow one to run some of the tests with -mysqld=--log-warnings=4. All test in the 'rpl' suite can now be run with this option. - Ensure that global.log_warnings are restored at test end in a way that allows one to use mtr --mysqld=--log-warnings=4. Reviewed-by: <serg@mariadb.org>,<brandon.nesterenko@mariadb.com>	2024-03-05 20:19:49 +02:00
Brandon Nesterenko	bd604add76	MDEV-33546: Rpl_semi_sync_slave_status is ON When Replication Is Not Configured If a server has a default configuration (e.g. in a my.cnf file) with rpl_semi_sync_slave_enabled set, on server start, the corresponding rpl_semi_sync_slave_status variable will also be ON initially, even if the slave was never configured/started. This is because the Repl_semi_sync_slave initialization logic (function init_object()) sets the running status to the enabled value during init_server_components(). This patch fixes this by removing the statement which sets the semi-sync slave running status from the initialization logic. An additional change needed from this is to semi-sync recovery: this status variable was used as a condition to determine binlog truncation during server recovery. This patch also switches this condition to reference the global rpl_semi_sync_slave_enabled variable. Though note, the semi-sync recovery condition is to be changed entirely with the MDEV-33424 agenda. Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>	2024-02-29 07:38:55 -07:00
Brandon Nesterenko	b04c857596	MDEV-33500: rpl.rpl_parallel_sbm can fail on slow machines, e.g. MSAN/Valgrind builders In an addition to test rpl.rpl_parallel_sbm added by MDEV-32265, the test uses sleep statements alone to test Seconds_Behind_Master with delayed replication. On slow running machines, the test can pass the intended MASTER_DELAY duration and Seconds_Behind_Master can become 0, when the test expects the transaction to still be actively in a delaying state. This can be consistently reproduced by adding a sleep statement before the call to --let = query_get_value(SHOW SLAVE STATUS, Seconds_Behind_Master, 1) to sleep past the delay end point. This patch fixes this by locking the table which the delayed transaction targets so Second_Behind_Master cannot be updated before the test reads it for validation.	2024-02-20 08:19:18 -07:00
Kristian Nielsen	c73c6aea63	MDEV-33426: Aria temptables wrong thread-specific memory accounting in slave thread Aria temporary tables account allocated memory as specific to the current THD. But this fails for slave threads, where the temporary tables need to be detached from any specific THD. Introduce a new flag to mark temporary tables in replication as "global", and use that inside Aria to not account memory allocations as thread specific for such tables. Based on original suggestion by Monty. Reviewed-by: Monty <monty@mariadb.org> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-02-16 12:48:30 +01:00
Alexey Botchkov	85517f609a	MDEV-33393 audit plugin do not report user did the action.. The '<replication_slave>' user is assigned to the slave replication thread so this name appears in the auditing logs.	2024-02-14 00:02:29 +04:00
Marko Mäkelä	691f923906	Merge 10.5 into 10.6	2024-02-13 20:42:59 +02:00
Brandon Nesterenko	03d1346e7f	MDEV-29369: rpl.rpl_semi_sync_shutdown_await_ack fails regularly with Result content mismatch This test was prone to failures for a few reasons, summarized below: 1) MDEV-32168 introduced “only_running_threads=1” to slave_stop.inc, which allowed the stop logic to bypass an attempting-to-reconnect IO thread. That is, the IO thread could realize the master shutdown in `read_event()`, and thereby call into `try_to_reconnect()`. This would leave the IO thread up when the test expected it to be stopped. Fixed by explicitly stopping the IO thread and allowing an error state, as the above case would lead to errno 2003. 2) On slow systems (or those running profiling tools, e.g. MSAN), the waiting-for-ack transaction can complete before the system processes the `SHUTDOWN WAIT FOR ALL SLAVES`. There was shutdown preparation logic in-between the transaction and shutdown itself, which contributes to this problem. This patch also moves this preparation logic before the transaction, so there is less to do in-between the calls. 3) Changed work-around for MDEV-28141 to use debug_sync instead of sleep delay, as it was still possible to hit the bug on very slow systems. 4) Masked MTR variable reset with disable/enable query log Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org>	2024-02-12 05:48:18 -07:00
Brandon Nesterenko	ee89558312	MDEV-14357: rpl.rpl_domain_id_filter_io_crash failed in buildbot with wrong result A race condition with the SQL thread, where depending on if it was killed before or after it had executed the fake/generated IGN_GTIDS Gtid_list_log_event, may or may not update gtid_slave_pos with the position of the ignored events. Then, the slave would be restarted while resetting IGNORE_DOMAIN_IDS to be empty, which would result in the slave requesting different starting locations, depending on whether or not gtid_slave_pos was updated. And, because previously ignored events could now be requested and executed (no longer ignored), their presence would fail the test. This patch fixes this in two ways. First, to use GTID positions for synchronization rather than binlog file positions. Then second, to synchronize the SQL thread’s gtid_slave_pos with the ignored events before killing the SQL thread. To consistently reproduce the test failure, the following patch can be applied: diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc index f51f5b7deec..de62233acff 100644 --- a/sql/log_event_server.cc +++ b/sql/log_event_server.cc @@ -3686,6 +3686,12 @@ Gtid_list_log_event::do_apply_event(rpl_group_info rgi) void hton= NULL; uint32 i; + sleep(1); + if (rli->sql_driver_thd->killed \|\| rli->abort_slave) + { + return 0; + } + Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org>	2024-02-12 05:48:18 -07:00
Marko Mäkelä	8ec12e0d6d	Merge 10.4 into 10.5	2024-02-12 11:38:13 +02:00
Sergei Golubchik	3f6038bc51	Merge branch '10.5' into 10.6	2024-01-31 18:04:03 +01:00
Sergei Golubchik	01f6abd1d4	Merge branch '10.4' into 10.5	2024-01-31 17:32:53 +01:00
Monty	57ffcd686f	MDEV-21472: ALTER TABLE ... ANALYZE PARTITION ... with EITS reads and locks all rows This was fixed in 10.2 in 2020 but merging the code to 10.3 caused the bug to come back.	2024-01-30 09:19:01 +02:00
Brandon Nesterenko	c75905cacb	MDEV-33327: rpl_seconds_behind_master_spike Sensitive to IO Thread Stop Position rpl.rpl_seconds_behind_master_spike uses the DEBUG_SYNC mechanism to count how many format descriptor events (FDEs) have been executed, to attempt to pause on a specific relay log FDE after executing transactions. However, depending on when the IO thread is stopped, it can send an extra FDE before sending the transactions, forcing the test to pause before executing any transactions, resulting in a table not existing, that is attempted to be read for COUNT. This patch fixes this by no longer counting FDEs, but rather by programmatically waiting until the SQL thread has executed the transaction and then automatically activating the DEBUG_SYNC point to trigger at the next relay log FDE.	2024-01-30 06:58:44 +01:00
Brandon Nesterenko	e4f221a5f2	MDEV-33327: rpl_seconds_behind_master_spike Sensitive to IO Thread Stop Position rpl.rpl_seconds_behind_master_spike uses the DEBUG_SYNC mechanism to count how many format descriptor events (FDEs) have been executed, to attempt to pause on a specific relay log FDE after executing transactions. However, depending on when the IO thread is stopped, it can send an extra FDE before sending the transactions, forcing the test to pause before executing any transactions, resulting in a table not existing, that is attempted to be read for COUNT. This patch fixes this by no longer counting FDEs, but rather by programmatically waiting until the SQL thread has executed the transaction and then automatically activating the DEBUG_SYNC point to trigger at the next relay log FDE.	2024-01-29 15:17:57 -07:00
Brandon Nesterenko	112eb14f7e	MDEV-27850: rpl_seconds_behind_master_spike debug_sync fix A debug_sync signal could remain for the SQL thread that should have begun a wait_for upon seeing a GTID event, but would instead see the old signal and continue on without waiting. This broke an "idle" condition in SHOW SLAVE STATUS which should have automatically negated Seconds_Behind_Master. Instead, because the SQL thread had already processed the GTID event, it set sql_thread_caught_up to false, and thereby calculated the value of Seconds_behind_master, when the test expected 0. This patch fixes this by resetting the debug_sync state before creating a new transaction which sends a GTID event to the replica	2024-01-26 11:43:34 -07:00
Monty	26c86c39fc	Fixed some mtr tests that failed on windows Most things where wrong in the test suite. The one thing that was a bug was that table_map_id was in some places defined as ulong and in other places as ulonglong. On Linux 64 bit this is not a problem as ulong == ulonglong, but on windows this caused failures. Fixed by ensuring that all instances of table_map_id are ulonglong.	2024-01-23 13:03:12 +02:00
Michael Widenius	7af50e4df4	MDEV-32551: "Read semi-sync reply magic number error" warnings on master rpl_semi_sync_slave_enabled_consistent.test and the first part of the commit message comes from Brandon Nesterenko. A test to show how to induce the "Read semi-sync reply magic number error" message on a primary. In short, if semi-sync is turned on during the hand-shake process between a primary and replica, but later a user negates the rpl_semi_sync_slave_enabled variable while the replica's IO thread is running; if the io thread exits, the replica can skip a necessary call to kill_connection() in repl_semisync_slave.slave_stop() due to its reliance on a global variable. Then, the replica will send a COM_QUIT packet to the primary on an active semi-sync connection, causing the magic number error. The test in this patch exits the IO thread by forcing an error; though note a call to STOP SLAVE could also do this, but it ends up needing more synchronization. That is, the STOP SLAVE command also tries to kill the VIO of the replica, which makes a race with the IO thread to try and send the COM_QUIT before this happens (which would need more debug_sync to get around). See THD::awake_no_mutex for details as to the killing of the replica’s vio. Notes: - The MariaDB documentation does not make it clear that when one enables semi-sync replication it does not matter if one enables it first in the master or slave. Any order works. Changes done: - The rpl_semi_sync_slave_enabled variable is now a default value for when semisync is started. The variable does not anymore affect semisync if it is already running. This fixes the original reported bug. Internally we now use repl_semisync_slave.get_slave_enabled() instead of rpl_semi_sync_slave_enabled. To check if semisync is active on should check the @@rpl_semi_sync_slave_status variable (as before). - The semisync protocol conflicts in the way that the original MySQL/MariaDB client-server protocol was designed (client-server send and reply packets are strictly ordered and includes a packet number to allow one to check if a packet is lost). When using semi-sync the master and slave can send packets at 'any time', so packet numbering does not work. The 'solution' has been that each communication starts with packet number 1, but in some cases there is still a chance that the packet number check can fail. Fixed by adding a flag (pkt_nr_can_be_reset) in the NET struct that one can use to signal that packet number checking should not be done. This is flag is set when semi-sync is used. - Added Master_info::semi_sync_reply_enabled to allow one to configure some slaves with semisync and other other slaves without semisync. Removed global variable semi_sync_need_reply that would not work with multi-master. - Repl_semi_sync_master::report_reply_packet() can now recognize the COM_QUIT packet from semisync slave and not give a "Read semi-sync reply magic number error" error for this case. The slave will be removed from the Ack listener. - On Windows, don't stop semisync Ack listener just because one slave connection is using socket_id > FD_SETSIZE. - Removed busy loop in Ack_receiver::run() by using "Self-pipe trick" to signal new slave and stop Ack_receiver. - Changed some Repl_semi_sync_slave functions that always returns 0 from int to void. - Added Repl_semi_sync_slave::slave_reconnect(). - Removed dummy_function Repl_semi_sync_slave::reset_slave(). - Removed some duplicate semisync notes from the error log. - Add test of "if (get_slave_enabled() && semi_sync_need_reply)" before calling Repl_semi_sync_slave::slave_reply(). (Speeds up the code as we can skip all initializations). - If epl_semisync_slave.slave_reply() fails, we disable semisync for that connection. - We do not call semisync.switch_off() if there are no active slaves. Instead we check in Repl_semi_sync_master::commit_trx() if there are no active threads. This simplices the code. - Changed assert() to DBUG_ASSERT() to ensure that the DBUG log is flushed in case of asserts. - Removed the internal rpl_semi_sync_slave_status as it is not needed anymore. The @@rpl_semi_sync_slave_status status variable is now mapped to rpl_semi_sync_enabled. - Removed rpl_semi_sync_slave_enabled as it is not needed anymore. Repl_semi_sync_slave::get_slave_enabled() contains the active status. - Added checking that we do not add a slave twice with Ack_receiver::add_slave(). This could happen with old code. - Removed Repl_semi_sync_master::check_and_switch() as it is not needed anymore. - Ensure that when we call Ack_receiver::remove_slave() that the slave is removed from the listener before function returns. - Call listener.listen_on_sockets() outside of mutex for better performance and less contested mutex. - Ensure that listening is ignoring newly added slaves when checking for responses. - Fixed the master ack_receiver listener is not killed if there are no connected slaves (and thus stop semisync handling of future connections). This could happen if all slaves sockets where would be marked as unreliable. - Added unlink() to base_ilist_iterator and remove() to I_List_iterator. This enables us to remove 'dead' slaves in Ack_recever::run(). - kill_zombie_dump_threads() now does killing of dump threads properly. - It can now kill several threads (should be impossible but could happen if IO slaves reconnects very fast). - We now wait until the dump thread is done before starting the dump. - Added an error if kill_zombie_dump_threads() fails. - Set thd->variables.server_id before calling kill_zombie_dump_threads(). This simplies the code. - Added a lot of comments both in code and tests. - Removed DBUG_EVALUATE_IF "failed_slave_start" as it is not used. Test changes: - rpl.rpl_session_var2 added which runs rpl.rpl_session_var test with semisync enabled. - Some timings changed slight with startup of slave which caused rpl_binlog_dump_slave_gtid_state_info.text to fail as it checked the error log file before the slave had started properly. Fixed by adding wait_for_pattern_in_file.inc that allows waiting for the pattern to appear in the log file. - Tests have been updated so that we first set rpl_semi_sync_master_enabled on the master and then set rpl_semi_sync_slave_enabled on the slaves (this is according to how the MariaDB documentation document how to setup semi-sync). - Error text "Master server does not have semi-sync enabled" has been replaced with "Master server does not support semi-sync" for the case when the master supports semi-sync but semi-sync is not enabled. Other things: - Some trivial cleanups in Repl_semi_sync_master::update_sync_header(). - We should in 11.3 changed the default value for rpl-semi-sync-master-wait-no-slave from TRUE to FALSE as the TRUE does not make much sense as default. The main difference with using FALSE is that we do not wait for semisync Ack if there are no slave threads. In the case of TRUE we wait once, which did not bring any notable benefits except slower startup of master configured for using semisync. Co-author: Brandon Nesterenko <brandon.nesterenko@mariadb.com> This solves the problem reported in MDEV-32960 where a new slave may not be registered in time and the master disables semi sync because of that.	2024-01-23 13:03:11 +02:00
Brandon Nesterenko	01ca57ec16	MDEV-32168: Postpush fix for rpl_domain_id_filter_master_crash While a replica may be reading events from the primary, the primary is killed. Left to its own devices, the IO thread may or may not stop in error, depending on what it is doing when its connection to the primary is killed (e.g. a failed read results in an error, whereas if the IO thread is idly waiting for events when the connection dies, it will enter into a reconnect loop and reconnect). MDEV-32168 changed the test to always wait for the reconnect, thus breaking the error case, as the IO thread would be stopped at a time of expecting it to be running. The fix is to manually stop/start the IO thread to ensure it is in a consistent state. Note that rpl_domain_id_filter_master_crash.test will need additional changes after fixing MDEV-33268 Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org>	2024-01-22 07:30:52 -07:00
Marko Mäkelä	3a96eba25f	Merge 10.5 into 10.6	2024-01-17 13:35:05 +02:00
Alexander Barkov	fa3171df08	MDEV-27666 User variable not parsed as geometry variable in geometry function Adding GEOMETRY type user variables.	2024-01-16 18:53:23 +04:00
Marko Mäkelä	2b01e5103d	Merge 10.5 into 10.6	2023-12-19 18:41:42 +02:00
Marko Mäkelä	12995559f9	Merge 10.4 into 10.5	2023-12-19 18:30:58 +02:00
Kristian Nielsen	a204ce2788	MDEV-33045: Server crashes in Item_func_binlog_gtid_pos::val_str / Binary_string::c_ptr_safe Item::val_str() sets the Item::null_value flag, so call it before checking the flag, not after. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-12-19 12:08:53 +01:00

1 2 3 4 5 ...

4162 commits