mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-28 01:34:17 +01:00

Author	SHA1	Message	Date
Monty	25b5c63905	MDEV-33856: Alternative Replication Lag Representation via Received/Executed Master Binlog Event Timestamps This commit adds 3 new status variables to 'show all slaves status': - Master_last_event_time ; timestamp of the last event read from the master by the IO thread. - Slave_last_event_time ; Master timestamp of the last event committed on the slave. - Master_Slave_time_diff: The difference of the above two timestamps. All the above variables are NULL until the slave has started and the slave has read one query event from the master that changes data. - Added information_schema.slave_status, which allows us to remove: - show_master_info(), show_master_info_get_fields(), send_show_master_info_data(), show_all_master_info() - class Sql_cmd_show_slave_status. - Protocol::store(I_List<i_string_pair>* str_list) as it is not used anymore. - Changed old SHOW SLAVE STATUS and SHOW ALL SLAVES STATUS to use the SELECT code path, as all other SHOW ... STATUS commands. Other things: - Xid_log_time is set to time of commit to allow slave that reads the binary log to calculate Master_last_event_time and Slave_last_event_time. This is needed as there is not 'exec_time' for row events. - Fixed that Load_log_event calculates exec_time identically to Query_event. - Updated RESET SLAVE to reset Master/Slave_last_event_time - Updated SQL thread's update on first transaction read-in to only update Slave_last_event_time on group events. - Fixed possible (unlikely) bugs in sql_show.cc ...old_format() functions if allocation of 'field' would fail. Reviewed By: Brandon Nesterenko <brandon.nesterenko@mariadb.com> Kristian Nielsen <knielsen@knielsen-hq.org>	2024-07-25 08:57:27 -06:00
Vladislav Vaintroub	186a1afe63	MDEV-32537 due to Linux, restrict thread name to 15 characters, also in PS. Rename some threads to workaround this restrictions, e.g "rpl_parallel_thread"->"rpl_parallel", "slave_background" -> "slave_bg" etc.	2024-07-09 13:20:49 +02:00
Vladislav Vaintroub	5bd0516488	MDEV-32537 Name threads to improve debugging experience and diagnostics. Use SetThreadDescription/pthread_setname_np to give threads a name.	2024-07-09 13:17:20 +02:00
Vladislav Vaintroub	584fc85e21	MDEV-32537 Name threads to improve debugging experience and diagnostics. Use SetThreadDescription/pthread_setname_np to give threads a name.	2024-07-09 13:17:20 +02:00
Alexander Barkov	8f4ec79d09	Merge remote-tracking branch 'origin/11.4' into 11.5	2024-07-08 12:25:04 +04:00
Marko Mäkelä	22ba7e4ff8	Merge 10.6 into 10.11	2024-05-30 16:04:00 +03:00
Marko Mäkelä	5ba542e9ee	Merge 10.5 into 10.6	2024-05-30 14:27:07 +03:00
Monty	2464ee758a	MDEV-33655 Remove alter_algorithm Remove alter_algorithm but keep the variable as no-op (with a warning). The reasons for removing alter_algorithm are: - alter_algorithm was introduced as a replacement for the old_alter_table that was used to force the usage of the original alter table algorithm (copy) in the cases where the new alter algorithm did not work. The new option was added as a way to force the usage of a specific algorithm when it should instead have made it possible to disable algorithms that would not work for some reason. - alter_algorithm introduced some cases where ALTER TABLE would not work without specifying the ALGORITHM=XXX option together with ALTER TABLE. - Having different values of alter_algorithm on master and slave could cause slave to stop unexpectedly. - ALTER TABLE FORCE, as used by mariadb-upgrade, would not always work if alter_algorithm was set for the server. - As part of the MDEV-33449 "improving repair of tables" it become clear that alter- algorithm made it harder to provide a better and more consistent ALTER TABLE FORCE and REPAIR TABLE and it would be better to remove it.	2024-05-27 12:39:03 +02:00
Monty	dfdedd46e4	MDEV-32188 make TIMESTAMP use whole 32-bit unsigned range This patch extends the timestamp from 2038-01-19 03:14:07.999999 to 2106-02-07 06:28:15.999999 for 64 bit hardware and OS where 'long' is 64 bits. This is true for 64 bit Linux but not for Windows. This is done by treating the 32 bit stored int as unsigned instead of signed. This is safe as MariaDB has never accepted dates before the epoch (1970). The benefit of this approach that for normal timestamp the storage is compatible with earlier version. However for tables using system versioning we before stored a timestamp with the year 2038 as the 'max timestamp', which is used to detect current values. This patch stores the new 2106 year max value as the max timestamp. This means that old tables using system versioning needs to be updated with mariadb-upgrade when moving them to 11.4. That will be done in a separate commit.	2024-05-27 12:39:02 +02:00
Robin Newhouse	dc38d8ea80	Minimize unsafe C functions with safe_strcpy() Similar to #2480. `567b681` introduced safe_strcpy() to minimize the use of C with potentially unsafe memory overflow with strcpy() whose use is discouraged. Replace instances of strcpy() with safe_strcpy() where possible, limited here to files in the `sql/` directory. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.	2024-05-17 13:33:16 +01:00
Kristian Nielsen	383ee364dc	Merge 10.6 to 10.11	2024-05-07 08:45:31 +02:00
Kristian Nielsen	596921dab8	MDEV-34042: Deadlock kill of XA PREPARE can break replication / rpl.rpl_parallel_multi_domain_xa sporadic failure Clear any pending deadlock kill after completing XA PREPARE, and before updating the mysql.gtid_slave_pos table in a separate transaction. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-05-02 21:07:51 +02:00
Sergei Golubchik	018d537ec1	Merge branch '10.6' into 10.11	2024-04-22 15:23:10 +02:00
Marko Mäkelä	829cb1a49c	Merge 10.5 into 10.6	2024-04-17 14:14:58 +03:00
Kristian Nielsen	16aa4b5f59	Merge from 10.4 to 10.5 Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-15 17:46:49 +02:00
Kristian Nielsen	d90a2b44ad	MDEV-33668: More precise dependency tracking of XA XID in parallel replication Keep track of each recently active XID, recording which worker it was queued on. If an XID might still be active, choose the same worker to queue event groups that refer to the same XID to avoid conflicts. Otherwise, schedule the XID freely in the next round-robin slot. This way, XA PREPARE can normally be scheduled without restrictions (unless duplicate XID transactions come close together). This improves scheduling and parallelism over the old method, where the worker thread to schedule XA PREPARE on was fixed based on a hash value of the XID. XA COMMIT will normally be scheduled on the same worker as XA PREPARE, but can be a different one if the XA PREPARE is far back in the event history. Testcase and code for trimming dynamic array due to Andrei. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-09 11:42:34 +03:00
Kristian Nielsen	f9ecaa87ce	MDEV-33668: Refactor parallel replication round-robin scheduling to use explicit FIFO This is a preparatory patch to facilitate the next commit to improve the scheduling of XA transactions in parallel replication. When choosing the scheduling bucket for the next event group in rpl_parallel_entry::choose_thread(), use an explicit FIFO for the round-robin selection instead of a simple cyclic counter i := (i+1) % N. This allows to schedule XA COMMIT/ROLLBACK dependencies explicitly without changing the round-robin scheduling of other event groups. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-04-09 11:42:34 +03:00
Marko Mäkelä	788953463d	Merge 10.6 into 10.11 Some fixes related to commit `f838b2d799` and Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row() for system-versioned tables were provided by Nikita Malyavin. This was required by test versioning.rpl,trx_id,row.	2024-03-28 09:16:57 +02:00
Kristian Nielsen	0a6f46965a	MDEV-33475: --gtid-ignore-duplicate can double-apply event in case of parallel replication retry When rolling back and retrying a transaction in parallel replication, don't release the domain ownership (for --gtid-ignore-duplicates) as part of the rollback. Otherwise another master connection could grab the ownership and double-apply the transaction in parallel with the retry. Reviewed-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2024-03-13 16:59:10 +01:00
Monty	9a132d423a	MDEV-33620 Improve times and states in show processlist for replication This will makes it easier to find out what replication workers are doing and what they are waiting for. Things changed in processlist: - Slave_SQL time was not consistent. Now time for state "Slave has read all relay log; waiting for more updates" shows how long it has waited for getting the next event. - Slave_worker threads did often show "Closing tables" for a long time. Now the state is reverted to the previous state after "Closing tables" is done. - Commit and Rollback states where not shown for replication (and some other threads). Now Commit and Rollback states are always shown and the state is reverted to previous state when the Commit/Rollback have finished. Code changes: - Added thd->set_time_for_next_stage() for parallel replication when when starting to wait for prior transactions to commit, group commit, and FTWRL and for free space in thread pool. Before we reset the time only after the above events. - Moved THD_STAGE_INFO(stage_rollback) and THD_STAGE_INFO(stage_commit) from sql_parse.cc to transaction.cc to ensure this is done for all commits and not only 'normal connection queries'. Test case changes: - close_thread_tables() reverting stage to previous stage caused the counter in performance_schema to be increased. In many case it is the 'sql/starting' stage that was effected. - We only change to "Commit" stage if there is a need for a commit. This caused some "Commit" stages to disapper from perfschema reports. TODO in 11.#: - Slave_IO always showes "Waiting for master to send event" and the time is from SLAVE START. We should in 11.# change this to be the time since reading the last event.	2024-03-08 15:23:17 +02:00
Marko Mäkelä	2b99e5f7ef	Merge 10.6 into 10.11	2023-12-20 15:58:36 +02:00
Marko Mäkelä	2b01e5103d	Merge 10.5 into 10.6	2023-12-19 18:41:42 +02:00
Marko Mäkelä	12995559f9	Merge 10.4 into 10.5	2023-12-19 18:30:58 +02:00
Kristian Nielsen	eaa4968fc5	MDEV-10653: Fix segfault in SHOW MASTER STATUS with NULL inuse_relaylog The previous patch for MDEV-10653 changes the rpl_parallel::workers_idle() function to use Relay_log_info::last_inuse_relaylog to check for idle workers. But the code was missing a NULL check. Also, there was one place during SQL slave thread start which was missing mutex synchronisation when updating inuse_relaylog. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-12-19 12:08:54 +01:00
Kristian Nielsen	1cbba45e6e	Attempt to fix rare race in test for MDEV-8031 The error-injection inject_mdev8031 simulates a deadlock kill in a specific place, by setting killed_for_retry to RETRY_KILL_KILLED directly. If a real deadlock kill triggers at the same time, it is possible for the thread to complete its transaction retry and set rgi_slave to NULL before the real readlock kill can complete in the background. This will cause a segfault due to null-pointer access. Fix by changing the error injection to do a real background deadlock kill, which ensures that the thread will wait for any pending background kills to complete. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-12-19 12:08:53 +01:00
Marko Mäkelä	4ae105a37d	Merge 10.4 into 10.5	2023-12-18 08:59:07 +02:00
Brandon Nesterenko	8dad51481b	MDEV-10653: SHOW SLAVE STATUS Can Deadlock an Errored Slave AKA rpl.rpl_parallel, binlog_encryption.rpl_parallel fails in buildbot with timeout in include A replication parallel worker thread can deadlock with another connection running SHOW SLAVE STATUS. That is, if the replication worker thread is in do_gco_wait() and is killed, it will already hold the LOCK_parallel_entry, and during error reporting, try to grab the err_lock. SHOW SLAVE STATUS, however, grabs these locks in reverse order. It will initially grab the err_lock, and then try to grab LOCK_parallel_entry. This leads to a deadlock when both threads have grabbed their first lock without the second. This patch implements the MDEV-31894 proposed fix to optimize the workers_idle() check to compare the last in-use relay log’s queued_count==dequeued_count for idleness. This removes the need for workers_idle() to grab LOCK_parallel_entry, as these values are atomically updated. Huge thanks to Kristian Nielsen for diagnosing the problem! Reviewed By: ============ Kristian Nielsen <knielsen@knielsen-hq.org> Andrei Elkin <andrei.elkin@mariadb.com>	2023-12-11 07:45:23 -07:00
Marko Mäkelä	d5e15424d8	Merge 10.6 into 10.10 The MDEV-29693 conflict resolution is from Monty, as well as is a bug fix where ANALYZE TABLE wrongly built histograms for single-column PRIMARY KEY. Also includes a fix for safe_malloc error reporting. Other things: - Copied main.log_slow from 10.4 to avoid mtr issue Disabled test: - spider/bugfix.mdev_27239 because we started to get +Error 1429 Unable to connect to foreign data source: localhost -Error 1158 Got an error reading communication packets - main.delayed - Bug#54332 Deadlock with two connections doing LOCK TABLE+INSERT DELAYED This part is disabled for now as it fails randomly with different warnings/errors (no corruption).	2023-10-14 13:36:11 +03:00
Marko Mäkelä	0f9acce3f2	Merge 10.5 into 10.6	2023-09-14 09:01:15 +03:00
sjaakola	a3cbc44b24	MDEV-31833 replication breaks when using optimistic replication and replica is a galera node MariaDB async replication SQL thread was stopped for any failure in applying of replication events and error message logged for the failure was: "Node has dropped from cluster". The assumption was that event applying failure is always due to node dropping out. With optimistic parallel replication, event applying can fail for natural reasons and applying should be retried to handle the failure. This retry logic was never exercised because the slave SQL thread was stopped with first applying failure. To support optimistic parallel replication retrying logic this commit will now skip replication slave abort, if node remains in cluster (wsrep_ready==ON) and replication is configured for optimistic or aggressive retry logic. During the development of this fix, galera.galera_as_slave_nonprim test showed some problems. The test was analyzed, and it appears to need some attention. One excessive sleep command was removed in this commit, but it will need more fixes still to be fully deterministic. After this commit galera_as_slave_nonprim is successful, though. Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>	2023-09-12 02:37:30 +02:00
Marko Mäkelä	0dd25f28f7	Merge 10.5 into 10.6	2023-09-11 14:46:39 +03:00
Marko Mäkelä	f8f7d9de2c	Merge 10.4 into 10.5	2023-09-11 11:29:31 +03:00
Kristian Nielsen	e937a64d46	MDEV-10356: rpl.rpl_parallel_temptable failure due to incorrect commit optimization of temptables The problem was that parallel replication of temporary tables using statement-based binlogging could overlap the COMMIT in one thread with a DML or DROP TEMPORARY TABLE in another thread using the same temporary table. Temporary tables are not safe for concurrent access, so this caused reference to freed memory and possibly other nastiness. The fix is to disable the optimisation with overlapping commits of one transaction with the start of a later transaction, when temporary tables are in use. Then the following event groups will be blocked from starting until the one using temporary tables is completed. This also fixes occasional test failures of rpl.rpl_parallel_temptable seen in Buildbot. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-09-07 14:40:05 +02:00
Marko Mäkelä	448c2077fb	Merge 10.5 into 10.6	2023-08-21 15:50:31 +03:00
Marko Mäkelä	5895a3622b	Merge 10.4 into 10.5	2023-08-17 10:33:36 +03:00
Marko Mäkelä	9cd2989589	Merge 10.6 into 10.10	2023-08-16 15:28:42 +03:00
Kristian Nielsen	34e8585437	MDEV-29974: Missed kill waiting for worker queues to drain When the SQL driver thread goes to wait for room in the parallel slave worker queue, there was a race where a kill at the right moment could be ignored and the wait proceed uninterrupted by the kill. Fix by moving the THD::check_killed() to occur _after_ doing ENTER_COND(). This bug was seen as sporadic failure of the testcase rpl.rpl_parallel (rpl.rpl_parallel_gco_wait_kill since 10.5), with "Slave stopped with wrong error code". Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-16 14:07:06 +02:00
Kristian Nielsen	7c9837ce74	Merge 10.4 into 10.5 Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 18:02:18 +02:00
Kristian Nielsen	18acbaf416	MDEV-31655: Parallel replication deadlock victim preference code errorneously removed Restore code to make InnoDB choose the second transaction as a deadlock victim if two transactions deadlock that need to commit in-order for parallel replication. This code was erroneously removed when VATS was implemented in InnoDB. Also add a test case for InnoDB choosing the right deadlock victim. Also fixes this bug, with testcase that reliably reproduces: MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master Reviewed-by: Marko Mäkelä <marko.makela@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:39:49 +02:00
Kristian Nielsen	900c4d6920	MDEV-31655: Parallel replication deadlock victim preference code errorneously removed Restore code to make InnoDB choose the second transaction as a deadlock victim if two transactions deadlock that need to commit in-order for parallel replication. This code was erroneously removed when VATS was implemented in InnoDB. Also add a test case for InnoDB choosing the right deadlock victim. Also fixes this bug, with testcase that reliably reproduces: MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master Note: This should be null-merged to 10.6, as a different fix is needed there due to InnoDB locking code changes. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-08-15 16:35:30 +02:00
Oleksandr Byelkin	34a8e78581	Merge branch '10.6' into 10.9	2023-08-04 08:01:06 +02:00
Oleksandr Byelkin	6bf8483cac	Merge branch '10.5' into 10.6	2023-08-01 15:08:52 +02:00
Oleksandr Byelkin	7564be1352	Merge branch '10.4' into 10.5	2023-07-26 16:02:57 +02:00
Oleksandr Byelkin	f52954ef42	Merge commit '10.4' into 10.5	2023-07-20 11:54:52 +02:00
Kristian Nielsen	08585b0949	MDEV-31509: Lost data with FTWRL and STOP SLAVE The largest_started_sub_id needs to be set under LOCK_parallel_entry together with testing stop_sub_id. However, in-between was the logic for do_ftwrl_wait(), which temporarily releases the mutex. This could lead to inconsistent stopping amongst worker threads and lost data. Fix by moving all the stop-related logic out from unrelated do_gco_wait() and do_ftwrl_wait() and into its own function do_stop_handling(). Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Kristian Nielsen	5d61442c85	MDEV-31448: Killing a replica thread awaiting its GCO can hang/crash a parallel replica The problem is that when a worker thread is (user) killed in wait_for_prior_commit, the event group may complete out-of-order since the wait for prior commit was aborted by the kill. This fix ensures that event groups will always complete in-order, even in the error case. This is done in finish_event_group() by doing an extra wait_for_prior_commit(), if necessary, that ignores kills. This fix supersedes the fix for MDEV-30780, so the earlier fix for that is reverted in this patch. Also fix that an error from wait_for_prior_commit() inside finish_event_group() would not signal the error to wakeup_subsequent_commits(). Based on earlier work by Brandon Nesterenko and Andrei Elkin, with some changes to simplify the semantics of wait_for_prior_commit() and make the code more robust to future changes. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Kristian Nielsen	a8ea6627a4	MDEV-31448: Killing a replica thread awaiting its GCO can hang/crash a parallel replica The problem was an incorrect unmark_start_commit() in signal_error_to_sql_driver_thread(). If an event group gets an error, this unmark could run after the following GCO started, and the subsequent re-marking could access de-allocated GCO. The offending unmark_start_commit() looks obviously incorrect, and the fix is to just remove it. It was introduced in the MDEV-8302 patch, the commit message of which suggests it was added there solely to satisfy an assertion in ha_rollback_trans(). So update this assertion instead to not trigger for event groups that experienced an error (rgi->worker_error). When an error occurs in an event group, all following event groups are skipped anyway, so the unmark should never be needed in this case. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Kristian Nielsen	60bec1d54d	MDEV-13915: STOP SLAVE takes very long time on a busy system At STOP SLAVE, worker threads will continue applying event groups until the end of the current GCO before stopping. This is a left-over from when only conservative mode was available. In optimistic and aggressive mode, often _all_ queued event will be in the same GCO, and slave stop will be needlessly delayed. This patch instead records at STOP SLAVE time the latest (highest sub_id) event group that has started. Then worker threads will continue to apply event groups up to that event group, but skip any following. The result is that each worker thread will complete its currently running event group, and then the slave will stop. If the slave is caught up, and STOP SLAVE is run in the middle of an event group that is already executing in a worker thread, then that event group will be rolled back and the slave stop immediately, as normal. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Brandon Nesterenko	8ed88e3455	Revert "MDEV-13915: STOP SLAVE takes very long time on a busy system" This reverts commit `0a99d457b3` because it should go into only 10.5+	2023-06-06 08:11:38 -06:00
Brandon Nesterenko	0a99d457b3	MDEV-13915: STOP SLAVE takes very long time on a busy system The problem is that a parallel replica would not immediately stop running/queued transactions when issued STOP SLAVE. That is, it allowed the current group of transactions to run, and sometimes the transactions which belong to the next group could be started and run through commit after STOP SLAVE was issued too, if the last group had started committing. This would lead to long periods to wait for all waiting transactions to finish. This patch updates a parallel replica to try and abort immediately and roll-back any ongoing transactions. The exception to this is any transactions which are non-transactional (e.g. those modifying sequences or non-transactional tables), and any prior transactions, will be run to completion. The specifics are as follows: 1. A new stage was added to SHOW PROCESSLIST output for the SQL Thread when it is waiting for a replica thread to either rollback or finish its transaction before stopping. This stage presents as “Waiting for worker thread to stop” 2. Worker threads which error or are killed no longer perform GCO cleanup if there is a concurrently running prior transaction. This is because a worker thread scheduled to run in a future GCO could be killed and incorrectly perform cleanup of the active GCO. 3. Refined cases when the FL_TRANSACTIONAL flag is added to GTID binlog events to disallow adding it to transactions which modify both transactional and non-transactional engines when the binlogging configuration allow the modifications to exist in the same event, i.e. when using binlog_direct_non_trans_update == 0 and binlog_format == statement. 4. A few existing MTR tests relied on the completion of certain transactions after issuing STOP SLAVE, and were re-recorded (potentially with added synchronizations) under the new rollback behavior. Reviewed By =========== Andrei Elkin <andrei.elkin@mariadb.com>	2023-06-05 10:03:06 -06:00

1 2 3 4 5 ...

295 commits