mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-16 20:12:31 +01:00

Author	SHA1	Message	Date
Oleksandr Byelkin	6bf8483cac	Merge branch '10.5' into 10.6	2023-08-01 15:08:52 +02:00
Oleksandr Byelkin	7564be1352	Merge branch '10.4' into 10.5	2023-07-26 16:02:57 +02:00
Oleksandr Byelkin	f52954ef42	Merge commit '10.4' into 10.5	2023-07-20 11:54:52 +02:00
Kristian Nielsen	08585b0949	MDEV-31509: Lost data with FTWRL and STOP SLAVE The largest_started_sub_id needs to be set under LOCK_parallel_entry together with testing stop_sub_id. However, in-between was the logic for do_ftwrl_wait(), which temporarily releases the mutex. This could lead to inconsistent stopping amongst worker threads and lost data. Fix by moving all the stop-related logic out from unrelated do_gco_wait() and do_ftwrl_wait() and into its own function do_stop_handling(). Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Kristian Nielsen	5d61442c85	MDEV-31448: Killing a replica thread awaiting its GCO can hang/crash a parallel replica The problem is that when a worker thread is (user) killed in wait_for_prior_commit, the event group may complete out-of-order since the wait for prior commit was aborted by the kill. This fix ensures that event groups will always complete in-order, even in the error case. This is done in finish_event_group() by doing an extra wait_for_prior_commit(), if necessary, that ignores kills. This fix supersedes the fix for MDEV-30780, so the earlier fix for that is reverted in this patch. Also fix that an error from wait_for_prior_commit() inside finish_event_group() would not signal the error to wakeup_subsequent_commits(). Based on earlier work by Brandon Nesterenko and Andrei Elkin, with some changes to simplify the semantics of wait_for_prior_commit() and make the code more robust to future changes. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Kristian Nielsen	a8ea6627a4	MDEV-31448: Killing a replica thread awaiting its GCO can hang/crash a parallel replica The problem was an incorrect unmark_start_commit() in signal_error_to_sql_driver_thread(). If an event group gets an error, this unmark could run after the following GCO started, and the subsequent re-marking could access de-allocated GCO. The offending unmark_start_commit() looks obviously incorrect, and the fix is to just remove it. It was introduced in the MDEV-8302 patch, the commit message of which suggests it was added there solely to satisfy an assertion in ha_rollback_trans(). So update this assertion instead to not trigger for event groups that experienced an error (rgi->worker_error). When an error occurs in an event group, all following event groups are skipped anyway, so the unmark should never be needed in this case. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Kristian Nielsen	60bec1d54d	MDEV-13915: STOP SLAVE takes very long time on a busy system At STOP SLAVE, worker threads will continue applying event groups until the end of the current GCO before stopping. This is a left-over from when only conservative mode was available. In optimistic and aggressive mode, often _all_ queued event will be in the same GCO, and slave stop will be needlessly delayed. This patch instead records at STOP SLAVE time the latest (highest sub_id) event group that has started. Then worker threads will continue to apply event groups up to that event group, but skip any following. The result is that each worker thread will complete its currently running event group, and then the slave will stop. If the slave is caught up, and STOP SLAVE is run in the middle of an event group that is already executing in a worker thread, then that event group will be rolled back and the slave stop immediately, as normal. Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com> Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>	2023-07-12 09:41:32 +02:00
Brandon Nesterenko	8ed88e3455	Revert "MDEV-13915: STOP SLAVE takes very long time on a busy system" This reverts commit `0a99d457b3` because it should go into only 10.5+	2023-06-06 08:11:38 -06:00
Brandon Nesterenko	0a99d457b3	MDEV-13915: STOP SLAVE takes very long time on a busy system The problem is that a parallel replica would not immediately stop running/queued transactions when issued STOP SLAVE. That is, it allowed the current group of transactions to run, and sometimes the transactions which belong to the next group could be started and run through commit after STOP SLAVE was issued too, if the last group had started committing. This would lead to long periods to wait for all waiting transactions to finish. This patch updates a parallel replica to try and abort immediately and roll-back any ongoing transactions. The exception to this is any transactions which are non-transactional (e.g. those modifying sequences or non-transactional tables), and any prior transactions, will be run to completion. The specifics are as follows: 1. A new stage was added to SHOW PROCESSLIST output for the SQL Thread when it is waiting for a replica thread to either rollback or finish its transaction before stopping. This stage presents as “Waiting for worker thread to stop” 2. Worker threads which error or are killed no longer perform GCO cleanup if there is a concurrently running prior transaction. This is because a worker thread scheduled to run in a future GCO could be killed and incorrectly perform cleanup of the active GCO. 3. Refined cases when the FL_TRANSACTIONAL flag is added to GTID binlog events to disallow adding it to transactions which modify both transactional and non-transactional engines when the binlogging configuration allow the modifications to exist in the same event, i.e. when using binlog_direct_non_trans_update == 0 and binlog_format == statement. 4. A few existing MTR tests relied on the completion of certain transactions after issuing STOP SLAVE, and were re-recorded (potentially with added synchronizations) under the new rollback behavior. Reviewed By =========== Andrei Elkin <andrei.elkin@mariadb.com>	2023-06-05 10:03:06 -06:00
Oleksandr Byelkin	de703a2b21	Merge branch '10.4' into 10.4.29 release	2023-05-11 09:07:45 +02:00
Oleksandr Byelkin	043d69bbcc	Merge branch '10.5' into 10.6	2023-05-03 09:51:25 +02:00
Monty	4f7317579e	Fixed "Trying to lock uninitialized mutex' in parallel replication The problem was that mutex_init() was called after the worker was put into the domain_hash, which allowed other threads to access it before mutex was initialized.	2023-05-02 23:43:07 +03:00
Oleksandr Byelkin	edf8ce5b97	Merge branch 'bb-10.4-release' into bb-10.5-release	2023-05-02 13:54:54 +02:00
Oleksandr Byelkin	edd0b03e60	Merge branch '10.3' into 10.4	2023-05-02 10:09:27 +02:00
Andrei	55a53949be	MDEV-29621: Replica stopped by locks on sequence When using binlog_row_image=FULL with sequence table inserts, a replica can deadlock because it treats full inserts in a sequence as DDL statements by getting an exclusive lock on the sequence table. It has been observed that with parallel replication, this exclusive lock on the sequence table can lead to a deadlock where one transaction has the exclusive lock and is waiting on a prior transaction to commit, whereas this prior transaction is waiting on the MDL lock. This fix for this is on the master side, to raise FL_DDL flag on the GTID of a full binlog_row_image write of a sequence table. This forces the slave to execute the statement serially so a deadlock cannot happen. A test verifies the deadlock also to prove it happen on the OLD (pre-fixes) slave. OLD (buggy master) -replication-> NEW (fixed slave) is provided. As the pre-fixes master's full row-image may represent both SELECT NEXT VALUE and INSERT, the parallel slave pessimistically waits for the prior transaction to have committed before to take on the critical part of the second (like INSERT in the test) event execution. The waiting exploits a parallel slave's retry mechanism which is controlled by `@@global.slave_transaction_retries`. Note that in order to avoid any persistent 'Deadlock found' 2013 error in OLD -> NEW, `slave_transaction_retries` may need to be set to a higher than the default value. START-SLAVE is an effective work-around if this still happens.	2023-04-27 21:55:45 +03:00
Marko Mäkelä	bb1d1dc846	Merge 10.5 into 10.6	2023-04-27 09:48:27 +03:00
Marko Mäkelä	902c622215	Merge 10.4 into 10.5	2023-04-27 09:39:53 +03:00
Andrei	e22a57da82	MDEV-30620 Trying to lock uninitialized LOCK_parallel_entry The error was seen by a number of mtr tests being caused by overdue initialization of rpl_parallel::LOCK_parallel_entry. Specifically, SHOW-SLAVE-STATUS might find in rpl_parallel::workers_idle() a gtid domain hash entry already inserted whose mutex had not done mysql_mutex_init(). Fixed with swapping the mutex init and the its entry's stack insertion. Tested with a generous number of `mtr --repeat` of a few of the reported to fail tests, incl rpl.parallel_backup.	2023-04-25 19:43:04 +03:00
Marko Mäkelä	5bada1246d	Merge 10.5 into 10.6	2023-04-11 16:15:19 +03:00
Oleksandr Byelkin	ac5a534a4c	Merge remote-tracking branch '10.4' into 10.5	2023-03-31 21:32:41 +02:00
Andrei	216d99bb39	MDEV-26071: rpl.rpl_perfschema_applier_status_by_worker failed in bb … …with: Test assertion failed Problem: ======= Assertion text: 'Value returned by SSS and PS table for Last_Error_Number should be same.' Assertion condition: '"1146" = "0"' Assertion condition, interpolated: '"1146" = "0"' Assertion result: '0' Analysis: ======== In parallel replication when slave is started the worker pool gets activated and it gets cleared when slave stops. Each time the worker pool gets activated a backup worker pool also gets created to store worker specific perforance schema information in case of errors. On error, all relevant information is copied from rpl_parallel_thread to rli and it gets cleared from thread. Then server waits for all workers to complete their work, during this stage performance schema table specific worker info is stored into the backup pool and finally the actual pool gets cleared. If users query the performance schema table to know the status of workers the information from backup pool will be used. The test simulates ER_NO_SUCH_TABLE error and verifies the worker information in pfs table. Test works fine if execution occurs in following order. Step 1. Error occurred 'worker information is copied to backup pool'. Step 2. handle_slave_sql invokes 'rpl_parallel_resize_pool_if_no_slaves' to deactivate worker pool, it marks the pool->count=0 Step 3. PFS table is queried, since actual pool is deactivated backup pool information is read. If the Step 3 happens prior to Step2 the pool is yet to be deactivated and the actual pool is read, which doesn't have any error details as they were cleared. Hence test ocasionally fails. Fix: === Upon error mark the back pool as being active so that if PFS table is quried since the backup pool is flagged as valid its information will be read, in case it is not flagged regular pool will be read. This work is one of the last pieces created by the late Sujatha Sivakumar.	2023-03-24 15:56:24 +02:00
Andrei	d4339620be	MDEV-30780 optimistic parallel slave hangs after hit an error The hang could be seen as show slave status displaying an error like Last_Error: Could not execute Write_rows_v1 along with Slave_SQL_Running: Yes accompanied with one of the replication threads in show-processlist characteristically having status like 2394 \| system user \| \| NULL \| Slave_worker \| 50852\| closing tables It turns out that closing tables worker got entrapped in endless looping in mark_start_commit_inner() across already garbage-collected gco items. The reclaimed gco links are explained with actually possible out-of-order groups of events termination due to the Last_Error. This patch reinforces the correct ordering to perform finish_event_group's cleanup actions, incl unlinking gco:s from the active list.	2023-03-16 18:55:19 +02:00
Oleksandr Byelkin	c3a5cf2b5b	Merge branch '10.5' into 10.6	2023-01-31 09:31:42 +01:00
Oleksandr Byelkin	a977054ee0	Merge branch '10.3' into 10.4	2023-01-28 18:22:55 +01:00
Oleksandr Byelkin	7fa02f5c0b	Merge branch '10.4' into 10.5	2023-01-27 13:54:14 +01:00
Oleksandr Byelkin	dd24fa3063	Merge branch '10.3' into 10.4	2023-01-26 10:34:26 +01:00
Brandon Nesterenko	d69e835787	MDEV-29639: Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas Problem ======== On a parallel, delayed replica, Seconds_Behind_Master will not be calculated until after MASTER_DELAY seconds have passed and the event has finished executing, resulting in potentially very large values of Seconds_Behind_Master (which could be much larger than the MASTER_DELAY parameter) for the entire duration the event is delayed. This contradicts the documented MASTER_DELAY behavior, which specifies how many seconds to withhold replicated events from execution. Solution ======== After a parallel replica idles, the first event after idling should immediately update last_master_timestamp with the time that it began execution on the primary. Reviewed By =========== Andrei Elkin <andrei.elkin@mariadb.com>	2023-01-24 08:11:35 -07:00
Marko Mäkelä	829e8111c7	Merge 10.5 into 10.6	2022-09-26 14:34:43 +03:00
Marko Mäkelä	6286a05d80	Merge 10.4 into 10.5	2022-09-26 13:34:38 +03:00
Marko Mäkelä	3c92050d1c	Fix build without either ENABLED_DEBUG_SYNC or DBUG_OFF There are separate flags DBUG_OFF for disabling the DBUG facility and ENABLED_DEBUG_SYNC for enabling the DEBUG_SYNC facility. Let us allow debug builds without DEBUG_SYNC. Note: For CMAKE_BUILD_TYPE=Debug, CMakeLists.txt will continue to define ENABLED_DEBUG_SYNC.	2022-09-23 17:37:52 +03:00
Marko Mäkelä	ca3f497564	Merge 10.2 into 10.3, except MDEV-25682	2021-05-18 08:40:19 +03:00
Sachin Kumar	355dc74b76	MDEV-22370 safe_mutex: Trying to lock uninitialized mutex at /data/src/10.4-bug/sql/rpl_parallel.cc, line 470 upon shutdown during FTWRL Problem:- When we issue FTWRL with shutdown in parallel, there is race between FTWRL and shutdown. Shutdown might destroy the mutex (pool->LOCK_rpl_thread_pool) before FTWRL can lock it. So we can get crash on FTWRL thread Solution:- mysql_mutex_destroy(pool->LOCK_rpl_thread_pool) should wait for FTWRL thread to complete its work , and then destroy. So slave_prepare_for_shutdown will just deactivate the pool, and mutex is destroyed later in end_slave()	2021-05-14 11:49:46 +01:00
Sujatha	f9bd7f2012	MDEV-20220: Merge 5.7 P_S replication table 'replication_applier_status_by_worker Step 3: ====== Preserve worker pool information on either STOP SLAVE/Error. In case STOP SLAVE is executed worker threads will be gone, hence worker threads will be unavailable. Querying the table at this stage will give empty rows. To address this case when worker threads are about to stop, due to an error or forced stop, create a backup pool and preserve the data which is relevant to populate performance schema table. Clear the backup pool upon slave start.	2021-04-08 17:19:51 +05:30
Sujatha	036ee61246	MDEV-20220: Merge 5.7 P_S replication table 'replication_applier_status_by_worker Step2: ===== Add two extra columns mentioned below. --------------------------------------------------------------------------- \|Column Name: \| Description: \| \|-------------------------------------------------------------------------\| \| \| \| \|WORKER_IDLE_TIME \| Total idle time in seconds that the worker \| \| \| thread has spent waiting for work from \| \| \| co-ordinator thread \| \| \| \| \|LAST_TRANS_RETRY_COUNT \| Total number of retries attempted by last \| \| \| transaction \| ---------------------------------------------------------------------------	2021-04-08 17:19:51 +05:30
Sujatha	94f1d0f84d	MDEV-20220: Merge 5.7 P_S replication table 'replication_applier_status_by_worker Step1: ===== Backport 'replication_applier_status_by_worker' from upstream. Iterate through rpl_parallel_thread_pool and display slave worker thread specific information as part of 'replication_applier_status_by_worker' table. --------------------------------------------------------------------------- \|Column Name: \| Description: \| \|-------------------------------------------------------------------------\| \| \| \| \|CHANNEL_NAME \| Name of replication channel through which the \| \| \| transaction is received. \| \| \| \| \|THREAD_ID \| Thread_Id as displayed in 'performance_schema. \| \| \| threads' table for thread with name \| \| \| 'thread/sql/rpl_parallel_thread' \| \| \| \| \| \| THREAD_ID will be NULL when worker threads are \| \| \| stopped due to an error/force stop \| \| \| \| \|SERVICE_STATE \| Thread is running or not \| \| \| \| \|LAST_SEEN_TRANSACTION \| Last GTID executed by worker \| \| \| \| \|LAST_ERROR_NUMBER \| Last Error that occured on a particular worker \| \| \| \| \|LAST_ERROR_MESSAGE \| Last error specific message \| \| \| \| \|LAST_ERROR_TIMESTAMP \| Time stamp of last error \| \| \| \| --------------------------------------------------------------------------- CHANNEL_NAME will be empty when the worker has not processed any transaction. Channel_name points to valid source channel_name when it is processing a transaction/event group.	2021-04-08 17:19:51 +05:30
Sergei Golubchik	25d9d2e37f	Merge branch 'bb-10.4-release' into bb-10.5-release	2021-02-15 16:43:15 +01:00
Sergei Golubchik	00a313ecf3	Merge branch 'bb-10.3-release' into bb-10.4-release Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution" was null-merged. 10.4 version of the fix is coming up separately	2021-02-12 17:44:22 +01:00
Sergei Golubchik	60ea09eae6	Merge branch '10.2' into 10.3	2021-02-01 13:49:33 +01:00
Sujatha	eb75e8705d	MDEV-8134: The relay-log is not flushed after the slave-relay-log.999999 showed Problem: ======== Auto purge of relaylogs stops when relay-log-file is 'slave-relay-log.999999' and slave_parallel_threads is enabled. Analysis: ========= The problem is that in Relay_log_info::inc_group_relay_log_pos() function, when two log names are compared via strcmp() function, it gives correct result, when log name sequence numbers are of same digits(6 digits), But when the number goes to 7 digits, a 999999 compares greater than 1000000, which is wrong, hence the bug. Fix: ==== Extract the numeric extension part of the file name, convert it into unsigned long and compare. Thanks to David Zhao for the contribution.	2021-01-21 13:00:02 +05:30
Sergei Golubchik	4668e079ee	Merge branch '10.2' into 10.3	2020-08-06 17:01:44 +02:00
Sergei Golubchik	fbcae42c2a	Merge branch '10.1' into 10.2	2020-08-06 16:47:39 +02:00
Oleksandr Byelkin	48b5777ebd	Merge branch '10.4' into 10.5	2020-08-04 17:24:15 +02:00
Sachin	e3c18b8e84	MDEV-23089 rpl_parallel2 fails in 10.5 Problem:- rpl_parallel2 was failing non-deterministically Analysis:- When FLUSH TABLES WITH READ LOCK is executed, it will allow all worker threads to complete their ongoing transactions and then it will pause them. At this state FTWRL will proceed to acquire global read lock. FTWRL first blocks threads from starting new commits, then upgrades the lock to block commit of existing transactions. Step1: FLUSH TABLES WITH READ LOCK - Blocks new commits Step2: * STOP SLAVE command enables 'force_abort=1' which unblocks workers, they continue to execute events. * T1: Waits in 'record_gtid' call to update 'gtid_slave_pos' table with its current GTID, but it is blocked becuase of Step1. * T2: Holds COMMIT lock and waits for T1 to commit. Step3: FLUSH TABLES WITH READ LOCK - Waiting to get BLOCK_COMMIT. This results in deadlock. When STOP SLAVE command allows paused workers to proceed, workers should skip the execution of all further events, similar to 'conservative' parallel mode. Solution:- We will assign 1 to skip_event_group when we are aborted in do_ftwrl_wait. rpl_parallel_entry->pause_sub_id is only reset when force_abort is off in rpl_pause_after_ftwrl.	2020-08-04 11:28:26 +05:30
Sachin	706a7101bf	MDEV-23089 rpl_parallel2 fails in 10.5 Problem:- rpl_parallel2 was failing non-deterministically Analysis:- When FLUSH TABLES WITH READ LOCK is executed, it will allow all worker threads to complete their ongoing transactions and then it will pause them. At this state FTWRL will proceed to acquire global read lock. FTWRL first blocks threads from starting new commits, then upgrades the lock to block commit of existing transactions. Step1: FLUSH TABLES WITH READ LOCK - Blocks new commits Step2: * STOP SLAVE command enables 'force_abort=1' which unblocks workers, they continue to execute events. * T1: Waits in 'record_gtid' call to update 'gtid_slave_pos' table with its current GTID, but it is blocked becuase of Step1. * T2: Holds COMMIT lock and waits for T1 to commit. Step3: FLUSH TABLES WITH READ LOCK - Waiting to get BLOCK_COMMIT. This results in deadlock. When STOP SLAVE command allows paused workers to proceed, workers should skip the execution of all further events, similar to 'conservative' parallel mode. Solution:- We will assign 1 to skip_event_group when we are aborted in do_ftwrl_wait. rpl_parallel_entry->pause_sub_id is only reset when force_abort is off in rpl_pause_after_ftwrl.	2020-08-03 17:07:16 +05:30
Marko Mäkelä	4ec032b492	Merge 10.4 into 10.5	2020-07-21 17:33:16 +03:00
Monty	fc48c8ff4c	MDEV-21953 deadlock between BACKUP STAGE BLOCK_COMMIT and parallel repl. The issue was: T1, a parallel slave worker thread, is waiting for another worker thread to commit. While waiting, it has the MDL_BACKUP_COMMIT lock. T2, working for mariabackup, is doing BACKUP STAGE BLOCK_COMMIT and blocks all commits. This causes a deadlock as the thread T1 is waiting for can't commit. Fixed by moving locking of MDL_BACKUP_COMMIT from ha_commit_trans() to commit_one_phase_2() Other things: - Added a new argument to ha_comit_one_phase() to signal if the transaction was a write transaction. - Ensured that ha_maria::implicit_commit() is always called under MDL_BACKUP_COMMIT. This code is not needed in 10.5 - Ensure that MDL_Request values 'type' and 'ticket' are always initialized. This makes it easier to check the state of the MDL_Request. - Moved thd->store_globals() earlier in handle_rpl_parallel_thread() as thd->init_for_queries() could use a MDL that could crash if store_globals where not called. - Don't call ha_enable_transactions() in THD::init_for_queries() as this is both slow (uses MDL locks) and not needed.	2020-07-21 12:42:42 +03:00
Marko Mäkelä	c515b1d092	Merge 10.4 into 10.5	2020-06-18 13:58:54 +03:00
Sachin	592a10d079	MDEV-22370 safe_mutex: Trying to lock uninitialized mutex at /data/src/10.4-bug/sql/rpl_parallel.cc, line 470 upon shutdown during FTWRL Problem:- When we issue FTWRL with shutdown in parallel, there is race between FTWRL and shutdown. Shutdown might destroy the mutex (pool->LOCK_rpl_thread_pool) before FTWRL can lock it. So we can get crash on FTWRL thread Solution:- mysql_mutex_destroy(pool->LOCK_rpl_thread_pool) should wait for FTWRL thread to complete its work , and then destroy. So slave_prepare_for_shutdown will just deactivate the pool, and mutex is destroyed later in end_slave()	2020-06-17 02:22:46 +05:30
Marko Mäkelä	4a0b56f604	Merge 10.4 into 10.5	2020-05-31 10:28:59 +03:00
Marko Mäkelä	6da14d7b4a	Merge 10.3 into 10.4	2020-05-30 11:04:27 +03:00

1 2 3 4 5

243 commits