mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-26 00:34:18 +01:00

Author	SHA1	Message	Date
Kristian Nielsen	ec47beaba6	Merge parallel replication async deadlock kill into 10.2. Conflicts: sql/mysqld.cc sql/slave.cc	2016-09-09 12:15:53 +02:00
Kristian Nielsen	7e0c9de864	Parallel replication async deadlock kill When a deadlock kill is detected inside the storage engine, the kill is not done immediately, to avoid calling back into the storage engine kill_query method with various lock subsystem mutexes held. Instead the kill is queued and done later by a slave background thread. This patch in preparation for fixing TokuDB optimistic parallel replication, as well as for removing locking hacks in InnoDB/XtraDB in 10.2. Signed-off-by: Kristian Nielsen <knielsen at knielsen-hq.org>	2016-09-08 15:25:40 +02:00
Monty	96e95b5465	Better SHOW PROCESSLIST for replication - When waiting for events, start time is now counted from start of wait - Instead of having "Connect" as "Command" for all replication threads we now have: - Slave_IO for Slave thread reading relay log - Slave_SQL for slave executing SQL commands or distribution queries to Slave workers - Slave_worker for slave threads executin SQL commands in parallel replication	2016-08-29 13:10:17 +03:00
Monty	89685d55d7	Reuse THD for new user connections - To ensure that mallocs are marked for the correct THD, even if it's allocated in another thread, I added the thread_id to the THD constructor - Added st_my_thread_var to thr_lock_info_init() to avoid a call to my_thread_var - Moved things from THD::THD() to THD::init() - Moved some things to THD::cleanup() - Added THD::free_connection() and THD::reset_for_reuse() - Added THD to CONNECT::create_thd() - Added THD::thread_dbug_id and st_my_thread_var->dbug_id. These are needed to ensure that we have a constant thread_id used for debugging with a THD, even if it changes thread_id (=connection_id) - Set variables.pseudo_thread_id in constructor. Removed not needed sets.	2016-06-04 09:06:00 +02:00
Monty	732adec0a4	Removed some not needed when doing delete thd, which caused warnings about wrong mutex usage from safe_mutex. Ensure that LOCK_status is always taken before LOCK_thread_count	2016-04-28 13:39:55 +03:00
Monty	cdd4043117	Cleanups: - Removed some QQ markers - Removed some rows not compatible with valgrind 3.9.0 - Made mysql_install_db.sh more silent by default. --verbose now gives more information - Added assert that auto-increment doesn't generate 0 (safety) - Removed thd->set_time() in some places as it's set in init_for_queries() - Fixed some --big tests in tokudb - Fixed a bug in mysql_client_test.cc where sql_mode was not properly reset	2016-04-05 18:00:04 +03:00
Sergei Golubchik	f67a2211ec	Merge branch '10.1' into 10.2	2016-03-23 22:36:46 +01:00
Sergei Golubchik	3b0c7ac1f9	Merge branch '10.0' into 10.1	2016-03-21 13:02:53 +01:00
Otto Kekäläinen	1777fd5f55	Fix spelling: occurred, execute, which etc	2016-03-04 02:09:37 +02:00
Monty	3d4a7390c1	MDEV-6150 Speed up connection speed by moving creation of THD to new thread Creating a CONNECT object on client connect and pass this to the working thread which creates the THD. Split LOCK_thread_count to different mutexes Added LOCK_thread_start to syncronize threads Moved most usage of LOCK_thread_count to dedicated functions Use next_thread_id() instead of thread_id++ Other things: - Thread id now starts from 1 instead of 2 - Added cast for thread_id as thread id is now of type my_thread_id - Made THD->host const (To ensure it's not changed) - Removed some DBUG_PRINT() about entering/exiting mutex as these was already logged by mutex code - Fixed that aborted_connects and connection_errors_internal are counted in all cases - Don't take locks for current_linfo when we set it (not needed as it was 0 before)	2016-02-07 10:34:03 +02:00
Sergei Golubchik	a2bcee626d	Merge branch '10.0' into 10.1	2015-12-21 21:24:22 +01:00
Monty	c3018b0ff4	Fixes to get all test to run on MacosX Lion 10.7 This includes fixing all utilities to not have any memory leaks, as safemalloc warnings stopped tests from passing on MacOSX. - Ensure that all clients takes character-set-dir, as the libmysqlclient library will use it. - mysql-test-run now passes character-set-dir to all external clients. - Changed dynstr_free() so that it can be called twice (made freeing code easier) - Changed rpl_global_gtid_slave_state to be allocated dynamicly as it includes a mutex that needs to be initizlied/destroyed before my_end() is called. - Removed rpl_slave_state::init() and rpl_slave_stage::deinit() as their job are better handling by constructor and delete. - Print alias instead of table_name in check_duplicate_key as table_name may have been converted to lower case. Other things: - Fixed a case in time_to_datetime_with_warn() where we where using && instead of & in tests	2015-11-29 17:51:23 +02:00
Monty	b30a768e7b	Fixed failures in rpl_parallel2 Problem was that we used same condition variable with 2 different mutex. Fixed by changing to use COND_rpl_thread_stop instead of COND_parallel_entry for stopping threads. Patch by Kristian Nielsen	2015-11-23 19:58:30 +02:00
Kristian Nielsen	8f2e05f41c	Merge branch 'mdev7818-4' into 10.1 Conflicts: mysql-test/suite/perfschema/r/stage_mdl_global.result sql/rpl_rli.cc sql/sql_parse.cc	2015-11-13 14:24:40 +01:00
Kristian Nielsen	ba02550166	MDEV-7818: Deadlock occurring with parallel replication and FTWRL Problem is that FLUSH TABLES WITH READ LOCK first blocks threads from starting new commits, then waits for running commits to complete. But in-order parallel replication needs commits to happen in a particular order, so this can easily deadlock. To fix this problem, this patch introduces a way to temporarily pause the parallel replication worker threads. Before starting FTWRL, we let all worker threads complete in-progress transactions, and then wait. Then we proceed to take the global read lock. Once the lock is obtained, we unpause the worker threads. Now commits are blocked from starting by the global read lock, so the deadlock will no longer occur.	2015-11-13 14:02:15 +01:00
Kristian Nielsen	6d96fab7dd	MDEV-7818: Deadlock occurring with parallel replication and FTWRL Preparation patch, moving the GCO wait into a separate function, in preparation for adding a separate wait phase for FLUSH TABLES WITH READ LOCK.	2015-11-13 14:02:14 +01:00
Kristian Nielsen	75dc267101	Change Seconds_behind_master to be updated only at commit in parallel replication Before, the Seconds_behind_master was updated already when an event was queued for a worker thread to execute later. This might lead users to interpret a low value as the slave being almost up to date with the master, while in reality there might still be lots and lots of events still queued up waiting to be applied by the slave. See https://lists.launchpad.net/maria-developers/msg08958.html for more detailed discussions.	2015-11-13 10:24:53 +01:00
Kristian Nielsen	df9b8aee58	Merge MDEV-8193 into 10.1 Conflicts: sql/rpl_rli.cc	2015-09-11 12:01:48 +02:00
Kristian Nielsen	51eaa7fe53	MDEV-8193: UNTIL clause in START SLAVE is sporadically disobeyed by parallel replication The code was using the wrong variable when comparing the binlog name for the UNTIL position. This could cause the comparison to fail after binlog rotation, in turn causing the UNTIL clause to not trigger slave stop.	2015-09-11 10:51:56 +02:00
Sergei Golubchik	b85a00161e	MDEV-8264 encryption for binlog * Start_encryption_log_event * --encrypt-binlog command line option based on google patches.	2015-09-04 10:33:55 +02:00
Kristian Nielsen	ef82cb7c2c	Merge MDEV-8725 into 10.1	2015-09-02 10:53:37 +02:00
Kristian Nielsen	999c43aeb7	MDEV-8725: Assertion `!(thd->rgi_slave && thd-> rgi_slave->did_mark_start_commit)' failed in ha_rollback_trans The assertion is there to catch cases where we rollback while mark_start_commit() is active. This can allow following event groups to be replicated too early, causing conflicts. But in this case, we have an _explicit_ ROLLBACK event in the binlog, which should not assert. We fix this by delaying the mark_start_commit() in the explicit ROLLBACK case. It seems safest to delay this in ROLLBACK case anyway, and there should be no reason to try to optimise this corner case.	2015-09-02 09:57:18 +02:00
Kristian Nielsen	dbd205797b	Merge MDEV-8302 into 10.1	2015-08-04 12:39:22 +02:00
Kristian Nielsen	9b9c5e890c	MDEV-8302: Duplicate key with parallel replication This bug is essentially another variant of MDEV-7458. If a transaction conflict caused a deadlock kill of T2 in record_gtid() during commit, the code would do a rollback _before_ running rgi->unmark_start_commit(). This creates a race where following transactions could start too early (before T2 has completed its transaction retry). This in turn could lead to replication failure, if there was a conflict that caused eg. duplicate key error or similar. The fix is to remove these rollbacks (in Query_log_event::do_apply_event() and Xid_log_event::do_apply_event(). They seem out-of-place; code in log_event.cc generally does not roll back on error, this is handled higher up. In addition, because of the extreme difficulty of reproducing bugs like MDEV-7458 and MDEV-8302, this patch adds some extra precations to try to detect (in debug builds) or prevent (in release builds) similar bugs. ha_rollback_trans() will now call unmark_start_commit() if needed (and assert in debug build when a caller does rollback without unmark first). We also add an extra check for thd->killed() so that we avoid doing mark_start_commit() if we already have a pending deadlock kill. And we add a missing unmark_start_commit() call in the error case, found by the above assertion.	2015-08-04 11:40:19 +02:00
Kristian Nielsen	903f8dc72d	Merge MDEV-8147 into 10.1	2015-05-26 15:03:22 +02:00
Kristian Nielsen	e5f1e841dc	MDEV-8147: Assertion `m_lock_type == 2' failed in handler::ha_close() during parallel replication When the slave processes the master restart format_description event, parallel replication needs to complete any prior events before processing the restart event (which closes temporary tables and such stuff). This happens in wait_for_workers_idle(), however it was not waiting long enough. The wait was using wait_for_prior_commit(), but at that points table can still be open. This lead to assertion in this case. So change wait_for_workers_idle() to wait until all worker threads have reached finish_event_group(), at which point all tables should have been closed.	2015-05-26 13:04:15 +02:00
Sergey Vojtovich	9851a8193f	MDEV-8001 - mysql_reset_thd_for_next_command() takes 0.04% in OLTP RO Removed yet more mysql_reset_thd_for_next_command(). Call THD::reset_for_next_command() directly instead.	2015-05-13 15:28:34 +04:00
Kristian Nielsen	8bedb638d7	MDEV-8113: Parallel slave: slave hangs on ALTER TABLE (or other DDL) as the first event after slave start In optimistic parallel replication, it is not safe to try to run a following transaction in parallel with a DDL statement, and there is code to prevent this. However, the code was missing the case where the DDL is the very first event after slave start. In this case, following transactions could run in parallel with the DDL, which can cause the slave to hang or even corrupt slave in unlucky cases.	2015-05-11 12:43:38 +02:00
Kristian Nielsen	c2dd88ac85	Merge MDEV-8031 into 10.1	2015-04-23 14:40:10 +02:00
Kristian Nielsen	b616991a68	MDEV-8031: Parallel replication stops on "connection killed" error (probably incorrectly handled deadlock kill) There was a rare race, where a deadlock error might not be correctly handled, causing the slave to stop with something like this in the error log: 150423 14:04:10 [ERROR] Slave SQL: Connection was killed, Gtid 0-1-2, Internal MariaDB error code: 1927 150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927 150423 14:04:10 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927 150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927 150423 14:04:10 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001 position 1234 The problem was incorrect error handling. When a deadlock is detected, it causes a KILL CONNECTION on the offending thread. This error is then later converted to a deadlock error, and the transaction is retried. However, the deadlock error was not cleared at the start of the retry, nor was the lingering kill signal. So it was possible to get another deadlock kill early during retry. If this happened with particular thread scheduling/timing, it was possible that the new KILL CONNECTION error was masked by the earlier deadlock error, so that the second kill was not properly converted into a deadlock error and retry. This patch adds code that clears the old error and killed flag before starting the retry. It also adds code to handle a deadlock kill caught in a couple of places where it was not handled before.	2015-04-23 14:09:15 +02:00
Kristian Nielsen	167332597f	Merge 10.0 -> 10.1. Conflicts: mysql-test/suite/multi_source/multisource.result sql/sql_base.cc	2015-04-17 15:18:44 +02:00
Kristian Nielsen	accdabd668	Merge MDEV-7888 and MDEV-7929 into 10.0.	2015-04-08 13:19:22 +02:00
Kristian Nielsen	48c10fb5f7	Merge MDEV-7888 and MDEV-7929 into 10.1.	2015-04-08 11:04:24 +02:00
Kristian Nielsen	3b961347db	MDEV-7888, MDEV-7929: Parallel replication hangs sometimes on ANALYZE TABLE or DDL The hangs occur when the group_commit_orderer object is freed before the last mark_start_commit() call on it - this loses the wakeup to other waiting worker threads, causing them to hang until killed manually. The object was freed because wakeup_subsequent_commits() was called two early in two places. For MDEV-7888, during ANALYZE TABLE, and for MDEV-7929 during record_gtid() after processing a DDL event. The group_commit_orderer object can be freed when its last transaction has called wait_for_prior_commit(). Fix by implementing a suspend/resume mechanism for wakeup_subsequent_commits() that can be used in places where a transaction is committed without this being the commit of the actual replication event group. Also add a protection mechanism (that asserts in debug builds) which can prevent the too-early free and hang if other similar bugs should remain in other parts of the code.	2015-04-08 11:01:18 +02:00
Kristian Nielsen	f573b65e41	Merge MDEV-7847 and MDEV-7882 into 10.0. Conflicts: mysql-test/suite/rpl/r/rpl_parallel.result sql/rpl_parallel.cc	2015-03-30 15:10:29 +02:00
Kristian Nielsen	c41e4d3b49	Merge MDEV-7847 and MDEV-7882 into 10.0. Conflicts: mysql-test/suite/rpl/r/rpl_parallel.result mysql-test/suite/rpl/t/rpl_parallel.test	2015-03-30 14:51:25 +02:00
Kristian Nielsen	880f2273fd	MDEV-7847: "Slave worker thread retried transaction 10 time(s) in vain, giving up", followed by replication hanging This patch fixes a bug in the error handling in parallel replication, when one worker thread gets a failure and other worker threads processing later transactions have to rollback and abort. The problem was with the lifetime of group_commit_orderer objects (GCOs). A GCO is freed when we register that its last event group has committed. This relies on register_wait_for_prior_commit() and wait_for_prior_commit() to ensure that the fact that T2 has committed implies that any earlier T1 has also committed, and can thus no longer execute mark_start_commit(). However, in the error case, the code was skipping the register_wait_for_prior_commit() and wait_for_prior_commit() calls. Thus commit ordering was not guaranteed, and a GCO could be freed too early. Then a later mark_start_commit() would reference deallocated GCO, which could lead to lost wakeup (causing slave threads to hang) or other corruption. This patch makes also the error case respect commit order. This way, also the error case gets the GCO lifetime correct, and the hang no longer occurs.	2015-03-30 14:33:44 +02:00
Kristian Nielsen	a4082918c8	MDEV-7882: Excessive transaction retry in parallel replication When a transaction in parallel replication needs to retry (eg. because of deadlock kill), first wait for all prior transactions to commit before doing the retry. This way, we avoid the retry once again conflicting with a prior transaction, requiring yet another retry. Without this patch, we saw "in the wild" that transactions had to be retried more than 10 times to succeed, which exceeds the default --slave_transaction_retries value and is in any case undesirable. (We already do this in 10.1 in "optimistic" parallel replication mode; this patch just makes the code use the same logic for "conservative" mode (only mode in 10.0)).	2015-03-30 14:16:57 +02:00
Kristian Nielsen	bd2ae787ea	MDEV-7825: Parallel replication race condition on gco->flags, possibly resulting in slave hang The patch for optimistic parallel replication as a memory optimisation moved the gco->installed field into a bit in gco->flags. However, that is just plain wrong. The gco->flags field is owned by the SQL driver thread, but gco->installed is used by the worker threads, so this will cause a race condition. The user-visible problem might be conflicts between transactions and/or slave threads hanging. So revert this part of the optimistic parallel replication patch, going back to using a separate field gco->installed like in 10.0.	2015-03-24 16:33:51 +01:00
Kristian Nielsen	ed04c40b01	MDEV-5289: master server starts slave parallel threads Delay spawning parallel replication worker threads until a slave SQL thread is running, and de-spawn them when the last SQL thread stops. This is especially useful to avoid needless threads on a master in a setup where same my.cnf is used on masters and slaves.	2015-03-11 09:18:16 +01:00
Sergei Golubchik	2db62f686e	Merge branch '10.0' into 10.1	2015-03-07 13:21:02 +01:00
Kristian Nielsen	95d7208859	Merge MDEV-6589 and MDEV-6403 into 10.1. Conflicts: sql/log.cc sql/rpl_rli.cc sql/sql_repl.cc	2015-03-04 13:49:37 +01:00
Kristian Nielsen	3ef0b9b235	Merge MDEV-6589 and MDEV-6403 into 10.0.	2015-03-04 13:36:54 +01:00
Kristian Nielsen	ad0d203f2e	MDEV-6589: Incorrect relay log start position when restarting SQL thread after error in parallel replication The problem occurs in parallel replication in GTID mode, when we are using multiple replication domains. In this case, if the SQL thread stops, the slave GTID position may refer to a different point in the relay log for each domain. The bug was that when the SQL thread was stopped and restarted (but the IO thread was kept running), the SQL thread would resume applying the relay log from the point of the most advanced replication domain, silently skipping all earlier events within other domains. This caused replication corruption. This patch solves the problem by storing, when the SQL thread stops with multiple parallel replication domains active, the current GTID position. Additionally, the current position in the relay logs is moved back to a point known to be earlier than the current position of any replication domain. Then when the SQL thread restarts from the earlier position, GTIDs encountered are compared against the stored GTID position. Any GTID that was already applied before the stop is skipped to avoid duplicate apply. This patch should have no effect if multi-domain GTID parallel replication is not used. Similarly, if both SQL and IO thread are stopped and restarted, the patch has no effect, as in this case the existing relay logs are removed and re-fetched from the master at the current global @@gtid_slave_pos.	2015-03-04 13:36:04 +01:00
Kristian Nielsen	a227cf8046	MDEV-7335: Potential parallel slave deadlock with specific binlog corruption If somehow the COMMIT or XID event in an event group was missing, the code in parallel replication to handle this was not sufficient, leading to server deadlock.	2015-02-24 14:39:15 +01:00
Kristian Nielsen	8672339328	MDEV-6676: Optimistic parallel replication Adjust the configuration options, as discussed on the maria-developers@ mailing list. The option to hint a transaction to not be replicated in parallel is now called @@skip_parallel_replication, consistent with @@skip_replication. And the --slave-parallel-mode is now simplified to have just one of the following values: none minimal conservative optimistic aggressive This reflects successively harder efforts to find opportunities to run things in parallel on the slave. It allows to extend the server with more automatic heuristics in the future without having to introduce a new configuration option for each and every one.	2015-02-07 09:42:58 +01:00
Daniel Black	2deaa292e7	MDEV-7201: parallel threads resizing - potential race condition to access freed memory pool->threads is freed before being reassigned the new pool. Although not really a memory barrier I though it prudent to keep the pool thread count to be the lower of the old/new thread list before the new threads is allocated.	2015-02-07 09:42:57 +01:00
Sergei Golubchik	4b21cd21fe	Merge branch '10.0' into merge-wip	2015-01-31 21:48:47 +01:00
Sergei Golubchik	e695db0f2d	MDEV-7437 remove suport for "atomics" with rwlocks	2015-01-13 10:15:21 +01:00
Kristian Nielsen	f27817c1d0	MDEV-7326: Server deadlock in connection with parallel replication The bug occurs when a transaction does a retry after all transactions have done mark_start_commit() in a batch of group commit from the master. In this case, the retrying transaction can unmark_start_commit() after the following batch has already started running and de-allocated the GCO. Then after retry, the transaction will re-do mark_start_commit() on a de-allocated GCO, and also wakeup of later GCOs can be lost. This was seen "in the wild" by a user, even though it is not known exactly what circumstances can lead to retry of one transaction after all transactions in a group have reached the commit phase. The lifetime around GCO was somewhat clunky anyway. With this patch, a GCO lives until rpl_parallel_entry::last_committed_sub_id has reached the last transaction in the GCO. This guarantees that the GCO will still be alive when a transaction does mark_start_commit(). Also, we now loop over the list of active GCOs for wakeup, to ensure we do not lose a wakeup even in the problematic case.	2015-01-07 14:45:39 +01:00

1 2 3

136 commits