Commit graph

143 commits

Author SHA1 Message Date
Sergei Golubchik
4a5d25c338 Merge branch '10.1' into 10.2 2016-12-29 13:23:18 +01:00
Sergei Golubchik
2f20d297f8 Merge branch '10.0' into 10.1 2016-12-11 09:53:42 +01:00
Kristian Nielsen
390f2a013b Fix incorrect reading of events from relaylog in parallel replication.
The SQL thread keeps track of the position in the current relay log from
which to read the next event. This position is not normally used, but a
certain interaction with the IO thread can cause the SQL thread to re-open
the relay log and seek to the stored position.

In parallel replication, there were a couple of places where the position
was not updated. This created a race where a re-open of the relay log could
seek to the wrong position and start re-reading and processing events
already handled once, causing various kinds of problems.

Fix this by moving the position update into a single place in
apply_event_and_update_pos(), which should ensure that the position is
always updated in the parallel replication case.

This problem was found from the testcase of MDEV-10863, but it is logically
a separate problem.
2016-11-16 11:00:38 +01:00
Kristian Nielsen
c06bc66816 MDEV-11065: Compressed binary log
Minor review comments/changes:

 - A bunch of style-fixes.

 - Change macros to static inline functions.

 - Update check_event_type() with compressed event types.

 - Small .result file update.
2016-10-20 18:00:59 +02:00
Kristian Nielsen
e1ef99c3dc MDEV-7145: Delayed replication
Merge feature into 10.2 from feature branch.

Delayed replication adds an option

  CHANGE MASTER TO master_delay=<seconds>

Replication will then delay applying events with that many
seconds. This creates a replication slave that reflects the state of
the master some time in the past.

Feature is ported from MySQL source tree.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2016-10-16 23:44:44 +02:00
Kristian Nielsen
3011060b2a MDEV-7145: Delayed slave.
Extend to work also for parallel replication.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2016-10-14 23:15:59 +02:00
Kristian Nielsen
50f19ca809 Remove unnecessary global mutex in parallel replication.
The function apply_event_and_update_pos() is called with the
rli->data_lock mutex held. However, there seems to be nothing in the
function actually needing the mutex to be held. Certainly not in the
parallel replication case, where sql_slave_skip_counter is always 0
since the non-zero case is handled by the SQL driver thread.

So this patch makes parallel replication use a variant of
apply_event_and_update_pos() without the need to take the
rli->data_lock mutex. This avoids one contended global mutex for each
event executed, which might improve performance on CPU-bound workloads
somewhat.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2016-10-14 22:44:40 +02:00
Kristian Nielsen
ec47beaba6 Merge parallel replication async deadlock kill into 10.2.
Conflicts:
	sql/mysqld.cc
	sql/slave.cc
2016-09-09 12:15:53 +02:00
Kristian Nielsen
7e0c9de864 Parallel replication async deadlock kill
When a deadlock kill is detected inside the storage engine, the kill
is not done immediately, to avoid calling back into the storage engine
kill_query method with various lock subsystem mutexes held. Instead the
kill is queued and done later by a slave background thread.

This patch in preparation for fixing TokuDB optimistic parallel
replication, as well as for removing locking hacks in InnoDB/XtraDB in
10.2.

Signed-off-by: Kristian Nielsen <knielsen at knielsen-hq.org>
2016-09-08 15:25:40 +02:00
Monty
96e95b5465 Better SHOW PROCESSLIST for replication
- When waiting for events, start time is now counted from start of wait
- Instead of having "Connect" as "Command" for all replication threads we
  now have:
  - Slave_IO for Slave thread reading relay log
  - Slave_SQL for slave executing SQL commands or distribution queries to
    Slave workers
  - Slave_worker for slave threads executin SQL commands in parallel replication
2016-08-29 13:10:17 +03:00
Monty
89685d55d7 Reuse THD for new user connections
- To ensure that mallocs are marked for the correct THD, even if it's
  allocated in another thread, I added the thread_id to the THD constructor
- Added st_my_thread_var to thr_lock_info_init() to avoid a call to my_thread_var
- Moved things from THD::THD() to THD::init()
- Moved some things to THD::cleanup()
- Added THD::free_connection() and THD::reset_for_reuse()
- Added THD to CONNECT::create_thd()
- Added THD::thread_dbug_id and st_my_thread_var->dbug_id. These are needed
  to ensure that we have a constant thread_id used for debugging with a THD,
  even if it changes thread_id (=connection_id)
- Set variables.pseudo_thread_id in constructor. Removed not needed sets.
2016-06-04 09:06:00 +02:00
Monty
732adec0a4 Removed some not needed when doing delete thd, which caused warnings about
wrong mutex usage from safe_mutex.
Ensure that LOCK_status is always taken before LOCK_thread_count
2016-04-28 13:39:55 +03:00
Monty
cdd4043117 Cleanups:
- Removed some QQ markers
- Removed some rows not compatible with valgrind 3.9.0
- Made mysql_install_db.sh more silent by default. --verbose now gives more information
- Added assert that auto-increment doesn't generate 0 (safety)
- Removed thd->set_time() in some places as it's set in init_for_queries()
- Fixed some --big tests in tokudb
- Fixed a bug in mysql_client_test.cc where sql_mode was not properly reset
2016-04-05 18:00:04 +03:00
Sergei Golubchik
f67a2211ec Merge branch '10.1' into 10.2 2016-03-23 22:36:46 +01:00
Sergei Golubchik
3b0c7ac1f9 Merge branch '10.0' into 10.1 2016-03-21 13:02:53 +01:00
Otto Kekäläinen
1777fd5f55 Fix spelling: occurred, execute, which etc 2016-03-04 02:09:37 +02:00
Monty
3d4a7390c1 MDEV-6150 Speed up connection speed by moving creation of THD to new thread
Creating a CONNECT object on client connect and pass this to the working thread which creates the THD.
Split LOCK_thread_count to different mutexes
Added LOCK_thread_start to syncronize threads
Moved most usage of LOCK_thread_count to dedicated functions
Use next_thread_id() instead of thread_id++

Other things:
- Thread id now starts from 1 instead of 2
- Added cast for thread_id as thread id is now of type my_thread_id
- Made THD->host const (To ensure it's not changed)
- Removed some DBUG_PRINT() about entering/exiting mutex as these was already logged by mutex code
- Fixed that aborted_connects and connection_errors_internal are counted in all cases
- Don't take locks for current_linfo when we set it (not needed as it was 0 before)
2016-02-07 10:34:03 +02:00
Sergei Golubchik
a2bcee626d Merge branch '10.0' into 10.1 2015-12-21 21:24:22 +01:00
Monty
c3018b0ff4 Fixes to get all test to run on MacosX Lion 10.7
This includes fixing all utilities to not have any memory leaks,
as safemalloc warnings stopped tests from passing on MacOSX.

- Ensure that all clients takes character-set-dir, as the
  libmysqlclient library will use it.
- mysql-test-run now passes character-set-dir to all external clients.
- Changed dynstr_free() so that it can be called twice (made freeing code easier)
- Changed rpl_global_gtid_slave_state to be allocated dynamicly as it
  includes a mutex that needs to be initizlied/destroyed before my_end() is called.
- Removed rpl_slave_state::init() and rpl_slave_stage::deinit() as
  their job are better handling by constructor and delete.
- Print alias instead of table_name in check_duplicate_key as
  table_name may have been converted to lower case.

Other things:
- Fixed a case in time_to_datetime_with_warn() where we where
  using && instead of & in tests
2015-11-29 17:51:23 +02:00
Monty
b30a768e7b Fixed failures in rpl_parallel2
Problem was that we used same condition variable with 2 different mutex.
Fixed by changing to use COND_rpl_thread_stop instead of COND_parallel_entry
for stopping threads.

Patch by Kristian Nielsen
2015-11-23 19:58:30 +02:00
Kristian Nielsen
8f2e05f41c Merge branch 'mdev7818-4' into 10.1
Conflicts:
	mysql-test/suite/perfschema/r/stage_mdl_global.result
	sql/rpl_rli.cc
	sql/sql_parse.cc
2015-11-13 14:24:40 +01:00
Kristian Nielsen
ba02550166 MDEV-7818: Deadlock occurring with parallel replication and FTWRL
Problem is that FLUSH TABLES WITH READ LOCK first blocks threads from
starting new commits, then waits for running commits to complete. But
in-order parallel replication needs commits to happen in a particular
order, so this can easily deadlock.

To fix this problem, this patch introduces a way to temporarily pause
the parallel replication worker threads. Before starting FTWRL, we let
all worker threads complete in-progress transactions, and then
wait. Then we proceed to take the global read lock. Once the lock is
obtained, we unpause the worker threads. Now commits are blocked from
starting by the global read lock, so the deadlock will no longer occur.
2015-11-13 14:02:15 +01:00
Kristian Nielsen
6d96fab7dd MDEV-7818: Deadlock occurring with parallel replication and FTWRL
Preparation patch, moving the GCO wait into a separate function, in
preparation for adding a separate wait phase for FLUSH TABLES WITH
READ LOCK.
2015-11-13 14:02:14 +01:00
Kristian Nielsen
75dc267101 Change Seconds_behind_master to be updated only at commit in parallel replication
Before, the Seconds_behind_master was updated already when an event
was queued for a worker thread to execute later. This might lead users
to interpret a low value as the slave being almost up to date with the
master, while in reality there might still be lots and lots of events
still queued up waiting to be applied by the slave.

See https://lists.launchpad.net/maria-developers/msg08958.html for
more detailed discussions.
2015-11-13 10:24:53 +01:00
Kristian Nielsen
df9b8aee58 Merge MDEV-8193 into 10.1
Conflicts:
	sql/rpl_rli.cc
2015-09-11 12:01:48 +02:00
Kristian Nielsen
51eaa7fe53 MDEV-8193: UNTIL clause in START SLAVE is sporadically disobeyed by parallel replication
The code was using the wrong variable when comparing the binlog name
for the UNTIL position. This could cause the comparison to fail after
binlog rotation, in turn causing the UNTIL clause to not trigger slave
stop.
2015-09-11 10:51:56 +02:00
Sergei Golubchik
b85a00161e MDEV-8264 encryption for binlog
* Start_encryption_log_event
* --encrypt-binlog command line option

based on google patches.
2015-09-04 10:33:55 +02:00
Kristian Nielsen
ef82cb7c2c Merge MDEV-8725 into 10.1 2015-09-02 10:53:37 +02:00
Kristian Nielsen
999c43aeb7 MDEV-8725: Assertion `!(thd->rgi_slave && thd-> rgi_slave->did_mark_start_commit)' failed in ha_rollback_trans
The assertion is there to catch cases where we rollback while
mark_start_commit() is active. This can allow following event groups
to be replicated too early, causing conflicts.

But in this case, we have an _explicit_ ROLLBACK event in the binlog,
which should not assert.

We fix this by delaying the mark_start_commit() in the explicit
ROLLBACK case. It seems safest to delay this in ROLLBACK case anyway,
and there should be no reason to try to optimise this corner case.
2015-09-02 09:57:18 +02:00
Kristian Nielsen
dbd205797b Merge MDEV-8302 into 10.1 2015-08-04 12:39:22 +02:00
Kristian Nielsen
9b9c5e890c MDEV-8302: Duplicate key with parallel replication
This bug is essentially another variant of MDEV-7458.

If a transaction conflict caused a deadlock kill of T2 in record_gtid()
during commit, the code would do a rollback _before_ running
rgi->unmark_start_commit(). This creates a race where following transactions
could start too early (before T2 has completed its transaction retry). This
in turn could lead to replication failure, if there was a conflict that
caused eg. duplicate key error or similar.

The fix is to remove these rollbacks (in Query_log_event::do_apply_event()
and Xid_log_event::do_apply_event(). They seem out-of-place; code in
log_event.cc generally does not roll back on error, this is handled higher
up.

In addition, because of the extreme difficulty of reproducing bugs like
MDEV-7458 and MDEV-8302, this patch adds some extra precations to try to
detect (in debug builds) or prevent (in release builds) similar bugs.
ha_rollback_trans() will now call unmark_start_commit() if needed (and
assert in debug build when a caller does rollback without unmark first).

We also add an extra check for thd->killed() so that we avoid doing
mark_start_commit() if we already have a pending deadlock kill.

And we add a missing unmark_start_commit() call in the error case, found by
the above assertion.
2015-08-04 11:40:19 +02:00
Kristian Nielsen
903f8dc72d Merge MDEV-8147 into 10.1 2015-05-26 15:03:22 +02:00
Kristian Nielsen
e5f1e841dc MDEV-8147: Assertion `m_lock_type == 2' failed in handler::ha_close() during parallel replication
When the slave processes the master restart format_description event,
parallel replication needs to complete any prior events before processing
the restart event (which closes temporary tables and such stuff).

This happens in wait_for_workers_idle(), however it was not waiting long
enough. The wait was using wait_for_prior_commit(), but at that points table
can still be open. This lead to assertion in this case.

So change wait_for_workers_idle() to wait until all worker threads have
reached finish_event_group(), at which point all tables should have been
closed.
2015-05-26 13:04:15 +02:00
Sergey Vojtovich
9851a8193f MDEV-8001 - mysql_reset_thd_for_next_command() takes 0.04% in OLTP RO
Removed yet more mysql_reset_thd_for_next_command(). Call
THD::reset_for_next_command() directly instead.
2015-05-13 15:28:34 +04:00
Kristian Nielsen
8bedb638d7 MDEV-8113: Parallel slave: slave hangs on ALTER TABLE (or other DDL) as the first event after slave start
In optimistic parallel replication, it is not safe to try to run a following
transaction in parallel with a DDL statement, and there is code to prevent
this.

However, the code was missing the case where the DDL is the very first event
after slave start. In this case, following transactions could run in
parallel with the DDL, which can cause the slave to hang or even corrupt
slave in unlucky cases.
2015-05-11 12:43:38 +02:00
Kristian Nielsen
c2dd88ac85 Merge MDEV-8031 into 10.1 2015-04-23 14:40:10 +02:00
Kristian Nielsen
b616991a68 MDEV-8031: Parallel replication stops on "connection killed" error (probably incorrectly handled deadlock kill)
There was a rare race, where a deadlock error might not be correctly
handled, causing the slave to stop with something like this in the error
log:

150423 14:04:10 [ERROR] Slave SQL: Connection was killed, Gtid 0-1-2, Internal MariaDB error code: 1927
150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
150423 14:04:10 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
150423 14:04:10 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001 position 1234

The problem was incorrect error handling. When a deadlock is detected, it
causes a KILL CONNECTION on the offending thread. This error is then later
converted to a deadlock error, and the transaction is retried.

However, the deadlock error was not cleared at the start of the retry, nor
was the lingering kill signal. So it was possible to get another deadlock
kill early during retry. If this happened with particular thread
scheduling/timing, it was possible that the new KILL CONNECTION error was
masked by the earlier deadlock error, so that the second kill was not
properly converted into a deadlock error and retry.

This patch adds code that clears the old error and killed flag before
starting the retry. It also adds code to handle a deadlock kill caught in a
couple of places where it was not handled before.
2015-04-23 14:09:15 +02:00
Kristian Nielsen
167332597f Merge 10.0 -> 10.1.
Conflicts:
	mysql-test/suite/multi_source/multisource.result
	sql/sql_base.cc
2015-04-17 15:18:44 +02:00
Kristian Nielsen
accdabd668 Merge MDEV-7888 and MDEV-7929 into 10.0. 2015-04-08 13:19:22 +02:00
Kristian Nielsen
48c10fb5f7 Merge MDEV-7888 and MDEV-7929 into 10.1. 2015-04-08 11:04:24 +02:00
Kristian Nielsen
3b961347db MDEV-7888, MDEV-7929: Parallel replication hangs sometimes on ANALYZE TABLE or DDL
The hangs occur when the group_commit_orderer object is freed before the last
mark_start_commit() call on it - this loses the wakeup to other waiting worker
threads, causing them to hang until killed manually.

The object was freed because wakeup_subsequent_commits() was called two early
in two places. For MDEV-7888, during ANALYZE TABLE, and for MDEV-7929 during
record_gtid() after processing a DDL event. The group_commit_orderer object
can be freed when its last transaction has called wait_for_prior_commit().

Fix by implementing a suspend/resume mechanism for wakeup_subsequent_commits()
that can be used in places where a transaction is committed without this being
the commit of the actual replication event group.

Also add a protection mechanism (that asserts in debug builds) which can
prevent the too-early free and hang if other similar bugs should remain in
other parts of the code.
2015-04-08 11:01:18 +02:00
Kristian Nielsen
f573b65e41 Merge MDEV-7847 and MDEV-7882 into 10.0.
Conflicts:
	mysql-test/suite/rpl/r/rpl_parallel.result
	sql/rpl_parallel.cc
2015-03-30 15:10:29 +02:00
Kristian Nielsen
c41e4d3b49 Merge MDEV-7847 and MDEV-7882 into 10.0.
Conflicts:
	mysql-test/suite/rpl/r/rpl_parallel.result
	mysql-test/suite/rpl/t/rpl_parallel.test
2015-03-30 14:51:25 +02:00
Kristian Nielsen
880f2273fd MDEV-7847: "Slave worker thread retried transaction 10 time(s) in vain, giving up", followed by replication hanging
This patch fixes a bug in the error handling in parallel replication, when one
worker thread gets a failure and other worker threads processing later
transactions have to rollback and abort.

The problem was with the lifetime of group_commit_orderer objects (GCOs).
A GCO is freed when we register that its last event group has committed. This
relies on register_wait_for_prior_commit() and wait_for_prior_commit() to
ensure that the fact that T2 has committed implies that any earlier T1 has
also committed, and can thus no longer execute mark_start_commit().

However, in the error case, the code was skipping the
register_wait_for_prior_commit() and wait_for_prior_commit() calls. Thus
commit ordering was not guaranteed, and a GCO could be freed too early. Then a
later mark_start_commit() would reference deallocated GCO, which could lead to
lost wakeup (causing slave threads to hang) or other corruption.

This patch makes also the error case respect commit order. This way, also the
error case gets the GCO lifetime correct, and the hang no longer occurs.
2015-03-30 14:33:44 +02:00
Kristian Nielsen
a4082918c8 MDEV-7882: Excessive transaction retry in parallel replication
When a transaction in parallel replication needs to retry (eg. because of
deadlock kill), first wait for all prior transactions to commit before doing
the retry. This way, we avoid the retry once again conflicting with a prior
transaction, requiring yet another retry.

Without this patch, we saw "in the wild" that transactions had to be retried
more than 10 times to succeed, which exceeds the default
--slave_transaction_retries value and is in any case undesirable.

(We already do this in 10.1 in "optimistic" parallel replication mode; this
patch just makes the code use the same logic for "conservative" mode (only
mode in 10.0)).
2015-03-30 14:16:57 +02:00
Kristian Nielsen
bd2ae787ea MDEV-7825: Parallel replication race condition on gco->flags, possibly resulting in slave hang
The patch for optimistic parallel replication as a memory optimisation moved
the gco->installed field into a bit in gco->flags. However, that is just plain
wrong. The gco->flags field is owned by the SQL driver thread, but
gco->installed is used by the worker threads, so this will cause a race
condition.

The user-visible problem might be conflicts between transactions and/or slave
threads hanging.

So revert this part of the optimistic parallel replication patch, going back
to using a separate field gco->installed like in 10.0.
2015-03-24 16:33:51 +01:00
Kristian Nielsen
ed04c40b01 MDEV-5289: master server starts slave parallel threads
Delay spawning parallel replication worker threads until a slave SQL
thread is running, and de-spawn them when the last SQL thread stops.

This is especially useful to avoid needless threads on a master in a
setup where same my.cnf is used on masters and slaves.
2015-03-11 09:18:16 +01:00
Sergei Golubchik
2db62f686e Merge branch '10.0' into 10.1 2015-03-07 13:21:02 +01:00
Kristian Nielsen
95d7208859 Merge MDEV-6589 and MDEV-6403 into 10.1.
Conflicts:
	sql/log.cc
	sql/rpl_rli.cc
	sql/sql_repl.cc
2015-03-04 13:49:37 +01:00
Kristian Nielsen
3ef0b9b235 Merge MDEV-6589 and MDEV-6403 into 10.0. 2015-03-04 13:36:54 +01:00