Commit graph

733 commits

Author SHA1 Message Date
Jan Lindström
468e56bfde Add missing includes. 2020-07-24 19:25:32 +03:00
Teemu Ollakka
1e2a4ed7ed MDEV-21718 Assertion in wsrep::client_state::before_command().
An assertion

  `server_state_.rollback_mode() == wsrep::server_state::rm_async`

fired in before_command() when
- thread-handling was set to pool-of-threads and
- a BF abort happened between client session calls to
  wait_rollback_complete_and_acquire_ownership() and before_command().

This commit introduces a test case to reproduce the crash and
updates wsrep-lib submodule to fixed version.
2020-07-24 13:26:21 +03:00
Jan Lindström
134a6a8d2f Silence unnecessary warning. 2020-07-24 12:05:39 +03:00
sjaakola
95132ade6d MDEV-20928 mtr test galera.galera_var_innodb_disallow_writes test failure
The sporadic test hangs happen because of mutex dealock between innodb
background threads and two test connection executions.
The test sets variable innodb_disallow_writes, which blocks all writes
to filesyste. The test logic is to execute an INSERT, which should hang
because of filesytstem writes are blocked, and through another session
verify by SELECT that this hanging happens. The SELECT session will then
release innodb_disallow_writes blocking.

However, filesystem write  blocking affects also innodb background threads
and they may hang while keeping some other resources locked.
As an example, in one test hang situation, buffer pool access was blocked.
And, if buffer pool is blocked, the test connections will be blocked as well,
and the SELECT session will not be able to continue to release the
innodb_disallow_writes.

The fix in this commit is refactoring of the test logic.
The test will now set first innodb_disallow_writes blocking, and then record
a hash of data directory's filesystem contents. This works as checksum of the
state of data on the datadirectory.

Then some SQL load is tried on both nodes, these sessions will be blocking
due to frozen file system state. The test will have a short sleep to allow
innodb background threads to loop and possibly encounter innodb_disallow_writes
blocking as well.

After the sleep, the test will record file system checksun for the second time,
and then release the innodb_disallow-writes blocking.

Finally, the two checksums are compared, they should be identical to verify that
nothing was written on datadirectory during the test execution.

The checksum is implemented by md5sum hash over all files found in datadirectory
by find command. all these file hashes are hashed together by one more md5sum.

The test therefore depends on md5sum and find. find may work differently with some
OS distributions, e.g. freebsd may be problematic.
2020-07-24 12:05:39 +03:00
mkaruza
4b4372af6a MDEV-22458: Server with WSREP hangs after INSERT, wrong usage of mutex 'LOCK_thd_data' and 'share->intern_lock' / 'lock->mutex'
Add `find_thread_by_id_with_thd_data_lock` which will be used only when killing thread.
This version needs to take `thd->LOCK_thd_data` lock.
2020-07-24 12:05:39 +03:00
Jan Lindström
8c7f7bae47 Fix regex on test. 2020-07-22 08:48:14 +03:00
Julius Goryavsky
956f21c3b0 Merge remote-tracking branch 'origin/bb-10.4-MDEV-21910' into 10.4 2020-07-16 13:03:29 +02:00
Julius Goryavsky
df1846aeea Merge branch '10.4-MDEV-18838' of https://github.com/codership/mariadb-server into 10.4-MDEV-22966 2020-07-14 09:36:38 +02:00
Julius Goryavsky
1bf863a91a Merge branch '10.4-MDEV-22222' of https://github.com/codership/mariadb-server into 10.4-MDEV-22222 2020-07-03 16:17:59 +02:00
MikkoJaakola
7b8319f3f1 MDEV-22966- Hang on galera_toi_truncate test case
galera_toi_truncate test launches a long term INSERT statement in node 2,
and then submits an offending TRUNCATE through node 1. The idea is that
the replicated TRUNCATE will conflict with INSERT in node 2, and force
the INSERT to abort.

The test first issues --send INSERT in node 2, and then switches to node 1
to launch --send TRUNCATE. As the INSERT is launched asynchronously by --send,
it may happen that INSERT has not yet started to process, before the TRUNCATE
is replicated. The net effect may be that TRUCATE processes to completion in node 2,
and only after that INSERT starts to execute. As the INSERT is very long query,
it will last longer than mtr test suite max test time, the test will fail for timeout.

The fix in this commit uses another connection in node 2, to wait until the INSERT
has started to process in node 2. TRUNCATE in node 1, will be submitted in node 1
after this wait condition.
2020-07-02 10:59:00 +03:00
Marko Mäkelä
f347b3e0e6 Merge 10.3 into 10.4 2020-07-02 07:39:33 +03:00
Marko Mäkelä
1df1a63924 Merge 10.2 into 10.3 2020-07-02 06:17:51 +03:00
Marko Mäkelä
ea2bc974dc Merge 10.1 into 10.2 2020-07-01 12:03:55 +03:00
Julius Goryavsky
8e8f9671cb MDEV-21773: added missing include file to mtr tests 2020-06-30 14:03:22 +02:00
mkaruza
2b8b7394a1 MDEV-22222: Assertion `state() == s_executing || state() == s_preparing || state() == s_prepared || state() == s_must_abort || state() == s_aborting || state() == s_cert_failed || state() == s_must_replay' failed in wsrep::transaction::before_rollback
LOCK TABLE will do implicit commit, we need to properly handle transaction after commit.
2020-06-28 23:07:41 +02:00
sjaakola
5a7794d3a8 MDEV-21910 Deadlock between BF abort and manual KILL command
When high priority replication slave applier encounters lock conflict in innodb,
it will force the conflicting lock holder transaction (victim) to rollback.
This is a must in multi-master sychronous replication model to avoid cluster lock-up.
This high priority victim abort (aka "brute force" (BF) abort), is started
from innodb lock manager while holding the victim's transaction's (trx) mutex.
Depending on the execution state of the victim transaction, it may happen that the
BF abort will call for THD::awake() to wake up the victim transaction for the rollback.
Now, if BF abort requires THD::awake() to be called, then the applier thread executed
locking protocol of: victim trx mutex -> victim THD::LOCK_thd_data

If, at the same time another DBMS super user issues KILL command to abort the same victim,
it will execute locking protocol of: victim THD::LOCK_thd_data  -> victim trx mutex.
These two locking protocol acquire mutexes in opposite order, hence unresolvable mutex locking
deadlock may occur.

The fix in this commit adds THD::wsrep_aborter flag to synchronize who can kill the victim
This flag is set both when BF is called for from innodb and by KILL command.
Either path of victim killing will bail out if victim's wsrep_killed is already
set to avoid mutex conflicts with the other aborter execution. THD::wsrep_aborter
records the aborter THD's ID. This is needed to preserve the right to kill
the victim from different locations for the same aborter thread.
It is also good error logging, to see who is reponsible for the abort.

A new test case was added in galera.galera_bf_kill_debug.test for scenario where
wsrep applier thread and manual KILL command try to kill same idle victim
2020-06-26 09:56:23 +03:00
Julius Goryavsky
141b390d82 Merge branch '10.4-MDEV-22729-2' into 10.4 2020-06-25 13:06:51 +02:00
Jan Lindström
bffa8264aa Stabilize glera_var_cluster_conf_id test case. 2020-06-24 17:16:38 +03:00
sjaakola
33de71c2f8 MDEV-22632 wsrep XID checkpointing can happen out of order for certification failure
When a transaction fails in certification phase, it has connsumed one GTID, but as
transaction must rollback, it will not go for commit ordering, and because of this
also the wsrep XID checkpointing can happen out of order.
This PR will make the thread, which has failed for certiication failure to wait for its
commit order turn for checkpointing wsrep IXD in innodb rollback segment.

There is a specific test for wsrep XID checkpointing ordering in mtr test:
mysql-wsrep-bugs-607, which is added in this PR.

Test galera_slave_replay depends also on this fix, as the second test phase
may also assert for bad wsrep XID checkpointing order.
galera_slave_replay.test had also other problems, which caused the test to
fail immediately, thse are now fixes in this PR as well.
2020-06-24 17:16:38 +03:00
Jan Lindström
9fb8d87d2d Test fixes. 2020-06-24 09:38:54 +03:00
Julius Goryavsky
7bd11fb46f MDEV-22729: additional changes after merge 2020-06-23 12:56:08 +02:00
Jan Lindström
eba9189777 Test case cleanups. 2020-06-23 07:46:35 +03:00
MikkoJaakola
51c8289ed6 MDEV-21759 galera.galera_parallel_autoinc_manytrx sporadic failures.
The galera.galera_parallel_autoinc_manytrx mtr test opens and runs test
scenario through 3 connections to node 1 and one connection to node 2.
In the test initialization phase, the test creates two tables 't1' and 'ten'
and then creates a stored procedure 'p1' to operate on these tables.
These 3 create DDL statements are issued through same connection to node 1.

In the next test phase, the mtr script uses send command to launch the call
for the p1 stored procedure through all 3 connections to node 1 and through
one connection to node 2. As the mtr send command is asynchronous,
this test phase is non blocking and fast operation.
Now, if the replication between nodes is slow, it may happen that the
initialization phase DDL statements have not been received or have not been
fully applied in node 2. Therefore there is no guarantee that the test tables
and the stored procedure have been created in node 2. Yet, the test is trying
to call p1 in node 2.

In the failure case error logs, there is error message
"MTR failed: query 'reap' failed: 1305: PROCEDURE test.p1 does not exist"

The reap command through connection to node 2, is the first place where test
execution may observe that test tables and/or stored procedure are not yet
created in node 2.

The fix in this commit adds a wait condition in connection to node 2, to wait
until the stored procedure is created before calling the stored procedure.
The wait is implemented by looking in information_schema.routines for the p1
stored procedure.
2020-06-23 07:46:35 +03:00
Jan Lindström
5d7e067cce MDEV-22125 : galera.galera_drop_multi MTR failed: InnoDB: MySQL is trying to drop database fts.`` though there are still open handles
MDEV-22140 galera.galera_drop_database MTR failed: InnoDB: MySQL is trying to drop database `fts`.`` though there are still open handles

Add wait conditions to wait that all operations are done in both
nodes.
2020-06-23 07:46:35 +03:00
Jan Lindström
319886eca7 MDEV-20928 : Galera test failure on galera.galera_var_innodb_disallow_writes: Result length mismatch
Add wait_conditions to force desired execution.
2020-06-23 07:46:35 +03:00
Jan Lindström
b80b52394d Test case cleanups. 2020-06-22 13:25:25 +03:00
Julius Goryavsky
4b4e77db64 Merge branch '10.4-MDEV-22729' of https://github.com/codership/mariadb-server into 10.4-MDEV-22729-2 2020-06-19 18:01:15 +02:00
MikkoJaakola
0128e13e62 MDEV-21759 galera.galera_parallel_autoinc_manytrx sporadic failures.
The galera.galera_parallel_autoinc_manytrx mtr test opens and runs test
scenario through 3 connections to node 1 and one connection to node 2.
In the test initialization phase, the test creates two tables 't1' and 'ten'
and then creates a stored procedure 'p1' to operate on these tables.
These 3 create DDL statements are issued through same connection to node 1.

In the next test phase, the mtr script uses send command to launch the call
for the p1 stored procedure through all 3 connections to node 1 and through
one connection to node 2. As the mtr send command is asynchronous,
this test phase is non blocking and fast operation.
Now, if the replication between nodes is slow, it may happen that the
initialization phase DDL statements have not been received or have not been
fully applied in node 2. Therefore there is no guarantee that the test tables
and the stored procedure have been created in node 2. Yet, the test is trying
to call p1 in node 2.

In the failure case error logs, there is error message
"MTR failed: query 'reap' failed: 1305: PROCEDURE test.p1 does not exist"

The reap command through connection to node 2, is the first place where test
execution may observe that test tables and/or stored procedure are not yet
created in node 2.

The fix in this commit adds a wait condition in connection to node 2, to wait
until the stored procedure is created before calling the stored procedure.
The wait is implemented by looking in information_schema.routines for the p1
stored procedure.
2020-06-16 11:43:31 +03:00
Jan Lindström
7710f28eec Add missing include as test requires galera debug library 2020-06-15 09:29:17 +03:00
Marko Mäkelä
b3e395a13e Merge 10.2 into 10.3 2020-06-06 18:50:25 +03:00
Julius Goryavsky
5f55f69e4a Merge 10.1 into 10.2 2020-06-05 18:32:37 +02:00
Julius Goryavsky
3f019d1771 Added missing include files to check for debug_sync 2020-06-03 15:34:44 +02:00
sjaakola
8ec0e9111a MDEV-22763 backporting MDEV-20225 fix into 10.1
Backported the support for aborting and replaying stored procedure and fix for trigger
key assigments from 10.4 version.
Backported also two mtr tests: wsrep_sp_bf_abort and MDEV-20225
2020-06-03 15:34:44 +02:00
sjaakola
ccec6b887b MDEV-22729 fixes for galera.galera_slave_replay test
The test was changing wsrep_on option in node_3, which is native
MariaDB server (i.e. not a cluster node). Native NariaDB server
should not manipulate wsrep replication state, this problem is fixed.

galera.galera_slave_replay test phase 2 will cause certification failure
for async slave SQL handler thread. This certification failure is now
monitored and required to happen in the test.

The test phase 2, generates scenario, where async slave SQL handler faces
certification failure and galera slave applier is paused when this happens.
This makes the test vulnerable for anomaly described in MDEV-22632.
Therefore the fix in this commit depends on MDEV-22632, and should be merged
after the fix for MDEV-22632.
2020-05-27 21:21:24 +03:00
Julius Goryavsky
e04999c460 Forgotten include files were added to check the necessary conditions for running the test 2020-05-26 14:01:13 +02:00
sjaakola
1af6e92f0b MDEV-22666 galera.MW-328A hang
The hang can happen between a lock connection issuing KILL CONNECTION for a victim,
which is in committing phase.
There happens two resource deadlockwhere  killer is holding victim's
LOCK_thd_data and requires trx mutex for the victim.
The victim, otoh, holds his own trx mutex, but requires LOCK_thd_data
in wsrep_commit_ordered(). Hence a classic two thread deadlock happens.

The fix in this commit changes innodb commit so that wsrep_commit_ordered()
is not called while holding trx mutex. With this, wsrep patch commit time mutex
locking does not violate the locking protocol of KILL command
(i.e. LOCK_thd_data -> trx mutex)

Also, a new test case has been added in galera.galera_bf_kill.test for scenario
where a client connection is killed in committting phase.
2020-05-25 19:30:23 +03:00
Marko Mäkelä
d8dc3c72b6 Merge 10.3 into 10.4 2020-05-20 12:25:23 +03:00
Marko Mäkelä
f4f0ef3e37 Merge 10.2 into 10.3 2020-05-20 11:41:51 +03:00
Jan Lindström
ad0f85bcd2 MDEV-18838 : galera.galera_toi_truncate: Test failure: mysqltest: query 'reap' succeeded - should have failed with errno 1213
Test cleanup.
2020-05-20 09:34:50 +03:00
Jan Lindström
fde94b4cd6 MDEV-21483 : Galera MTR tests failed: galera.MW-328A galera.MW-328B
Enable tests with additional galera output to find out actual
reason for test failures.
2020-05-18 14:21:12 +03:00
Jan Lindström
523d67a272 MDEV-22494 : Galera assertion lock_sys.mutex.is_owned() at lock_trx_handle_wait_low
Problem was that trx->lock.was_chosen_as_wsrep_victim variable was
not set back to false after it was set true.

wsrep_thd_bf_abort
	Add assertions for correct mutex status and take necessary
	mutexes before calling thd->awake_no_mutex().

innobase_rollback_trx()
	Reset trx->lock.was_chosen_as_wsrep_victim

wsrep_abort_slave_trx()
	Removed unused function.

wsrep_innobase_kill_one_trx()
	Added function comment, removed unnecessary parameters
	and added debug assertions to enforce correct usage. Added
	more debug output to help out on error analysis.

wsrep_abort_transaction()
	Added debug assertions and removed unused variables.

trx0trx.h
	Removed assert_trx_is_free macro and replaced it with
	assert_freed() member function.

trx_create()
	Use above assert_free() and initialize wsrep variables.

trx_free()
	Use assert_free()

trx_t::commit_in_memory()
	Reset lock.was_chosen_as_wsrep_victim

trx_rollback_for_mysql()
	Reset trx->lock.was_chosen_as_wsrep_victim

Add test case galera_bf_kill
2020-05-15 09:04:02 +03:00
Marko Mäkelä
38f6c47f8a Merge 10.3 into 10.4 2020-05-13 12:52:57 +03:00
Marko Mäkelä
15fa70b840 Merge 10.2 into 10.3 2020-05-13 11:45:05 +03:00
Jan Lindström
748fb55093 MDEV-21483 : Galera MTR tests failed: galera.MW-328A galera.MW-328B
Enable tests with additional galera output to find out actual
reason for test failures.
2020-05-08 11:35:15 +03:00
Jan Lindström
a878344ee5 MDEV-21421 : Galera test sporadic failure on galera.galera_as_slave_gtid_myisam: Result length mismatch
Add wait_condition so that drop table has time to replicate to
Galera cluster.
2020-05-08 09:16:37 +03:00
Jan Lindström
40d0b64167 MDEV-21421 : Galera test sporadic failure on galera.galera_as_slave_gtid_myisam: Result length mismatch
Add wait_condition so that drop table has time to replicate to
Galera cluster.
2020-05-08 09:13:47 +03:00
Jan Lindström
057a700a2a MDEV-22466 : Galera missing .test or .result files
Add missing .test and .result files.
2020-05-07 14:23:33 +03:00
Jan Lindström
e6301d8f67 MDEV-21515 : Galera test sporadic failure on galera.galera_wsrep_new_cluster: Result content mismatch
Test starts two servers and we do not know order they really start,
thus wsrep_local_index can be 1 or 2.
2020-05-06 17:32:08 +03:00
Marko Mäkelä
2c3c851d2c Merge 10.3 into 10.4 2020-05-05 20:33:10 +03:00
Jan Lindström
37a01aceca MDEV-21489 : wsrep_cluster_conf_id has wrong value
Do not show exact value as it depends order of test execution.
Instead use # for correct values and ERROR for incorrect.
2020-05-05 09:48:03 +03:00