mariadb/mysql-test/suite/rpl/t/rpl_parallel_deadlock_victim2.test
Marko Mäkelä ddd7d5d8e3 MDEV-24035 Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0 causing disruption and replication failure
Under unknown circumstances, the SQL layer may wrongly disregard an
invocation of thd_mark_transaction_to_rollback() when an InnoDB
transaction had been aborted (rolled back) due to one of the following errors:
* HA_ERR_LOCK_DEADLOCK
* HA_ERR_RECORD_CHANGED (if innodb_snapshot_isolation=ON)
* HA_ERR_LOCK_WAIT_TIMEOUT (if innodb_rollback_on_timeout=ON)

Such an error used to cause a crash of InnoDB during transaction commit.
These changes aim to catch and report the error earlier, so that not only
this crash can be avoided but also the original root cause be found and
fixed more easily later.

The idea of this fix is from Michael 'Monty' Widenius.

HA_ERR_ROLLBACK: A new error code that will be translated into
ER_ROLLBACK_ONLY, signalling that the current transaction
has been aborted and the only allowed action is ROLLBACK.

trx_t::state: Add TRX_STATE_ABORTED that is like
TRX_STATE_NOT_STARTED, but noting that the transaction had been
rolled back and aborted.

trx_t::is_started(): Replaces trx_is_started().

ha_innobase: Check the transaction state in various places.
Simplify the logic around SAVEPOINT.

ha_innobase::is_valid_trx(): Replaces ha_innobase::is_read_only().

The InnoDB logic around transaction savepoints, commit, and rollback
was unnecessarily complex and might have contributed to this
inconsistency. So, we are simplifying that logic as well.

trx_savept_t: Replace with const undo_no_t*. When we rollback to
a savepoint, all we need to know is the number of undo log records
that must survive.

trx_named_savept_t, DB_NO_SAVEPOINT: Remove. We can store undo_no_t
directly in the space allocated at innobase_hton->savepoint_offset.

fts_trx_create(): Do not copy previous savepoints.

fts_savepoint_rollback(): If a savepoint was not found, roll back
everything after the default savepoint of fts_trx_create().
The test innodb_fts.savepoint is extended to cover this code.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
2024-12-12 18:02:00 +02:00

87 lines
3.1 KiB
Text

--source include/master-slave.inc
--source include/have_innodb.inc
--source include/have_debug.inc
--source include/have_binlog_format_statement.inc
--disable_query_log
call mtr.add_suppression("InnoDB: Transaction was aborted due to ");
--enable_query_log
--connection master
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
CREATE TABLE t1(a INT) ENGINE=INNODB;
INSERT INTO t1 VALUES(1);
--source include/save_master_gtid.inc
--connection slave
--source include/sync_with_master_gtid.inc
--source include/stop_slave.inc
--let $save_transaction_retries= `SELECT @@global.slave_transaction_retries`
--let $save_slave_parallel_threads= `SELECT @@global.slave_parallel_threads`
--let $save_slave_parallel_mode= `SELECT @@global.slave_parallel_mode`
set @@global.slave_parallel_threads= 2;
set @@global.slave_parallel_mode= OPTIMISTIC;
set @@global.slave_transaction_retries= 2;
--echo *** MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master
# This was a failure where a transaction T1 could deadlock multiple times
# with T2, eventually exceeding the default --slave-transaction-retries=10.
# Root cause was MDEV-31655, causing InnoDB to wrongly choose T1 as deadlock
# victim over T2. If thread scheduling is right, it was possible for T1 to
# repeatedly deadlock, roll back, and have time to grab an S lock again before
# T2 woke up and got its waiting X lock, thus repeating the same deadlock over
# and over.
# Once the bug is fixed, it is not possible to re-create the same execution
# and thread scheduling. Instead we inject small sleeps in a way that
# triggered the problem when the bug was there, to demonstrate that the
# problem no longer occurs.
--connection master
# T1
SET @@gtid_seq_no= 100;
INSERT INTO t1 SELECT 1+a FROM t1;
# T2
SET @@gtid_seq_no= 200;
INSERT INTO t1 SELECT 2+a FROM t1;
SELECT * FROM t1 ORDER BY a;
--source include/save_master_gtid.inc
--connection slave
SET @save_dbug= @@GLOBAL.debug_dbug;
# Inject various delays to hint thread scheduling to happen in the way that
# triggered MDEV-28776.
# Small delay starting T1 so it will be the youngest trx and be chosen over
# T2 as the deadlock victim by default in InnoDB.
SET GLOBAL debug_dbug="+d,rpl_parallel_delay_gtid_0_x_100_start";
# Small delay before taking insert X lock to give time for both T1 and T2 to
# get the S lock first and cause a deadlock.
SET GLOBAL debug_dbug="+d,rpl_write_record_small_sleep_gtid_100_200";
# Small delay after T2's wait on the X lock, to give time for T1 retry to
# re-aquire the T1 S lock first.
SET GLOBAL debug_dbug="+d,small_sleep_after_lock_wait";
# Delay deadlock kill of T2.
SET GLOBAL debug_dbug="+d,rpl_delay_deadlock_kill";
--source include/start_slave.inc
--source include/sync_with_master_gtid.inc
SET GLOBAL debug_dbug= @save_dbug;
SELECT * FROM t1 ORDER BY a;
# Cleanup.
--connection slave
--source include/stop_slave.inc
eval SET @@global.slave_parallel_threads= $save_slave_parallel_threads;
eval SET @@global.slave_parallel_mode= $save_slave_parallel_mode;
eval SET @@global.slave_transaction_retries= $save_transaction_retries;
--source include/start_slave.inc
--connection master
DROP TABLE t1;
--source include/rpl_end.inc