trx_t::is_recovered: Revert most of the changes that were made by the
merge of MDEV-15326 from 10.2. The trx_sys.rw_trx_hash and the recovery
of transactions at startup is quite different in 10.3.
trx_free_at_shutdown(): Avoid excessive mutex protection. Reading fields
that can only be modified by the current thread (owning the transaction)
can be done outside mutex.
trx_t::commit_state(): Restore a tighter assertion.
trx_rollback_recovered(): Clarify why there is no potential race condition
with other transactions.
lock_trx_release_locks(): Merge with trx_t::release_locks(),
and avoid holding lock_sys.mutex unnecessarily long.
rw_trx_hash_t::find(): Remove redundant code, and avoid starving the
committer by checking trx_t::state before trx_t::reference().
Backport the applicable part of Sergey Vojtovich's commit
0ca2ea1a65 from MariaDB Server 10.3.
trx reference counter was updated under mutex and read without any
protection. This is both slow and unsafe. Use atomic operations for
reference counter accesses.
MySQL 5.7.9 (and MariaDB 10.2.2) introduced a race condition
between InnoDB transaction commit and the conversion of implicit
locks into explicit ones.
The assertion failure can be triggered with a test that runs
3 concurrent single-statement transactions in a loop on a simple
table:
CREATE TABLE t (a INT PRIMARY KEY) ENGINE=InnoDB;
thread1: INSERT INTO t SET a=1;
thread2: DELETE FROM t;
thread3: SELECT * FROM t FOR UPDATE; -- or DELETE FROM t;
The failure scenarios are like the following:
(1) The INSERT statement is being committed, waiting for lock_sys->mutex.
(2) At the time of the failure, both the DELETE and SELECT transactions
are active but have not logged any changes yet.
(3) The transaction where the !other_lock assertion fails started
lock_rec_convert_impl_to_expl().
(4) After this point, the commit of the INSERT removed the transaction from
trx_sys->rw_trx_set, in trx_erase_lists().
(5) The other transaction consulted trx_sys->rw_trx_set and determined
that there is no implicit lock. Hence, it grabbed the lock.
(6) The !other_lock assertion fails in lock_rec_add_to_queue()
for the lock_rec_convert_impl_to_expl(), because the lock was 'stolen'.
This assertion failure looks genuine, because the INSERT transaction
is still active (trx->state=TRX_STATE_ACTIVE).
The problematic step (4) was introduced in
mysql/mysql-server@e27e0e0bb7
which fixed something related to MVCC (covered by the test
innodb.innodb-read-view). Basically, it reintroduced an error
that had been mentioned in an earlier commit
mysql/mysql-server@a17be6963f:
"The active transaction was removed from trx_sys->rw_trx_set prematurely."
Our fix goes along the following lines:
(a) Implicit locks will released by assigning
trx->state=TRX_STATE_COMMITTED_IN_MEMORY as the first step.
This transition will no longer be protected by lock_sys_t::mutex,
only by trx->mutex. This idea is by Sergey Vojtovich.
(b) We detach the transaction from trx_sys before starting to release
explicit locks.
(c) All callers of trx_rw_is_active() and trx_rw_is_active_low() must
recheck trx->state after acquiring trx->mutex.
(d) Before releasing any explicit locks, we will ensure that any activity
by other threads to convert implicit locks into explicit will have ceased,
by checking !trx_is_referenced(trx). There was a glitch
in this check when it was part of lock_trx_release_locks(); at the end
we would release trx->mutex and acquire lock_sys->mutex and trx->mutex,
and fail to recheck (trx_is_referenced() is protected by trx_t::mutex).
(e) Explicit locks can be released in batches (LOCK_RELEASE_INTERVAL=1000)
just like we did before.
trx_t::state: Document that the transition to COMMITTED is only
protected by trx_t::mutex, no longer by lock_sys_t::mutex.
trx_rw_is_active_low(), trx_rw_is_active(): Document that the transaction
state should be rechecked after acquiring trx_t::mutex.
trx_t::commit_state(): New function to change a transaction to committed
state, to release implicit locks.
trx_t::release_locks(): New function to release the explicit locks
after commit_state().
lock_trx_release_locks(): Move much of the logic to the caller
(which must invoke trx_t::commit_state() and trx_t::release_locks()
as needed), and assert that the transaction will have locks.
trx_get_trx_by_xid(): Make the parameter a pointer to const.
lock_rec_other_trx_holds_expl(): Recheck trx->state after acquiring
trx->mutex, and avoid a redundant lookup of the transaction.
lock_rec_queue_validate(): Recheck impl_trx->state while holding
impl_trx->mutex.
row_vers_impl_x_locked(), row_vers_impl_x_locked_low():
Document that the transaction state must be rechecked after
trx_mutex_enter().
trx_free_prepared(): Adjust for the changes to lock_trx_release_locks().
Some bugs are detected only after a table definition has been evicted
and then reloaded to the InnoDB data dictionary cache.
For debug builds, introduce the settable Boolean configuration parameter
innodb_evict_tables_on_commit_debug that can be set to request InnoDB
to attempt to evict table definitions from the data dictionary cache
whenever a transaction is committed.
This has been tested on 10.3 and 10.4 with the following:
./mysql-test-run.pl --mysqld=--loose-innodb-evict-tables-on-commit-debug
You can also use the following:
SET GLOBAL innodb_evict_tables_on_commit_debug=ON;
SET GLOBAL innodb_evict_tables_on_commit_debug=OFF;
The parameter affects the commit (or rollback or abort) of
transactions that have modified persistent InnoDB tables.
Revert part of fa2a74e08d.
trx_reference(): Remove, and merge the relevant part to the only caller
trx_rw_is_active(). If the statements trx = NULL; were ever executed,
the function would have dereferenced a NULL pointer and crashed in
trx_mutex_exit(trx). Hence, those statements must have been unreachable,
and they can be replaced with debug assertions.
trx_rw_is_active(): Avoid unnecessary acquisition and release of trx->mutex
when do_ref_count=false.
lock_trx_release_locks(): Do not reset trx->id=0. Had the statement been
necessary, we would have experienced crashes in trx_reference().
log_checkpoint(), log_make_checkpoint_at(): Remove the parameter
write_always. It seems that the primary purpose of this parameter
was to ensure in the function recv_reset_logs() that both checkpoint
header pages will be overwritten, when the function is called from
the never-enabled function recv_recovery_from_archive_start().
create_log_files(): Merge recv_reset_logs() to its only caller.
Debug instrumentation: Prefer to flush the redo log, instead of
triggering a redo log checkpoint.
page_header_set_field(): Disable a debug assertion that will
always fail due to MDEV-19344, now that we no longer initiate
a redo log checkpoint before an injected crash.
In recv_reset_logs() there used to be two calls to
log_make_checkpoint_at(). The apparent purpose of this was
to ensure that both InnoDB redo log checkpoint header pages
will be initialized or overwritten.
The second call was removed (without any explanation) in MySQL 5.6.3:
mysql/mysql-server@4ca37968da
In MySQL 5.6.8 WL#6494, starting with
mysql/mysql-server@00a0ba8ad9
the function recv_reset_logs() was not only invoked during
InnoDB data file initialization, but also during a regular
startup when the redo log is being resized.
mysql/mysql-server@45e9167983
in MySQL 5.7.2 removed the UNIV_LOG_ARCHIVE code, but still
did not remove the parameter write_always.
This reverts commit 21b2fada7a
and commit 81d71ee6b2.
The MDEV-18464 change introduces a few data race issues. Contrary to
the documentation, the field trx_t::victim is not always being protected
by lock_sys_t::mutex and trx_t::mutex. Most importantly, it seems
that KILL QUERY could wrongly avoid acquiring both mutexes when
invoking lock_trx_handle_wait_low(), in case another thread had
already set trx->victim=true.
We also revert MDEV-12009, because it should depend on the MDEV-18464
fix being present.
1) Avoid writing of MLOG_INDEX_LOAD redo log record during inplace
alter table when the table is empty and also for spatial index.
2) Avoid creation of temporary merge file for spatial index during
index creation process.
Pushed the decision for innodb transaction and system
locking down to lock0lock.cc level. With this,
we can avoid releasing these mutexes for executions
where these mutexes were acquired upfront.
This patch will also fix BF aborting of native threads, e.g.
threads which have declared wsrep_on=OFF. Earlier, we have
used, for innodb trx locks, was_chosen_as_deadlock_victim
flag, for marking inodb transactions, which are victims for
wsrep BF abort. With native threads (wsrep_on==OFF), re-using
was_chosen_as_deadlock_victim flag may lead to inteference
with real deadlock, and to deal with this, the patch has added new
flag for marking wsrep BF aborts only: victim=true
Similar way if replication decides to abort one of the threads
we mark victim by: victim=true
innobase_kill_query
Remove lock sys and trx mutex handling.
wsrep_innobase_kill_one_trx
Mark victim trx with victim=true
trx0trx.h
Remove trx_abort_t type and abort type variable from
trx struct. Add victim variable to trx.
wsrep_kill_victim
Remove abort_type
lock_report_waiters_to_mysql
Take also trx mutex and mark trx as a victim for
replication abort.
lock_trx_handle_wait_low
New low level function to check whether the transaction
has already been rolled back because it was selected as
a deadlock victim, or if it has to wait then cancel
the wait lock.
lock_trx_handle_wait
If transaction is not marked as victim take lock sys
and trx mutex before calling lock_trx_handle_wait_low
and release them after that.
row_search_for_mysql
Remove lock sys and trx mutex taking and releasing.
trx_rollback_to_savepoint_for_mysql_low
trx_commit_in_memory
Clean up victim variable.
During database recovery, a transaction with wsrep XID is
recovered from InnoDB in prepared state. However, when the
transaction is looked up with trx_get_trx_by_xid() in
innobase_commit_by_xid(), trx->xid gets cleared in
trx_get_trx_by_xid_low() and commit time serialization history
write does not update the wsrep XID in trx sys header for
that recovered trx. As a result the transaction gets
committed during recovery but the wsrep position does not
get updated appropriately.
As a fix, we preserve trx->xid for Galera over transaction
commit in recovery phase.
Fix authored by: Teemu Ollakka (GaleraCluster) and Marko Mäkelä.
modified: mysql-test/suite/galera/disabled.def
modified: mysql-test/suite/galera/r/galera_gcache_recover_full_gcache.result
modified: mysql-test/suite/galera/r/galera_gcache_recover_manytrx.result
modified: mysql-test/suite/galera/t/galera_gcache_recover_full_gcache.test
modified: mysql-test/suite/galera/t/galera_gcache_recover_manytrx.test
modified: storage/innobase/trx/trx0trx.cc
modified: storage/xtradb/trx/trx0trx.cc
main.derived_cond_pushdown: Move all 10.3 tests to the end,
trim trailing white space, and add an "End of 10.3 tests" marker.
Add --sorted_result to tests where the ordering is not deterministic.
main.win_percentile: Add --sorted_result to tests where the
ordering is no longer deterministic.
The parameters bool sync=true, bool read_only=false of mtr_t::start()
were added in
eca5b0fc17
(MySQL 5.7.3).
The parameter read_only was never used anywhere.
The parameter sync was only copied around, and would be returned
by the unused function mtr_t::is_async().
We do not need this dead code in MariaDB.
It turned out that ha_innobase::truncate() would prematurely
commit the transaction already before the completion of the
ha_innobase::create(). All of this must be atomic.
innodb.truncate_crash: Use the correct DEBUG_SYNC point, and
tolerate non-truncation of the table, because the redo log
for the TRUNCATE transaction commit might be flushed due to
some InnoDB background activity.
dict_build_tablespace_for_table(): Merge to the function
dict_build_table_def_step().
dict_build_table_def_step(): If a table is being created during
an already started data dictionary transaction (such as TRUNCATE),
persistently write the table_id to the undo log header before
creating any file. In this way, the recovery of TRUNCATE will be
able to delete the new file before rolling back the rename of
the original table.
dict_table_rename_in_cache(): Add the parameter replace_new_file,
used as part of rolling back a TRUNCATE operation.
fil_rename_tablespace_check(): Add the parameter replace_new.
If the parameter is set and a file identified by new_path exists,
remove a possible tablespace and also the file.
create_table_info_t::create_table_def(): Remove some debug assertions
that no longer hold. During TRUNCATE, the transaction will already
have been started (and performed a rename operation) before the
table is created. Also, remove a call to dict_build_tablespace_for_table().
create_table_info_t::create_table(): Add the parameter create_fk=true.
During TRUNCATE TABLE, do not add FOREIGN KEY constraints to the
InnoDB data dictionary, because they will also not be removed.
row_table_add_foreign_constraints(): If trx=NULL, do not modify
the InnoDB data dictionary, but only load the FOREIGN KEY constraints
from the data dictionary.
ha_innobase::create(): Lock the InnoDB data dictionary cache only
if no transaction was passed by the caller. Unlock it in any case.
innobase_rename_table(): Add the parameter commit = true.
If !commit, do not lock or unlock the data dictionary cache.
ha_innobase::truncate(): Lock the data dictionary before invoking
rename or create, and let ha_innobase::create() unlock it and
also commit or roll back the transaction.
trx_undo_mark_as_dict(): Renamed from trx_undo_mark_as_dict_operation()
and declared global instead of static.
row_undo_ins_parse_undo_rec(): If table_id is set, this must
be rolling back the rename operation in TRUNCATE TABLE, and
therefore replace_new_file=true.
If trx_free() and trx_create_low() were called while a call to
trx_reference() was pending, we could get a reference to a wrong
transaction object.
trx_reference(): Return NULL if the trx->id no longer matches.
lock_trx_release_locks(): Assign trx->id = 0, so that trx_reference()
will not return a reference to this object.
trx_cleanup_at_db_startup(): Assign trx->id = 0.
assert_trx_is_free(): Assert !trx->n_ref. Assert trx->id == 0,
now that it will be cleared as part of a transaction commit.
Allocate trx->lock.rec_pool and trx->lock.table_pool directly from trx_t.
Remove unnecessary use of std::vector.
In order to do this, move some definitions from lock0priv.h to
lock0types.h, so that ib_lock_t will not be an opaque type.
trx_set_rw_mode() is never called for read-only transactions, this is guarded
by callers.
Removing this condition from critical section immediately gives 5% scalability
improvement in OLTP index updates benchmark.
In InnoDB, an INSERT will not create an explicit lock object. Instead,
the inserted record is initially implicitly locked by the transaction
that wrote its trx_t::id to the hidden system column DB_TRX_ID.
(Other transactions would check if DB_TRX_ID is referring to a
transaction that has not been committed.)
If a record was inserted in the current transaction, it would be
implicitly locked by that transaction. Only if some other transaction
is requesting access to the record, the implicit lock should be
converted to an explicit one, so that the waits-for graph can be
constructed for detecting deadlocks and lock wait timeouts.
Before this fix, InnoDB would convert implicit locks to
explicit ones, even if no conflict exists.
lock_rec_convert_impl_to_expl(): Return whether caller_trx
already holds an explicit lock that covers the record.
row_vers_impl_x_locked_low(): Avoid a lookup if the record matches
caller_trx->id.
lock_trx_has_expl_x_lock(): Renamed from lock_trx_has_rec_x_lock().
row_upd_clust_step(): In a debug assertion, check for implicit lock
before invoking lock_trx_has_expl_x_lock().
rw_trx_hash_t::find(): Make do_ref_count a mandatory parameter.
Assert that trx_id is not 0 (the caller should check it).
trx_sys_t::is_registered(): Only invoke find() if id != 0.
trx_sys_t::find(): Add the optional parameter do_ref_count.
lock_rec_queue_validate(): Avoid lookup for trx_id == 0.
Thanks to Sergey Vojtovich for feedback and many ideas.
purge_state_t: Remove. The states are replaced with
purge_sys_t::enabled() and purge_sys_t::paused() as follows:
PURGE_STATE_INIT, PURGE_STATE_EXIT, PURGE_STATE_DISABLED: !enabled().
PURGE_STATE_RUN, PURGE_STATE_STOP: paused() distinguishes these.
purge_sys_t::m_paused: Renamed from purge_sys_t::n_stop.
Protected by atomic memory access only, not purge_sys_t::latch.
purge_sys_t::m_enabled: An atomically updated Boolean that
replaces purge_sys_t::state.
purge_sys_t:🏃 Remove, because it duplicates
srv_sys.n_threads_active[SRV_PURGE].
purge_sys_t::running(): Accessor for srv_sys.n_threads_active[SRV_PURGE].
purge_sys_t::stop(): Renamed from trx_purge_stop().
purge_sys_t::resume(): Renamed from trx_purge_run().
Do not acquire latch; solely rely on atomics.
purge_sys_t::is_initialised(), purge_sys_t::m_initialised: Remove.
purge_sys_t::create(), purge_sys_t::close(): Instead of invoking
is_initialised(), check whether event is NULL.
purge_sys_t::event: Move before latch, so that fields that are
protected by latch can reside on the same cache line with latch.
srv_start_wait_for_purge_to_start(): Merge to the only caller srv_start().