Problem:
=======
During dropping of fts index, InnoDB waits for fts_optimize_remove_table()
and it holds dict_sys->mutex and dict_operaiton_lock even though the
table id is not present in the queue. But fts_optimize_thread does wait
for dict_sys->mutex to process the unrelated table id from the slot.
Solution:
========
Whenever table is added to fts_optimize_wq, update the fts_status
of in-memory fts subsystem to TABLE_IN_QUEUE. Whenever drop index
wants to remove table from the queue, it can check the fts_status
to decide whether it should send the MSG_DELETE_TABLE to the queue.
Removed the following functions because these are all deadcode.
dict_table_wait_for_bg_threads_to_exit(),
fts_wait_for_background_thread_to_start(),fts_start_shutdown(), fts_shudown().
Problem:
=======
Transaction left with nonempty table locks list. This leads to
assumption that table_locks is not subset of trx_locks. Problem is that
lock_wait_timeout_thread() doesn't remove the table lock from
table_locks for transaction.
Solution:
========
In lock_wait_timeout_thread(), remove the lock from table vector of
transaction.
- During trx_undo_report_rename(), InnoDB can fail to write undo log
for it if undo log doesn't fit in the undo page. In that case, InnoDB
adds one more undo log page and retry to write the rename undo log.
But the assert is wrong and it doesn't allow to fail even for one time.
trx_t::is_recovered: Revert most of the changes that were made by the
merge of MDEV-15326 from 10.2. The trx_sys.rw_trx_hash and the recovery
of transactions at startup is quite different in 10.3.
trx_free_at_shutdown(): Avoid excessive mutex protection. Reading fields
that can only be modified by the current thread (owning the transaction)
can be done outside mutex.
trx_t::commit_state(): Restore a tighter assertion.
trx_rollback_recovered(): Clarify why there is no potential race condition
with other transactions.
lock_trx_release_locks(): Merge with trx_t::release_locks(),
and avoid holding lock_sys.mutex unnecessarily long.
rw_trx_hash_t::find(): Remove redundant code, and avoid starving the
committer by checking trx_t::state before trx_t::reference().
Backport the applicable part of Sergey Vojtovich's commit
0ca2ea1a65 from MariaDB Server 10.3.
trx reference counter was updated under mutex and read without any
protection. This is both slow and unsafe. Use atomic operations for
reference counter accesses.
MySQL 5.7.9 (and MariaDB 10.2.2) introduced a race condition
between InnoDB transaction commit and the conversion of implicit
locks into explicit ones.
The assertion failure can be triggered with a test that runs
3 concurrent single-statement transactions in a loop on a simple
table:
CREATE TABLE t (a INT PRIMARY KEY) ENGINE=InnoDB;
thread1: INSERT INTO t SET a=1;
thread2: DELETE FROM t;
thread3: SELECT * FROM t FOR UPDATE; -- or DELETE FROM t;
The failure scenarios are like the following:
(1) The INSERT statement is being committed, waiting for lock_sys->mutex.
(2) At the time of the failure, both the DELETE and SELECT transactions
are active but have not logged any changes yet.
(3) The transaction where the !other_lock assertion fails started
lock_rec_convert_impl_to_expl().
(4) After this point, the commit of the INSERT removed the transaction from
trx_sys->rw_trx_set, in trx_erase_lists().
(5) The other transaction consulted trx_sys->rw_trx_set and determined
that there is no implicit lock. Hence, it grabbed the lock.
(6) The !other_lock assertion fails in lock_rec_add_to_queue()
for the lock_rec_convert_impl_to_expl(), because the lock was 'stolen'.
This assertion failure looks genuine, because the INSERT transaction
is still active (trx->state=TRX_STATE_ACTIVE).
The problematic step (4) was introduced in
mysql/mysql-server@e27e0e0bb7
which fixed something related to MVCC (covered by the test
innodb.innodb-read-view). Basically, it reintroduced an error
that had been mentioned in an earlier commit
mysql/mysql-server@a17be6963f:
"The active transaction was removed from trx_sys->rw_trx_set prematurely."
Our fix goes along the following lines:
(a) Implicit locks will released by assigning
trx->state=TRX_STATE_COMMITTED_IN_MEMORY as the first step.
This transition will no longer be protected by lock_sys_t::mutex,
only by trx->mutex. This idea is by Sergey Vojtovich.
(b) We detach the transaction from trx_sys before starting to release
explicit locks.
(c) All callers of trx_rw_is_active() and trx_rw_is_active_low() must
recheck trx->state after acquiring trx->mutex.
(d) Before releasing any explicit locks, we will ensure that any activity
by other threads to convert implicit locks into explicit will have ceased,
by checking !trx_is_referenced(trx). There was a glitch
in this check when it was part of lock_trx_release_locks(); at the end
we would release trx->mutex and acquire lock_sys->mutex and trx->mutex,
and fail to recheck (trx_is_referenced() is protected by trx_t::mutex).
(e) Explicit locks can be released in batches (LOCK_RELEASE_INTERVAL=1000)
just like we did before.
trx_t::state: Document that the transition to COMMITTED is only
protected by trx_t::mutex, no longer by lock_sys_t::mutex.
trx_rw_is_active_low(), trx_rw_is_active(): Document that the transaction
state should be rechecked after acquiring trx_t::mutex.
trx_t::commit_state(): New function to change a transaction to committed
state, to release implicit locks.
trx_t::release_locks(): New function to release the explicit locks
after commit_state().
lock_trx_release_locks(): Move much of the logic to the caller
(which must invoke trx_t::commit_state() and trx_t::release_locks()
as needed), and assert that the transaction will have locks.
trx_get_trx_by_xid(): Make the parameter a pointer to const.
lock_rec_other_trx_holds_expl(): Recheck trx->state after acquiring
trx->mutex, and avoid a redundant lookup of the transaction.
lock_rec_queue_validate(): Recheck impl_trx->state while holding
impl_trx->mutex.
row_vers_impl_x_locked(), row_vers_impl_x_locked_low():
Document that the transaction state must be rechecked after
trx_mutex_enter().
trx_free_prepared(): Adjust for the changes to lock_trx_release_locks().
Some bugs are detected only after a table definition has been evicted
and then reloaded to the InnoDB data dictionary cache.
For debug builds, introduce the settable Boolean configuration parameter
innodb_evict_tables_on_commit_debug that can be set to request InnoDB
to attempt to evict table definitions from the data dictionary cache
whenever a transaction is committed.
This has been tested on 10.3 and 10.4 with the following:
./mysql-test-run.pl --mysqld=--loose-innodb-evict-tables-on-commit-debug
You can also use the following:
SET GLOBAL innodb_evict_tables_on_commit_debug=ON;
SET GLOBAL innodb_evict_tables_on_commit_debug=OFF;
The parameter affects the commit (or rollback or abort) of
transactions that have modified persistent InnoDB tables.
Revert part of fa2a74e08d.
trx_reference(): Remove, and merge the relevant part to the only caller
trx_rw_is_active(). If the statements trx = NULL; were ever executed,
the function would have dereferenced a NULL pointer and crashed in
trx_mutex_exit(trx). Hence, those statements must have been unreachable,
and they can be replaced with debug assertions.
trx_rw_is_active(): Avoid unnecessary acquisition and release of trx->mutex
when do_ref_count=false.
lock_trx_release_locks(): Do not reset trx->id=0. Had the statement been
necessary, we would have experienced crashes in trx_reference().
Starting with commit 210855ce5d
Valgrind became aware that the unused tail of the buffer that
is returned by thd_get_xid() is actually uninitialized.
The problem should exist already in MySQL 5.0. I was able to
repeat it on MariaDB Server 5.5 with some additional instrumentation.
InnoDB is allocating 128+4+4 bytes for the XID and the lengths of
its components, even when the XID is shorter than 64+64 bytes.
In MariaDB Server 10.3, while running the test main.xa_binlog,
in the xid_t::set() that is called by sql_yacc.yy, the 128-byte data
buffer was uninitialized according to Valgrind, and only the first bytes
were initialized. When the xid_t::data was copied to
thd.transaction.xid_state.xid.data, it happened so that the entire
target buffer was considered initialized. With MariaDB Server 10.4 since
the said commit, Valgrind will correctly be detect the tail of the buffer
as uninitialized.
The impact of this bug is as follows:
(1) InnoDB will write unnecessarily much redo log for XA PREPARE.
(2) InnoDB will write garbage bytes to the redo log and undo log pages.
(3) The garbage should be 'harmless', because on recovery, only the
actual payload of the XID will be used, based on the written length.
trx_rseg_write_wsrep_checkpoint(), trx_undo_write_xid(): Write only
the actually used length of xid->data to the data page, and
zero out the rest of the buffer by mlog_memset().
btr_push_update_extern_fields(): Add a parameter for the original number
of fields in the record before btr_cur_trim(). Assume that this function
will only be called for the clustered index, which is the only index
that can contain off-page columns.
trx_undo_prev_version_build(), btr_cur_pessimistic_update():
Only invoke btr_push_update_extern_fields() for the clustered index.
C++11 defines the singly-linked std::forward_list. Prefer it to
the doubly-linked std::list in cases where we dot really need it.
Also, clean up some code.
dict_index_remove_from_v_col_list(): Remove.
Obsoleted by dict_index_t::detach_columns().
There is no std::forward_list::push_back(). Use push_front() instead.
The ordering does not really matter.
dict_v_col_t::n_v_indexes: Added. There is no std::forward_list::size(),
and trx_undo_log_v_idx() needs to know the size.
rtr_info_track_t::rtr_active: Encapsulate. There really was no justification
for the pointer indirection.
There is only one InnoDB crash recovery subsystem.
Allocating recv_sys statically removes one level of pointer indirection
and makes code more readable, and removes the awkward initialization of
recv_sys->dblwr.
recv_sys_t::create(): Replaces recv_sys_init().
recv_sys_t::debug_free(): Replaces recv_sys_debug_free().
recv_sys_t::close(): Replaces recv_sys_close().
recv_sys_t::add(): Replaces recv_add_to_hash_table().
recv_sys_t::empty(): Replaces recv_sys_empty_hash().
Bootstrapping a new cluster from a backup created from a MariaDB
version prior to 10.3.5 may result in error "SST position can't be
set in past" when attempting to join additional nodes.
The problem stems from the fact that when reading the wsrep position
from InnoDB, the position is looked up in two places:
the TRX_SYS page, where versions prior to 10.3.5 used to store
WSREP's position; and rollback segments, this is where newer versions
store the position.
When starting a new cluster, the starting seqno is 0 and a new cluster
UUID is generated. This is persisted in rollback segments, but the old
UUID and seqno are not cleared from TRX_SYS page.
Subsequently, when reading back the position,
trx_rseg_read_wsrep_checkpoint() is going to return the maximum seqno
found in both TRX_SYS page and rollback segments. So in the case of a
newly bootstrapped cluster, it's always going to return the old
cluster information.
The fix consists of changing trx_rseg_read_wsrep_checkpoint() so that
only rollback segments are looked up. On startup, position is read
from the TRX_SYS page, and if present, it is copied to rollback
segments (unless a newer position is already present in the rollback
segments).
Finally the position stored in TRX_SYS page is cleared.
dict_sys_t::create(): Renamed from dict_init().
dict_sys_t::close(): Renamed from dict_close().
dict_sys_t::add(): Sliced from dict_table_t::add_to_cache().
dict_sys_t::remove(): Renamed from dict_table_remove_from_cache().
dict_sys_t::prevent_eviction(): Renamed from
dict_table_move_from_lru_to_non_lru().
dict_sys_t::acquire(): Replaces dict_move_to_mru() and some more logic.
dict_sys_t::resize(): Renamed from dict_resize().
dict_sys_t::find(): Replaces dict_lru_find_table() and
dict_non_lru_find_table().
In MySQL 5.7.8 an extra level of pointer indirection was added to
dict_operation_lock and some other rw_lock_t without solid justification,
in mysql/mysql-server@52720f1772.
Let us revert that change and remove the rather useless rw_lock_t
constructor and destructor and the magic_n field. In this way,
some unnecessary pointer dereferences and heap allocation will be avoided
and debugging might be a little easier.
Some places didn't match the previous rules, making the Floor
address wrong.
Additional sed rules:
sed -i -e 's/Place.*Suite .*, Boston/Street, Fifth Floor, Boston/g'
sed -i -e 's/Suite .*, Boston/Fifth Floor, Boston/g'