This feature adds the functionality of ignorability for indexes.
Indexes are not ignored be default.
To control index ignorability explicitly for a new index,
use IGNORE or NOT IGNORE as part of the index definition for
CREATE TABLE, CREATE INDEX, or ALTER TABLE.
Primary keys (explicit or implicit) cannot be made ignorable.
The table INFORMATION_SCHEMA.STATISTICS get a new column named IGNORED that
would store whether an index needs to be ignored or not.
In btr_index_rec_validate(), externally stored column
check is missing while matching the length of the field
with the length of the field data stored in record.
Fetch the length of the externally stored part and compare it
with the fixed field length.
When doing a truncate on an Innodb under lock tables, InnoDB would rename
the old table to #sql-... and recreate a new 't1' table. The table lock
would still be on the #sql-table.
When doing ALTER TABLE, Innodb would do the changes on the #sql table
(which would disappear on close).
When the SQL layer, as part of inline alter table, would close the
original t1 table (#sql in InnoDB) and then reopen the t1 table, Innodb
would notice that this does not match it's own (old) t1 table and
generate an error.
Fixed by adding code in truncate table that if we are under lock tables
and truncating an InnoDB table, we would close, reopen and lock the
table after truncate. This will remove the #sql table and ensure that
lock tables is using the new empty table.
Reviewer: Marko Mäkelä
A performance regression was introduced by
commit e71e613353 (MDEV-24671)
and mostly addressed by
commit 455514c800.
The regression is likely caused by increased contention
lock_sys.latch (former lock_sys.mutex), possibly indirectly
caused by contention on lock_sys.wait_mutex. This change aims to
reduce both, but further improvements will be needed.
lock_wait(): Minimize the lock_sys.wait_mutex hold time.
lock_sys_t::deadlock_check(): Add a parameter for indicating
whether lock_sys.latch is exclusively locked.
trx_t::was_chosen_as_deadlock_victim: Always use atomics.
lock_wait_wsrep(): Assume that no mutex is being held.
Deadlock::report(): Always kill the victim transaction.
lock_sys_t::timeout: New counter to back MONITOR_TIMEOUT.
trx_t::commit_in_memory(): Invoke mod_tables.clear().
trx_free_at_shutdown(): Invoke mod_tables.clear() for transactions
that are discarded on shutdown.
Everywhere else, assert mod_tables.empty() on freed transaction objects.
commit a993310593 (MDEV-24537)
introduced the regression that the page cleaner will keep sleeping
even if there is work to do.
innodb_max_dirty_pages_pct_update(): Always wake up the page cleaner
on any SET GLOBAL innodb_max_dirty_pages_pct= assignment.
buf_flush_page_cleaner(): If innodb_max_dirty_pages_pct is nonzero,
consult only that parameter when determining whether there is work
to do. Else, consult innodb_max_dirty_pages.
A new configuration parameter innodb_deadlock_report is introduced:
* innodb_deadlock_report=off: Do not report any details of deadlocks.
* innodb_deadlock_report=basic: Report transactions and waiting locks.
* innodb_deadlock_report=full (default): Report also the blocking locks.
The improved deadlock checker will consider all involved transactions
in one loop, even if the deadlock loop includes several transactions.
The theoretical maximum number of transactions that can be involved in
a deadlock is `innodb_page_size` * 8, limited by the persistent data
structures.
Note: Similar to
mysql/mysql-server@3859219875
our deadlock checker will consider at most one blocking transaction
for each waiting transaction. The new field trx->lock.wait_trx be
nullptr if and only if trx->lock.wait_lock is nullptr. Note that
trx->lock.wait_lock->trx == trx (the waiting transaction), while
trx->lock.wait_trx points to one of the transactions whose lock is
conflicting with trx->lock.wait_lock.
Considering only one blocking transaction will greatly simplify
our deadlock checker, but it may also make the deadlock checker
blind to some deadlocks where the deadlock cycle is 'hidden' by
the fact that the registered trx->lock.wait_trx is not actually
waiting for any InnoDB lock, but something else. So, instead of
deadlocks, sometimes lock wait timeout may be reported.
To improve on this, whenever trx->lock.wait_trx is changed, we
will register further 'candidate' transactions in Deadlock::to_check(),
and check for 'revealed' deadlocks as soon as possible, in lock_release()
and innobase_kill_query().
The old DeadlockChecker was holding lock_sys.latch, even though using
lock_sys.wait_mutex should be less contended (and thus preferred)
in the likely case that no deadlock is present.
lock_wait(): Defer the deadlock check to this function, instead of
executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().
DeadlockChecker: Complete rewrite:
(1) Explicitly keep track of transactions that are being waited for,
in trx->lock.wait_trx, protected by lock_sys.wait_mutex. Previously,
we were painstakingly traversing the lock heaps while blocking
concurrent registration or removal of any locks (even uncontended ones).
(2) Use Brent's cycle-detection algorithm for deadlock detection,
traversing each trx->lock.wait_trx edge at most 2 times.
(3) If a deadlock is detected, release lock_sys.wait_mutex,
acquire LockMutexGuard, re-acquire lock_sys.wait_mutex and re-invoke
find_cycle() to find out whether the deadlock is still present.
(4) Display information on all transactions that are involved in the
deadlock, and choose a victim to be rolled back.
lock_sys.deadlocks: Replaces lock_deadlock_found. Protected by wait_mutex.
Deadlock::find_cycle(): Quickly find a cycle of trx->lock.wait_trx...
using Brent's cycle detection algorithm.
Deadlock::report(): Report a deadlock cycle that was found by
Deadlock::find_cycle(), and choose a victim with the least weight.
Altogether, we may traverse each trx->lock.wait_trx edge up to 5
times (2*find_cycle()+1 time for reporting and choosing the victim).
Deadlock::check_and_resolve(): Find and resolve a deadlock.
lock_wait_rpl_report(): Report the waits-for information to
replication. This used to be executed as part of DeadlockChecker.
Replication must know the waits-for relations even if no deadlocks
are present in InnoDB.
Reviewed by: Vladislav Vaintroub
Let us avoid the excessive allocation of explicit record locks
(a work-around of MDEV-24813) so that the test will execute
much faster under AddressSanitizer, MemorySanitizer, Valgrind.
The test innodb.innodb_bug60049 used to check that the record
(ID,NAME)=(12,'SYS_FOREIGN_COLS') is the last record in the
secondary index of the system table SYS_TABLES.
But, ever since commit 2336558423
or mysql/mysql-server@082d59670f
that record no longer is the last one in the table!
The more recent test innodb.purge_secondary covers the purge
functionality much better.
trx_t::commit_tables(): Ensure that mod_tables will be empty.
This was broken in commit b08448de64
where the query cache invalidation was moved from lock_release().
innobase_rename_column_try(): When renaming SYS_FIELDS records
for secondary indexes, try to use both formats of SYS_FIELDS.POS
as keys, in case the PRIMARY KEY includes a column prefix.
Without this fix, an ALTER TABLE that renames a column followed
by a server restart (or LRU eviction of the table definition
from dict_sys) would make the table inaccessible.
This failure is caused by commit 43ca6059ca
(MDEV-24720). InnoDB fails to remove the ahi entries
during rollback of bulk insert operation. InnoDB should
remove the AHI entries of root page before reinitialising it.
Reviewed-by: Marko Mäkelä
Now that an INSERT into an empty table is replicated more efficiently
during online ALTER, an old test case started to fail. Let us disable
the MDEV-515 logic for the critical INSERT statement.
This is caused by commit 3cef4f8f0f
(MDEV-515). dict_table_t::clear() frees all the blob during
rollback of bulk insert.But online log tries to read the
freed blob while applying the log. It can be fixed if we
truncate the online log during rollback of bulk insert operation.
The DeadlockChecker expects to be able to freeze the waits-for graph.
Hence, it is best executed somewhere where we are not holding any
additional mutexes.
lock_wait(): Defer the deadlock check to this function, instead
of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().
DeadlockChecker::trx_rollback(): Merge with the only caller,
check_and_resolve().
LockMutexGuard: RAII accessor for lock_sys.mutex.
lock_sys.deadlocks: Replaces lock_deadlock_found.
trx_t: Clean up some comments.
InnoDB fails to remove the ahi entries during rollback
of bulk insert operation. InnoDB throws the error when
validates the ahi hash tables. InnoDB should remove
the ahi entries while freeing the segment only during
bulk index rollback operation.
Reviewed-by: Marko Mäkelä
The purpose of the test was to ensure that the SX (update) mode of
index tree and buffer page latches are being used.
The test has become unstable, possibly due to changes related to
buf_pool.mutex and buf_pool.page_hash, or to the use of MDL in the
purge of transaction history.
In 10.6, the test depends on instrumentation that was refactored
or removed in MDEV-24142.
The use of different latching modes can better be indirectly observed
through high-concurrency benchmarks. For MDEV-14637, a performance test
was conducted where the finer-grained latching and
BTR_CUR_FINE_HISTORY_LENGTH were removed. It caused a 20% performance
regression for UPDATE and somewhat smaller for INSERT.
Any new problem with latching granularity should be easily caught by
performance testing, or by stress tests with Random Query Generator.
In commit 3cef4f8f0f (MDEV-515)
we inadvertently broke CREATE TABLE...REPLACE SELECT statements
by wrongly disabling row-level undo logging.
select_create::prepare(): Only invoke extra(HA_EXTRA_BEGIN_ALTER_COPY)
if no special treatment of duplicates is needed.
After an ignored INSERT IGNORE statement into an empty table, we would
wrongly use the MDEV-515 table-level undo logging for a subsequent
REPLACE statement.
ha_innobase::reset_template(): Clear m_prebuilt->ins_node->bulk_insert
on every statement boundary.
ha_innobase::start_stmt(): Invoke end_bulk_insert().
ha_innobase::extra(): Avoid accessing m_prebuilt->trx. Do not call
thd_to_trx(). Invoke end_bulk_insert() and try to reset bulk_insert
when changing the REPLACE or IGNORE settings.
trx_mod_table_time_t::WAS_BULK: Use a distinct value from BULK.
trx_undo_report_row_operation(): Add debug assertions.
Note: Some calls to end_bulk_insert() may be redundant, but statement
boundaries are not always clear in the API (especially in the
presence of LOCK TABLES or stored procedures).
We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.
For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().
Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.
Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.
This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).
INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.
In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.
Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.
dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.
ins_node_t::bulk_insert: Whether bulk insert was initiated.
trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.
trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().
trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.
row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.
lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.
btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.
ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.
row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
btr_root_page_init(): Refactored from btr_create().
dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.
ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.
This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.