This is not fixing the reported problem, but a potential problem that was
introduced in MDEV-11369.
row_sel_try_search_shortcut(), row_sel_try_search_shortcut_for_mysql():
When an adaptive hash index search lands on top of rec_is_default_row(),
we must skip the candidate and perform a normal search. This is because
the adaptive hash index latch only protects the record from being deleted
but does not prevent concurrent inserts into the page. Therefore, it is not
safe to dereference the next-record pointer.
InnoDB originally skipped the redo logging of PAGE_MAX_TRX_ID changes
until I enabled it in commit e76b873f24
that was part of MySQL 5.5.5 already.
Later, when a more complete history of the InnoDB Plugin for MySQL 5.1
(aka branches/zip in the InnoDB subversion repository) and of the
planned-to-be closed-source branches/innodb+ that became the basis of
InnoDB in MySQL 5.5 was pushed to the MySQL source repository, the
change was part of commit 509e761f06:
------------------------------------------------------------------------
r5038 | marko | 2009-05-19 22:59:07 +0300 (Tue, 19 May 2009) | 30 lines
branches/zip: Write PAGE_MAX_TRX_ID to the redo log. Otherwise,
transactions that are started before the rollback of incomplete
transactions has finished may have an inconsistent view of the
secondary indexes.
dict_index_is_sec_or_ibuf(): Auxiliary function for controlling
updates and checks of PAGE_MAX_TRX_ID: check whether an index is a
secondary index or the insert buffer tree.
page_set_max_trx_id(), page_update_max_trx_id(),
lock_rec_insert_check_and_lock(),
lock_sec_rec_modify_check_and_lock(), btr_cur_ins_lock_and_undo(),
btr_cur_upd_lock_and_undo(): Add the parameter mtr.
page_set_max_trx_id(): Allow mtr to be NULL. When mtr==NULL, do not
attempt to write to the redo log. This only occurs when creating a
page or reorganizing a compressed page. In these cases, the
PAGE_MAX_TRX_ID will be set correctly during the application of redo
log records, even though there is no explicit log record about it.
btr_discard_only_page_on_level(): Preserve PAGE_MAX_TRX_ID. This
function should be unreachable, though.
btr_cur_pessimistic_update(): Update PAGE_MAX_TRX_ID.
Add some assertions for checking that PAGE_MAX_TRX_ID is set on all
secondary index leaf pages.
rb://115 tested by Michael, fixes Issue #211
------------------------------------------------------------------------
After this fix, some bogus references to recv_recovery_is_on()
remained. Also, some references could be replaced with
references to index->is_dummy to prepare us for MDEV-14481
(background redo log apply).
In CREATE SEQUENCE or CREATE TEMPORARY SEQUENCE, we should not start
an InnoDB transaction for inserting the sequence status record into
the underlying no-rollback table. Because we did this, a debug assertion
failure would fail in START TRANSACTION WITH CONSISTENT SNAPSHOT after
CREATE TEMPORARY SEQUENCE was executed.
row_ins_step(): Do not start the transaction. Let the caller do that.
que_thr_step(): Start the transaction before calling row_ins_step().
row_ins_clust_index_entry(): Skip locking and undo logging for no-rollback
tables, even for temporary no-rollback tables.
row_ins_index_entry(): Allow trx->id==0 for no-rollback tables.
row_insert_for_mysql(): Do not start a transaction for no-rollback tables.
trx reference counter was updated under mutex and read without any
protection. This is both slow and unsafe. Use atomic operations for
reference counter accesses.
trx_sys_t::rw_trx_set is implemented as std::set, which does a few quite
expensive operations under trx_sys_t::mutex protection: e.g. malloc/free
when adding/removing elements. Traversing b-tree is not that cheap either.
This has negative scalability impact, which is especially visible when running
oltp_update_index.lua benchmark on a ramdisk.
To reduce trx_sys_t::mutex contention std::set is replaced with LF_HASH. None
of LF_HASH operations require trx_sys_t::mutex (nor any other global mutex)
protection.
Another interesting issue observed with std::set is reproducible ~2% performance
decline after benchmark is ran for ~60 seconds. With LF_HASH results are stable.
All in all this patch optimises away one of three trx_sys->mutex locks per
oltp_update_index.lua query. The other two critical sections became smaller.
Relevant clean-ups:
Replaced rw_trx_set iteration at startup with local set. The latter is needed
because values inserted to rw_trx_list must be ordered by trx->id.
Removed redundant conditions from trx_reference(): it is (and even was) never
called with transactions that have trx->state == TRX_STATE_COMMITTED_IN_MEMORY.
do_ref_count doesn't (and probably even didn't) make any sense: now it is called
only when reference counter increment is actually requested.
Moved condition out of mutex in trx_erase_lists().
trx_rw_is_active(), trx_rw_is_active_low() and trx_get_rw_trx_by_id() were
greatly simplified and replaced by appropriate trx_rw_hash_t methods.
Compared to rw_trx_set, rw_trx_hash holds transactions only in PREPARED or
ACTIVE states. Transactions in COMMITTED state were required to be found
at InnoDB startup only. They are now looked up in the local set.
Removed unused trx_assert_recovered().
Removed unused innobase_get_trx() declaration.
Removed rather semantically incorrect trx_sys_rw_trx_add().
Moved information printout from trx_sys_init_at_db_start() to
trx_lists_init_at_db_start().
trx_undo_page_report_modify(): For SPATIAL INDEX, keep logging
updated off-page columns twice, so that
the minimum bounding rectangle (MBR) will be logged.
Avoiding the redundant logging would require larger changes
to the undo log format.
row_build_index_entry_low(): Handle SPATIAL_UNKNOWN more robustly,
by refusing to purge the record from the spatial index.
We can get this code when processing old undo log from 10.2.10 or
10.2.11 (the releases affected by MDEV-14799, which was a regression
from MDEV-14051).
This is a regression caused by MDEV-14051 'Undo log record is too big.'
Purge in the secondary index is wrongly skipped in
row_purge_upd_exist_or_extern() because node->row only does not contain all
indexed columns.
trx_undo_rec_get_partial_row(): Add the parameter for node->update
so that the updated columns will be copied from the initial part
of the undo log record.
row_log_table_apply_insert_low(), row_log_table_apply_update():
When reporting the error_key_num, only count the clustered index
if it corresponds to a key in the SQL layer.
The assertion failure was probably introduced by the (incomplete)
MySQL 5.6.28 bug fix
Bug #21364096 THE BOGUS DUPLICATE KEY ERROR IN ONLINE DDL
WITH INCORRECT KEY NAME
which we are improving.
Side note: the fix was incorrectly merged to MySQL 5.7.10;
incorrect key names will continue to be reported in MySQL 5.7.
Now that MDEV-14717 made RENAME TABLE crash-safe within InnoDB,
it should be safe to drop the #sql- tables within InnoDB during
crash recovery. These tables can be one of two things:
(1) #sql-ib related to deferred DROP TABLE (follow-up to MDEV-13407)
or to table-rebuilding ALTER TABLE...ALGORITHM=INPLACE
(since MDEV-14378, only related to the intermediate copy of a table),
(2) #sql- related to the intermediate copy of a table during
ALTER TABLE...ALGORITHM=COPY
We will not drop tables whose name starts with #sql2, because
the server can be killed during an ALGORITHM=COPY operation at
a point where the original table was renamed to #sql2 but the
finished intermediate copy was not yet renamed from #sql-
to the original table name.
InnoDB in MariaDB 10.2 appears to only write MLOG_FILE_RENAME2
redo log records during table-rebuilding ALGORITHM=INPLACE operations.
We must write the records for any .ibd file renames, so that the
operations are crash-safe.
If InnoDB is killed during a RENAME TABLE operation, it can happen that
the transaction for updating the data dictionary will be rolled back.
But, nothing will roll back the renaming of the .ibd file
(the MLOG_FILE_RENAME2 only guarantees roll-forward), or for that matter,
the renaming of the dict_table_t::name in the dict_sys cache. We introduce
the undo log record TRX_UNDO_RENAME_TABLE to fix this.
fil_space_for_table_exists_in_mem(): Remove the parameters
adjust_space, table_id and some code that was trying to work around
these deficiencies.
fil_name_write_rename(): Write a MLOG_FILE_RENAME2 record.
dict_table_rename_in_cache(): Invoke fil_name_write_rename().
trx_undo_rec_copy(): Set the first 2 bytes to the length of the
copied undo log record.
trx_undo_page_report_rename(), trx_undo_report_rename():
Write a TRX_UNDO_RENAME_TABLE record with the old table name.
row_rename_table_for_mysql(): Invoke trx_undo_report_rename()
before modifying any data dictionary tables.
row_undo_ins_parse_undo_rec(): Roll back TRX_UNDO_RENAME_TABLE
by invoking dict_table_rename_in_cache(), which will take care
of both renaming the table and the file.
The InnoDB background DROP TABLE queue is something that we should
really remove, but are unable to until we remove dict_operation_lock
so that DDL and DML operations can be combined in a single transaction.
Because the queue is not persistent, it is not crash-safe. We should
in some way ensure that the deferred-dropped tables will be dropped
after server restart.
The existence of two separate transactions complicates the error handling
of CREATE TABLE...SELECT. We should really not break locks in DROP TABLE.
Our solution to these problems is to rename the table to a temporary
name, and to drop such-named tables on InnoDB startup. Also, the
queue will use table IDs instead of names from now on.
check-testcase.test: Ignore #sql-ib*.ibd files, because tables may enter
the background DROP TABLE queue shortly before the test finishes.
innodb.drop_table_background: Test CREATE...SELECT and the creation of
tables whose file name starts with #sql-ib.
innodb.alter_crash: Adjust the recovery, now that the #sql-ib tables
will be dropped on InnoDB startup.
row_mysql_drop_garbage_tables(): New function, to drop all #sql-ib tables
on InnoDB startup.
row_drop_table_for_mysql_in_background(): Remove an unnecessary and
misplaced call to log_buffer_flush_to_disk(). (The call should have been
after the transaction commit. We do not care about flushing the redo log
here, because the table would be dropped again at server startup.)
Remove the entry from the list after the table no longer exists.
If server shutdown has been initiated, empty the list without actually
dropping any tables. They will be dropped again on startup.
row_drop_table_for_mysql(): Do not call lock_remove_all_on_table().
Instead, if locks exist, defer the DROP TABLE until they do not exist.
If the table name does not start with #sql-ib, rename it to that prefix
before adding it to the background DROP TABLE queue.
The InnoDB background DROP TABLE queue is something that we should
really remove, but are unable to until we remove dict_operation_lock
so that DDL and DML operations can be combined in a single transaction.
Because the queue is not persistent, it is not crash-safe. In stable
versions of MariaDB, we can only try harder to drop all enqueued
tables before server shutdown.
row_mysql_drop_t::table_id: Replaces table_name.
row_drop_tables_for_mysql_in_background():
Do not remove the entry from the list as long as the table exists.
In this way, the table should eventually be dropped.
trx_roll_must_shutdown(): During the rollback of recovered transactions,
report progress and check if the rollback should be interrupted because
of a pending shutdown.
trx_roll_max_undo_no, trx_roll_progress_printed_pct: Remove, along with
the messages that were interleaved with other messages.
row_undo_step(), trx_rollback_active(): Abort the rollback of a
recovered ordinary transaction if fast shutdown has been initiated.
trx_rollback_resurrected(): Convert an aborted-rollback transaction
into a fake XA PREPARE transaction, so that fast shutdown can proceed.
row_quiesce_table_start(), row_quiesce_table_complete():
Use the more appropriate predicate srv_undo_sources for skipping
purge control. (This change alone is insufficient; it is possible
that this predicate will change during the call to trx_purge_stop()
or trx_purge_run().)
trx_purge_stop(), trx_purge_run(): Tolerate PURGE_STATE_EXIT.
It is very well possible to initiate shutdown soon after the statement
FLUSH TABLES FOR EXPORT has been submitted to execution.
srv_purge_coordinator_thread(): Ensure that the wait for purge_sys->event
in trx_purge_stop() will terminate when the coordinator thread exits.
When the transaction isolation level is SERIALIZABLE, or when
a locking read is performed in the REPEATABLE READ isolation level,
InnoDB must lock delete-marked records in order to prevent another
transaction from inserting something.
However, at READ UNCOMMITTED or READ COMMITTED isolation level or
when the parameter innodb_locks_unsafe_for_binlog is set, the
repeatability of the reads does not matter, and there is no need
to lock any records.
row_search_mvcc(): Skip locks on delete-marked committed records upfront,
instead of invoking row_unlock_for_mysql() afterwards. The unlocking
never worked for secondary index records.
Remove volatile modifier from waiters: it's not supposed for inter-thread
communication, use appropriate atomic operations instead.
Changed waiters to int32_t, my_atomic friendly type.
Allow DROP TABLE `#mysql50##sql-...._.` to drop tables that were
being rebuilt by ALGORITHM=INPLACE
NOTE: If the server is killed after the table-rebuilding ALGORITHM=INPLACE
commits inside InnoDB but before the .frm file has been replaced, then
the recovery will involve something else than DROP TABLE.
NOTE: If the server is killed in a true inplace ALTER TABLE commits
inside InnoDB but before the .frm file has been replaced, then we
are really out of luck. To properly handle that situation, we would
need a transactional mysql.ddl_fixup table that directs recovery to
rename or remove files.
prepare_inplace_alter_table_dict(): Use the altered_table->s->table_name
for generating the new_table_name.
table_name_t::part_suffix: The start of the partition name suffix.
table_name_t::dbend(): Return the end of the schema name.
table_name_t::dblen(): Return the length of the schema name, in bytes.
table_name_t::basename(): Return the name without the schema name.
table_name_t::part(): Return the partition name, or NULL if none.
row_drop_table_for_mysql(): Assert for #sql, not #sql-ib.
When logging ROW_T_INSERT or ROW_T_UPDATE records, we did not normalize
the DB_TRX_ID of the current transaction into 0 if the current transaction
had started (modifying other tables) before the ALTER TABLE started.
MDEV-13654 introduced this normalization for ROW_T_DELETE
and for all operations with ADD PRIMARY KEY, in row_log_table_get_pk().