Commit graph

1032 commits

Author SHA1 Message Date
Marko Mäkelä
4179f93d28 MDEV-18976 Implement OPT_PAGE_CHECKSUM log record for improved validation
We will introduce an optional log record OPT_PAGE_CHECKSUM for recording
page checksums, so that more inconsistencies on crash recovery may be
caught.

mtr_t::page_checksum(const buf_page_t&): Write OPT_PAGE_CHECKSUM
(currently not for ROW_FORMAT=COMPRESSED pages).

mtr_t::do_write(): Write OPT_PAGE_CHECKSUM records for all pages
(currently, in debug builds only).

mtr_t::is_logged(): Return whether log should be written.

mtr_t::set_log_mode_sub(const mtr_t&): Set the logging mode of
a sub-minitransaction when another mini-transaction is holding
latches on some modified pages. When creating or freeing BLOB pages,
we may only write OPT_PAGE_CHECKSUM records in the main mini-transaction,
after all changes have been written to the log.

MTR_LOG_SUB: Log mode for a sub-mini-transaction.

mtr_t::free(): Define non-inline, and invoke MarkFreed.

MarkFreed: For any matching page in the mini-transaction log,
change the first entry to say MTR_MEMO_PAGE_X_MODIFY and any subsequent
entries to MTR_MEMO_PAGE_X_FIX.

FindModified: Simplify a condition. MTR_MEMO_MODIFY can only be set
if MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX are set.

FindBlockX: Consider also MTR_MEMO_PAGE_X_MODIFY.

recv_sys_t::parse(): Store OPT_PAGE_CHECKSUM records.

log_phys_t::apply(): Validate OPT_PAGE_CHECKSUM records.

log_phys_t::page_checksum(): Validate an OPT_PAGE_CHECKSUM record.

Tested by: Matthias Leich
2022-06-06 14:05:01 +03:00
Marko Mäkelä
0b47c126e3 MDEV-13542: Crashing on corrupted page is unhelpful
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.

We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.

This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.

We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.

Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.

buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.

btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.

btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().

btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().

FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.

dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.

trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().

trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.

trx_sys_create_sys_pages(): Merged with trx_sysf_create().

dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.

dir_pathname(): Replaces os_file_make_new_pathname().

row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.

btr_decryption_failed(): Report decryption failures

dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.

dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.

dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.

dict_table_is_corrupted(): Remove.

dict_index_t::is_btree(): Determine if the index is a valid B-tree.

BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.

UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.

buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().

fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.

btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.

opt_search_plan_for_table(): Simplify the code.

row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.

row_undo_mod_upd_exist_sec(): Simplify the code.

row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().

fut_get_ptr(): Replace with buf_page_get_gen() calls.

buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.

buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.

fseg_page_is_allocated(): Replaces fseg_page_is_free().

fts_drop_common_tables(): Return an error if the transaction
was rolled back.

fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.

fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.

Clean up mtr_t::page_lock()

buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.

buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.

mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().

recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.

recv_recover_page(): Return whether the operation succeeded.

recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.

Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
2022-06-06 14:03:22 +03:00
Marko Mäkelä
75096c84b4 MDEV-28525 Some conditions around btr_latch_mode could be eliminated
The types btr_latch_mode and mtr_memo_type_t are partly derived from
rw_lock_type_t. Despite that, some code for converting between them
is using conditions instead of bitwise arithmetics.

Let us define btr_latch_mode in such a way that more conversions to
rw_lock_type_t are possible by bitwise and.

Some SPATIAL INDEX code that assumed !(BTR_MODIFY_TREE & BTR_MODIFY_LEAF)
was adjusted.
2022-06-06 11:56:29 +03:00
Marko Mäkelä
1b03db11d2 MDEV-15528 fixup: Remove some dead code
btr_page_split_and_insert(): Declare all parameters nonnull.
btr_pessimistic_scrub() was removed
in commit a5584b13d1 (MDEV-15528).
2022-06-06 09:52:11 +03:00
Marko Mäkelä
2f8d0af883 Merge 10.5 into 10.6 2022-06-02 17:39:13 +03:00
Marko Mäkelä
5909e0ec31 Cleanup: btr_store_big_rec_extern_fields() does not really modify pcur 2022-06-02 17:22:16 +03:00
Sergei Golubchik
3bc98a4ec4 Merge branch '10.5' into 10.6 2022-05-10 14:01:23 +02:00
Sergei Golubchik
ef781162ff Merge branch '10.4' into 10.5 2022-05-09 22:04:06 +02:00
Sergei Golubchik
a70a1cf3f4 Merge branch '10.3' into 10.4 2022-05-08 23:03:08 +02:00
Marko Mäkelä
57a9626fe4 Merge 10.5 into 10.6 2022-05-06 11:11:04 +03:00
Marko Mäkelä
26d46234e8 MDEV-28478: INSERT into SPATIAL INDEX in TEMPORARY table writes log
This is based on commit 20ae4816bb
with some adjustments for MDEV-12353.

row_ins_sec_index_entry_low(): If a separate mini-transaction is
needed to adjust the minimum bounding rectangle (MBR) in the parent
page, we must disable redo logging if the table is a temporary table.
For temporary tables, no log is supposed to be written, because
the temporary tablespace will be reinitialized on server restart.

rtr_update_mbr_field(), rtr_merge_and_update_mbr(): Changed the return
type to void and removed unreachable code. In older versions, these
used to return a different value for temporary tables.

page_id_t: Add constexpr to most member functions.

mtr_t::log_write(): Catch log writes to invalid tablespaces
so that the test case would crash without the fix to
row_ins_sec_index_entry_low().
2022-05-06 10:12:31 +03:00
Marko Mäkelä
c844a5881a MDEV-21452 fixup: Remove an unused variable 2022-05-03 13:31:59 +03:00
Marko Mäkelä
0806592ac8 MDEV-28422 Page split breaks a gap lock
btr_insert_into_right_sibling(): Inherit any gap lock from the
left sibling to the right sibling before inserting the record
to the right sibling and updating the node pointer(s).

lock_update_node_pointer(): Update locks in case a node pointer
will move.

Based on mysql/mysql-server@c7d93c274f
2022-04-27 13:38:08 +03:00
Marko Mäkelä
2ca1123464 MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table
This follows up the previous fix in
commit c3c53926c4 (MDEV-26554).

ha_innobase::delete_table(): Work around the insufficient
metadata locking (MDL) during DML operations by acquiring exclusive
InnoDB table locks on all child tables. Previously, this was only
done on TRUNCATE and ALTER.

ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke
lock_update_delete() during change buffer operations.
The revised trx_t::commit(std::vector<pfs_os_file_t>&) will
hold exclusive lock_sys.latch while invoking fil_delete_tablespace(),
which in turn may invoke ibuf_delete_rec().

dict_index_t::has_locking(): A new predicate, replacing the dummy
!dict_table_is_locking_disabled(index->table). Used for skipping lock
operations during ibuf_delete_rec().

trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks
and remove the table from the cache while holding exclusive
lock_sys.latch.

trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds.

trx_t::commit(): Reset dict_operation before invoking commit_in_memory()
via commit_persist().

lock_release_on_drop(): Release locks while lock_sys.latch is
exclusively locked.

lock_table(): Add a parameter for a pointer to the table.
We must not dereference the table before a lock_sys.latch has
been acquired. If the pointer to the table does not match the table
at that point, the table is invalid and DB_DEADLOCK will be returned.

row_ins_foreign_check_on_constraint(): Improve the checks.
Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed
before commit c5fd9aa562 (MDEV-25919).

row_upd_check_references_constraints(),
wsrep_row_upd_check_foreign_constraints(): Simplify checks.
2022-04-26 18:09:03 +03:00
Thirunarayanan Balathandayuthapani
4b80c11f52 MDEV-15250 UPSERT during ALTER TABLE results in 'Duplicate entry' error for alter
- InnoDB DDL results in `Duplicate entry' if concurrent DML throws
duplicate key error. The following scenario explains the problem

connection con1:
  ALTER TABLE t1 FORCE;

connection con2:
  INSERT INTO t1(pk, uk) VALUES (2, 2), (3, 2);

In connection con2, InnoDB throws the 'DUPLICATE KEY' error because
of unique index. Alter operation will throw the error when applying
the concurrent DML log.

- Inserting the duplicate key for unique index logs the insert
operation for online ALTER TABLE. When insertion fails,
transaction does rollback and it leads to logging of
delete operation for online ALTER TABLE.
While applying the insert log entries, alter operation
encounters 'DUPLICATE KEY' error.

- To avoid the above fake duplicate scenario, InnoDB should
not write any log for online ALTER TABLE before DML transaction
commit.

- User thread which does DML can apply the online log if
InnoDB ran out of online log and index is marked as completed.
Set online log error if apply phase encountered any error.
It can also clear all other indexes log, marks the newly
added indexes as corrupted.

- Removed the old online code which was a part of DML operations

commit_inplace_alter_table() : Does apply the online log
for the last batch of secondary index log and does frees
the log for the completed index.

trx_t::apply_online_log: Set to true while writing the undo
log if the modified table has active DDL

trx_t::apply_log(): Apply the DML changes to online DDL tables

dict_table_t::is_active_ddl(): Returns true if the table
has an active DDL

dict_index_t::online_log_make_dummy(): Assign dummy value
for clustered index online log to indicate the secondary
indexes are being rebuild.

dict_index_t::online_log_is_dummy(): Check whether the online
log has dummy value

ha_innobase_inplace_ctx::log_failure(): Handle the apply log
failure for online DDL transaction

row_log_mark_other_online_index_abort(): Clear out all other
online index log after encountering the error during
row_log_apply()

row_log_get_error(): Get the error happened during row_log_apply()

row_log_online_op(): Does apply the online log if index is
completed and ran out of memory. Returns false if apply log fails

UndorecApplier: Introduced a class to maintain the undo log
record, latched undo buffer page, parse the undo log record,
maintain the undo record type, info bits and update vector

UndorecApplier::get_old_rec(): Get the correct version of the
clustered index record that was modified by the current undo
log record

UndorecApplier::clear_undo_rec(): Clear the undo log related
information after applying the undo log record

UndorecApplier::log_update(): Handle the update, delete undo
log and apply it on online indexes

UndorecApplier::log_insert(): Handle the insert undo log
and apply it on online indexes

UndorecApplier::is_same(): Check whether the given roll pointer
is generated by the current undo log record information

trx_t::rollback_low(): Set apply_online_log for the transaction
after partially rollbacked transaction has any active DDL

prepare_inplace_alter_table_dict(): After allocating the online
log, InnoDB does create fulltext common tables. Fulltext index
doesn't allow the index to be online. So removed the dead
code of online log removal

Thanks to Marko Mäkelä for providing the initial prototype and
Matthias Leich for testing the issue patiently.
2022-04-25 18:52:19 +05:30
Marko Mäkelä
8f8ba75855 MDEV-27234: Data dictionary recovery was not READ COMMITTED
This also fixes MDEV-20198: Instant ALTER TABLE is not crash safe

InnoDB dictionary recovery wrongly used the READ UNCOMMITTED isolation
level, causing some mismatch. For example, if a table was renamed or
replaced in a transaction, according to READ UNCOMMITTED the table might
not exist at all.

We implement READ COMMITTED isolation level for accessing the dictionary
tables SYS_TABLES, SYS_COLUMNS, SYS_INDEXES, SYS_FIELDS, SYS_VIRTUAL,
SYS_FOREIGN, SYS_FOREIGN_COLS. For most of these tables, no secondary
index exists. For the secondary indexes (on SYS_TABLES.ID,
SYS_FOREIGN.FOR_NAME, SYS_FOREIGN.REF_NAME), we will always look up
the primary key in the clustered index and check if the record actually
is a committed version.

dict_check_sys_tables(): Recover tablespaces also from delete-marked
committed records, so that if a matching .ibd file exists, it will
be removed by fil_delete_tablespace() when the committed delete-marked
SYS_INDEXES record of the clustered index is purged
in row_purge_remove_clust_if_poss_low().

fil_ibd_open(): Change the Boolean parameter "validate" to a ternary
one, to suppress error messages when the file might not exist.
It is possible that a .ibd file was deleted and the server shut down
before the SYS_INDEXES and SYS_TABLES records were purged. Hence, if
dict_check_sys_tables() finds a committed delete-marked record,
we must not complain if the tablespace file is not found.
On Windows, we msut treat ERROR_PATH_NOT_FOUND (directory not found)
in the same way as ERROR_FILE_NOT_FOUND. This fixes a few failures where
a previous test successfully executed DROP DATABASE (and deleted all
files and the directory), but a committed delete-marked SYS_TABLES
record had not been purged before server restart.

dict_getnext_system_low(): Do not filter out delete-marked records.

dict_startscan_system(), dict_getnext_system(): Do filter out
delete-marked records, for accessing the INFORMATION_SCHEMA tables.

dict_sys_tables_rec_read(): Return the DB_TRX_ID of the committed
version of the record. This is needed in dict_load_table_low().

dict_load_foreign_cols(), dict_load_foreign(): Add a parameter for
the current transaction identifier. In some DDL operations, the
FOREIGN KEY constraints are being loaded from the data dictionary
before the DDL transaction has been committed. For SYS_FOREIGN
and SYS_FOREIGN_COLS, we must implement the special case of
READ COMMITTED that the changes of the uncommitted current transaction
are visible.

dict_load_foreign(): Validate the table name. We could find a
SYS_FOREIGN.ID via a committed delete-marked secondary index record
that does not match the REF_NAME or FOR_NAME of the secondary index record.

dict_load_index_low(): Optionally take the table as a parameter,
so that table->def_trx_id can be updated in case of a
committed delete-marked SYS_INDEXES record corresponding
to DROP INDEX, but not corresponding to an index stub of ADD INDEX.

dict_load_indexes(): Do not update table->def_trx_id
in case of delete-marked records.

rec_is_metadata(), rec_offs_make_valid(), rec_get_offsets_func(),
row_build_low(): Relax some assertions. We may now have
!index->is_instant() even if a metadata record is present in the index.
Previously, the recovery of instant ADD/DROP COLUMN assumed
that READ UNCOMMITTED of the data dictionary will be performed.
Now, we will have a READ COMMITTED copy of the data dictionary
cache, and a READ UNCOMMITTED copy of the metadata record.

btr_page_reorganize_low(): Correctly update the FIL_PAGE_TYPE
when rolling back an instant ADD/DROP COLUMN operation.

row_rec_to_index_entry_impl(): Relax some assertions,
and disallow accessing "extra" fields. This fixes the recovery
of a crash during an instant ADD COLUMN after a successful
instant DROP COLUMN, in the test innodb.instant_alter_crash.

Tested by: Matthias Leich
2022-03-28 08:37:51 +03:00
Vlad Lesin
202316a38f Merge 10.5 into 10.6 2022-03-07 18:42:47 +03:00
Marko Mäkelä
2dce3bad9c Merge 10.4 into 10.5 2022-03-07 09:26:50 +02:00
Marko Mäkelä
7b97020d40 Merge 10.3 into 10.4 2022-03-07 09:05:36 +02:00
Marko Mäkelä
02da00a98c Merge 10.2 into 10.3 2022-03-04 14:29:36 +02:00
Marko Mäkelä
a92f07f4bd MDEV-27993 Assertion failed in btr_page_reorganize_low()
btr_cur_optimistic_insert(): Disregard DEBUG_DBUG injection to
invoke btr_page_reorganize() if the page (and the table) is empty.
Otherwise, an assertion would fail in btr_page_reorganize_low()
because PAGE_MAX_TRX_ID is 0 in an empty secondary index leaf page.
2022-03-03 11:51:25 +02:00
Vlad Lesin
f6f055a191 Merge 10.3 into 10.4 2022-02-21 14:10:27 +03:00
Vlad Lesin
a6f258e47f MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration
Backported from 10.5 20e9e804c1 and
5948d7602e.

sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).

It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode  to PAGE_CUR_LE for cursor->rel_pos ==  BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.

But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.

The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.

The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.

Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.

If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.

To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.

The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.

One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.

If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().
2022-02-21 12:49:54 +03:00
Vlad Lesin
f2f22c382b Merge 10.5 into 10.6 2022-02-14 18:30:51 +03:00
Vlad Lesin
20e9e804c1 MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration
sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).

It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode  to PAGE_CUR_LE for cursor->rel_pos ==  BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.

But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.

The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.

The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.

Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.

If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.

To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.

The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.

One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.

If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().
2022-02-14 17:35:04 +03:00
Marko Mäkelä
ba5ef63ae1 MDEV-27476 heap-use-after-free in buf_pool_t::is_block_field()
This follows up commit 017d1b867b.
In commit aaef2e1d8c (MDEV-27058)
some more problematic debug assertions were added.

btr_search_update_block_hash_info(), trx_purge_truncate_history():
Use simpler assertions to check that an uncompressed page is present.
2022-01-12 12:34:07 +02:00
Marko Mäkelä
aaef2e1d8c MDEV-27058: Reduce the size of buf_block_t and buf_page_t
buf_page_t::frame: Moved from buf_block_t::frame.
All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
pages will have frame=nullptr, while all 'fat' buf_block_t
will have a non-null frame pointing to aligned innodb_page_size bytes.
This eliminates the need for separate states for
BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.

buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block
descriptors will have a page latch. The IO_PIN state that was used
for discarding or creating the uncompressed page frame of a
ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
and page X-latch.

page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
of buf_page_t with a single std::atomic<uint32_t>. All modifications
will use store(), fetch_add(), fetch_sub(). This space was previously
wasted to alignment on 64-bit systems. We will use the following encoding
that combines a state (partly read-fix or write-fix) and a buffer-fix
count:

buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
buf_page_t::FREED=3 + fix: pages marked as freed in the file
buf_page_t::UNFIXED=1U<<29 + fix: normal pages
buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)

buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.

buf_page_t::read_complete(): Renamed from buf_page_read_complete().
Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.

buf_page_t::can_relocate(): If the page latch is being held or waited for,
or the block is buffer-fixed or io-fixed, return false. (The condition
on the page latch is new.)

Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
will acquire the page latch before fix(), and unfix() before unlocking.

buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
handling of FREED pages.

buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
by the caller.

buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.

buf_page_get_low(): Ignore guesses that are read-fixed because they
may not yet be registered in buf_pool.page_hash and buf_pool.LRU.

buf_page_optimistic_get(): Acquire latch before buffer-fixing.

buf_page_make_young(): Leave read-fixed blocks alone, because they
might not be registered in buf_pool.LRU yet.

recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
Possibly fix MDEV-26326, by holding a page X-latch instead of
only buffer-fixing the page.
2021-11-18 17:47:19 +02:00
Marko Mäkelä
02e72f7b44 MDEV-26769 follow-up: Remove unnecessary page locking
btr_cur_optimistic_latch_leaves(): Use transactional_shared_lock_guard.

btr_cur_latch_leaves(): Avoid acquiring some page latches, because
the changes are already blocked by index->lock.

btr_cur_search_to_nth_level_func(): Remove a redundant variable
retrying_for_search_prev=!!prev_tree_blocks, and avoid acquiring
some page latches.
2021-11-18 17:44:33 +02:00
Marko Mäkelä
1f02280904 MDEV-26769 InnoDB does not support hardware lock elision
This implements memory transaction support for:

* Intel Restricted Transactional Memory (RTM), also known as TSX-NI
(Transactional Synchronization Extensions New Instructions)
* POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux

transactional_lock_guard, transactional_shared_lock_guard:
RAII lock guards that try to elide the lock acquisition
when transactional memory is available.

buf_pool.page_hash: Try to elide latches whenever feasible.
Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED
tables, this is not always possible.
In buf_page_get_low(), memory transactions only work reasonably
well for validating a guessed block address.

TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards
that try to elide lock_sys.latch and related latches.
2021-10-22 12:38:45 +03:00
Marko Mäkelä
c091a0bc8d MDEV-26826 Duplicated computations of buf_pool.page_hash addresses
Since commit bd5a6403ca (MDEV-26033)
we can actually calculate the buf_pool.page_hash cell and latch
addresses while not holding buf_pool.mutex.

buf_page_alloc_descriptor(): Remove the MEM_UNDEFINED.
We now expect buf_page_t::hash to be zero-initialized.

buf_pool_t::hash_chain: Dedicated data type for buf_pool.page_hash.array.

buf_LRU_free_one_page(): Merged to the only caller
buf_pool_t::corrupted_evict().
2021-10-22 12:33:37 +03:00
Marko Mäkelä
d95361107c Merge 10.5 into 10.6 2021-09-24 14:38:52 +03:00
Marko Mäkelä
7e2b42324c Merge 10.4 into 10.5 2021-09-24 08:42:23 +03:00
Marko Mäkelä
9024498e88 Merge 10.3 into 10.4 2021-09-22 18:26:54 +03:00
Marko Mäkelä
b46cf33ab8 Merge 10.2 into 10.3 2021-09-22 18:01:41 +03:00
Marko Mäkelä
3209bc667f MDEV-26636: InnoDB defragmentation statistics cause races on TEMPORARY TABLE
btr_defragment_save_defrag_stats_if_needed(): Do not save
defragmentation statistics for temporary tables.
They are exempt of defragmentation anyway
(ha_innobase::optimize() never invokes defragmentation for them),
and the user-visible names are not available inside InnoDB.

Furthermore, InnoDB assumes that temporary tables are never accessed
by other threads than the one that handles the session with which
the temporary table is associated with.

Furthermore, we simplify the test innodb.innodb_defrag_stats
and include a test case that demonstrates that defragmentation
statistics are no longer being saved for temporary tables.
2021-09-18 15:47:52 +03:00
Marko Mäkelä
1e9c922fa7 MDEV-26623 Possible race condition between statistics and bulk insert
When computing statistics, let us play it safe and check whether
an insert into an empty table is in progress, once we have acquired
the root page latch. If yes, let us pretend that the table is empty,
just like MVCC reads would do.

It is unlikely that this race condition could lead into any crashes
before MDEV-24621, because the index tree structure should be protected
by the page latches. But, at least we can avoid some busy work and
return earlier.

As part of this, some code that is only used for statistics calculation
is being moved into static functions in that compilation unit.
2021-09-17 14:40:41 +03:00
Marko Mäkelä
1a4a7dddb6 Cleanup: Make btr_root_block_get() more robust
btr_root_block_get(): Check for index->page == FIL_NULL.

btr_root_get(): Declare static. Other callers can invoke
btr_root_block_get() directly.

btr_get_size(): Remove conditions that are checked in
btr_root_block_get().
2021-09-17 14:40:41 +03:00
Marko Mäkelä
277ba134ad MDEV-26467: Avoid futile spin loops
Typically, index_lock and fil_space_t::latch will be held for a longer
time than the spin loop in latch acquisition would be waiting for.
Let us avoid spin loops for those as well as dict_sys.latch, which
could be held in exclusive mode for a longer time (while loading
metadata into the buffer pool and the dictionary cache).

Performance testing on a dual Intel Xeon E5-2630 v4 (2 NUMA nodes)
suggests that the buffer pool page latch (block_lock) benefits from a
spin loop in both read-only and read-write workloads where the working
set is slightly larger than the buffer pool. Presumably, most contention
would occur on leaf page latches. Contention on upper level pages in
the buffer pool should intuitively last longer.

We introduce srw_spin_lock and srw_spin_mutex to allow users of
srw_lock or srw_mutex to opt in for the spin loop.
On Microsoft Windows, a spin loop variant was and will not be available;
srw_mutex and srw_lock will simply wrap SRWLOCK.
That is, on Microsoft Windows, the parameters innodb_sync_spin_loops
and innodb_spin_wait_delay will only affect block_lock.
2021-09-06 12:32:24 +03:00
Marko Mäkelä
7730dd392b Merge 10.5 into 10.6 2021-09-06 10:31:32 +03:00
Marko Mäkelä
84c578c795 MDEV-26547 Restoring InnoDB buffer pool dump is single-threaded for no reason
buf_read_page_background(): Remove the parameter "bool sync"
and always actually initiate a page read in the background.

buf_load(): Always submit asynchronous reads. This allows
page checksums to be verified in concurrent threads as
soon as the reads are completed.
2021-09-06 10:14:24 +03:00
Marko Mäkelä
c5fd9aa562 MDEV-25919: Lock tables before acquiring dict_sys.latch
In commit 1bd681c8b3 (MDEV-25506 part 3)
we introduced a "fake instant timeout" when a transaction would wait
for a table or record lock while holding dict_sys.latch. This prevented
a deadlock of the server but could cause bogus errors for operations
on the InnoDB persistent statistics tables.

A better fix is to ensure that whenever a transaction is being
executed in the InnoDB internal SQL parser (which will for now
require dict_sys.latch to be held), it will already have acquired
all locks that could be required for the execution. So, we will
acquire the following locks upfront, before acquiring dict_sys.latch:

(1) MDL on the affected user table (acquired by the SQL layer)
(2) If applicable (not for RENAME TABLE): InnoDB table lock
(3) If persistent statistics are going to be modified:
(3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats
(3.b) exclusive table locks on the statistics tables
(4) Exclusive table locks on the InnoDB data dictionary tables
(not needed in ANALYZE TABLE and the like)

Note: Acquiring exclusive locks on the statistics tables may cause
more locking conflicts between concurrent DDL operations.
Notably, RENAME TABLE will lock the statistics tables
even if no persistent statistics are enabled for the table.

DROP DATABASE will only acquire locks on statistics tables if
persistent statistics are enabled for the tables on which the
SQL layer is invoking ha_innobase::delete_table().
For any "garbage collection" in innodb_drop_database(), a timeout
while acquiring locks on the statistics tables will result in any
statistics not being deleted for any tables that the SQL layer
did not know about.

If innodb_defragment=ON, information may be written to the statistics
tables even for tables for which InnoDB persistent statistics are
disabled. But, DROP TABLE will no longer attempt to delete that
information if persistent statistics are not enabled for the table.

This change should also fix the hangs related to InnoDB persistent
statistics and STATS_AUTO_RECALC (MDEV-15020) as well as
a bug that running ALTER TABLE on the statistics tables
concurrently with running ALTER TABLE on InnoDB tables could
cause trouble.

lock_rec_enqueue_waiting(), lock_table_enqueue_waiting():
Do not issue a fake instant timeout error when the transaction
is holding dict_sys.latch. Instead, assert that the dict_sys.latch
is never being held here.

lock_sys_tables(): A new function to acquire exclusive locks on all
dictionary tables, in case DROP TABLE or similar operation is
being executed. Locking non-hard-coded tables is optional to avoid
a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was
introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require
all these dictionary tables to exist before executing any DDL, but
the function row_merge_drop_temp_indexes() is an exception.
When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier,
the table SYS_VIRTUAL would not exist at this point.

ha_innobase::commit_inplace_alter_table(): Invoke
log_write_up_to() while not holding dict_sys.latch.

dict_sys_t::remove(), dict_table_close(): No longer try to
drop index stubs that were left behind by aborted online ADD INDEX.
Such indexes should be dropped from the InnoDB data dictionary by
row_merge_drop_indexes() as part of the failed DDL operation.
Stubs for aborted indexes may only be left behind in the
data dictionary cache.

dict_stats_fetch_from_ps(): Use a normal read-only transaction.

ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table():
While waiting for purge to stop using the table,
do not hold dict_sys.latch.

ha_innobase::delete_table(): Implement a work-around for the rollback
of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if
ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL
due to a conflicting LOCK TABLES, such as in the first ALTER TABLE
in the test case of Bug#53676 in parts.partition_special_innodb.
Therefore, we must explicitly stop purge, because it would not be
stopped by MDL.

dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that
we can acquire MDL on the InnoDB persistent statistics tables.

mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory()
in order to avoid ASAN heap-use-after-free related to acquire_thd().

trx_t::dict_operation_lock_mode: Changed the type to bool.

row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary():
Implemented as macros.

rollback_inplace_alter_table(): Apply an infinite timeout to lock waits.

innodb_thd_increment_pending_ops(): Wrapper for
thd_increment_pending_ops(). Never attempt async operation for
InnoDB background threads, such as the trx_t::commit() in
dict_stats_process_entry_from_recalc_pool().

lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL.

lock_wait(): Make dictionary transactions immune to KILL, and to
lock wait timeout when waiting for locks on dictionary tables.

parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly
get ER_LOCK_WAIT_TIMEOUT.

main.mdl: Filter out MDL on InnoDB persistent statistics tables

Reviewed by: Thirunarayanan Balathandayuthapani
2021-08-31 13:54:44 +03:00
Marko Mäkelä
82b7c561b7 MDEV-24258 Merge dict_sys.mutex into dict_sys.latch
In the parent commit, dict_sys.latch could theoretically have been
replaced with a mutex. But, we can do better and merge dict_sys.mutex
into dict_sys.latch. Generally, every occurrence of dict_sys.mutex_lock()
will be replaced with dict_sys.lock().

The PERFORMANCE_SCHEMA instrumentation for dict_sys_mutex
will be removed along with dict_sys.mutex. The dict_sys.latch
will remain instrumented as dict_operation_lock.

Some use of dict_sys.lock() will be replaced with dict_sys.freeze(),
which we will reintroduce for the new shared mode. Most notably,
concurrent table lookups are possible as long as the tables are present
in the dict_sys cache. In particular, this will allow more concurrency
among InnoDB purge workers.

Because dict_sys.mutex will no longer 'throttle' the threads that purge
InnoDB transaction history, a performance degradation may be observed
unless innodb_purge_threads=1.

The table cache eviction policy will become FIFO-like,
similar to what happened to fil_system.LRU
in commit 45ed9dd957.
The name of the list dict_sys.table_LRU will become somewhat misleading;
that list contains tables that may be evicted, even though the
eviction policy no longer is least-recently-used but first-in-first-out.
(Note: Tables can never be evicted as long as locks exist on them or
the tables are in use by some thread.)

As demonstrated by the test perfschema.sxlock_func, there
will be less contention on dict_sys.latch, because some previous
use of exclusive latches will be replaced with shared latches.

fts_parse_sql_no_dict_lock(): Replaced with pars_sql().

fts_get_table_name_prefix(): Merged to fts_optimize_create().

dict_stats_update_transient_for_index(): Deduplicated some code.

ha_innobase::info_low(), dict_stats_stop_bg(): Use a combination
of dict_sys.latch and table->stats_mutex_lock() to cover the
changes of BG_STAT_SHOULD_QUIT, because the flag is being read
in dict_stats_update_persistent() while not holding dict_sys.latch.

row_discard_tablespace_for_mysql(): Protect stats_bg_flag by
exclusive dict_sys.latch, like most other code does.

row_quiesce_table_has_fts_index(): Remove unnecessary mutex
acquisition. FLUSH TABLES...FOR EXPORT is protected by MDL.

row_import::set_root_by_heuristic(): Remove unnecessary mutex
acquisition. ALTER TABLE...IMPORT TABLESPACE is protected by MDL.

row_ins_sec_index_entry_low(): Replace a call
to dict_set_corrupted_index_cache_only(). Reads of index->type
were not really protected by dict_sys.mutex, and writes
(flagging an index corrupted) should be extremely rare.

dict_stats_process_entry_from_defrag_pool(): Only freeze the dictionary,
do not lock it exclusively.

dict_stats_wait_bg_to_stop_using_table(), DICT_BG_YIELD: Remove trx.
We can simply invoke dict_sys.unlock() and dict_sys.lock() directly.

dict_acquire_mdl_shared()<trylock=false>: Assert that dict_sys.latch is
only held in shared more, not exclusive mode. Only acquire it in
exclusive mode if the table needs to be loaded to the cache.

dict_sys_t::acquire(): Remove. Relocating elements in dict_sys.table_LRU
would require holding an exclusive latch, which we want to avoid
for performance reasons.

dict_sys_t::allow_eviction(): Add the table first to dict_sys.table_LRU,
to compensate for the removal of dict_sys_t::acquire(). This function
is only invoked by INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS.

dict_table_open_on_id(), dict_table_open_on_name(): If dict_locked=false,
try to acquire dict_sys.latch in shared mode. Only acquire the latch in
exclusive mode if the table is not found in the cache.

Reviewed by: Thirunarayanan Balathandayuthapani
2021-08-31 13:51:35 +03:00
Marko Mäkelä
49f95c4065 Merge 10.5 into 10.6 2021-08-23 11:21:33 +03:00
Marko Mäkelä
2c9f2a4c8c Merge 10.4 into 10.5 2021-08-23 11:10:59 +03:00
Marko Mäkelä
2b66cd2493 Merge 10.3 into 10.4 2021-08-23 10:44:06 +03:00
Marko Mäkelä
cfbdb5d210 Merge 10.2 into 10.3 2021-08-23 10:14:01 +03:00
Marko Mäkelä
ca89489716 MDEV-26383 fixup: Consistently protect freed_indexes with autoinc_mutex
To avoid potential race conditions between concurrent access to
dict_table_t::freed_indexes, let us consistently use
dict_table_t::autoinc_mutex.

dict_table_remove_from_cache_low(): To avoid extensive hold time
of table->autoinc_mutex, unconditionally free the FTS data structures.
2021-08-23 10:06:21 +03:00
Thirunarayanan Balathandayuthapani
08e5a3d2e3 MDEV-26383 ASAN heap-use-after-free failure in btr_search_lazy_free
Problem:
=======
The last AHI page for two indexes of an dropped table is being
freed at the same time by two threads. One thread frees the
table heap and other thread tries to access table heap again.
It leads to asan failure in btr_search_lazy_free().

Solution:
========
InnoDB uses autoinc_mutex to avoid the race condition
in btr_search_lazy_free()
2021-08-21 12:38:10 +05:30
Marko Mäkelä
f3fcf5f45c Merge 10.5 to 10.6 2021-08-19 12:25:00 +03:00
Marko Mäkelä
4a25957274 Merge 10.4 into 10.5 2021-08-18 18:22:35 +03:00