For clang compiler the compiler's flag -Wno-unused-but-set-variable
was set based on compiler version. This approach could result in
false positive detection for presence of compiler option since
only first three groups of digits in compiler version taken into account
and it could lead to inaccuracy in determining of supported compiler's
features.
Correct way to detect options supported by a compiler is to use
the macros MY_CHECK_CXX_COMPILER_FLAG and to check the result of
variable with prefix have_CXX__
So, to check whether compiler does support the option
-Wno-unused-but-set-variable
the macros
MY_CHECK_CXX_COMPILER_FLAG(-Wno-unused-but-set-variable)
should be called and the result variable
have_CXX__Wno_unused_but_set_variable
be tested for assigned value.
To avoid heap memory allocation overhead for mtr_t::m_memo,
we will allocate a small number of elements statically in
mtr_t::m_memo::small. Only if that preallocated data is
insufficient, we will invoke my_alloc() or my_realloc() for
more storage. The implementation of the data structure is
inspired by llvm::SmallVector.
btr_cur_t: Zero-initialize all fields in the default constructor.
btr_cur_t::index: Remove; it duplicated page_cur.index.
Many functions: Remove arguments that were duplicating
page_cur_t::index and page_cur_t::block.
page_cur_open_level(), btr_pcur_open_level(): Replaces
btr_cur_open_at_index_side() for dict_stats_analyze_index().
At the end, release all latches except the dict_index_t::lock
and the buf_page_t::lock on the requested page.
dict_stats_analyze_index(): Rely on mtr_t::rollback_to_savepoint()
to release all uninteresting page latches.
btr_search_guess_on_hash(): Simplify the logic, and invoke
mtr_t::rollback_to_savepoint().
We will use plain C++ std::vector<mtr_memo_slot_t> for mtr_t::m_memo.
In this way, we can avoid setting mtr_memo_slot_t::object to nullptr
and instead just remove garbage from m_memo.
mtr_t::rollback_to_savepoint(): Shrink the vector. We will be needing this
in dict_stats_analyze_index(), where we will release page latches and
only retain the index->lock in mtr_t::m_memo.
mtr_t::release_last_page(): Release the last acquired page latch.
Replaces btr_leaf_page_release().
mtr_t::release(const buf_block_t&): Release a single page latch.
Used in btr_pcur_move_backward_from_page().
mtr_t::memo_release(): Replaced with mtr_t::release().
mtr_t::upgrade_buffer_fix(): Acquire a latch for a buffer-fixed page.
This replaces the double bookkeeping in btr_cur_t::open_leaf().
Reviewed by: Vladislav Lesin
Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB,
and InnoDB only counted the records in each index according
to the current read view. Unless the attribute QUICK was specified, the
function btr_validate_index() would be invoked to validate the B-tree
structure (the sibling and child links between index pages).
The EXTENDED check will not only count all index records according to the
current read view, but also ensure that any delete-marked records in the
clustered index are waiting for the purge of history, and that all
secondary index records point to a version of the clustered index record
that is waiting for the purge of history. In other words, no index may
contain orphan records. Normal MVCC reads and the non-EXTENDED version
of CHECK TABLE would ignore these orphans.
Unpurged records merely result in warnings (at most one per index),
not errors, and no indexes will be flagged as corrupted due to such
garbage. It will remain possible to SELECT data from such indexes or
tables (which will skip such records) or to rebuild the table to
reclaim some space.
We introduce purge_sys.end_view that will be (almost) a copy of
purge_sys.view at the end of a batch of purging committed transaction
history. It is not an exact copy, because if the size of a purge batch
is limited by innodb_purge_batch_size, some records that
purge_sys.view would allow to be purged will be left over for
subsequent batches.
The purge_sys.view is relevant in the purge of committed transaction
history, to determine if records are safe to remove. The new
purge_sys.end_view is relevant in MVCC operations and in
CHECK TABLE ... EXTENDED. It tells which undo log records are
safe to access (have not been discarded at the end of a purge batch).
purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(),
clone the oldest read view similar to purge_sys_t::clone_end_view()
so that CHECK TABLE ... EXTENDED will not report bogus failures between
InnoDB restart and the completed purge of committed transaction history.
purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible()
in the case that purge_sys.latch will not be held by the caller.
Among other things, this guards access to BLOBs. It is not safe to
dereference any BLOBs of a delete-marked purgeable record, because
they may have already been freed.
purge_sys_t::view_guard::view(): Return a reference to purge_sys.view
that will be protected by purge_sys.latch, held by purge_sys_t::view_guard.
purge_sys_t::end_view_guard::view(): Return a reference to
purge_sys.end_view while it is protected by purge_sys.end_latch.
Whenever a thread needs to retrieve an older version of a clustered
index record, it will hold a page latch on the clustered index page
and potentially also on a secondary index page that points to the
clustered index page. If these pages contain purgeable records that
would be accessed by a currently running purge batch, the progress of
the purge batch would be blocked by the page latches. Hence, it is
safe to make a copy of purge_sys.end_view while holding an index page
latch, and consult the copy of the view to determine whether a record
should already have been purged.
btr_validate_index(): Remove a redundant check.
row_check_index_match(): Check if a secondary index record and a
version of a clustered index record match each other.
row_check_index(): Replaces row_scan_index_for_mysql().
Count the records in each index directly, duplicating the relevant
logic from row_search_mvcc(). Initialize check_table_extended_view
for CHECK ... EXTENDED while holding an index leaf page latch.
If we encounter an orphan record, the copy of purge_sys.end_view that
we make is safe for visibility checks, and trx_undo_get_undo_rec() will
check for the safety to access each undo log record. Should that check
fail, we should return DB_MISSING_HISTORY to report a corrupted index.
The EXTENDED check tries to match each secondary index record with
every available clustered index record version, by duplicating the logic
of row_vers_build_for_consistent_read() and invoking
trx_undo_prev_version_build() directly.
Before invoking row_check_index_match() on delete-marked clustered index
record versions, we will consult purge_sys.is_purgeable() in order to
avoid accessing freed BLOBs.
We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not
exceed the global maximum. Orphan secondary index records will be
flagged only if everything up to PAGE_MAX_TRX_ID has been purged.
We warn also about clustered index records whose nonzero DB_TRX_ID
should have been reset in purge or rollback.
trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id().
trx_undo_prev_version_build(): Remove two debug-only parameters,
and return an error code instead of a Boolean.
trx_undo_get_undo_rec(): Return a pointer to the undo log record,
or nullptr if one cannot be retrieved. Instead of consulting the
purge_sys.view, consult the purge_sys.end_view to determine which
records can be accessed.
trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec()
that will consult purge_sys.view instead of purge_sys.end_view.
TRX_UNDO_CHECK_PURGEABILITY: A new parameter to
trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry()
so that purge_sys.view instead of purge_sys.end_view will be consulted
to determine whether a secondary index record may be safely purged.
row_upd_changes_disowned_external(): Remove. This should be more
expensive than briefly latching purge_sys in trx_undo_prev_version_build()
(which may make use of transactional memory).
row_sel_reset_old_vers_heap(): New function, split from
row_sel_build_prev_vers_for_mysql().
row_sel_build_prev_vers_for_mysql(): Reorder some parameters
to simplify the call to row_sel_reset_old_vers_heap().
row_search_for_mysql(): Replaced with direct calls to row_search_mvcc().
sel_node_get_nth_plan(): Define inline in row0sel.h
open_step(): Define at the call site, in simplified form.
sel_node_reset_cursor(): Merged with the only caller open_step().
---
ReadViewBase::check_trx_id_sanity(): Remove.
Let us handle "future" DB_TRX_ID in a more meaningful way:
row_sel_clust_sees(): Return DB_SUCCESS if the record is visible,
DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if
the DB_TRX_ID is in the future.
row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore
corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed
that corruption when we were about to modify the record in the first
place (leading us to refuse the operation).
row_vers_build_for_consistent_read(): Return DB_CORRUPTION if
DB_TRX_ID is in the future.
Tested by: Matthias Leich
Reviewed by: Vladislav Lesin
The reason why mysql/mysql-server@8020cfac20
split the files was some unit tests that never existed in the
MariaDB Server code base. The storage/innobase/unittest/ works just fine
with this file.
This is reverting part of 2e814d4702
which applied InnoDB changes from MySQL 5.7.9.
trx_undo_rec_copy(): Return nullptr if the undo record is corrupted.
trx_undo_rec_get_undo_no(): Define inline with the declaration.
trx_purge_dummy_rec: Replaced with a -1 pointer.
row_undo_rec_get(), UndorecApplier::apply_undo_rec(): Check
if trx_undo_rec_copy() returned nullptr.
trx_purge_get_next_rec(): Return nullptr upon encountering any
corruption, to signal the end of purge.
fil_page_type_validate(): Remove. This debug check was mostly redundant
and added little value to the code paths that deal with page_compressed
or encrypted pages.
fil_get_page_type_name(): Remove; unused function.
fil_space_decrypt(): Return an error if the page is not
supposed to be encrypted. It is possible that an unencrypted page
contains a nonzero key_version field even though it is not supposed
to be encrypted. Previously we would crash in such a situation.
buf_page_decrypt_after_read(): Simplify the code. Remove some
unnecessary error message about temporary tablespace corruption.
This is where we would usually invoke fil_space_decrypt().
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
- InnoDB DDL results in `Duplicate entry' if concurrent DML throws
duplicate key error. The following scenario explains the problem
connection con1:
ALTER TABLE t1 FORCE;
connection con2:
INSERT INTO t1(pk, uk) VALUES (2, 2), (3, 2);
In connection con2, InnoDB throws the 'DUPLICATE KEY' error because
of unique index. Alter operation will throw the error when applying
the concurrent DML log.
- Inserting the duplicate key for unique index logs the insert
operation for online ALTER TABLE. When insertion fails,
transaction does rollback and it leads to logging of
delete operation for online ALTER TABLE.
While applying the insert log entries, alter operation
encounters 'DUPLICATE KEY' error.
- To avoid the above fake duplicate scenario, InnoDB should
not write any log for online ALTER TABLE before DML transaction
commit.
- User thread which does DML can apply the online log if
InnoDB ran out of online log and index is marked as completed.
Set online log error if apply phase encountered any error.
It can also clear all other indexes log, marks the newly
added indexes as corrupted.
- Removed the old online code which was a part of DML operations
commit_inplace_alter_table() : Does apply the online log
for the last batch of secondary index log and does frees
the log for the completed index.
trx_t::apply_online_log: Set to true while writing the undo
log if the modified table has active DDL
trx_t::apply_log(): Apply the DML changes to online DDL tables
dict_table_t::is_active_ddl(): Returns true if the table
has an active DDL
dict_index_t::online_log_make_dummy(): Assign dummy value
for clustered index online log to indicate the secondary
indexes are being rebuild.
dict_index_t::online_log_is_dummy(): Check whether the online
log has dummy value
ha_innobase_inplace_ctx::log_failure(): Handle the apply log
failure for online DDL transaction
row_log_mark_other_online_index_abort(): Clear out all other
online index log after encountering the error during
row_log_apply()
row_log_get_error(): Get the error happened during row_log_apply()
row_log_online_op(): Does apply the online log if index is
completed and ran out of memory. Returns false if apply log fails
UndorecApplier: Introduced a class to maintain the undo log
record, latched undo buffer page, parse the undo log record,
maintain the undo record type, info bits and update vector
UndorecApplier::get_old_rec(): Get the correct version of the
clustered index record that was modified by the current undo
log record
UndorecApplier::clear_undo_rec(): Clear the undo log related
information after applying the undo log record
UndorecApplier::log_update(): Handle the update, delete undo
log and apply it on online indexes
UndorecApplier::log_insert(): Handle the insert undo log
and apply it on online indexes
UndorecApplier::is_same(): Check whether the given roll pointer
is generated by the current undo log record information
trx_t::rollback_low(): Set apply_online_log for the transaction
after partially rollbacked transaction has any active DDL
prepare_inplace_alter_table_dict(): After allocating the online
log, InnoDB does create fulltext common tables. Fulltext index
doesn't allow the index to be online. So removed the dead
code of online log removal
Thanks to Marko Mäkelä for providing the initial prototype and
Matthias Leich for testing the issue patiently.
CMAKE_SYSTEM_PROCESSOR on AIX is "powerpc". To
deconflict with the Linux 32bit arch of the same
name, CMAKE_SYSTEM_NAME was used in the CMakeLists.txt
test to enable -mhtm in the same way that was required
for Linux ppc64{,le} compilers in MDEV-27936
Per https://gcc.gnu.org/onlinedocs/gcc/PowerPC-Hardware-Transactional-Memory-Built-in-Functions.html
The .. high level HTM interface .. is common between PowerPC and S/390
Reimplemented the transactional_lock_enabled() detection mechanism for
s390x and POWER based on SIGILL. This also gives non-Linux based unixes
the ability to use HTM. The implementation is based off openssl.
(ref:
1c0eede982/crypto/s390xcap.c (L104))
The other ppc64{,le} problems with getauxvec based detection:
* Checking PPC_FEATURE2_HTM_NOSC not needed as we do not do syscalls while
in a transactional state.
* As we don't use, and never should use PPC_FEATURE2_HTM_NO_SUSPEND,
or do syscalls while in transactional state, don't test it.
From: https://www.kernel.org/doc/html/v5.4/powerpc/syscall64-abi.html#transactional-memory
S390x high level __builtin_tbegin functions in the htmxlintrin.h are not
inline. This header file can be included once in the entire set of sources for
a linked target, otherwise duplicate symbols occur. While we could use inline
xabort/xend functions using the low level interface, we keep this the same as
ppc64 for simplicity.
SLES-15, gcc-7, appeared to want everything that included the htmlxlintrin to
be compiled with -mhtm otherwise the __builtin_t{func} where not defined
(in addition to a #ifdef __HTM__ #error). Debian sid gcc-11.2 wanted the same
on ppc64le/ppc64. In general we want to avoid a wide spread use of architecture
cflags as it makes justifications for selective optimizations easier.
(ref: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006702)
There is only a very small range of gcc compiler versions
that allow the built_{htm} functions to be defined without -mhtm
being specified as a global C{,XX}FLAGS.
Because the design is centered around enable HTM only in the
functional blocks that use it, this breaks on the inclusion
of the htmxlintrin.h header that includes this.
As a partial mitigation, extented to GNU/clang compilers,
transaction functions gain the attribute "hot".
In general the use of htm is around the optimistic
transaction ability of the function. The key part of using the
hot attribute is to place these functions together so that
a maximization of icache, tlb and OS paging can ensure that
these can be ready to execute by any thread/cpu with the
minimum amount of overhead.
POWER is particularly affected here because the xbegin/xend
functions are not inline.
srw_lock.cc requires the -mhtm cflag, both in the storage
engine and the unit tests.
buf_page_t::frame: Moved from buf_block_t::frame.
All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
pages will have frame=nullptr, while all 'fat' buf_block_t
will have a non-null frame pointing to aligned innodb_page_size bytes.
This eliminates the need for separate states for
BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.
buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block
descriptors will have a page latch. The IO_PIN state that was used
for discarding or creating the uncompressed page frame of a
ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
and page X-latch.
page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
of buf_page_t with a single std::atomic<uint32_t>. All modifications
will use store(), fetch_add(), fetch_sub(). This space was previously
wasted to alignment on 64-bit systems. We will use the following encoding
that combines a state (partly read-fix or write-fix) and a buffer-fix
count:
buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
buf_page_t::FREED=3 + fix: pages marked as freed in the file
buf_page_t::UNFIXED=1U<<29 + fix: normal pages
buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)
buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.
buf_page_t::read_complete(): Renamed from buf_page_read_complete().
Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.
buf_page_t::can_relocate(): If the page latch is being held or waited for,
or the block is buffer-fixed or io-fixed, return false. (The condition
on the page latch is new.)
Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
will acquire the page latch before fix(), and unfix() before unlocking.
buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
handling of FREED pages.
buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
by the caller.
buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.
buf_page_get_low(): Ignore guesses that are read-fixed because they
may not yet be registered in buf_pool.page_hash and buf_pool.LRU.
buf_page_optimistic_get(): Acquire latch before buffer-fixing.
buf_page_make_young(): Leave read-fixed blocks alone, because they
might not be registered in buf_pool.LRU yet.
recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
Possibly fix MDEV-26326, by holding a page X-latch instead of
only buffer-fixing the page.
InnoDB commit fails when consecutive FTS_DOC_ID value
is greater than 4294967295.
Fix is that InnoDB should remove the delta FTS_DOC_ID
value limitations and fts should encode 8 byte value,
remove FTS_DOC_ID_MAX_STEP variable. Replaced the
fts0vlc.ic file with fts0vlc.h
fts_encode_int(): Should be able to encode 10 bytes value
fts_get_encoded_len(): Should get the length of the value
which has 10 bytes
fts_decode_vlc(): Add debug assertion to verify the maximum
length allowed is 10.
mach_read_uint64_little_endian(): Reads 64 bit stored in
little endian format
Added a unit test case which check for minimum and maximum
value to do the fts encoding
1. rename option DEPENDENCIES in MYSQL_ADD_PLUGIN() to DEPENDS
to be consistent with other cmake commands and macros
2. use this DEPENDS option in plugins
3. add dependencies to the plugin embedded target too
4. plugins don't need to add GenError dependency explicitly,
all plugins depend on it automatically
This is a complete rewrite of DROP TABLE, also as part of other DDL,
such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE.
The background DROP TABLE queue hack is removed.
If a transaction needs to drop and create a table by the same name
(like TRUNCATE TABLE does), it must first rename the table to an
internal #sql-ib name. No committed version of the data dictionary
will include any #sql-ib tables, because whenever a transaction
renames a table to a #sql-ib name, it will also drop that table.
Either the rename will be rolled back, or the drop will be committed.
Data files will be unlinked after the transaction has been committed
and a FILE_RENAME record has been durably written. The file will
actually be deleted when the detached file handle returned by
fil_delete_tablespace() will be closed, after the latches have been
released. It is possible that a purge of the delete of the SYS_INDEXES
record for the clustered index will execute fil_delete_tablespace()
concurrently with the DDL transaction. In that case, the thread that
arrives later will wait for the other thread to finish.
HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag.
ha_innobase::truncate() now requires that all other references to
the table be released in advance. This was implemented by Monty.
ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected,
we will "hijack" the current transaction, drop the table in
the current transaction and commit the current transaction.
This essentially fixes MDEV-21602. There is a FIXME comment about
making the check less failure-prone.
ha_innobase::truncate(), ha_innobase::delete_table():
Implement a fast path for temporary tables. We will no longer allow
temporary tables to use the adaptive hash index.
dict_table_t::mdl_name: The original table name for the purpose of
acquiring MDL in purge, to prevent a race condition between a
DDL transaction that is dropping a table, and purge processing
undo log records of DML that had executed before the DDL operation.
For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the
dict_table_t::mdl_name will differ from dict_table_t::name.
dict_table_t::parse_name(): Use mdl_name instead of name.
dict_table_rename_in_cache(): Update mdl_name.
For the internal FTS_ tables of FULLTEXT INDEX, purge would
acquire MDL on the FTS_ table name, but not on the main table,
and therefore it would be able to run concurrently with a
DDL transaction that is dropping the table. Previously, the
DROP TABLE queue hack prevented a race between purge and DDL.
For now, we introduce purge_sys.stop_FTS() to prevent purge from
opening any table, while a DDL transaction that may drop FTS_
tables is in progress. The function fts_lock_table(), which will
be invoked before the dictionary is locked, will wait for
purge to release any table handles.
trx_t::drop_table_statistics(): Drop statistics for the table.
This replaces dict_stats_drop_index(). We will drop or rename
persistent statistics atomically as part of DDL transactions.
On lock conflict for dropping statistics, we will fail instantly
with DB_LOCK_WAIT_TIMEOUT, because we will be holding the
exclusive data dictionary latch.
trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory().
Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT
in addition to DB_DUPLICATE_KEY. The call to fts_commit() is
entirely misplaced here and may obviously break the consistency
of transactions that affect FULLTEXT INDEX. It needs to be fixed
separately.
dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175).
The counter was a work-around for missing meta-data locking (MDL)
on the SQL layer, and not really needed in MariaDB.
ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28.
HA_ERR_TABLE_IN_FK_CHECK: Remove.
row_ins_check_foreign_constraints(): Do not acquire
dict_sys.latch either. The SQL-layer MDL will protect us.
This was reviewed by Thirunarayanan Balathandayuthapani
and tested by Matthias Leich.
Many InnoDB data dictionary cache operations require that the
table name be copied so that it will be NUL terminated.
(For example, SYS_TABLES.NAME is not guaranteed to be NUL-terminated.)
dict_table_t::is_garbage_name(): Check if a name belongs to
the background drop table queue.
dict_check_if_system_table_exists(): Remove.
dict_sys_t::load_sys_tables(): Load the non-hard-coded system tables
SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL on startup.
dict_sys_t::create_or_check_sys_tables(): Replaces
dict_create_or_check_foreign_constraint_tables() and
dict_create_or_check_sys_virtual().
dict_sys_t::load_table(): Replaces dict_table_get_low()
and dict_load_table().
dict_sys_t::find_table(): Renamed from get_table().
dict_sys_t::sys_tables_exist(): Check whether all the non-hard-coded
tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL exist.
trx_t::has_stats_table_lock(): Moved to dict0stats.cc.
Some error messages will now report table names in the internal
databasename/tablename format, instead of `databasename`.`tablename`.
In the SUX_LOCK_GENERIC implementation, we can remember at most
one pending exclusive lock request. If multiple exclusive lock
requests are pending, the WRITER_WAITING flag will be cleared when
the first waiting writer acquires the exclusive lock.
ssux_lock_low::update_lock(): If WRITER_WAITING is set, wake up
the writer even if the UPDATER flag is set, because the waiting
writer may be in the process of upgrading its U lock to X.
rw_lock::read_unlock(): Also indicate that an X lock waiter must
be woken up if an U lock exists.
This fix may cause unnecessary wake-ups and system calls, but this
is the best that we can do. Ideally we would use the MDEV-25404
idea of a separate 'writer' mutex, but there is no portable way to
request that a non-recursive mutex be created, and InnoDB requires
the ability to transfer buf_block_t::lock ownership to an I/O thread.
To allow problems like this to be caught more reliably in the future,
we add a unit test for srw_mutex, srw_lock, ssux_lock, sux_lock.
- use FIND_PACKAGE(LIBAIO) to find libaio
- Use standard CMake conventions in Find{PMEM,URING}.cmake
- Drop the LIB from LIB{PMEM,URING}_{INCLUDE_DIR,LIBRARIES}
It is cleaner, and consistent with how other packages are handled in CMake.
e.g successful FIND_PACKAGE(PMEM) now sets PMEM_FOUND, PMEM_LIBRARIES,
PMEM_INCLUDE_DIR, not LIBPMEM_{FOUND,LIBRARIES,INCLUDE_DIR}.
- Decrease the output. use FIND_PACKAGE with QUIET argument.
- for Linux packages, either liburing, or libaio is required
If liburing is installed, libaio does not need to be present .
Use FIND_PACKAGE([LIBAIO|URING] REQUIRED) if either library is required.
The new default values WITH_URING:BOOL=OFF, WITH_PMEM:BOOL=OFF imply
that the dependencies are optional.
An explicit request WITH_URING=ON or WITH_PMEM=ON will cause the
build to fail if the requested dependencies are not available.
Last, to prevent a feature to be built in even though the built-time
dependencies are available, the following can be used:
cmake -DCMAKE_DISABLE_FIND_PACKAGE_URING=1
cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1
This cleanup was suggested by Vladislav Vaintroub.
The DeadlockChecker expects to be able to freeze the waits-for graph.
Hence, it is best executed somewhere where we are not holding any
additional mutexes.
lock_wait(): Defer the deadlock check to this function, instead
of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().
DeadlockChecker::trx_rollback(): Merge with the only caller,
check_and_resolve().
LockMutexGuard: RAII accessor for lock_sys.mutex.
lock_sys.deadlocks: Replaces lock_deadlock_found.
trx_t: Clean up some comments.
WaitOnAddress() turns out to be too CPU-heavy for the specific scenario,
which makes it prominent in profiler output on several benchmarks with
contended sux_lock.
The condition variable implementation does not show the same behavior.
Thus, defined SRWLOCK_DUMMY for Windows
srw_mutex should remain mapped to SRWLOCK on Windows (since SRWLOCK is
smaller).
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.
We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.
dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.
lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.
FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.
fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.
fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
We will default to MUTEXTYPE=sys (using OSTrackMutex) for those
ib_mutex_t that have not been replaced yet.
The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed.
The parameter innodb_sync_array_size is removed.
FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced.
We should enforce it for lock_sys.mutex and dict_sys.mutex somehow!
innodb_sync_debug=ON might still cover ib_mutex_t.
InnoDB buffer pool block and index tree latches depend on a
special kind of read-update-write lock that allows reentrant
(recursive) acquisition of the 'update' and 'write' locks
as well as an upgrade from 'update' lock to 'write' lock.
The 'update' lock allows any number of reader locks from
other threads, but no concurrent 'update' or 'write' lock.
If there were no requirement to support an upgrade from 'update'
to 'write', we could compose the lock out of two srw_lock
(implemented as any type of native rw-lock, such as SRWLOCK on
Microsoft Windows). Removing this requirement is very difficult,
so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we
implemented an 'update' mode to our srw_lock.
Re-entrant or recursive locking is mostly needed when writing or
freeing BLOB pages, but also in crash recovery or when merging
buffered changes to an index page. The re-entrancy allows us to
attach a previously acquired page to a sub-mini-transaction that
will be committed before whatever else is holding the page latch.
The SUX lock supports Shared ('read'), Update, and eXclusive ('write')
locking modes. The S latches are not re-entrant, but a single S latch
may be acquired even if the thread already holds an U latch.
The idea of the U latch is to allow a write of something that concurrent
readers do not care about (such as the contents of BTR_SEG_LEAF,
BTR_SEG_TOP and other page allocation metadata structures, or
the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field
is only updated when a dict_table_t for the table exists, and only
read when a dict_table_t for the table is being added to dict_sys.)
block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page()
to allow concurrent readers but no concurrent modifications while the
page is being written to the data file. That latch will be released
by buf_page_write_complete() in a different thread. Hence, we use
the special lock owner value FOR_IO.
The index_lock::u_lock() improves concurrency on operations that
involve non-leaf index pages.
The interface has been cleaned up a little. We will use
x_lock_recursive() instead of x_lock() when we know that a
lock is already held by the current thread. Similarly,
a lock upgrade from U to X is only allowed via u_x_upgrade()
or x_lock_upgraded() but not via x_lock().
We will disable the LatchDebug and sync_array interfaces to
InnoDB rw-locks.
The SEMAPHORES section of SHOW ENGINE INNODB STATUS output
will no longer include any information about InnoDB rw-locks,
only TTASEventMutex (cmake -DMUTEXTYPE=event) waits.
This will make a part of the 'innotop' script dead code.
The block_lock buf_block_t::lock will not be covered by any
PERFORMANCE_SCHEMA instrumentation.
SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES
will no longer output source code file names or line numbers.
The dict_index_t::lock will be identified by index and table names,
which should be much more useful. PERFORMANCE_SCHEMA is lumping
information about all dict_index_t::lock together as
event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'.
buf_page_free(): Remove the file,line parameters. The sux_lock will
not store such diagnostic information.
buf_block_dbg_add_level(): Define as empty macro, to be removed
in a subsequent commit.
Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO
the index_lock dict_index_t::lock will be instrumented via
PERFORMANCE_SCHEMA. Similar to
commit 1669c8890c
we will distinguish lock waits by registering shared_lock,exclusive_lock
events instead of try_shared_lock,try_exclusive_lock.
Actual 'try' operations will not be instrumented at all.
rw_lock_list: Remove. After MDEV-24167, this only covered
buf_block_t::lock and dict_index_t::lock. We will output their
information by traversing buf_pool or dict_sys.