An InnoDB check for the validity of index pages would occasionally fail
in the test encryption.innodb_encryption_discard_import.
An analysis of a "rr replay" failure trace revealed that the problem
basically is a combination of two old anomalies, and a recently
implemented optimization in MariaDB 10.5.
MDEV-15528 allows InnoDB to discard buffer pool pages that were freed.
PageBulk::init() will disable the InnoDB validity check, because
during native ALTER TABLE (rebuilding tables or creating indexes)
we could write inconsistent index pages to data files.
In the occasional test failure, page 8:6 would have been written
from the buffer pool to the data file and subsequently freed.
However, fil_crypt_thread may perform dummy writes to pages that
have been freed. In case we are causing an inconsistent page to
be re-encrypted on page flush, we should disable the check.
In the analyzed "rr replay" trace, a fil_crypt_thread attempted
to access page 8:6 twice after it had been freed.
On the first call, buf_page_get_gen(..., BUF_PEEK_IF_IN_POOL, ...)
returned NULL. The second call succeeded, and shortly thereafter,
the server intentionally crashed due to writing the corrupted page.
This is a race condition where a table on which a 10.3-style
instant ADD COLUMN is emptied during the execution of
ALTER TABLE ... DROP COLUMN ..., DROP INDEX ..., ALGORITHM=NOCOPY.
In commit 2c4844c9e7 the
function instant_metadata_lock() would prevent this race condition.
But, it would also hold a page latch on the leftmost leaf page of
clustered index for the duration of a possible DROP INDEX operation.
The race could be fixed by restoring the function
instant_metadata_lock() that was removed in
commit ea37b14409
but it would be more future-proof to prevent the
dict_index_t::clear_instant_add() call from being issued at all.
We at some point support DROP COLUMN ..., ADD INDEX ..., ALGORITHM=NOCOPY
and that would spend a non-trivial amount of
execution time in ha_innobase::inplace_alter(),
making a server hang possible. Currently this is not supported
and our added test case will notice when the support is introduced.
dict_index_t::must_avoid_clear_instant_add(): Determine if
a call to clear_instant_add() must be avoided.
btr_discard_only_page_on_level(): Preserve the metadata record
if must_avoid_clear_instant_add() holds.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Do not remove the metadata record even if the table becomes empty
but must_avoid_clear_instant_add() holds.
btr_pcur_store_position(): Relax a debug assertion.
This is joint work with Thirunarayanan Balathandayuthapani.
Apply this patch from Percona Server (amended for 10.5):
commit cd7201514fee78aaf7d3eb2b28d2573c76f53b84
Author: Laurynas Biveinis <laurynas.biveinis@gmail.com>
Date: Tue Nov 14 06:34:19 2017 +0200
Fix bug 1704195 / 87065 / TDB-83 (Stop ANALYZE TABLE from flushing table definition cache)
Make ANALYZE TABLE stop flushing affected tables from the table
definition cache, which has the effect of not blocking any subsequent
new queries involving the table if there's a parallel long-running
query:
- new table flag HA_ONLINE_ANALYZE, return it for InnoDB and TokuDB
tables;
- in mysql_admin_table, if we are performing ANALYZE TABLE, and the
table flag is set, do not remove the table from the table
definition cache, do not invalidate query cache;
- in partitioning handler, refresh the query optimizer statistics
after ANALYZE if the underlying handler supports HA_ONLINE_ANALYZE;
- new testcases main.percona_nonflushing_analyze_debug,
parts.percona_nonflushing_abalyze_debug and a supporting debug sync
point.
For TokuDB, this change exposes bug TDB-83 (Index cardinality stats
updated for handler::info(HA_STATUS_CONST), not often enough for
tokudb_cardinality_scale_percent). TokuDB may return different
rec_per_key values depending on dynamic variable
tokudb_cardinality_scale_percent value. The server does not have a way
of knowing that changing this variable invalidates the previous
rec_per_key values in any opened table shares, and so does not call
info(HA_STATUS_CONST) again. Fix by updating rec_per_key for both
HA_STATUS_CONST and HA_STATUS_VARIABLE. This also forces a re-record
of tokudb.bugs.db756_card_part_hash_1_pick, with the new output
seeming to be more correct.
MDEV-15053 did not remove all unnecessary buf_pool.page_hash S-latch
acquisition. There are code paths where we are holding buf_pool.mutex
(which will sufficiently protect buf_pool.page_hash against changes)
and unnecessarily acquire the latch. Many invocations of
buf_page_hash_get_locked() can be replaced with the much simpler
buf_pool.page_hash_get_low().
In the worst case the thread that is holding buf_pool.mutex will become
a victim of MDEV-22871, suffering from a spurious reader-reader conflict
with another thread that genuinely needs to acquire a buf_pool.page_hash
S-latch.
In many places, we were also evaluating page_id_t::fold() while holding
buf_pool.mutex. Low-level functions such as buf_pool.page_hash_get_low()
must get the page_id_t::fold() as a parameter.
buf_buddy_relocate(): Defer the hash_lock acquisition to the critical
section that starts by calling buf_page_t::can_relocate().
fil_space_t::freed_ranges: Store ranges of freed page numbers.
fil_space_t::last_freed_lsn: Store the most recent LSN of
freeing a page.
fil_space_t::freed_mutex: Protects freed_ranges, last_freed_lsn.
fil_space_create(): Initialize the freed_range mutex.
fil_space_free_low(): Frees the freed_range mutex.
range_set: Ranges of page numbers.
buf_page_create(): Removes the page from freed_ranges when page
is being reused.
btr_free_root(): Remove the PAGE_INDEX_ID invalidation. Because
btr_free_root() and dict_drop_index_tree() are executed in
the same atomic mini-transaction, there is no need to
invalidate the root page.
buf_release_freed_page(): Split from buf_flush_freed_page().
Skip any I/O
buf_flush_freed_pages(): Get the freed ranges from tablespace and
Write punch-hole or zeroes of the freed ranges.
buf_flush_try_neighbors(): Handles the flushing of freed ranges.
mtr_t::freed_pages: Variable to store the list of freed pages.
mtr_t::add_freed_pages(): To add freed pages.
mtr_t::clear_freed_pages(): To clear the freed pages.
mtr_t::m_freed_in_system_tablespace: Variable to indicate whether page has
been freed in system tablespace.
mtr_t::m_trim_pages: Variable to indicate whether the space has been trimmed.
mtr_t::commit(): Add the freed page and update the last freed lsn
in the tablespace and clear the tablespace freed range if space is
trimmed.
file_name_t::freed_pages: Store the freed pages during recovery.
file_name_t::add_freed_page(), file_name_t::remove_freed_page(): To
add and remove freed page during recovery.
store_freed_or_init_rec(): Store or remove the freed pages while
encountering FREE_PAGE or INIT_PAGE redo log record.
recv_init_crash_recovery_spaces(): Add the freed page encountered
during recovery to respective tablespace.
For INET6 columns the values are stored as BINARY columns and returned to the client in TEXT format.
For rocksdb the indexes store mem-comparable images for columns, so use the pack_length() to store
the mem-comparable form for INET6 columns. This would also remain consistent with CHAR columns.
For reads, the buf_pool.page_hash is protected by buf_pool.mutex or
by the hash_lock. There is no need to compute or acquire hash_lock
if we are not modifying the buf_pool.page_hash.
However, the buf_pool.page_hash latch must be held exclusively
when changing buf_page_t::in_file(), or if we desire to prevent
buf_page_t::can_relocate() or buf_page_t::buf_fix_count()
from changing.
rw_lock_lock_word_decr(): Add a comment that explains the polling logic.
buf_page_t::set_state(): When in_file() is to be changed, assert that
an exclusive buf_pool.page_hash latch is being held. Unfortunately
we cannot assert this for set_state(BUF_BLOCK_REMOVE_HASH) because
set_corrupt_id() may already have been called.
buf_LRU_free_page(): Check buf_page_t::can_relocate() before
aqcuiring the hash_lock.
buf_block_t::initialise(): Initialize also page.buf_fix_count().
buf_page_create(): Initialize buf_fix_count while not holding
any mutex or hash_lock. Acquire the hash_lock only for the
duration of inserting the block to the buf_pool.page_hash.
buf_LRU_old_init(), buf_LRU_add_block(),
buf_page_t::belongs_to_unzip_LRU(): Do not assert buf_page_t::in_file(),
because buf_page_create() will invoke buf_LRU_add_block()
before acquiring hash_lock and buf_page_t::set_state().
buf_pool_t::validate(): Rely on the buf_pool.mutex and do not
unnecessarily acquire any buf_pool.page_hash latches.
buf_page_init_for_read(): Clarify that we must acquire the hash_lock
upfront in order to prevent a race with buf_pool_t::watch_remove().
ut_filename_hash(): Add better casts to please the compiler:
warning C4307: '*': integral constant overflow
This regression was introduced in
commit dd77f072f9 (MDEV-22841).
MONITOR_SRV_MEM_VALIDATE_MICROSECOND, MEM_PERIODIC_CHECK,
SRV_MASTER_MEM_VALIDATE_INTERVAL: Remove. These were unused
ever since UNIV_MEM_DEBUG was removed.
MONITOR_SRV_PURGE_MICROSECOND: Remove. This was always unused.
Problematic mutex is dict_sys.mutex.
Idea of the patch: unlink() fd under that mutex while
it's still open. This way unlink() will be fast and
actual file removal will happen on close().
And close() will be called outside of dict_sys.mutex.
This should be safe against crash which may happen between
unlink() and close(): file will be removed by OS anyway.
The same applies to both *nix and Windows.
I created and removed a 4G file on some NVMe SSD on ext4:
write(3, "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1"..., 1048576) = 1048576 <0.000519>
fdatasync(3) = 0 <3.533763>
close(3) = 0 <0.000011>
unlink("file") = 0 <0.411563>
write(3, "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1"..., 1048576) = 1048576 <0.000520>
fdatasync(3) = 0 <3.544938>
unlink("file") = 0 <0.000029>
close(3) = 0 <0.407057>
Such systems can benefit of this patch.
fil_node_t::detach(): closes fil_node_t but not file handle,
returns that file handle
fil_node_t::prepare_to_close_or_deatch(): 'closes' fil_node_t
fil_node_t:close_to_free(): new argument detach_handle
fil_system_t::detach(): now can detach file handles
fil_delete_tablespace(): now can detach file handles
row_drop_table_for_mysql(): performs actual file removal
To install Spider one can simply drop a /etc/mysql/conf.d/spider.cnf like
[mariadb]
plugin-load-add=ha_spider.so
This is automatically generated and installed when plugin is correctly
registered to plugin.cmake with its own component name. Many other plugins
such as Connect and RocksDB install in the same way.
This solved MDEV-19917 as the mere adding and removing of spider.cnf
automatically installs and uninstalls it.
Remove the overly complex and uncecessary install.sql from Spider,
if should not be needed in modern times anymore.
With this change there is no need for a uninstall.sql either.
Change how lookup for the "auto" PSI_memory_keys is done.
Lookup for filename hashes (integers), instead of C strings
Generate these hashes at the compile time with constexpr,
rather than at runtime.
Let us invoke the debug member functions of mtr_t directly.
mtr_t::memo_contains(): Change the parameter type to
const rw_lock_t&. This function cannot be invoked on
buf_block_t::lock.
The function mtr_t::memo_contains_flagged() is intended to be invoked
on buf_block_t* or rw_lock_t*, and it along with
mtr_t::memo_contains_page_flagged() are the way to check whether
a buffer pool page has been latched within a mini-transaction.
xdes_get_state(), fseg_get_nth_frag_page_no(),
fseg_find_free_frag_page_slot(), fseg_find_last_used_frag_page_slot(),
fseg_get_n_frag_pages(), fseg_n_reserved_pages_low(),
fseg_print_low(): Remove the unused parameter mtr, and add
a const qualifier to the pointer to the buffer pool page frame.
svr_n_page_hash_locks: Increase from 16 to 64. Before MDEV-15058,
we used to have the buf_pool.page_hash partitioned to each instance.
rw_lock_lock_word_decr(): Sleep a little in the spinloop.
rw_lock_s_lock_low(): Correct a comment. The function does perform
spinning.
This improves scalability in read-only workloads on a 32-CPU system
when the number of concurrent connections exceeds the CPU core count.
Thanks to Axel Schwenke for running benchmarks.
reduce the amount of engine-specific code in the server,
particularly as it does not serve any purpose now.
may be needed for VP engine,
to be reconsidered in MDEV-7795
buf_LRU_make_block_young(): Merge with buf_page_make_young().
buf_pool_check_no_pending_io(): Remove. Replaced with
buf_pool.any_io_pending() and buf_pool.io_pending(),
which do not unnecessarily acquire buf_pool.mutex.
buf_pool_t::init_flush[]: Use atomic access, so that
buf_flush_wait_LRU_batch_end() can avoid acquiring buf_pool.mutex.
buf_pool_t::try_LRU_scan: Declare as bool.
When MDEV-22769 introduced srv_shutdown_state=SRV_SHUTDOWN_INITIATED in
commit efc70da5fd
we forgot to adjust a few checks for SRV_SHUTDOWN_NONE.
In the initial shutdown step, we are waiting for the background
DROP TABLE queue to be processed or discarded. At that time,
some background tasks (such as buffer pool resizing or dumping
or encryption key rotation) may be terminated, but others must
remain running normally.
srv_purge_coordinator_suspend(), srv_purge_coordinator_thread(),
srv_start_wait_for_purge_to_start(): Treat SRV_SHUTDOWN_NONE
and SRV_SHUTDOWN_INITIATED equally.
create_table_info_t::create_foreign_keys(): Make the create_name buffer
long enough for both the database and table name. It is still not long
enough to hold partition or subpartition names. Because we do never
supported FOREIGN KEY constraints on partitions, we can simply skip
the call to innobase_convert_name() on CREATE TABLE.
lock_check_trx_id_sanity(): Because the argument of UNIV_LIKELY
or __builtin_expect() can be less than sizeof(trx_id_t) on 32-bit
systems, it cannot reliably perform an implicit comparison to 0.
page_zip_fields_decode(): Do not dereference index=NULL.
Instead, return NULL early. The only caller does not care
about the values of output parameters in that case.
This bug was introduced in MySQL 5.7.6 by
mysql/mysql-server@9eae0edb7a
and in MariaDB 10.2.2 by
commit 2e814d4702.
Thanks to my son for pointing this out after investigating
the output of a static analysis tool.
fil_aio_callback(): Remove a bogus assertion that was added in
commit b1ab211dee (MDEV-15053).
We will have !bpage for any write by buf_flush_freed_page().
The background drop table queue in InnoDB is a work-around for
cases where the SQL layer is requesting DDL on tables on which
transactional locks exist.
One such case are XA transactions. Our test case exploits the
fact that the recovery of XA PREPARE transactions will
only resurrect InnoDB table locks, but not MDL that should
block any concurrent DDL.
srv_shutdown_t: Introduce the srv_shutdown_state=SRV_SHUTDOWN_INITIATED
for the initial part of shutdown, to wait for the background drop
table queue to be emptied.
srv_shutdown_bg_undo_sources(): Assign
srv_shutdown_state=SRV_SHUTDOWN_INITIATED
before waiting for the background drop table queue to be emptied.
row_drop_tables_for_mysql_in_background(): On slow shutdown, if
no active transactions exist (excluding ones that are in
XA PREPARE state), skip any tables on which locks exist.
row_drop_table_for_mysql(): Do not unnecessarily attempt to
drop InnoDB persistent statistics for tables that have
already been added to the background drop table queue.
row_mysql_close(): Relax an assertion, and free all memory
even if innodb_force_recovery=2 would prevent the background
drop table queue from being emptied.
This race condition was introduced by
commit ad6171b91c (MDEV-22456).
In the observed case, two threads were executing
btr_search_drop_page_hash_index() on the same block,
to free a stale entry that was attached to a dropped index.
Both threads were holding an S latch on the block.
We must prevent the double-free of block->index by holding
block->lock in exclusive mode.
btr_search_guess_on_hash(): Do not invoke
btr_search_drop_page_hash_index(block) to get rid of
stale entries, because we are not necessarily holding
an exclusive block->lock here.
buf_defer_drop_ahi(): New function, to safely drop stale
entries in buf_page_mtr_lock(). We will skip the call to
btr_search_drop_page_hash_index(block) when only requesting
bufferfixing (no page latch), because in that case, we should
not be accessing the adaptive hash index, and we might get
a deadlock if we acquired the page latch.