In commit 1bd681c8b3 (MDEV-25506 part 3)
we introduced a "fake instant timeout" when a transaction would wait
for a table or record lock while holding dict_sys.latch. This prevented
a deadlock of the server but could cause bogus errors for operations
on the InnoDB persistent statistics tables.
A better fix is to ensure that whenever a transaction is being
executed in the InnoDB internal SQL parser (which will for now
require dict_sys.latch to be held), it will already have acquired
all locks that could be required for the execution. So, we will
acquire the following locks upfront, before acquiring dict_sys.latch:
(1) MDL on the affected user table (acquired by the SQL layer)
(2) If applicable (not for RENAME TABLE): InnoDB table lock
(3) If persistent statistics are going to be modified:
(3.a) MDL_SHARED on mysql.innodb_table_stats, mysql.innodb_index_stats
(3.b) exclusive table locks on the statistics tables
(4) Exclusive table locks on the InnoDB data dictionary tables
(not needed in ANALYZE TABLE and the like)
Note: Acquiring exclusive locks on the statistics tables may cause
more locking conflicts between concurrent DDL operations.
Notably, RENAME TABLE will lock the statistics tables
even if no persistent statistics are enabled for the table.
DROP DATABASE will only acquire locks on statistics tables if
persistent statistics are enabled for the tables on which the
SQL layer is invoking ha_innobase::delete_table().
For any "garbage collection" in innodb_drop_database(), a timeout
while acquiring locks on the statistics tables will result in any
statistics not being deleted for any tables that the SQL layer
did not know about.
If innodb_defragment=ON, information may be written to the statistics
tables even for tables for which InnoDB persistent statistics are
disabled. But, DROP TABLE will no longer attempt to delete that
information if persistent statistics are not enabled for the table.
This change should also fix the hangs related to InnoDB persistent
statistics and STATS_AUTO_RECALC (MDEV-15020) as well as
a bug that running ALTER TABLE on the statistics tables
concurrently with running ALTER TABLE on InnoDB tables could
cause trouble.
lock_rec_enqueue_waiting(), lock_table_enqueue_waiting():
Do not issue a fake instant timeout error when the transaction
is holding dict_sys.latch. Instead, assert that the dict_sys.latch
is never being held here.
lock_sys_tables(): A new function to acquire exclusive locks on all
dictionary tables, in case DROP TABLE or similar operation is
being executed. Locking non-hard-coded tables is optional to avoid
a crash in row_merge_drop_temp_indexes(). The SYS_VIRTUAL table was
introduced in MySQL 5.7 and MariaDB Server 10.2. Normally, we require
all these dictionary tables to exist before executing any DDL, but
the function row_merge_drop_temp_indexes() is an exception.
When upgrading from MariaDB Server 10.1 or MySQL 5.6 or earlier,
the table SYS_VIRTUAL would not exist at this point.
ha_innobase::commit_inplace_alter_table(): Invoke
log_write_up_to() while not holding dict_sys.latch.
dict_sys_t::remove(), dict_table_close(): No longer try to
drop index stubs that were left behind by aborted online ADD INDEX.
Such indexes should be dropped from the InnoDB data dictionary by
row_merge_drop_indexes() as part of the failed DDL operation.
Stubs for aborted indexes may only be left behind in the
data dictionary cache.
dict_stats_fetch_from_ps(): Use a normal read-only transaction.
ha_innobase::delete_table(), ha_innobase::truncate(), fts_lock_table():
While waiting for purge to stop using the table,
do not hold dict_sys.latch.
ha_innobase::delete_table(): Implement a work-around for the rollback
of ALTER TABLE...ADD PARTITION. MDL_EXCLUSIVE would not be held if
ALTER TABLE hits lock_wait_timeout while trying to upgrade the MDL
due to a conflicting LOCK TABLES, such as in the first ALTER TABLE
in the test case of Bug#53676 in parts.partition_special_innodb.
Therefore, we must explicitly stop purge, because it would not be
stopped by MDL.
dict_stats_func(), btr_defragment_chunk(): Allocate a THD so that
we can acquire MDL on the InnoDB persistent statistics tables.
mysqltest_embedded: Invoke ha_pre_shutdown() before free_used_memory()
in order to avoid ASAN heap-use-after-free related to acquire_thd().
trx_t::dict_operation_lock_mode: Changed the type to bool.
row_mysql_lock_data_dictionary(), row_mysql_unlock_data_dictionary():
Implemented as macros.
rollback_inplace_alter_table(): Apply an infinite timeout to lock waits.
innodb_thd_increment_pending_ops(): Wrapper for
thd_increment_pending_ops(). Never attempt async operation for
InnoDB background threads, such as the trx_t::commit() in
dict_stats_process_entry_from_recalc_pool().
lock_sys_t::cancel(trx_t*): Make dictionary transactions immune to KILL.
lock_wait(): Make dictionary transactions immune to KILL, and to
lock wait timeout when waiting for locks on dictionary tables.
parts.partition_special_innodb: Use lock_wait_timeout=0 to instantly
get ER_LOCK_WAIT_TIMEOUT.
main.mdl: Filter out MDL on InnoDB persistent statistics tables
Reviewed by: Thirunarayanan Balathandayuthapani
que_eval_sql(): Remove the parameter lock_dict. The only caller
with lock_dict=true was dict_stats_exec_sql(), which will now
explicitly invoke dict_sys.lock() and dict_sys.unlock() by itself.
row_import_cleanup(): Do not unnecessarily lock the dictionary.
Concurrent access to the table during ALTER TABLE...IMPORT TABLESPACE
is prevented by MDL and the fact that there cannot exist any
undo log or change buffer records that would refer to the table
or tablespace.
row_import_for_mysql(): Do not unnecessarily lock the dictionary
while accessing fil_system. Thanks to MDL_EXCLUSIVE that was acquired
by the SQL layer, only one IMPORT may be in effect for the table name.
row_quiesce_set_state(): Do not unnecessarily lock the dictionary.
The dict_table_t::quiesce state is documented to be protected by
all index latches, which we are acquiring.
dict_table_close(): Introduce a simpler variant with fewer parameters.
dict_table_close(): Reduce the amount of calls.
We can simply invoke dict_table_t::release() on startup or
in DDL operations, or when the table is inaccessible.
In none of these cases, there is no need to invalidate the
InnoDB persistent statistics.
pars_info_t::graph_owns_us: Remove (unused).
pars_info_free(): Define inline.
fts_delete(), trx_t::evict_table(), row_prebuilt_free(),
row_rename_table_for_mysql(): Simplify.
row_mysql_lock_data_dictionary(): Remove some references;
use dict_sys.lock() and dict_sys.unlock() instead.
row_mysql_lock_table(): Remove. Use lock_table_for_trx() instead.
ha_innobase::check_if_supported_inplace_alter(),
row_create_table_for_mysql(): Simply assert dict_sys.sys_tables_exist().
In commit 49e2c8f0a6 and
commit 1bd681c8b3 srv_start()
actually guarantees that the system tables will exist,
or the server is in read-only mode, or startup will fail.
Reviewed by: Thirunarayanan Balathandayuthapani
sym_tab_free_private(): Do not call dict_table_close(), but
simply invoke dict_table_t::release(), which we can do without
locking the whole dictionary cache. (Note: On user tables it
may still be necessary to invoke dict_table_close(), so that
InnoDB persistent statistics will be deinitialized as expected.)
fts_check_corrupt(), row_fts_merge_insert(): Invoke
aux_table->release() to simplify the code. This is never a user table.
fts_que_graph_free(), fts_que_graph_free_check_lock(): Replaced with
que_graph_free().
Reviewed by: Thirunarayanan Balathandayuthapani
In the parent commit, dict_sys.latch could theoretically have been
replaced with a mutex. But, we can do better and merge dict_sys.mutex
into dict_sys.latch. Generally, every occurrence of dict_sys.mutex_lock()
will be replaced with dict_sys.lock().
The PERFORMANCE_SCHEMA instrumentation for dict_sys_mutex
will be removed along with dict_sys.mutex. The dict_sys.latch
will remain instrumented as dict_operation_lock.
Some use of dict_sys.lock() will be replaced with dict_sys.freeze(),
which we will reintroduce for the new shared mode. Most notably,
concurrent table lookups are possible as long as the tables are present
in the dict_sys cache. In particular, this will allow more concurrency
among InnoDB purge workers.
Because dict_sys.mutex will no longer 'throttle' the threads that purge
InnoDB transaction history, a performance degradation may be observed
unless innodb_purge_threads=1.
The table cache eviction policy will become FIFO-like,
similar to what happened to fil_system.LRU
in commit 45ed9dd957.
The name of the list dict_sys.table_LRU will become somewhat misleading;
that list contains tables that may be evicted, even though the
eviction policy no longer is least-recently-used but first-in-first-out.
(Note: Tables can never be evicted as long as locks exist on them or
the tables are in use by some thread.)
As demonstrated by the test perfschema.sxlock_func, there
will be less contention on dict_sys.latch, because some previous
use of exclusive latches will be replaced with shared latches.
fts_parse_sql_no_dict_lock(): Replaced with pars_sql().
fts_get_table_name_prefix(): Merged to fts_optimize_create().
dict_stats_update_transient_for_index(): Deduplicated some code.
ha_innobase::info_low(), dict_stats_stop_bg(): Use a combination
of dict_sys.latch and table->stats_mutex_lock() to cover the
changes of BG_STAT_SHOULD_QUIT, because the flag is being read
in dict_stats_update_persistent() while not holding dict_sys.latch.
row_discard_tablespace_for_mysql(): Protect stats_bg_flag by
exclusive dict_sys.latch, like most other code does.
row_quiesce_table_has_fts_index(): Remove unnecessary mutex
acquisition. FLUSH TABLES...FOR EXPORT is protected by MDL.
row_import::set_root_by_heuristic(): Remove unnecessary mutex
acquisition. ALTER TABLE...IMPORT TABLESPACE is protected by MDL.
row_ins_sec_index_entry_low(): Replace a call
to dict_set_corrupted_index_cache_only(). Reads of index->type
were not really protected by dict_sys.mutex, and writes
(flagging an index corrupted) should be extremely rare.
dict_stats_process_entry_from_defrag_pool(): Only freeze the dictionary,
do not lock it exclusively.
dict_stats_wait_bg_to_stop_using_table(), DICT_BG_YIELD: Remove trx.
We can simply invoke dict_sys.unlock() and dict_sys.lock() directly.
dict_acquire_mdl_shared()<trylock=false>: Assert that dict_sys.latch is
only held in shared more, not exclusive mode. Only acquire it in
exclusive mode if the table needs to be loaded to the cache.
dict_sys_t::acquire(): Remove. Relocating elements in dict_sys.table_LRU
would require holding an exclusive latch, which we want to avoid
for performance reasons.
dict_sys_t::allow_eviction(): Add the table first to dict_sys.table_LRU,
to compensate for the removal of dict_sys_t::acquire(). This function
is only invoked by INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS.
dict_table_open_on_id(), dict_table_open_on_name(): If dict_locked=false,
try to acquire dict_sys.latch in shared mode. Only acquire the latch in
exclusive mode if the table is not found in the cache.
Reviewed by: Thirunarayanan Balathandayuthapani
This will essentially make dict_sys.latch a mutex
(it is only acquired in exclusive mode).
The subsequent commit will merge dict_sys.mutex into dict_sys.latch
and reintroduce dict_sys.freeze() for those cases where we currently
acquire only dict_sys.latch but not dict_sys.mutex. The case where
both are acquired will be mapped to dict_sys.lock().
i_s_sys_tables_fill_table_stats(): Invoke dict_sys.prevent_eviction()
and the new function dict_sys.allow_eviction() to avoid table eviction
while a row in INFORMATION_SCHEMA.INNODB_SYS_TABLESTATS is being
produced.
Reviewed by: Thirunarayanan Balathandayuthapani
To avoid potential race conditions between concurrent access to
dict_table_t::freed_indexes, let us consistently use
dict_table_t::autoinc_mutex.
dict_table_remove_from_cache_low(): To avoid extensive hold time
of table->autoinc_mutex, unconditionally free the FTS data structures.
Problem:
=======
The last AHI page for two indexes of an dropped table is being
freed at the same time by two threads. One thread frees the
table heap and other thread tries to access table heap again.
It leads to asan failure in btr_search_lazy_free().
Solution:
========
InnoDB uses autoinc_mutex to avoid the race condition
in btr_search_lazy_free()
pars_info_bind_id(): Remove the parameter copy_name. It was always
being passed as constant TRUE or true. It turns out that copying
the string is completely unnecessary. In all calls except the one
in fts_get_select_columns_str() and fts_doc_fetch_by_doc_id(),
the parameter is being passed as a compile-time constant, and therefore
the pointer cannot become stale. In that special call, the string
that is being passed is allocated from the same memory heap that
pars_info_bind_id() would have been using.
pars_info_add_id(): Remove (unused declaration).
Before entering DML or DDL execution in the storage engine, the SQL layer
will have acquired metadata lock (MDL) on the current table name as well
as the names of FOREIGN KEY (grand)child tables (that is,
tables whose REFERENCES clauses point to the current table).
The MDL prevents any metadata changes to these tables, such as
RENAME, TRUNCATE, DROP, ALTER.
While the MDL on the current table prevents dict_table_t::foreign_set
from being modified, it does not prevent the table metadata that the
stored pointers are pointing to from being modified.
The MDL on the child tables will prevent both dict_table_t::referenced_set
as well as the pointed child table metadata from being modified.
wsrep_row_upd_index_is_foreign(): Do not unnecessarily acquire the
data dictionary latch if Galera replication is not enabled.
ha_innobase::can_switch_engines(): Rely on MDL. We are not dereferencing
any pointers stored in the sets.
row_mysql_freeze_data_dictionary(), row_mysql_unfreeze_data_dictionary():
Remove.
row_update_for_mysql(): Call init_fts_doc_id_for_ref() only once.
In ALTER TABLE...IMPORT TABLESPACE and FLUSH TABLES...FOR EXPORT
the SQL layer is protecting the current table with MDL. We do not
need InnoDB latches.
btr_scrub_start_space(): Avoid an unnecessary tablespace lookup
and related acquisition of fil_system->mutex. In MariaDB Server 10.3
we would get deadlocks between that mutex and a crypt_data mutex.
The fix was developed by Thirunarayanan Balathandayuthapani.
trx_t::will_lock: Changed the type to bool.
trx_t::is_autocommit_non_locking(): Replaces
trx_is_autocommit_non_locking().
trx_is_ac_nl_ro(): Remove (replaced with equivalent assertion expressions).
assert_trx_nonlocking_or_in_list(): Remove.
Replaced with at least as strict checks in each place.
check_trx_state(): Moved to a static function; partially replaced with
individual debug assertions implementing equivalent or stricter checks.
This is a backport of commit 7b51d11cca
from 10.5.
This patch changes it so that we do not free old BP `page_hash`, but rather modify it's parameters, during resize.
RB: 26084
Reviewed-by: Marcin Babij <marcin.babij@oracle.com>
Reviewed-by: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
mysql/mysql-server@ea3adc6a11
InnoDB tablespace identifiers and page numbers are 32-bit numbers.
Let us use a 32-bit type for them in innochecksum.
The changes in commit 1918bdf32c
broke the build on 32-bit Windows.
Thanks to Vicențiu Ciorbaru for an initial version of this fixup.
It is implementation-defined whether alignment requirements
that are larger than std::max_align_t (typically 8 or 16 bytes)
will be honored by the compiler and linker.
It turns out that on IBM AIX, both alignas() and MY_ALIGNED()
only guarantees alignment up to 16 bytes.
For some data structures, specifying alignment to the CPU
cache line size (typically 64 or 128 bytes) is a mere performance
optimization, and we do not really care whether the requested
alignment is guaranteed.
But, for the correct operation of direct I/O, we do require that
the buffers be aligned at a block size boundary.
field_ref_zero: Define as a pointer, not an array.
For innochecksum, we can make this point to unaligned memory;
for anything else, we will allocate an aligned buffer from the heap.
This buffer will be used for overwriting freed data pages when
innodb_immediate_scrub_data_uncompressed=ON. And exactly that code
hit an assertion failure on AIX, in the test innodb.innodb_scrub.
log_sys.checkpoint_buf: Define as a pointer to aligned memory
that is allocated from heap.
log_t::file::write_header_durable(): Reuse log_sys.checkpoint_buf
instead of trying to allocate an aligned buffer from the stack.
fil_ibd_create(): Remove code that should have been removed in
commit 86dc7b4d4c already.
We no longer wrote an initialized page to the file, but we would
still allocate a page image in memory and write it.
xb_space_create_file(): Remove an unnecessary page write.
(This is a functional change for Mariabackup.)
The replacement of buf_pool.page_hash with a different type of
hash table in commit 5155a300fa (MDEV-22871)
introduced a race condition with buffer pool resizing.
We have an execution trace where buf_pool.page_hash.array is changed
to point to something else while page_hash_latch::read_lock() is
executing. The same should also affect page_hash_latch::write_lock().
We fix the race condition by never resizing (and reallocating) the
buf_pool.page_hash. We assume that resizing the buffer pool is
a rare operation. Yes, there might be a performance regression if a
server is first started up with a tiny buffer pool, which is later
enlarged. In that case, the tiny buf_pool.page_hash.array could cause
increased use of the hash bucket lists. That problem can be worked
around by initially starting up the server with a larger buffer pool
and then shrinking that, until changing to a larger size again.
buf_pool_t::resize_hash(): Remove.
buf_pool_t::page_hash_table::lock(): Do not attempt to deal with
hash table resizing. If we really wanted that in a safe manner,
we would probably have to introduce a global rw-lock around the
operation, or at the very least, poll buf_pool.resizing, both of
which would be detrimental to performance.
With commit 1bd681c8b3 (MDEV-25506)
it no longer is necessary to run DDL and DML operations in
separate transactions. Let us remove the flag trx_t::internal.
Dictionary transactions will be distinguished by trx_t::dict_operation.
The practical maximum value of the parameter innodb_lock_wait_timeout
is 100,000,000. Any value larger than that specifies an infinite timeout.
Therefore, we should make 100,000,000 the maximum value of the parameter.
The MariaDB implementation of page_compressed tables for InnoDB used
sparse files. In the worst case, in the data file, every data page
will consist of some data followed by a hole. This may be extremely
inefficient in some file systems.
If the underlying storage device is thinly provisioned (can compress
data on the fly), it would be good to write regular files (with sequences
of NUL bytes at the end of each page_compressed block) and let the
storage device take care of compressing the data.
For reads, sparse file regions and regions containing NUL bytes will be
indistinguishable.
my_test_if_disable_punch_hole(): A new predicate for detecting thinly
provisioned storage. (Not implemented yet.)
innodb_atomic_writes: Correct the comment.
buf_flush_page(): Support all values of fil_node_t::punch_hole.
On a thinly provisioned storage device, we will always write
NUL-padded innodb_page_size bytes also for page_compressed tables.
buf_flush_freed_pages(): Remove a redundant condition.
fil_space_t::atomic_write_supported: Remove. (This was duplicating
fil_node_t::atomic_write.)
fil_space_t::punch_hole: Remove. (Duplicated fil_node_t::punch_hole.)
fil_node_t: Remove magic_n, and consolidate flags into bitfields.
For punch_hole we introduce a third value that indicates a
thinly provisioned storage device.
fil_node_t::find_metadata(): Detect all attributes of the file.
In commit 5f22511e35 we depend on
Total Store Ordering. For correct operation on ISAs that implement
weaker memory ordering, we must explicitly use release/acquire stores
and loads on buf_page_t::oldest_modification_ to prevent a race condition
when buf_page_t::list does not happen to be on the same cache line.
buf_page_t::clear_oldest_modification(): Assert that the block is
not in buf_pool.flush_list, and use std::memory_order_release.
buf_page_t::oldest_modification_acquire(): Read oldest_modification_
with std::memory_order_acquire. In this way, if the return value is 0,
the caller may safely assume that it will not observe the buf_page_t
as being in buf_pool.flush_list, even if it is not holding
buf_pool.flush_list_mutex.
buf_flush_relocate_on_flush_list(), buf_LRU_free_page():
Invoke buf_page_t::oldest_modification_acquire().
In commit 22b62edaed (MDEV-25113)
we introduced a race condition. buf_LRU_free_page() would read
buf_page_t::oldest_modification() as 0 and assume that
buf_page_t::list can be used (for attaching the block to the
buf_pool.free list). In the observed race condition,
buf_pool_t::delete_from_flush_list() had cleared the field,
and buf_pool_t::delete_from_flush_list_low() was executing
concurrently with buf_LRU_block_free_non_file_page(),
which resulted in buf_pool.flush_list.end becoming corrupted.
buf_pool_t::delete_from_flush_list(), buf_flush_relocate_on_flush_list():
First remove the block from buf_pool.flush_list, and only then
invoke buf_page_t::clear_oldest_modification(), to ensure that
reading oldest_modification()==0 really implies that the block
no longer is in buf_pool.flush_list.
buf_LRU_get_free_block(): Initially wait for a single block to be
freed, signaled by buf_pool.done_free. Only if that fails and no
LRU eviction flushing batch is already running, we initiate a
flushing batch that should serve all threads that are currently
waiting in buf_LRU_get_free_block().
Note: In an extreme case, this may introduce a performance regression
at larger numbers of connections. We observed this in sysbench
oltp_update_index with 512MiB buffer pool, 4GiB of data on fast NVMe,
and 1000 concurrent connections, on a 20-thread CPU. The contention point
appears to be buf_pool.mutex, and the improvement would turn into a
regression somewhere beyond 32 concurrent connections.
On slower storage, such regression was not observed; instead, the
throughput was improving and maximum latency was reduced.
The excessive waits were pointed out by Vladislav Vaintroub.