increased scalability through even distribution
Rollback segments are allocated to transactions in round-robin fashion.
This is controlled by incrementing a static-scope counter named rseg_slot.
Said logic is not protected by any mutex or use of atomic for the counter.
This potentially can cause the same rollback segment to get allocated to
N different transactions (requesting allocation at the same time).
While this is not an issue as a rollback segment can host multiple
transactions from contention (performance) perspective it is better to
allocate these rollback segments in round-robin fashion.
Fix for the said issue ports use of atomic for the said counter that would
ensure the original design semantic (even distribution through round-robin)
is retained.
This is one more follow-up fix to MDEV-22641.
Explicitly specify the dependency of the innobase library on mysys.
Also, remove stale references to CRC32_LIBRARY, which should have
been removed in commit dec3f8ca69.
MCOL-3875 Columnstore write cache
The main change is to change thr_lock function get_status to
return a value that indicates we have to abort the lock.
Other thing:
- Made start_bulk_insert() and end_bulk_insert() protected so that the
insert cache can use these
MDEV-22689 MSAN use-of-uninitialized-value in decode_bytes()
This was not a user visible issue as the huffman code lookup tables would
automatically ignore any of the unitialized bits
Fixed by adding a end-zero byte to the bit-stream buffer.
Other things:
- Fixed a (for this case) wrong assert in strmov() for myisamchk
and aria_chk by removing the strmov()
The used code is largely based on code from Tencent
The problem is that in some rare cases there may be a conflict between .frm
files and the files in the storage engine. In this case the DROP TABLE
was not able to properly drop the table.
Some MariaDB/MySQL forks has solved this by adding a FORCE option to
DROP TABLE. After some discussion among MariaDB developers, we concluded
that users expects that DROP TABLE should always work, even if the
table would not be consistent. There should not be a need to use a
separate keyword to ensure that the table is really deleted.
The used solution is:
- If a .frm table doesn't exists, try dropping the table from all storage
engines.
- If the .frm table exists but the table does not exist in the engine
try dropping the table from all storage engines.
- Update storage engines using many table files (.CVS, MyISAM, Aria) to
succeed with the drop even if some of the files are missing.
- Add HTON_AUTOMATIC_DELETE_TABLE to handlerton's where delete_table()
is not needed and always succeed. This is used by ha_delete_table_force()
to know which handlers to ignore when trying to drop a table without
a .frm file.
The disadvantage of this solution is that a DROP TABLE on a non existing
table will be a bit slower as we have to ask all active storage engines
if they know anything about the table.
Other things:
- Added a new flag MY_IGNORE_ENOENT to my_delete() to not give an error
if the file doesn't exist. This simplifies some of the code.
- Don't clear thd->error in ha_delete_table() if there was an active
error. This is a bug fix.
- handler::delete_table() will not abort if first file doesn't exists.
This is bug fix to handle the case when a drop table was aborted in
the middle.
- Cleaned up mysql_rm_table_no_locks() to ensure that if_exists uses
same code path as when it's not used.
- Use non_existing_Table_error() to detect if table didn't exists.
Old code used different errors tests in different position.
- Table_triggers_list::drop_all_triggers() now drops trigger file if
it can't be parsed instead of leaving it hanging around (bug fix)
- InnoDB doesn't anymore print error about .frm file out of sync with
InnoDB directory if .frm file does not exists. This change was required
to be able to try to drop an InnoDB file when .frm doesn't exists.
- Fixed bug in mi_delete_table() where the .MYD file would not be dropped
if the .MYI file didn't exists.
- Fixed memory leak in Mroonga when deleting non existing table
- Fixed memory leak in Connect when deleting non existing table
Bugs fixed introduced by the original version of this commit:
MDEV-22826 Presence of Spider prevents tables from being force-deleted from
other engines
An InnoDB check for the validity of index pages would occasionally fail
in the test encryption.innodb_encryption_discard_import.
An analysis of a "rr replay" failure trace revealed that the problem
basically is a combination of two old anomalies, and a recently
implemented optimization in MariaDB 10.5.
MDEV-15528 allows InnoDB to discard buffer pool pages that were freed.
PageBulk::init() will disable the InnoDB validity check, because
during native ALTER TABLE (rebuilding tables or creating indexes)
we could write inconsistent index pages to data files.
In the occasional test failure, page 8:6 would have been written
from the buffer pool to the data file and subsequently freed.
However, fil_crypt_thread may perform dummy writes to pages that
have been freed. In case we are causing an inconsistent page to
be re-encrypted on page flush, we should disable the check.
In the analyzed "rr replay" trace, a fil_crypt_thread attempted
to access page 8:6 twice after it had been freed.
On the first call, buf_page_get_gen(..., BUF_PEEK_IF_IN_POOL, ...)
returned NULL. The second call succeeded, and shortly thereafter,
the server intentionally crashed due to writing the corrupted page.
This is a race condition where a table on which a 10.3-style
instant ADD COLUMN is emptied during the execution of
ALTER TABLE ... DROP COLUMN ..., DROP INDEX ..., ALGORITHM=NOCOPY.
In commit 2c4844c9e7 the
function instant_metadata_lock() would prevent this race condition.
But, it would also hold a page latch on the leftmost leaf page of
clustered index for the duration of a possible DROP INDEX operation.
The race could be fixed by restoring the function
instant_metadata_lock() that was removed in
commit ea37b14409
but it would be more future-proof to prevent the
dict_index_t::clear_instant_add() call from being issued at all.
We at some point support DROP COLUMN ..., ADD INDEX ..., ALGORITHM=NOCOPY
and that would spend a non-trivial amount of
execution time in ha_innobase::inplace_alter(),
making a server hang possible. Currently this is not supported
and our added test case will notice when the support is introduced.
dict_index_t::must_avoid_clear_instant_add(): Determine if
a call to clear_instant_add() must be avoided.
btr_discard_only_page_on_level(): Preserve the metadata record
if must_avoid_clear_instant_add() holds.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Do not remove the metadata record even if the table becomes empty
but must_avoid_clear_instant_add() holds.
btr_pcur_store_position(): Relax a debug assertion.
This is joint work with Thirunarayanan Balathandayuthapani.
Apply this patch from Percona Server (amended for 10.5):
commit cd7201514fee78aaf7d3eb2b28d2573c76f53b84
Author: Laurynas Biveinis <laurynas.biveinis@gmail.com>
Date: Tue Nov 14 06:34:19 2017 +0200
Fix bug 1704195 / 87065 / TDB-83 (Stop ANALYZE TABLE from flushing table definition cache)
Make ANALYZE TABLE stop flushing affected tables from the table
definition cache, which has the effect of not blocking any subsequent
new queries involving the table if there's a parallel long-running
query:
- new table flag HA_ONLINE_ANALYZE, return it for InnoDB and TokuDB
tables;
- in mysql_admin_table, if we are performing ANALYZE TABLE, and the
table flag is set, do not remove the table from the table
definition cache, do not invalidate query cache;
- in partitioning handler, refresh the query optimizer statistics
after ANALYZE if the underlying handler supports HA_ONLINE_ANALYZE;
- new testcases main.percona_nonflushing_analyze_debug,
parts.percona_nonflushing_abalyze_debug and a supporting debug sync
point.
For TokuDB, this change exposes bug TDB-83 (Index cardinality stats
updated for handler::info(HA_STATUS_CONST), not often enough for
tokudb_cardinality_scale_percent). TokuDB may return different
rec_per_key values depending on dynamic variable
tokudb_cardinality_scale_percent value. The server does not have a way
of knowing that changing this variable invalidates the previous
rec_per_key values in any opened table shares, and so does not call
info(HA_STATUS_CONST) again. Fix by updating rec_per_key for both
HA_STATUS_CONST and HA_STATUS_VARIABLE. This also forces a re-record
of tokudb.bugs.db756_card_part_hash_1_pick, with the new output
seeming to be more correct.
MDEV-15053 did not remove all unnecessary buf_pool.page_hash S-latch
acquisition. There are code paths where we are holding buf_pool.mutex
(which will sufficiently protect buf_pool.page_hash against changes)
and unnecessarily acquire the latch. Many invocations of
buf_page_hash_get_locked() can be replaced with the much simpler
buf_pool.page_hash_get_low().
In the worst case the thread that is holding buf_pool.mutex will become
a victim of MDEV-22871, suffering from a spurious reader-reader conflict
with another thread that genuinely needs to acquire a buf_pool.page_hash
S-latch.
In many places, we were also evaluating page_id_t::fold() while holding
buf_pool.mutex. Low-level functions such as buf_pool.page_hash_get_low()
must get the page_id_t::fold() as a parameter.
buf_buddy_relocate(): Defer the hash_lock acquisition to the critical
section that starts by calling buf_page_t::can_relocate().
fil_space_t::freed_ranges: Store ranges of freed page numbers.
fil_space_t::last_freed_lsn: Store the most recent LSN of
freeing a page.
fil_space_t::freed_mutex: Protects freed_ranges, last_freed_lsn.
fil_space_create(): Initialize the freed_range mutex.
fil_space_free_low(): Frees the freed_range mutex.
range_set: Ranges of page numbers.
buf_page_create(): Removes the page from freed_ranges when page
is being reused.
btr_free_root(): Remove the PAGE_INDEX_ID invalidation. Because
btr_free_root() and dict_drop_index_tree() are executed in
the same atomic mini-transaction, there is no need to
invalidate the root page.
buf_release_freed_page(): Split from buf_flush_freed_page().
Skip any I/O
buf_flush_freed_pages(): Get the freed ranges from tablespace and
Write punch-hole or zeroes of the freed ranges.
buf_flush_try_neighbors(): Handles the flushing of freed ranges.
mtr_t::freed_pages: Variable to store the list of freed pages.
mtr_t::add_freed_pages(): To add freed pages.
mtr_t::clear_freed_pages(): To clear the freed pages.
mtr_t::m_freed_in_system_tablespace: Variable to indicate whether page has
been freed in system tablespace.
mtr_t::m_trim_pages: Variable to indicate whether the space has been trimmed.
mtr_t::commit(): Add the freed page and update the last freed lsn
in the tablespace and clear the tablespace freed range if space is
trimmed.
file_name_t::freed_pages: Store the freed pages during recovery.
file_name_t::add_freed_page(), file_name_t::remove_freed_page(): To
add and remove freed page during recovery.
store_freed_or_init_rec(): Store or remove the freed pages while
encountering FREE_PAGE or INIT_PAGE redo log record.
recv_init_crash_recovery_spaces(): Add the freed page encountered
during recovery to respective tablespace.
For INET6 columns the values are stored as BINARY columns and returned to the client in TEXT format.
For rocksdb the indexes store mem-comparable images for columns, so use the pack_length() to store
the mem-comparable form for INET6 columns. This would also remain consistent with CHAR columns.
For reads, the buf_pool.page_hash is protected by buf_pool.mutex or
by the hash_lock. There is no need to compute or acquire hash_lock
if we are not modifying the buf_pool.page_hash.
However, the buf_pool.page_hash latch must be held exclusively
when changing buf_page_t::in_file(), or if we desire to prevent
buf_page_t::can_relocate() or buf_page_t::buf_fix_count()
from changing.
rw_lock_lock_word_decr(): Add a comment that explains the polling logic.
buf_page_t::set_state(): When in_file() is to be changed, assert that
an exclusive buf_pool.page_hash latch is being held. Unfortunately
we cannot assert this for set_state(BUF_BLOCK_REMOVE_HASH) because
set_corrupt_id() may already have been called.
buf_LRU_free_page(): Check buf_page_t::can_relocate() before
aqcuiring the hash_lock.
buf_block_t::initialise(): Initialize also page.buf_fix_count().
buf_page_create(): Initialize buf_fix_count while not holding
any mutex or hash_lock. Acquire the hash_lock only for the
duration of inserting the block to the buf_pool.page_hash.
buf_LRU_old_init(), buf_LRU_add_block(),
buf_page_t::belongs_to_unzip_LRU(): Do not assert buf_page_t::in_file(),
because buf_page_create() will invoke buf_LRU_add_block()
before acquiring hash_lock and buf_page_t::set_state().
buf_pool_t::validate(): Rely on the buf_pool.mutex and do not
unnecessarily acquire any buf_pool.page_hash latches.
buf_page_init_for_read(): Clarify that we must acquire the hash_lock
upfront in order to prevent a race with buf_pool_t::watch_remove().
ut_filename_hash(): Add better casts to please the compiler:
warning C4307: '*': integral constant overflow
This regression was introduced in
commit dd77f072f9 (MDEV-22841).
MONITOR_SRV_MEM_VALIDATE_MICROSECOND, MEM_PERIODIC_CHECK,
SRV_MASTER_MEM_VALIDATE_INTERVAL: Remove. These were unused
ever since UNIV_MEM_DEBUG was removed.
MONITOR_SRV_PURGE_MICROSECOND: Remove. This was always unused.
Problematic mutex is dict_sys.mutex.
Idea of the patch: unlink() fd under that mutex while
it's still open. This way unlink() will be fast and
actual file removal will happen on close().
And close() will be called outside of dict_sys.mutex.
This should be safe against crash which may happen between
unlink() and close(): file will be removed by OS anyway.
The same applies to both *nix and Windows.
I created and removed a 4G file on some NVMe SSD on ext4:
write(3, "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1"..., 1048576) = 1048576 <0.000519>
fdatasync(3) = 0 <3.533763>
close(3) = 0 <0.000011>
unlink("file") = 0 <0.411563>
write(3, "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1"..., 1048576) = 1048576 <0.000520>
fdatasync(3) = 0 <3.544938>
unlink("file") = 0 <0.000029>
close(3) = 0 <0.407057>
Such systems can benefit of this patch.
fil_node_t::detach(): closes fil_node_t but not file handle,
returns that file handle
fil_node_t::prepare_to_close_or_deatch(): 'closes' fil_node_t
fil_node_t:close_to_free(): new argument detach_handle
fil_system_t::detach(): now can detach file handles
fil_delete_tablespace(): now can detach file handles
row_drop_table_for_mysql(): performs actual file removal
To install Spider one can simply drop a /etc/mysql/conf.d/spider.cnf like
[mariadb]
plugin-load-add=ha_spider.so
This is automatically generated and installed when plugin is correctly
registered to plugin.cmake with its own component name. Many other plugins
such as Connect and RocksDB install in the same way.
This solved MDEV-19917 as the mere adding and removing of spider.cnf
automatically installs and uninstalls it.
Remove the overly complex and uncecessary install.sql from Spider,
if should not be needed in modern times anymore.
With this change there is no need for a uninstall.sql either.
Change how lookup for the "auto" PSI_memory_keys is done.
Lookup for filename hashes (integers), instead of C strings
Generate these hashes at the compile time with constexpr,
rather than at runtime.
Let us invoke the debug member functions of mtr_t directly.
mtr_t::memo_contains(): Change the parameter type to
const rw_lock_t&. This function cannot be invoked on
buf_block_t::lock.
The function mtr_t::memo_contains_flagged() is intended to be invoked
on buf_block_t* or rw_lock_t*, and it along with
mtr_t::memo_contains_page_flagged() are the way to check whether
a buffer pool page has been latched within a mini-transaction.
xdes_get_state(), fseg_get_nth_frag_page_no(),
fseg_find_free_frag_page_slot(), fseg_find_last_used_frag_page_slot(),
fseg_get_n_frag_pages(), fseg_n_reserved_pages_low(),
fseg_print_low(): Remove the unused parameter mtr, and add
a const qualifier to the pointer to the buffer pool page frame.
svr_n_page_hash_locks: Increase from 16 to 64. Before MDEV-15058,
we used to have the buf_pool.page_hash partitioned to each instance.
rw_lock_lock_word_decr(): Sleep a little in the spinloop.
rw_lock_s_lock_low(): Correct a comment. The function does perform
spinning.
This improves scalability in read-only workloads on a 32-CPU system
when the number of concurrent connections exceeds the CPU core count.
Thanks to Axel Schwenke for running benchmarks.
reduce the amount of engine-specific code in the server,
particularly as it does not serve any purpose now.
may be needed for VP engine,
to be reconsidered in MDEV-7795
buf_LRU_make_block_young(): Merge with buf_page_make_young().
buf_pool_check_no_pending_io(): Remove. Replaced with
buf_pool.any_io_pending() and buf_pool.io_pending(),
which do not unnecessarily acquire buf_pool.mutex.
buf_pool_t::init_flush[]: Use atomic access, so that
buf_flush_wait_LRU_batch_end() can avoid acquiring buf_pool.mutex.
buf_pool_t::try_LRU_scan: Declare as bool.
When MDEV-22769 introduced srv_shutdown_state=SRV_SHUTDOWN_INITIATED in
commit efc70da5fd
we forgot to adjust a few checks for SRV_SHUTDOWN_NONE.
In the initial shutdown step, we are waiting for the background
DROP TABLE queue to be processed or discarded. At that time,
some background tasks (such as buffer pool resizing or dumping
or encryption key rotation) may be terminated, but others must
remain running normally.
srv_purge_coordinator_suspend(), srv_purge_coordinator_thread(),
srv_start_wait_for_purge_to_start(): Treat SRV_SHUTDOWN_NONE
and SRV_SHUTDOWN_INITIATED equally.