In commit 412533b4a7 (MDEV-18582),
one of the counters that was ported from XtraDB is useless.
Innodb_buffered_aio_submitted would be 0 or 1, depending on whether
is_linux_native_aio_supported() was executed to the point where
it would be incremented.
Let us remove this counter, because it has no practical value.
Even if its value were 1, io_setup() can still fail and we may
end up with innodb_use_native_aio=0.
std version has an advantage of a more convenient units implementation from
std::chrono. Now it's no need to multipy/divide to bring anything to
micro seconds.
Tests with 4096-byte sector size confirm that it is
safe to use O_DIRECT with page_compressed tables.
That had been disabled on Linux, in an attempt to fix MDEV-21584
which had been filed for the O_DIRECT problems earlier.
The fil_node_t::block_size was being set mostly correctly until
commit 10dd290b4b (MDEV-17380)
introduced a regression in MariaDB Server 10.4.4.
fil_node_open_file(): Only avoid setting O_DIRECT on
ROW_FORMAT=COMPRESSED tables that use KEY_BLOCK_SIZE=1 or 2
(1024 or 2048 bytes).
fil_ibd_create(): Avoid setting O_DIRECT on ROW_FORMAT=COMPRESSED tables
that use KEY_BLOCK_SIZE=1 or 2 (1024 or 2048 bytes).
fil_node_t::find_metadata(): Require fstat() to be always invoked
outside Microsoft Windows, so that fil_node_t::block_size can be set.
fil_node_t::read_page0(): Rely on find_metadata() to assign block_size.
Thanks to Vladislav Vaintroub for testing this on Microsoft Windows
using an old-fashioned rotational hard disk with 4KiB sector size.
Reviewed by: Vladislav Vaintroub
This is a port of commit 00f620b27e
and commit 6505662c23 from 10.2.
This had been originally added in
mysql/mysql-server@192bb153b6
with the motivation to disable O_DIRECT for the dedicated tablespace
for temporary tables. In MariaDB Server,
commit 5eb539555b (MDEV-12227)
should be a better solution.
The code became orphaned later in
mysql/mysql-server@c61244c0e6
and it had been applied to MariaDB Server 10.2.2 in
commit 2e814d4702 and
commit fec844aca8.
Thanks to Vladislav Vaintroub for spotting this.
If the user "opts in" (as in the parent
commit 92b2a911e5),
we can optimize multiple INSERT statements to use table-level locking
and undo logging.
There will be a change of behavior:
CREATE TABLE t(a PRIMARY KEY) ENGINE=InnoDB;
SET foreign_key_checks=0, unique_checks=0;
BEGIN; INSERT INTO t SET a=1; INSERT INTO t SET a=1; COMMIT;
will end up with an empty table, because in case of an error,
the entire transaction will be rolled back, instead of rolling
back the failing statement. Previously, the second INSERT statement
would have been logged row by row, and only that second statement
would have been rolled back, leaving the first INSERT intact.
lock_table_x_unlock(), trx_mod_table_time_t::WAS_BULK: Remove.
Because we cannot really support statement rollback in this
optimized mode, we will not optimize the locking. The exclusive
table lock will be held until the end of the transaction.
fil_op_replay_rename(): Remove.
fil_rename_tablespace_check(): Remove a parameter is_discarded=false.
recv_sys_t::parse(): Instead of applying FILE_RENAME operations,
buffer the operations in renamed_spaces.
recv_sys_t::apply(): In the last_batch, apply renamed_spaces.
Online log for insert operation of redundant table fails with
index->is_instant() assert. Purge can reset the n_core_fields when
alter is waiting to upgrade MDL for commit phase of DDL. In the
meantime, any insert DML tries to log the operation fails with
index is not being instant.
row_log_get_n_core_fields(): Get the n_core_fields of online log
for the given index.
rec_get_converted_size_comp_prefix_low(): Use n_core_fields of online
log when InnoDB calculates the size of data tuple during redundant
row format table rebuild.
rec_convert_dtuple_to_rec_comp(): Use n_core_fields of online log
when InnoDB does the conversion of data tuple to record during
redudant row format table rebuild.
- Adding the test case which has more than 129 instant columns.
n_page_gets is a global counter that is updated on each page access.
This also means it is updated pretty often and with a multi-core machine
it easily boils up to be the hottest counter as also reported by perf.
Using existing distributed counter framework help ease the contention
and improve the performance.
Patch also tend to increase the slot of the distributed counter from original
64 to 128 given that is new normal for next-generation machines.
The original idea and patch came from Daniel Black which is now ported
to 10.6 with some improvement and adjustment.
In commit 118e258aaa (part of MDEV-23855)
we inadvertently broke crash recovery, reintroducing MDEV-11556.
fil_system_t::extend_to_recv_size(): Extend all open tablespace files
to the recovered size.
recv_sys_t::apply(): Invoke fil_system.extend_to_recv_size() at the
start of each batch. In this way, any fil_space_t::recv_size
changes that were parsed after the file was opened will be applied.
page_apply_insert_redundant(): Replace a too strict condition
hdr_c > pextra_size. It turns out that page_cur_insert_rec_low()
is not even computing the extra_size of cur->rec when it is trying
to reuse header bytes of the preceding record.
Historically, InnoDB supported a buggy page checksum algorithm that did not
compute a checksum over the full page. Later, well before MySQL 4.1
introduced .ibd files and the innodb_file_per_table option, the algorithm
was corrected and the first 4 bytes of each page were redefined to be
a checksum.
The original checksum was so slow that an option to disable page checksum
was introduced for benchmarketing purposes.
The Intel Nehalem microarchitecture introduced the SSE4.2 instruction set
extension, which includes instructions for faster computation of CRC-32C.
In MySQL 5.6 (and MariaDB 10.0), innodb_checksum_algorithm=crc32 was
implemented to make of that. As that option was changed to be the default
in MySQL 5.7, a bug was found on big-endian platforms and some work-around
code was added to weaken that checksum further. MariaDB disables that
work-around by default since MDEV-17958.
Later, SIMD-accelerated CRC-32C has been implemented in MariaDB for POWER
and ARM and also for IA-32/AMD64, making use of carry-less multiplication
where available.
Long story short, innodb_checksum_algorithm=crc32 is faster and more secure
than the pre-MySQL 5.6 checksum, called innodb_checksum_algorithm=innodb.
It should have removed any need to use innodb_checksum_algorithm=none.
The setting innodb_checksum_algorithm=crc32 is the default in
MySQL 5.7 and MariaDB Server 10.2, 10.3, 10.4. In MariaDB 10.5,
MDEV-19534 made innodb_checksum_algorithm=full_crc32 the default.
It is even faster and more secure.
The default settings in MariaDB do allow old data files to be read,
no matter if a worse checksum algorithm had been used.
(Unfortunately, before innodb_checksum_algorithm=full_crc32,
the data files did not identify which checksum algorithm is being used.)
The non-default settings innodb_checksum_algorithm=strict_crc32 or
innodb_checksum_algorithm=strict_full_crc32 would only allow CRC-32C
checksums. The incompatibility with old data files is why they are
not the default.
The newest server not to support innodb_checksum_algorithm=crc32
were MySQL 5.5 and MariaDB 5.5. Both have reached their end of life.
A valid reason for using innodb_checksum_algorithm=innodb could have
been the ability to downgrade. If it is really needed, data files
can be converted with an older version of the innochecksum utility.
Because there is no good reason to allow data files to be written
with insecure checksums, we will reject those option values:
innodb_checksum_algorithm=none
innodb_checksum_algorithm=innodb
innodb_checksum_algorithm=strict_none
innodb_checksum_algorithm=strict_innodb
Furthermore, the following innochecksum options will be removed,
because only strict crc32 will be supported:
innochecksum --strict-check=crc32
innochecksum -C crc32
innochecksum --write=crc32
innochecksum -w crc32
If a user wishes to convert a data file to use a different checksum
(so that it might be used with the no-longer-supported
MySQL 5.5 or MariaDB 5.5, which do not support IMPORT TABLESPACE
nor system tablespace format changes that were made in MariaDB 10.3),
then the innochecksum tool from MariaDB 10.2, 10.3, 10.4, 10.5 or
MySQL 5.7 can be used.
Reviewed by: Thirunarayanan Balathandayuthapani
- Currently page cleaner thread will stop flushing if
dirty_pct < innodb_max_dirty_pages_pct_lwm.
- If the server is not performing any activity then said resources/time
could be used to flush the pending dirty pages and keep buffer pool
clean for the next burst of the cycle. This flushing is called idle flushing.
- flushing logic underwent a complete revamp in 10.5.7/8
and as part of the revamp idle flushing logic got removed.
- New proposed logic of idle flushing is based on updated logic of the
page cleaner that will enable idle flushing if
- buf page cleaner is idle
- there are dirty pages (< innodb_max_dirty_pages_pct_lwm)
- server is not performing any activity
Logic will kickstart the idle flushing bounded by innodb_io_capacity.
(Thanks to Marko Makela for reviewing the patch and idea
right from the its inception).
The maximum number of concurrently waiting transactions is one less
than the maximum number of concurrent transactions.
A 45-bit cumulative counter of lock waits will support more than
one million lock waits per second for a year.
Let us add the status variable innodb_buffer_pool_pages_LRU_freed
to monitor the number of pages that were freed by a buffer pool LRU
eviction scan, without flushing.
Also, let us simplify the monitor interface:
MONITOR_LRU_BATCH_FLUSH_COUNT, MONITOR_LRU_BATCH_FLUSH_PAGES,
MONITOR_LRU_BATCH_EVICT_COUNT, MONITOR_LRU_BATCH_EVICT_PAGES:
Remove.
MONITOR_LRU_BATCH_FLUSH_TOTAL_PAGE: Track buf_lru_flush_page_count
(innodb_buffer_pool_pages_LRU_flushed).
MONITOR_LRU_BATCH_EVICT_TOTAL_PAGE: Track buf_lru_freed_page_count
(buffer_pool_pages_LRU_freed).
Reviewed by: Vladislav Vaintroub
In theory, read-only workload shouldn't have anything to do with
dict_sys mutex. dict_sys mutex is meant to protect database metadata.
But then why does dict_sys mutex shows up as part of a read-only workload?
This workload needs to fetch stats for query processing and while reading
these stats dict_sys mutex is taken.
Is this really needed?
No. For the traditional reasons, it was the default global mutex used.
Based on 10.6 changes, flow can now use table->lock_mutex to protect
update/access of these stats. table mutexes being table specific
global contention arising out of dict_sys is reduced.
Thanks to Marko Makela for his early suggestion around proposed alternative
and review of the draft patch.
row_prebuilt_t::m_no_prefetch: Remove (it was always false).
row_prebuilt_t::m_read_virtual_key: Remove (it was always false).
Only ha_innopart ever set these fields.
In commit 8d16da1487 (MDEV-24789)
we accidentally introduced a race condition. During the time a
waiting lock request is being removed, the request might be
moved to another page due to a concurrent page split or merge.
To prevent this, we must hold exclusive lock_sys.latch when releasing
a record lock.
lock_release_autoinc_locks(): Avoid a potential hang.
No dict_table_t::lock_mutex must be waited for while already holding
lock_sys.wait_mutex or trx_t::mutex.
lock_cancel_waiting_and_release(): Correctly handle AUTO_INCREMENT locks.
lock_sys_t::deadlock_check(): Assume that only lock_sys.wait_mutex
is being held by the caller.
lock_sys_t::rd_lock_try(): New function.
lock_sys_t::cancel(trx_t*): Kill an active transaction that may be
holding a lock.
lock_sys_t::cancel(trx_t*, lock_t*): Cancel a waiting lock request.
lock_trx_handle_wait(): Avoid acquiring mutexes in some cases,
and in never acquire lock_sys.latch in exclusive mode.
This function is only invoked in a semi-consistent read
(locking a clustered index record only if it matches the search condition).
Normally, lock_wait() will take care of lock waits.
lock_wait(): Invoke the new function lock_sys_t::cancel() at the end,
to avoid acquiring exclusive lock_sys.latch.
lock_rec_other_trx_holds_expl(): Use LockGuard instead of LockMutexGuard.
lock_release_autoinc_locks(): Explicitly acquire table->lock_mutex,
in case only a shared lock_sys.latch is being held. Deadlock::report()
will still hold exclusive lock_sys.latch while invoking
lock_cancel_waiting_and_release().
lock_cancel_waiting_and_release(): Acquire trx->mutex in this function,
instead of expecting the caller to do so.
lock_unlock_table_autoinc(): Only acquire shared lock_sys.latch.
lock_table_has_locks(): Do not acquire lock_sys.latch at all.
Deadlock::check_and_resolve(): Only acquire shared lock_sys.latchm
for invoking lock_sys_t::cancel(trx, wait_lock).
innobase_query_caching_table_check_low(),
row_drop_tables_for_mysql_in_background(): Do not acquire lock_sys.latch.
A performance regression was introduced by
commit e71e613353 (MDEV-24671)
and mostly addressed by
commit 455514c800.
The regression is likely caused by increased contention
lock_sys.latch (former lock_sys.mutex), possibly indirectly
caused by contention on lock_sys.wait_mutex. This change aims to
reduce both, but further improvements will be needed.
lock_wait(): Minimize the lock_sys.wait_mutex hold time.
lock_sys_t::deadlock_check(): Add a parameter for indicating
whether lock_sys.latch is exclusively locked.
trx_t::was_chosen_as_deadlock_victim: Always use atomics.
lock_wait_wsrep(): Assume that no mutex is being held.
Deadlock::report(): Always kill the victim transaction.
lock_sys_t::timeout: New counter to back MONITOR_TIMEOUT.
rw_lock::upgrade_trylock(): If the compare-and-swap fails,
only assert that we are still holding the U lock and that
no conflicting lock exists. If the upgrade to X would fail due
to some thread holding an S latch, we will terminate the loop.
trx_t::commit_in_memory(): Invoke mod_tables.clear().
trx_free_at_shutdown(): Invoke mod_tables.clear() for transactions
that are discarded on shutdown.
Everywhere else, assert mod_tables.empty() on freed transaction objects.
Let us calculate the hash table cell address while we are calculating
the latch address, to avoid repeated computations of the address.
The latch address can be derived from the cell address with a simple
bitmask operation.
We have innodb_use_native_aio=ON by default since the introduction of
that parameter in commit 2f9fb41b05
(MySQL 5.5 and MariaDB 5.5).
However, to really benefit from the setting, the files should be
opened in O_DIRECT mode, to bypass the file system cache.
In this way, the reads and writes can be submitted with DMA, using
the InnoDB buffer pool directly, and no processor cycles need to be
used for copying data. The use of O_DIRECT benefits not only the
current libaio implementation, but also liburing.
os_file_set_nocache(): Test innodb_flush_method in the function,
not in the callers.
The fix of MDEV-23328 introduced a background thread for
killing conflicting transactions.
Thanks to the refactoring that was conducted in MDEV-24671,
the high-priority ("brute-force") applier thread can kill the
conflicting transactions itself, before waiting for the
locks to be finally released (after the conflicting transactions
have been rolled back).
This also allows us to remove the hack LockGGuard that had to
be added in MDEV-20612, and remove Galera-related function
parameters from lock creation.
A new configuration parameter innodb_deadlock_report is introduced:
* innodb_deadlock_report=off: Do not report any details of deadlocks.
* innodb_deadlock_report=basic: Report transactions and waiting locks.
* innodb_deadlock_report=full (default): Report also the blocking locks.
The improved deadlock checker will consider all involved transactions
in one loop, even if the deadlock loop includes several transactions.
The theoretical maximum number of transactions that can be involved in
a deadlock is `innodb_page_size` * 8, limited by the persistent data
structures.
Note: Similar to
mysql/mysql-server@3859219875
our deadlock checker will consider at most one blocking transaction
for each waiting transaction. The new field trx->lock.wait_trx be
nullptr if and only if trx->lock.wait_lock is nullptr. Note that
trx->lock.wait_lock->trx == trx (the waiting transaction), while
trx->lock.wait_trx points to one of the transactions whose lock is
conflicting with trx->lock.wait_lock.
Considering only one blocking transaction will greatly simplify
our deadlock checker, but it may also make the deadlock checker
blind to some deadlocks where the deadlock cycle is 'hidden' by
the fact that the registered trx->lock.wait_trx is not actually
waiting for any InnoDB lock, but something else. So, instead of
deadlocks, sometimes lock wait timeout may be reported.
To improve on this, whenever trx->lock.wait_trx is changed, we
will register further 'candidate' transactions in Deadlock::to_check(),
and check for 'revealed' deadlocks as soon as possible, in lock_release()
and innobase_kill_query().
The old DeadlockChecker was holding lock_sys.latch, even though using
lock_sys.wait_mutex should be less contended (and thus preferred)
in the likely case that no deadlock is present.
lock_wait(): Defer the deadlock check to this function, instead of
executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().
DeadlockChecker: Complete rewrite:
(1) Explicitly keep track of transactions that are being waited for,
in trx->lock.wait_trx, protected by lock_sys.wait_mutex. Previously,
we were painstakingly traversing the lock heaps while blocking
concurrent registration or removal of any locks (even uncontended ones).
(2) Use Brent's cycle-detection algorithm for deadlock detection,
traversing each trx->lock.wait_trx edge at most 2 times.
(3) If a deadlock is detected, release lock_sys.wait_mutex,
acquire LockMutexGuard, re-acquire lock_sys.wait_mutex and re-invoke
find_cycle() to find out whether the deadlock is still present.
(4) Display information on all transactions that are involved in the
deadlock, and choose a victim to be rolled back.
lock_sys.deadlocks: Replaces lock_deadlock_found. Protected by wait_mutex.
Deadlock::find_cycle(): Quickly find a cycle of trx->lock.wait_trx...
using Brent's cycle detection algorithm.
Deadlock::report(): Report a deadlock cycle that was found by
Deadlock::find_cycle(), and choose a victim with the least weight.
Altogether, we may traverse each trx->lock.wait_trx edge up to 5
times (2*find_cycle()+1 time for reporting and choosing the victim).
Deadlock::check_and_resolve(): Find and resolve a deadlock.
lock_wait_rpl_report(): Report the waits-for information to
replication. This used to be executed as part of DeadlockChecker.
Replication must know the waits-for relations even if no deadlocks
are present in InnoDB.
Reviewed by: Vladislav Vaintroub
ssux_lock_low::write_lock(): Before invoking writer_wait(), keep
attempting write_lock_wait_try() as long as no conflict exists.
rw_lock::upgrade_trylock(): Relax a bogus assertion and correct
the acquisition operation. Another thread may be executing in
ssux_lock_low::write_lock() on the same latch. Because we are the
only thread that can make progress on that latch, we must become
the writer. Any waiting thread will be eventually woken up by
ssux_lock_low::u_unlock() or ssux_lock_low::wr_unlock(), but not
by wr_u_downgrade() because the upgrade is a very rare operation.