Problem:
========
- During shutdown, InnoDB tries to free the asynchronous
I/O slots and hangs. The reason is that InnoDB disables
asynchronous I/O before waiting for pending
asynchronous I/O to finish.
buf_load(): InnoDB aborts the buffer pool load due to
user requested shutdown and doesn't wait for the asynchronous
read to get completed. This could lead to debug assertion
in buf_flush_buffer_pool() during shutdown
Fix:
===
os_aio_free(): Should wait all read_slots and write_slots
to finish before disabling the aio.
buf_load(): Should wait for pending read request to complete
even though it was aborted.
- Column stat_value and sample_size in mysql.innodb_index_stats
table is declared as BIGINT UNSIGNED without any check constraint.
user manually updates the value of stat_value and sample_size
to zero. InnoDB aborts the server while reading the statistics
information because InnoDB expects at least one leaf
page to exist for the index.
- To fix this issue, InnoDB should interpret the value of
stat_n_leaf_pages, stat_index_size in innodb_index_stats
stat_clustered_index_size, stat_sum_of_other_index_sizes
in innodb_table_stats as valid one even though user
mentioned it as 0.
DML transactions on FK-child tables also get table locks
on FK-parent tables. If there is a DML transaction holding
such a lock, and a TOI transaction starts, the latter
BF-aborts the former and puts itself into a waiting state.
If at this moment another DML transaction on FK-child table
starts, it doesn't check that the transaction waiting on
a parent table lock is TOI, and it erroneously BF-aborts
the waiting TOI transaction.
The fix: don't roll back high-priority transaction waiting
on a lock in InnoDB, instead roll back an incoming DML
transaction.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Rectify cases of mismatched brackets and address
possible cases of division by zero by checking if
the denominator is zero before dividing.
No functional changes were made.
All new code of the whole pull request, including one or several
files that are either new files or modified ones, are contributed
under the BSD-new license. I am contributing on behalf of my
employer Amazon Web Services, Inc.
The issue was that when repairing an Aria table of row format PAGE and
the data file was bigger the 4G, the data file length was cut short
because of wrong parameters to MY_ALIGN().
The effect was that ALTER TABLE, OPTIMIZE TABLE or REPAIR TABLE would fail
on these tables, possibly corrupting them.
The MDEV also exposed a bug where error state was not propagated properly
to the upper level if the number of rows in the table changed.
During read only mode, InnoDB doesn't allow checkpoint to happen.
So InnoDB should throw the warning when InnoDB tries to
force the checkpoint when innodb_read_only = 1 or
innodb_force_recovery = 6.
The symptoms were: take a server with no activity and a table that's
not in the buffer pool. Run a query that reads the whole table and
observe that r_engine_stats.pages_read_count shows about 2% of the table
was read. Who reads the rest?
The cause was that page prefetching done inside InnoDB was not counted.
This counts page prefetch requests made in buf_read_ahead_random() and
buf_read_ahead_linear() and makes them visible in:
- ANALYZE: r_engine_stats.pages_prefetch_read_count
- Slow Query Log: Pages_prefetched:
This patch intentionally doesn't attempt to count the time to read the
prefetched pages:
* there's no obvious place where one can do it
* prefetch reads may be done in parallel (right?), it is not clear how
to count the time in this case.
The performance regression seen while loading BP is caused by the
deadlock fix given in MDEV-33543. The area of impact is wider but is
more visible when BP is being loaded initially via DMLs. Specifically
the response time could be impacted in DML doing pessimistic operation
on index(split/merge) and the leaf pages are not found in buffer pool.
It is more likely to occur with small BP size.
The origin of the issue dates back to MDEV-30400 that introduced
btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for
leaf page searches. In btr_latch_prev, we use RW_NO_LATCH to get the
previous page fixed in BP without latching. When the page is not in BP,
we try to acquire and wait for S latch violating the latching order.
This deadlock was analyzed in MDEV-33543 and fixed by using the already
present wait logic in buf_page_get_gen() instead of waiting for latch.
The wait logic is inferior to usual S latch wait and is simply a
repeated sleep 100 of micro-sec (The actual sleep time could be more
depending on platforms). The bug was seen with "change-buffering" code
path and the idea was that this path should be less exercised. The
judgement was not correct and the path is actually quite frequent and
does impact performance when pages are not in BP and being loaded by
DML expanding/shrinking large data.
Fix: While trying to get a page with RW_NO_LATCH and we are attempting
"out of order" latch, return from buf_page_get_gen immediately instead
of waiting and follow the ordered latching path.
Valgrind looks as the assertions as examining uninitalized values.
As the assertions are tested in other Debug builds we know
it isn't all invalid.
Account for Valgrind by removing the assertion under
the WITH_VALGRIND=1 compulation.
InnoDB transactions may be reused after committed:
- when taken from the transaction pool
- during a DDL operation execution
In this case wsrep flag on trx object is cleared, which may cause wrong
execution logic afterwards (wsrep-related hooks are not run).
Make trx->wsrep flag initialize from THD object only once on InnoDB transaction
start and don't change it throughout the transaction's lifetime.
The flag is reset at commit time as before.
Unconditionally set wsrep=OFF for THD objects that represent InnoDB background
threads.
Make Wsrep_schema::store_view() operate in its own transaction.
Fix streaming replication transactions' fragments rollback to not switch
THD->wsrep value during transaction's execution
(use THD->wsrep_ignore_table as a workaround).
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
During a Sysbench oltp_point_select workload with 1 table and 400
concurrent connections, a bottleneck on dict_table_t::lock_mutex was
observed in ha_innobase::info_low().
dict_table_t::lock_latch: Replaces lock_mutex.
In ha_innobase::info_low() and several other places, we will acquire
a shared dict_table_t::lock_latch or we may elide the latch if
hardware memory transactions are available.
innobase_build_v_templ(): Remove the parameter "bool locked", and
require the caller to hold exclusive dict_table_t::lock_latch
(instead of holding an exclusive dict_sys.latch).
Tested by: Vladislav Vaintroub
Reviewed by: Vladislav Vaintroub
PFS_atomic class contains wrappers around my_atomic_* operations, which
are macros to GNU atomic operations (__atomic_*). Due to different
implementations of compilers, clang may encounter errors when compiling
on x86_32 architecture.
The following functions are replaced with C++ std::atomic type in
performance schema code base:
- PFS_atomic::store_*()
-> my_atomic_store*
-> __atomic_store_n()
=> std::atomic<T>::store()
- PFS_atomic::load_*()
-> my_atomic_load*
-> __atomic_load_n()
=> std::atomic<T>::load()
- PFS_atomic::add_*()
-> my_atomic_add*
-> __atomic_fetch_add()
=> std::atomic<T>::fetch_add()
- PFS_atomic::cas_*()
-> my_atomic_cas*
-> __atomic_compare_exchange_n()
=> std::atomic<T>::compare_exchange_strong()
and PFS_atomic class could be dropped completely.
Note that in the wrapper memory order passed to original GNU atomic
extensions are hard-coded as `__ATOMIC_SEQ_CST`, which is equivalent to
`std::memory_order_seq_cst` in C++, and is the default parameter for
std::atomic_* functions.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services.
Just like the spider/bugfix suite.
One caveat is that my_2_3.cnf needs something under mysqld.2.3 group,
otherwise mtr will fail with something like:
There is no group named 'mysqld.2.3' that can be used to resolve
'port' for ...
This will allow new tests under the spider suite to use what is
needed. It also somehow fixes issues of running a test followed by
spider.slave_trx_isolation.
ha_innobase::info_low(): For HA_STATUS_VARIABLE without
HA_STATUS_VARIABLE_EXTRA, let us avoid unnecessary and costly updates
of the data_free statistics, which are only needed for SHOW TABLE STATUS.
This optimization had been enabled in
commit 247ecb7597 but not utilized until now.
The fixes in b8a6719889 have not disabled
semi-consistent read for innodb_snapshot_isolation=ON mode, they just allowed
read uncommitted version of a record, that's why the test for MDEV-26643 worked
well.
The semi-consistent read should be disabled on upper level in
row_search_mvcc() for READ COMMITTED isolation level.
Reviewed by Marko Mäkelä.
- InnoDB tries to write FILE_CHECKPOINT marker during
early recovery when log file size is insufficient.
While updating the log checkpoint at the end of the recovery,
InnoDB must already have written out all pending changes
to the persistent files. To complete the checkpoint, InnoDB
has to write some log records for the checkpoint and to
update the checkpoint header. If the server gets killed
before updating the checkpoint header then it would lead
the logfile to be unrecoverable.
- This patch avoids FILE_CHECKPOINT marker during early
recovery and narrows down the window of opportunity to
make the log file unrecoverable.
- The deadlock counter was moved from
Deadlock::find_cycle into Deadlock::report, because
the find_cycle method is called multiple times during deadlock
detection flow, which means it shouldn't have such side effects.
But report() can, which called only once for
a victim transaction.
- Also the deadlock_detect.test and *.result test case
has been extended to handle the fix.
Problem was that there was two non-conflicting local idle
transactions in node_1 that both inserted a key to primary key.
Then two transactions from other nodes inserted also
a key to primary key so that insert from node_2 conflicted
one of the local transactions in node_1 so that there would
be duplicate key if both are committed. For this insert
from other node tries to acquire S-lock for this record
and because this insert is high priority brute force (BF)
transaction it will kill idle local transaction.
Concurrently, second insert from node_3 conflicts the second
idle insert transaction in node_1. Again, it tries to acquire
S-lock for this record and kills idle local transaction.
At this point we have two non-conflicting high priority
transactions holding S-lock on different records in node_1.
For example like this: rec s-lock-node2-rec s-lock-node3-rec rec.
Because these high priority BF-transactions do not wait
each other insert from node3 that has later seqno compared
to insert from node2 can continue. It will try to acquire
insert intention for record it tries to insert (to avoid
duplicate key to be inserted by local transaction). Hower,
it will note that there is conflicting S-lock in same gap
between records. This will lead deadlock error as we have
defined that BF-transactions may not wait for record lock
but we can't kill conflicting BF-transaction because
it has lower seqno and it should commit first.
BF-transactions are executed concurrently because their
values to primary key are different i.e. they do not
conflict.
Galera certification will make sure that inserts from
other nodes i.e these high priority BF-transactions
can't insert duplicate keys. Local transactions naturally
can but they will be killed when BF-transaction
acquires required record locks.
Therefore, we can allow situation where there is conflicting
S-lock and insert intention lock regardless of their seqno
order and let both continue with no wait. This will lead
to situation where we need to allow BF-transaction
to wait when lock_rec_has_to_wait_in_queue is called
because this function is also called from
lock_rec_queue_validate and because lock is waiting
there would be assertion in ut_a(lock->is_gap()
|| lock_rec_has_to_wait_in_queue(cell, lock));
lock_wait_wsrep_kill
Add debug sync points for BF-transactions killing
local transaction.
wsrep_assert_no_bf_bf_wait
Print also requested lock information
lock_rec_has_to_wait
Add function to handle wsrep transaction lock wait
cases.
lock_rec_has_to_wait_wsrep
New function to handle wsrep transaction lock wait
exceptions.
lock_rec_has_to_wait_in_queue
Remove wsrep exception, in this function all
conflicting locks need to wait in queue.
Conflicts between BF and local transactions
are handled in lock_wait.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
In an I/O bound concurrent INSERT test conducted by Mark Callaghan,
spin loops on dict_index_t::lock turn out to be beneficial.
This is a mixed bag; enabling the spin loops will improve throughput
and latency on some workloads and degrade in others.
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
Performance tested by: Axel Schwenke
srw_mutex_impl<spinloop>::wait_and_lock(): Invoke srw_pause() and
reload the lock word on each loop. Thanks to Mark Callaghan for
suggesting this.
ssux_lock_impl<spinloop>::rd_wait(): Actually implement a spin loop
on the rw-lock component without blocking on the mutex component.
If there is a conflict with wr_lock(), wait for writer.lock to be
released without actually acquiring it.
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
When MariaDB is built with PERFORMANCE_SCHEMA support enabled
and with futex-based rw-locks (not srw_lock_), we were unnecessarily
releasing and reacquiring lock.writer in srw_lock_impl::psi_wr_lock()
and ssux_lock::psi_wr_lock().
If there is a conflict with rd_lock(), let us hold the lock.writer
and execute u_wr_upgrade() to wait for rd_unlock().
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
The U lock mode of the sux_lock that was introduced in
commit 03ca6495df (MDEV-24142)
is unnecessarily complex.
Internally, sux_lock comprises two parts, each with their own wait queue
inside the operating system kernel: a mutex and a rw-lock.
We can map the operations as follows:
x_lock(): (X,X)
u_lock(): (X,_)
s_lock(): (_,S)
The Update lock mode, which is mutually exclusive with itself and with
X (exclusive) locks but not with shared (S) locks, was unnecessarily
acquiring a shared lock on the second component. The mutual exclusion
is guaranteed by the first component.
We might simplify the #ifdef SUX_LOCK_GENERIC case further by omitting
srw_mutex_impl::lock, because it is kind-of duplicating the mutex
that we will use for having a wait queue. However, the predicate
buf_page_t::can_relocate() would depend on the predicate
is_locked_or_waiting(), which is not available for pthread_mutex_t.
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
- During recovery, InnoDB may fail to shrink the undo tablespaces
when there are no pages to recover while applying the redo log.
This issue exists only when innodb_undo_truncate is enabled.
trx_lists_init_at_db_start() could've applied the redo logs
for undo tablespace page0.
On an UBSAN clang-15 build, if running with UBSAN option
halt_on_error=1 (the issue doesn't show up without it),
MTR fails during mysqld --bootstrap with UBSAN error:
call to function io_callback(tpool::aiocb*) through pointer to incorrect function type 'void (*)(void *)'
This patch corrects the parameter type of io_callback
to match its expected type defined by callback_func,
i.e. (void*).
Reviewed By:
============
<TODO>
In cmake -DWITH_UBSAN=ON builds with clang but not with GCC,
-fsanitize=undefined will flag several runtime errors on
function pointer mismatch related to the lock-free hash table LF_HASH.
Let us use matching function signatures and remove function pointer
casts in order to avoid potential bugs due to undefined behaviour.
These errors could be caught at compilation time by
-Wcast-function-type-strict, which is available starting with clang-16,
but not available in any version of GCC as of now. The old GCC flag
-Wcast-function-type is enabled as part of -Wextra, but it specifically
does not catch these errors.
Reviewed by: Vladislav Vaintroub
A few different incorrect function type UBSAN issues have been
grouped into this patch.
The only real potentially undefined behavior is an error about
show_func_mutex_instances_lost, which when invoked in
sql_show.cc::show_status_array(), puts 5 arguments onto the stack;
however, the implementing function only actually has 3 parameters (so
only 3 would be popped). This was fixed by adding in the remaining
parameters to satisfy the type mysql_show_var_func.
The rest of the findings are pointer type mismatches that wouldn't
lead to actual undefined behavior. The lf_hash_initializer function
type definition is
typedef void (*lf_hash_initializer)(LF_HASH *hash, void *dst, const void *src);
but the MDL_lock and table cache's implementations of this function
do not have that signature. The MDL_lock has specific MDL object
parameters:
static void lf_hash_initializer(LF_HASH *hash __attribute__((unused)),
MDL_lock *lock, MDL_key *key_arg)
and the table cache has specific TDC parameters:
static void tdc_hash_initializer(LF_HASH *,
TDC_element *element, LEX_STRING *key)
leading to UBSAN runtime errors when invoking these functions.
This patch fixes these type mis-matches by changing the
implementing functions to use void * and const void * for their
respective parameters, and later casting them to their expected
type in the function body.
Note too the functions tdc_hash_key and tc_purge_callback had
a similar problem to tdc_hash_initializer and was fixed
similarly.
Reviewed By:
============
Sergei Golubchik <serg@mariadb.com>
number of non-user tablespace.
fil_space_t::try_to_close(): Don't try to close
the tablespace which is acquired by the caller of
the function
Added the suppression message in open_files_limit test case
number of non-user tablespace.
- InnoDB only closes the user tablespace when the number of open
files exceeds innodb_open_files limit. In that case, InnoDB should
make sure that innodb_open_files value should be greater
than number of undo tablespace, system and temporary tablespace files.
Problem:
=======
- This commit is a merge of mysql commit 129ee47ef994652081a11ee9040c0488e5275b14.
InnoDB FTS can be in inconsistent state when sync operation
terminates the server before committing the operation. This
could lead to incorrect synced doc id and incorrect query results.
Solution:
========
- During sync commit operation, InnoDB should pass
the sync transaction to update the max doc id
in the config table.
fts_read_synced_doc_id() : This function is used
to read only synced doc id from the config table.
In commit 99bd22605938c42d876194f2ec75b32e658f00f5 (MDEV-31558)
we wrongly thought that there would be minimal overhead for accessing
a thread-local variable mariadb_stats.
It turns out that in C++11, each access to an extern thread_local
variable requires conditionally invoking an initialization function.
In fact, the initializer expression of mariadb_stats is dynamic, and
those calls were actually unavoidable.
In C++20, one could declare constinit thread_local variables, but
the address of a thread_local variable (&mariadb_dummy_stats) is not
a compile-time constant. We did not want to declare mariadb_dummy_stats
without thread_local, because then the dummy accesses could lead to
cache line contention between threads.
mariadb_stats: Declare as __thread or __declspec(thread) so that
there will be no dynamic initialization, but zero-initialization.
mariadb_dummy_stats: Remove. It is a lesser evil to let
the environment perform zero-initialization and check if
!mariadb_stats.
Reviewed by: Sergei Petrunia
btr_cur_t::search_leaf(): Invoke btr_cur_need_opposite_intention() after
positioning page_cur.rec so that the record will be in the intended page.
This is something that was broken in
commit f2096478d5 or
commit de4030e4d4 or related changes.
btr_cur_need_opposite_intention(): Add a debug assertion that would
catch the misuse.
The "next line of defence" that should have caught this bug in debug builds
are assertions that mtr_t::m_memo contains MTR_MEMO_X_LOCK for the
dict_index_t::lock. When btr_cur_need_opposite_intention() holds,
we should escalate to acquiring an exclusive index->lock in
btr_cur_t::pessimistic_search_leaf().
Reviewed by: Debarun Banerjee
buf_pool_invalidate(): Properly wait for
os_aio_wait_until_no_pending_writes() to ensure so that there
are no pending buf_page_t::write_complete() or buf_page_write_complete()
operations. This will avoid a failure of buf_pool.assert_all_freed().
This bug should affect debug builds only. At this point, the
buf_pool.flush_list should be clear and all changes should have
been written out. The loop around buf_LRU_scan_and_free_block() should
have eventually completed and freed all pages as soon as
buf_page_t::write_complete() had a chance to release the page latches.
It is worth noting that buf_flush_wait() is working as intended.
As soon as buf_flush_page_cleaner() invokes
buf_pool.get_oldest_modification() it will observe that
buf_page_t::write_complete() had assigned oldest_modification_ to 1,
and remove such blocks from buf_pool.flush_list. Upon reaching
buf_pool.flush_list.count=0 the buf_flush_page_cleaner() will mark
itself idle and wake buf_flush_wait() by broadcasting
buf_pool.done_flush_list.
This regression was introduced in
commit a55b951e60 (MDEV-26827).
Reviewed by: Debarun Banerjee