Commit graph

28061 commits

Author SHA1 Message Date
Thirunarayanan Balathandayuthapani
834c013b64 MDEV-34519 innodb_log_checkpoint_now crashes when innodb_read_only is enabled
During read only mode, InnoDB doesn't allow checkpoint to happen.
So InnoDB should throw the warning when InnoDB tries to
force the checkpoint when innodb_read_only = 1 or
innodb_force_recovery = 6.
2024-07-05 15:26:05 +05:30
Sergei Petrunia
513c827041 MDEV-34190: r_engine_stats.pages_read_count is unrealistically low
The symptoms were: take a server with no activity and a table that's
not in the buffer pool. Run a query that reads the whole table and
observe that r_engine_stats.pages_read_count shows about 2% of the table
was read. Who reads the rest?

The cause was that page prefetching done inside InnoDB was not counted.

This counts page prefetch requests made in buf_read_ahead_random() and
buf_read_ahead_linear() and makes them visible in:

- ANALYZE: r_engine_stats.pages_prefetch_read_count
- Slow Query Log: Pages_prefetched:

This patch intentionally doesn't attempt to count the time to read the
prefetched pages:
* there's no obvious place where one can do it
* prefetch reads may be done in parallel (right?), it is not clear how
  to count the time in this case.
2024-07-04 15:24:49 +03:00
Oleksandr Byelkin
034a175982 Merge branch '10.6' into 10.11 2024-07-04 11:52:07 +02:00
mariadb-DebarunBanerjee
73ad436e16 MDEV-34458 wait_for_read in buf_page_get_low hurts performance
The performance regression seen while loading BP is caused by the
deadlock fix given in MDEV-33543. The area of impact is wider but is
more visible when BP is being loaded initially via DMLs.  Specifically
the response time could be impacted in DML doing pessimistic operation
on index(split/merge) and the leaf pages are not found in buffer pool.
It is more likely to occur with small BP size.

The origin of the issue dates back to MDEV-30400 that introduced
btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for
leaf page searches. In btr_latch_prev, we use RW_NO_LATCH to get the
previous page fixed in BP without latching. When the page is not in BP,
we try to acquire and wait for S latch violating the latching order.

This deadlock was analyzed in MDEV-33543 and fixed by using the already
present wait logic in buf_page_get_gen() instead of waiting for latch.
The wait logic is inferior to usual S latch wait and is simply a
repeated sleep 100 of micro-sec (The actual sleep time could be more
depending on platforms). The bug was seen with "change-buffering" code
path and the idea was that this path should be less exercised. The
judgement was not correct and the path is actually quite frequent and
does impact performance when pages are not in BP and being loaded by
DML expanding/shrinking large data.

Fix: While trying to get a page with RW_NO_LATCH and we are attempting
"out of order" latch, return from buf_page_get_gen immediately instead
of waiting and follow the ordered latching path.
2024-07-03 18:08:43 +05:30
Oleksandr Byelkin
dcd8a64892 Merge branch '10.5' into 10.6 2024-07-03 13:27:23 +02:00
Daniel Black
25c6e3e4b7 MDEV-34502 InnoDB debug mode build - asserts with Valgrind
Valgrind looks as the assertions as examining uninitalized values.

As the assertions are tested in other Debug builds we know
it isn't all invalid.

Account for Valgrind by removing the assertion under
the WITH_VALGRIND=1 compulation.
2024-07-03 20:07:04 +10:00
Denis Protivensky
f6fcfc1a6a MDEV-33064 amendment: replace trx->wsrep=0 with an assert in trx_rollback_for_mysql()
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2024-07-01 13:08:11 +02:00
Denis Protivensky
cfbd57dfb7 MDEV-33064: Sync trx->wsrep state from THD on trx start
InnoDB transactions may be reused after committed:
- when taken from the transaction pool
- during a DDL operation execution

In this case wsrep flag on trx object is cleared, which may cause wrong
execution logic afterwards (wsrep-related hooks are not run).

Make trx->wsrep flag initialize from THD object only once on InnoDB transaction
start and don't change it throughout the transaction's lifetime.
The flag is reset at commit time as before.

Unconditionally set wsrep=OFF for THD objects that represent InnoDB background
threads.

Make Wsrep_schema::store_view() operate in its own transaction.

Fix streaming replication transactions' fragments rollback to not switch
THD->wsrep value during transaction's execution
(use THD->wsrep_ignore_table as a workaround).

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2024-07-01 13:07:39 +02:00
Marko Mäkelä
1d76794aba Merge 10.6 into 10.11 2024-06-28 16:03:28 +03:00
Marko Mäkelä
d1ecf5cc5f MDEV-32176 Contention in ha_innobase::info_low()
During a Sysbench oltp_point_select workload with 1 table and 400
concurrent connections, a bottleneck on dict_table_t::lock_mutex was
observed in ha_innobase::info_low().

dict_table_t::lock_latch: Replaces lock_mutex.

In ha_innobase::info_low() and several other places, we will acquire
a shared dict_table_t::lock_latch or we may elide the latch if
hardware memory transactions are available.

innobase_build_v_templ(): Remove the parameter "bool locked", and
require the caller to hold exclusive dict_table_t::lock_latch
(instead of holding an exclusive dict_sys.latch).

Tested by: Vladislav Vaintroub
Reviewed by: Vladislav Vaintroub
2024-06-28 15:57:07 +03:00
Marko Mäkelä
4ca355d863 MDEV-33894: Resurrect innodb_log_write_ahead_size
As part of commit 685d958e38 (MDEV-14425)
the parameter innodb_log_write_ahead_size was removed, because it was
thought that determining the physical block size would be a sufficient
replacement.

However, we can only determine the physical block size on Linux or
Microsoft Windows. On some file systems, the physical block size
is not relevant. For example, XFS uses a block size of 4096 bytes
even if the underlying block size may be smaller.

On Linux, we failed to determine the physical block size if
innodb_log_file_buffered=OFF was not requested or possible.
This will be fixed.

log_sys.write_size: The value of the reintroduced parameter
innodb_log_write_ahead_size. To keep it simple, this is read-only
and a power of two between 512 and 4096 bytes, so that the previous
alignment guarantees are fulfilled. This will replace the previous
log_sys.get_block_size().

log_sys.block_size, log_t::get_block_size(): Remove.

log_t::set_block_size(): Ensure that write_size will not be less
than the physical block size. There is no point to invoke this
function with 512 or less, because that is the minimum value of
write_size.

innodb_params_adjust(): Add some disabled code for adjusting
the minimum value and default value of innodb_log_write_ahead_size
to reflect the log_sys.write_size.

log_t::set_recovered(): Mark the recovery completed. This is the
place to adjust some things if we want to allow write_size>4096.

log_t::resize_write_buf(): Refer to write_size.

log_t::resize_start(): Refer to write_size instead of get_block_size().

log_write_buf(): Simplify some arithmetics and remove a goto.

log_t::write_buf(): Refer to write_size. If we are writing less than
that, do not switch buffers, but keep writing to the same buffer.
Move some code to improve the locality of reference.

recv_scan_log(): Refer to write_size instead of get_block_size().

os_file_create_func(): For type==OS_LOG_FILE on Linux, always invoke
os_file_log_maybe_unbuffered(), so that log_sys.set_block_size() will
be invoked even if we are not attempting to use O_DIRECT.

recv_sys_t::find_checkpoint(): Read the entire log header
in a single 12 KiB request into log_sys.buf.

Tested with:
./mtr --loose-innodb-log-write-ahead-size=4096
./mtr --loose-innodb-log-write-ahead-size=2048
2024-06-27 16:38:08 +03:00
Marko Mäkelä
27a3366663 Merge 10.6 into 10.11 2024-06-27 10:26:09 +03:00
Meng-Hsiu Chiang
55db59f16d [MDEV-28162] Replace PFS_atomic with std::atomic<T>
PFS_atomic class contains wrappers around my_atomic_* operations, which
are macros to GNU atomic operations (__atomic_*). Due to different
implementations of compilers, clang may encounter errors when compiling
on x86_32 architecture.

The following functions are replaced with C++ std::atomic type in
performance schema code base:
  - PFS_atomic::store_*()
      -> my_atomic_store*
        -> __atomic_store_n()
    => std::atomic<T>::store()

  - PFS_atomic::load_*()
      -> my_atomic_load*
        -> __atomic_load_n()
    => std::atomic<T>::load()

  - PFS_atomic::add_*()
      -> my_atomic_add*
        -> __atomic_fetch_add()
    => std::atomic<T>::fetch_add()

  - PFS_atomic::cas_*()
    -> my_atomic_cas*
      -> __atomic_compare_exchange_n()
    => std::atomic<T>::compare_exchange_strong()

and PFS_atomic class could be dropped completely.

Note that in the wrapper memory order passed to original GNU atomic
extensions are hard-coded as `__ATOMIC_SEQ_CST`, which is equivalent to
`std::memory_order_seq_cst` in C++, and is the default parameter for
std::atomic_* functions.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services.
2024-06-27 09:27:12 +03:00
Marko Mäkelä
2f6df93748 MDEV-34458 wait_for_read in buf_page_get_low hurts performance
BTR_MODIFY_PREV: Remove. This mode was only used by the change buffer,
which commit f27e9c8947 (MDEV-29694)
removed.

buf_page_get_gen(): Revert the change that was made in
commit 90b95c6149 (MDEV-33543)
because it is not applicable after MDEV-29694. This fixes the
performance regression that Vladislav Vaintroub reported.

This is a 11.x specific fix; this needs to be fixed differently
in older major versions where the change buffer is present.
2024-06-26 13:51:38 +03:00
Marko Mäkelä
cc1363071a MDEV-34455 innodb_read_only=ON fails to imply innodb_doublewrite=OFF
innodb_doublewrite_update(): Disallow any change if srv_read_only_mode
holds, that is, the server was started with innodb_read_only=ON or
innodb_force_recovery=6.

This fixes up commit 1122ac978e (MDEV-33545).
2024-06-26 08:23:54 +03:00
Yuchen Pei
c9212b7a43
MDEV-33746 [fixup] Add suggested overrides in mroonga 2024-06-26 11:08:56 +08:00
Yuchen Pei
d7042ec4da
Merge branch '10.5' into 10.6 2024-06-26 09:16:54 +08:00
Yuchen Pei
ad0ee8cdb9
MDEV-33746 [fixup] Add suggested overrides in oqgraph 2024-06-25 14:22:25 +08:00
Yuchen Pei
01289dac41
MDEV-34361 Split my.cnf in the spider suite.
Just like the spider/bugfix suite.

One caveat is that my_2_3.cnf needs something under mysqld.2.3 group,
otherwise mtr will fail with something like:

There is no group named 'mysqld.2.3' that can be used to resolve
'port' for ...

This will allow new tests under the spider suite to use what is
needed. It also somehow fixes issues of running a test followed by
spider.slave_trx_isolation.
2024-06-25 13:45:04 +08:00
Yuchen Pei
aebd2397cc
MDEV-34404 Use safe_str in spider udfs to avoid passing NULL str 2024-06-25 13:45:04 +08:00
Marko Mäkelä
0076eb3d4e Merge 10.5 into 10.6 2024-06-24 13:09:47 +03:00
Marko Mäkelä
acc077ffa1 MDEV-34443 ha_innobase::info_low() does not distinguish HA_STATUS_VARIABLE_EXTRA
ha_innobase::info_low(): For HA_STATUS_VARIABLE without
HA_STATUS_VARIABLE_EXTRA, let us avoid unnecessary and costly updates
of the data_free statistics, which are only needed for SHOW TABLE STATUS.
This optimization had been enabled in
commit 247ecb7597 but not utilized until now.
2024-06-24 10:39:13 +03:00
Vladislav Vaintroub
7c5fdc9b6a MDEV-33748 get rid of pthread_(get_/set_)specific, use thread_local
Apart from better performance when accessing thread local variables,
we'll get rid of things that depend on initialization/cleanup of
pthread_key_t variables.

Where appropriate, use compiler-dependent pre-C++11 thread-local
equivalents, where it makes sense, to avoid initialization check overhead
that non-static thread_local can suffer from.
2024-06-21 13:46:41 +02:00
Dave Gosselin
db0c28eff8 MDEV-33746 Supply missing override markings
Find and fix missing virtual override markings.  Updates cmake
maintainer flags to include -Wsuggest-override and
-Winconsistent-missing-override.
2024-06-20 11:32:13 -04:00
Vlad Lesin
0a199cb810 MDEV-34108 Inappropriate semi-consistent read in RC if innodb_snapshot_isolation=ON
The fixes in b8a6719889 have not disabled
semi-consistent read for innodb_snapshot_isolation=ON mode, they just allowed
read uncommitted version of a record, that's why the test for MDEV-26643 worked
well.

The semi-consistent read should be disabled on upper level in
row_search_mvcc() for READ COMMITTED isolation level.

Reviewed by Marko Mäkelä.
2024-06-20 16:11:54 +03:00
Thirunarayanan Balathandayuthapani
ab448d4b34 MDEV-34389 Avoid log overwrite in early recovery
- InnoDB tries to write FILE_CHECKPOINT marker during
early recovery when log file size is insufficient.
While updating the log checkpoint at the end of the recovery,
InnoDB must already have written out all pending changes
to the persistent files. To complete the checkpoint, InnoDB
has to write some log records for the checkpoint and to
update the checkpoint header. If the server gets killed
before updating the checkpoint header then it would lead
the logfile to be unrecoverable.

- This patch avoids FILE_CHECKPOINT marker during early
recovery and narrows down the window of opportunity to
make the log file unrecoverable.
2024-06-20 17:54:57 +05:30
Iaroslav Babanin
5d49a2add7 MDEV-33935 fix deadlock counter
- The deadlock counter was moved from
Deadlock::find_cycle into Deadlock::report, because
the find_cycle method is called multiple times during deadlock
detection flow, which means it shouldn't have such side effects.
But report() can, which called only once for
a victim transaction.
- Also the deadlock_detect.test and *.result test case
has been extended to handle the fix.
2024-06-19 20:43:33 +03:00
Jan Lindström
ee974ca5e0 MDEV-31658 : Deadlock found when trying to get lock during applying
Problem was that there was two non-conflicting local idle
transactions in node_1 that both inserted a key to primary key.
Then two transactions from other nodes inserted also
a key to primary key so that insert from node_2 conflicted
one of the local transactions in node_1 so that there would
be duplicate key if both are committed. For this insert
from other node tries to acquire S-lock for this record
and because this insert is high priority brute force (BF)
transaction it will kill idle local transaction.

Concurrently, second insert from node_3 conflicts the second
idle insert transaction in node_1. Again, it tries to acquire
S-lock for this record and kills idle local transaction.

At this point we have two non-conflicting high priority
transactions holding S-lock on different records in node_1.
For example like this: rec s-lock-node2-rec s-lock-node3-rec rec.

Because these high priority BF-transactions do not wait
each other insert from node3 that has later seqno compared
to insert from node2 can continue. It will try to acquire
insert intention for record it tries to insert (to avoid
duplicate key to be inserted by local transaction). Hower,
it will note that there is conflicting S-lock in same gap
between records. This will lead deadlock error as we have
defined that BF-transactions may not wait for record lock
but we can't kill conflicting BF-transaction because
it has lower seqno and it should commit first.

BF-transactions are executed concurrently because their
values to primary key are different i.e. they do not
conflict.

Galera certification will make sure that inserts from
other nodes i.e these high priority BF-transactions
can't insert duplicate keys. Local transactions naturally
can but they will be killed when BF-transaction
acquires required record locks.

Therefore, we can allow situation where there is conflicting
S-lock and insert intention lock regardless of their seqno
order and let both continue with no wait. This will lead
to situation where we need to allow BF-transaction
to wait when lock_rec_has_to_wait_in_queue is called
because this function is also called from
lock_rec_queue_validate and because lock is waiting
there would be assertion in ut_a(lock->is_gap()
|| lock_rec_has_to_wait_in_queue(cell, lock));

lock_wait_wsrep_kill
  Add debug sync points for BF-transactions killing
  local transaction.

wsrep_assert_no_bf_bf_wait
  Print also requested lock information

lock_rec_has_to_wait
  Add function to handle wsrep transaction lock wait
  cases.

lock_rec_has_to_wait_wsrep
  New function to handle wsrep transaction lock wait
  exceptions.

lock_rec_has_to_wait_in_queue
  Remove wsrep exception, in this function all
  conflicting locks need to wait in queue.
  Conflicts between BF and local transactions
  are handled in lock_wait.

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2024-06-19 14:09:11 +02:00
Marko Mäkelä
f9e717cb48 MDEV-34426: Assertion failure on bootstrap
Let us relax an assertion that would fail in
./mtr main.1st --mysqld=--innodb-undo-tablespaces=127
2024-06-19 15:08:19 +03:00
Marko Mäkelä
34813c1aa0 Merge 10.6 into 10.11 2024-06-19 15:04:07 +03:00
Marko Mäkelä
5b26a07698 MDEV-34178: Enable spinloop for index_lock
In an I/O bound concurrent INSERT test conducted by Mark Callaghan,
spin loops on dict_index_t::lock turn out to be beneficial.

This is a mixed bag; enabling the spin loops will improve throughput
and latency on some workloads and degrade in others.

Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
Performance tested by: Axel Schwenke
2024-06-19 13:41:11 +03:00
Marko Mäkelä
f8d213bd0e MDEV-34178: Improve the spin loops
srw_mutex_impl<spinloop>::wait_and_lock(): Invoke srw_pause() and
reload the lock word on each loop. Thanks to Mark Callaghan for
suggesting this.

ssux_lock_impl<spinloop>::rd_wait(): Actually implement a spin loop
on the rw-lock component without blocking on the mutex component.
If there is a conflict with wr_lock(), wait for writer.lock to be
released without actually acquiring it.

Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
2024-06-19 13:40:57 +03:00
Marko Mäkelä
6cde03aedc MDEV-34178: Improve PERFORMANCE_SCHEMA instrumentation
When MariaDB is built with PERFORMANCE_SCHEMA support enabled
and with futex-based rw-locks (not srw_lock_), we were unnecessarily
releasing and reacquiring lock.writer in srw_lock_impl::psi_wr_lock()
and ssux_lock::psi_wr_lock().

If there is a conflict with rd_lock(), let us hold the lock.writer
and execute u_wr_upgrade() to wait for rd_unlock().

Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
2024-06-19 13:30:23 +03:00
Marko Mäkelä
2bd661ca10 MDEV-34178: Simplify the U lock
The U lock mode of the sux_lock that was introduced in
commit 03ca6495df (MDEV-24142)
is unnecessarily complex.

Internally, sux_lock comprises two parts, each with their own wait queue
inside the operating system kernel: a mutex and a rw-lock.

We can map the operations as follows:

x_lock(): (X,X)
u_lock(): (X,_)
s_lock(): (_,S)

The Update lock mode, which is mutually exclusive with itself and with
X (exclusive) locks but not with shared (S) locks, was unnecessarily
acquiring a shared lock on the second component. The mutual exclusion
is guaranteed by the first component.

We might simplify the #ifdef SUX_LOCK_GENERIC case further by omitting
srw_mutex_impl::lock, because it is kind-of duplicating the mutex
that we will use for having a wait queue. However, the predicate
buf_page_t::can_relocate() would depend on the predicate
is_locked_or_waiting(), which is not available for pthread_mutex_t.

Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
2024-06-18 18:22:28 +03:00
Alexander Barkov
c4bf4ce948 Merge remote-tracking branch 'origin/11.2' into 11.4 2024-06-17 15:46:39 +04:00
Marko Mäkelä
a21e49cbcc Merge 11.1 into 11.2 2024-06-17 12:02:03 +03:00
Marko Mäkelä
d34289a3e2 Merge 10.11 into 11.1 2024-06-17 09:21:50 +03:00
Marko Mäkelä
346a0c1402 Merge 10.6 into 10.11 2024-06-17 09:08:07 +03:00
Marko Mäkelä
e60acae655 Merge 10.5 into 10.6 2024-06-17 08:40:07 +03:00
Marko Mäkelä
4b4c371fe7 MDEV-34297 fixup: -Wconversion on 32-bit 2024-06-14 13:21:19 +03:00
Thirunarayanan Balathandayuthapani
3271588bb7 MDEV-34381 During innodb_undo_truncate=ON recovery, InnoDB may fail to shrink undo* files
- During recovery, InnoDB may fail to shrink the undo tablespaces
when there are no pages to recover while applying the redo log.
This issue exists only when innodb_undo_truncate is enabled.
trx_lists_init_at_db_start() could've applied the redo logs
for undo tablespace page0.
2024-06-14 12:46:02 +05:30
Marko Mäkelä
5b89cab44f Merge 10.6 into 10.11 2024-06-13 08:16:49 +03:00
Thirunarayanan Balathandayuthapani
b40f9d3d5c MDEV-34374 Shrinking tablespace logic fails to handle error condition
- InnoDB ignores the error while traversing the used
extents during shrinking process. Made changes in
fsp_traverse_extents() to handle error condition
correctly
2024-06-12 17:36:12 +05:30
Brandon Nesterenko
d3a7e46bb4 MDEV-34365: UBSAN runtime error: call to function io_callback(tpool::aiocb*)
On an UBSAN clang-15 build, if running with UBSAN option
halt_on_error=1 (the issue doesn't show up without it),
MTR fails during mysqld --bootstrap with UBSAN error:

call to function io_callback(tpool::aiocb*) through pointer to incorrect function type 'void (*)(void *)'

This patch corrects the parameter type of io_callback
to match its expected type defined by callback_func,
i.e. (void*).

Reviewed By:
============
<TODO>
2024-06-12 08:39:41 +03:00
Marko Mäkelä
fc9005adc4 Merge 10.5 into 10.6 2024-06-12 07:51:28 +03:00
Thirunarayanan Balathandayuthapani
5b39ded713 MDEV-34156 InnoDB fails to apply the redo log for compressed tablespace
Problem:
=======
During recovery, InnoDB fails to apply the redo log for
compressed tablespace. The reason is that InnoDB assumes
that pages has been freed while applying the redo log for it.
During multiple scan of redo logs, InnoDB stores the freed
page information when it have sufficient buffer pool pages.
Once it ran out of memory, InnoDB doesn't store freed page
information. But InnoDB assigns the freed page ranges to tablespace
in recv_init_crash_recovery_spaces() even though
InnoDB doesn't have complete freed range information.
While applying the redo log, InnoDB wrongly assumes that page
has been freed and it could lead to corruption of tablespace.
This issue is caused by commit 941af1fa58 (MDEV-31803)
and commit 2f9e264781 (MDEV-29911).

Solution:
========
During recovery, set recovery size and freed page information
for all tablespace irrespective of memory.
2024-06-11 16:14:36 +05:30
Marko Mäkelä
b81d717387 Merge 10.6 into 10.11 2024-06-11 12:50:10 +03:00
Yuchen Pei
d524cb5b3d
MDEV-34002 Initialise fields in spider_db_handler
Otherwise it may result in nonsensical values like 190 for a boolean.
2024-06-11 09:18:42 +10:00
Marko Mäkelä
27834ebc91 Merge 10.5 into 10.6 2024-06-10 15:22:15 +03:00
Marko Mäkelä
a2bd936c52 MDEV-33161 Function pointer signature mismatch in LF_HASH
In cmake -DWITH_UBSAN=ON builds with clang but not with GCC,
-fsanitize=undefined will flag several runtime errors on
function pointer mismatch related to the lock-free hash table LF_HASH.

Let us use matching function signatures and remove function pointer
casts in order to avoid potential bugs due to undefined behaviour.

These errors could be caught at compilation time by
-Wcast-function-type-strict, which is available starting with clang-16,
but not available in any version of GCC as of now. The old GCC flag
-Wcast-function-type is enabled as part of -Wextra, but it specifically
does not catch these errors.

Reviewed by: Vladislav Vaintroub
2024-06-10 12:35:33 +03:00