Commit graph

665 commits

Author SHA1 Message Date
Marko Mäkelä
f074223ae7 MDEV-32068 Some calls to buf_read_ahead_linear() seem to be useless
The linear read-ahead (enabled by nonzero innodb_read_ahead_threshold)
works best if index leaf pages or undo log pages have been allocated
on adjacent page numbers. The read-ahead is assumed not to be helpful
in other types of page accesses, such as non-leaf index pages.

buf_page_get_low(): Do not invoke buf_page_t::set_accessed(),
buf_page_make_young_if_needed(), or buf_read_ahead_linear().
We will invoke them in those callers of buf_page_get_gen() or
buf_page_get() where it makes sense: the access is not
one-time-on-startup and the page and not going to be freed soon.

btr_copy_blob_prefix(), btr_pcur_move_to_next_page(),
trx_undo_get_prev_rec_from_prev_page(),
trx_undo_get_first_rec(), btr_cur_t::search_leaf(),
btr_cur_t::open_leaf(): Invoke buf_read_ahead_linear().

We will not invoke linear read-ahead in functions that would
essentially allocate or free pages, because pages that are
freshly allocated are expected to be initialized by buf_page_create()
and not read from the data file. Likewise, freeing pages should
not involve accessing any sibling pages, except for freeing
singly-linked lists of BLOB pages.

We will not invoke read-ahead in btr_cur_t::pessimistic_search_leaf()
or in a pessimistic operation of btr_cur_t::open_leaf(), because
it is assumed that pessimistic operations should be preceded by
optimistic operations, which should already have invoked read-ahead.

buf_page_make_young_if_needed(): Invoke also buf_page_t::set_accessed()
and return the result.

btr_cur_nonleaf_make_young(): Like buf_page_make_young_if_needed(),
but do not invoke buf_page_t::set_accessed().

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
2023-12-05 12:31:29 +02:00
Marko Mäkelä
bb511def1d MDEV-32371 Deadlock between buf_page_get_zip() and buf_pool_t::corrupted_evict()
buf_page_get_zip(): Do not wait for the page latch while holding hash_lock.
If the latch is not available, ensure that any concurrent
buf_pool_t::corrupted_evict() will be able to acquire the hash_lock,
and then retry the lookup. If the page was corrupted, we will finally
"goto must_read_page", retry the read once more, and then report an error.

Reviewed by: Thirunarayanan Balathandayuthapani
2023-11-30 10:35:53 +02:00
Marko Mäkelä
d963584d4c Merge 10.5 into 10.6 2023-11-22 16:56:47 +02:00
Marko Mäkelä
78c9a12c8f MDEV-32861 InnoDB hangs when running out of I/O slots
When the constant OS_AIO_N_PENDING_IOS_PER_THREAD is changed from 256 to 1
and the server is run with the minimum parameters
innodb_read_io_threads=1 and innodb_write_io_threads=2, two hangs
were observed.

tpool::cache<T>::put(T*): Ensure that get() in io_slots::acquire()
will be woken up when the cache previously was empty.

buf_pool_t::io_buf_t::reserve(): Schedule a possibly partial doublewrite
batch so that os_aio_wait_until_no_pending_writes() has a chance of
returning. Add a Boolean parameter and pass wait_for_reads=false inside
buf_page_decrypt_after_read(), because those calls will be executed
inside a read completion callback, and therefore
os_aio_wait_until_no_pending_reads() would block indefinitely.
2023-11-22 16:54:41 +02:00
Marko Mäkelä
a3d0d5fc33 MDEV-26055: Improve adaptive flushing
This is a 10.5 backport from 10.6
commit 9593cccf28.

Adaptive flushing is enabled by setting innodb_max_dirty_pages_pct_lwm>0
(not default) and innodb_adaptive_flushing=ON (default).
There is also the parameter innodb_adaptive_flushing_lwm
(default: 10 per cent of the log capacity). It should enable some
adaptive flushing even when innodb_max_dirty_pages_pct_lwm=0.
That is not being changed here.

This idea was first presented by Inaam Rana several years ago,
and I discussed it with Jean-François Gagné at FOSDEM 2023.

buf_flush_page_cleaner(): When we are not near the log capacity limit
(neither buf_flush_async_lsn nor buf_flush_sync_lsn are set),
also try to move clean blocks from the buf_pool.LRU list to buf_pool.free
or initiate writes (but not the eviction) of dirty blocks, until
the remaining I/O capacity has been consumed.

buf_flush_LRU_list_batch(): Add the parameter bool evict, to specify
whether dirty least recently used pages (from buf_pool.LRU) should
be evicted immediately after they have been written out. Callers outside
buf_flush_page_cleaner() will pass evict=true, to retain the existing
behaviour.

buf_do_LRU_batch(): Add the parameter bool evict.
Return counts of evicted and flushed pages.

buf_flush_LRU(): Add the parameter bool evict.
Assume that the caller holds buf_pool.mutex and
will invoke buf_dblwr.flush_buffered_writes() afterwards.

buf_flush_list_holding_mutex(): A low-level variant of buf_flush_list()
whose caller must hold buf_pool.mutex and invoke
buf_dblwr.flush_buffered_writes() afterwards.

buf_flush_wait_batch_end_acquiring_mutex(): Remove. It is enough to have
buf_flush_wait_batch_end().

page_cleaner_flush_pages_recommendation(): Avoid some floating-point
arithmetics.

buf_flush_page(), buf_flush_check_neighbor(), buf_flush_check_neighbors(),
buf_flush_try_neighbors(): Rename the parameter "bool lru" to "bool evict".

buf_free_from_unzip_LRU_list_batch(): Remove the parameter.
Only actual page writes will contribute towards the limit.

buf_LRU_free_page(): Evict freed pages of temporary tables.

buf_pool.done_free: Broadcast whenever a block is freed
(and buf_pool.try_LRU_scan is set).

buf_pool_t::io_buf_t::reserve(): Retry indefinitely.
During the test encryption.innochecksum we easily run out of
these buffers for PAGE_COMPRESSED or ENCRYPTED pages.

Tested by Matthias Leich and Axel Schwenke
2023-11-16 17:45:18 +02:00
Marko Mäkelä
b5e43a1d35 MDEV-32552 Write-ahead logging is broken for freed pages
buf_page_free(): Flag the freed page as modified if it is found in
the buffer pool.

buf_flush_page(): If the page has been freed, ensure that the log
for it has been durably written, before removing the page
from buf_pool.flush_list.

FindBlockX: Find also MTR_MEMO_PAGE_X_MODIFY in order to avoid an
occasional failure of innodb.innodb_defrag_concurrent, which involves
freeing and reallocating pages in the same mini-transaction.

This fixes a regression that was introduced in
commit a35b4ae898 (MDEV-15528).

This logic was tested by commenting out the $shutdown_timeout line
from a test and running the following:

./mtr --rr innodb.scrub
rr replay var/log/mysqld.1.rr/mariadbd-0

A breakpoint in the modified buf_flush_page() was hit, and the
FIL_PAGE_LSN of that page had been last modified during the
mtr_t::commit() of a mini-transaction where buf_page_free()
had been executed on that page.
2023-10-23 16:13:16 +03:00
Marko Mäkelä
a60462d93e Remove bogus references to replaced Google contributions
In commit 03ca6495df and
commit ff5d306e29
we forgot to remove some Google copyright notices related to
a contribution of using atomic memory access in the old InnoDB
mutex_t and rw_lock_t implementation.

The copyright notices had been mostly added in
commit c6232c06fa
due to commit a1bb700fd2.

The following Google contributions remain:
* some logic related to the parameter innodb_io_capacity
* innodb_encrypt_tables, added in MariaDB Server 10.1
2023-08-21 15:51:16 +03:00
Marko Mäkelä
6cc88c3db1 Clean up buf0buf.inl
Let us move some #include directives from buf0buf.inl to
the compilation units where they are really used.
2023-08-21 15:51:10 +03:00
Monty
99bd226059 MDEV-31558 Add InnoDB engine information to the slow query log
The new statistics is enabled by adding the "engine", "innodb" or "full"
option to --log-slow-verbosity

Example output:

 # Pages_accessed: 184  Pages_read: 95  Pages_updated: 0  Old_rows_read: 1
 # Pages_read_time: 17.0204  Engine_time: 248.1297

Page_read_time is time doing physical reads inside a storage engine.
(Writes cannot be tracked as these are usually done in the background).
Engine_time is the time spent inside the storage engine for the full
duration of the read/write/update calls. It uses the same code as
'analyze statement' for calculating the time spent.

The engine statistics is done with a generic interface that should be
easy for any engine to use. It can also easily be extended to provide
even more statistics.

Currently only InnoDB has counters for Pages_% and Undo_% status.
Engine_time works for all engines.

Implementation details:

class ha_handler_stats holds all engine stats.  This class is included
in handler and THD classes.
While a query is running, all statistics is updated in the handler. In
close_thread_tables() the statistics is added to the THD.

handler::handler_stats is a pointer to where statistics should be
collected. This is set to point to handler::active_handler_stats if
stats are requested. If not, it is set to 0.
handler_stats has also an element, 'active' that is 1 if stats are
requested. This is to allow engines to avoid doing any 'if's while
updating the statistics.

Cloned or partition tables have the pointer set to the base table if
status are requested.

There is a small performance impact when using --log-slow-verbosity=engine:
- All engine calls in 'select' will be timed.
- IO calls for InnoDB reads will be timed.
- Incrementation of counters are done on local variables and accesses
  are inline, so these should have very little impact.
- Statistics has to be reset for each statement for the THD and each
  used handler. This is only 40 bytes, which should be neglectable.
- For partition tables we have to loop over all partitions to update
  the handler_status as part of table_init(). Can be optimized in the
  future to only do this is log-slow-verbosity changes. For this to work
  we have to update handler_status for all opened partitions and
  also for all partitions opened in the future.

Other things:
- Added options 'engine' and 'full' to log-slow-verbosity.
- Some of the new files in the test suite comes from Percona server, which
  has similar status information.
- buf_page_optimistic_get(): Do not increment any counter, since we are
  only validating a pointer, not performing any buf_pool.page_hash lookup.
- Added THD argument to save_explain_data_intern().
- Switched arguments for save_explain_.*_data() to have
  always THD first (generates better code as other functions also have THD
  first).
2023-07-07 12:53:18 +03:00
Marko Mäkelä
f569e06e03 MDEV-31385 Change buffer stale entries leads to corruption while reusing page
buf_page_free(): If buffered changes existed for the page, drop them.

Co-developed with Thirunarayanan Balathandayuthapani
2023-06-02 11:06:09 +03:00
Marko Mäkelä
0976afec88 MDEV-31114 Assertion !...is_waiting() failed in os_aio_wait_until_no_pending_writes()
os_aio_wait_until_no_pending_reads(), os_aio_wait_until_pending_writes():
Add a Boolean parameter to indicate whether the wait should be declared
in the thread pool.

buf_flush_wait(): The callers have already declared a wait, so let us
avoid doing that again, just call os_aio_wait_until_pending_writes(false).

buf_flush_wait_flushed(): Do not declare a wait in the rare case that
the buf_flush_page_cleaner thread has been shut down already.

buf_flush_page_cleaner(), buf_flush_buffer_pool(): In the code that runs
during shutdown, do not declare waits.

buf_flush_buffer_pool(): Remove a debug assertion that might fail.
What really matters here is buf_pool.flush_list.count==0.

buf_read_recv_pages(), srv_prepare_to_delete_redo_log_file():
Do not declare waits during InnoDB startup.
2023-04-24 09:57:58 +03:00
Marko Mäkelä
a091d6ac4e MDEV-26827 fixup: Do not duplicate io_slots::pending_io_count()
os_aio_pending_reads_approx(), os_aio_pending_reads(): Replaces
buf_pool.n_pend_reads.

os_aio_pending_writes(): Replaces buf_dblwr.pending_writes().

buf_dblwr_t::write_cond, buf_dblwr_t::writes_pending: Remove.
2023-04-12 13:49:57 +03:00
Marko Mäkelä
a55b951e60 MDEV-26827 Make page flushing even faster
For more convenient monitoring of something that could greatly affect
the volume of page writes, we add the status variable
Innodb_buffer_pool_pages_split that was previously only available
via information_schema.innodb_metrics as "innodb_page_splits".
This was suggested by Axel Schwenke.

buf_flush_page_count: Replaced with buf_pool.stat.n_pages_written.
We protect buf_pool.stat (except n_page_gets) with buf_pool.mutex
and remove unnecessary export_vars indirection.

buf_pool.flush_list_bytes: Moved from buf_pool.stat.flush_list_bytes.
Protected by buf_pool.flush_list_mutex.

buf_pool_t::page_cleaner_status: Replaces buf_pool_t::n_flush_LRU_,
buf_pool_t::n_flush_list_, and buf_pool_t::page_cleaner_is_idle.
Protected by buf_pool.flush_list_mutex. We will exclusively broadcast
buf_pool.done_flush_list by the buf_flush_page_cleaner thread,
and only wait for it when communicating with buf_flush_page_cleaner.
There is no need to keep a count of pending writes by the
buf_pool.flush_list processing. A single flag suffices for that.

Waits for page write completion can be performed by
simply waiting on block->page.lock, or by invoking
buf_dblwr.wait_for_page_writes().

buf_LRU_block_free_non_file_page(): Broadcast buf_pool.done_free and
set buf_pool.try_LRU_scan when freeing a page. This would be
executed also as part of buf_page_write_complete().

buf_page_write_complete(): Do not broadcast buf_pool.done_flush_list,
and do not acquire buf_pool.mutex unless buf_pool.LRU eviction is needed.
Let buf_dblwr count all writes to persistent pages and broadcast a
condition variable when no outstanding writes remain.

buf_flush_page_cleaner(): Prioritize LRU flushing and eviction right after
"furious flushing" (lsn_limit). Simplify the conditions and reduce the
hold time of buf_pool.flush_list_mutex. Refuse to shut down
or sleep if buf_pool.ran_out(), that is, LRU eviction is needed.

buf_pool_t::page_cleaner_wakeup(): Add the optional parameter for_LRU.

buf_LRU_get_free_block(): Protect buf_lru_free_blocks_error_printed
with buf_pool.mutex. Invoke buf_pool.page_cleaner_wakeup(true) to
to ensure that buf_flush_page_cleaner() will process the LRU flush
request.

buf_do_LRU_batch(), buf_flush_list(), buf_flush_list_space():
Update buf_pool.stat.n_pages_written when submitting writes
(while holding buf_pool.mutex), not when completing them.

buf_page_t::flush(), buf_flush_discard_page(): Require that
the page U-latch be acquired upfront, and remove
buf_page_t::ready_for_flush().

buf_pool_t::delete_from_flush_list(): Remove the parameter "bool clear".

buf_flush_page(): Count pending page writes via buf_dblwr.

buf_flush_try_neighbors(): Take the block of page_id as a parameter.
If the tablespace is dropped before our page has been written out,
release the page U-latch.

buf_pool_invalidate(): Let the caller ensure that there are no
outstanding writes.

buf_flush_wait_batch_end(false),
buf_flush_wait_batch_end_acquiring_mutex(false):
Replaced with buf_dblwr.wait_for_page_writes().

buf_flush_wait_LRU_batch_end(): Replaces buf_flush_wait_batch_end(true).

buf_flush_list(): Remove some broadcast of buf_pool.done_flush_list.

buf_flush_buffer_pool(): Invoke also buf_dblwr.wait_for_page_writes().

buf_pool_t::io_pending(), buf_pool_t::n_flush_list(): Remove.
Outstanding writes are reflected by buf_dblwr.pending_writes().

buf_dblwr_t::init(): New function, to initialize the mutex and
the condition variables, but not the backing store.

buf_dblwr_t::is_created(): Replaces buf_dblwr_t::is_initialised().

buf_dblwr_t::pending_writes(), buf_dblwr_t::writes_pending:
Keeps track of writes of persistent data pages.

buf_flush_LRU(): Allow calls while LRU flushing may be in progress
in another thread.

Tested by Matthias Leich (correctness) and Axel Schwenke (performance)
2023-03-16 17:19:58 +02:00
Marko Mäkelä
9593cccf28 MDEV-26055: Improve adaptive flushing
Adaptive flushing is enabled by setting innodb_max_dirty_pages_pct_lwm>0
(not default) and innodb_adaptive_flushing=ON (default).
There is also the parameter innodb_adaptive_flushing_lwm
(default: 10 per cent of the log capacity). It should enable some
adaptive flushing even when innodb_max_dirty_pages_pct_lwm=0.
That is not being changed here.

This idea was first presented by Inaam Rana several years ago,
and I discussed it with Jean-François Gagné at FOSDEM 2023.

buf_flush_page_cleaner(): When we are not near the log capacity limit
(neither buf_flush_async_lsn nor buf_flush_sync_lsn are set),
also try to move clean blocks from the buf_pool.LRU list to buf_pool.free
or initiate writes (but not the eviction) of dirty blocks, until
the remaining I/O capacity has been consumed.

buf_flush_LRU_list_batch(): Add the parameter bool evict, to specify
whether dirty least recently used pages (from buf_pool.LRU) should
be evicted immediately after they have been written out. Callers outside
buf_flush_page_cleaner() will pass evict=true, to retain the existing
behaviour.

buf_do_LRU_batch(): Add the parameter bool evict.
Return counts of evicted and flushed pages.

buf_flush_LRU(): Add the parameter bool evict.
Assume that the caller holds buf_pool.mutex and
will invoke buf_dblwr.flush_buffered_writes() afterwards.

buf_flush_list_holding_mutex(): A low-level variant of buf_flush_list()
whose caller must hold buf_pool.mutex and invoke
buf_dblwr.flush_buffered_writes() afterwards.

buf_flush_wait_batch_end_acquiring_mutex(): Remove. It is enough to have
buf_flush_wait_batch_end().

page_cleaner_flush_pages_recommendation(): Avoid some floating-point
arithmetics.

buf_flush_page(), buf_flush_check_neighbor(), buf_flush_check_neighbors(),
buf_flush_try_neighbors(): Rename the parameter "bool lru" to "bool evict".

buf_free_from_unzip_LRU_list_batch(): Remove the parameter.
Only actual page writes will contribute towards the limit.

buf_LRU_free_page(): Evict freed pages of temporary tables.

buf_pool.done_free: Broadcast whenever a block is freed
(and buf_pool.try_LRU_scan is set).

buf_pool_t::io_buf_t::reserve(): Retry indefinitely.
During the test encryption.innochecksum we easily run out of
these buffers for PAGE_COMPRESSED or ENCRYPTED pages.

Tested by Matthias Leich and Axel Schwenke
2023-03-16 17:09:08 +02:00
Marko Mäkelä
54c0ac72e3 MDEV-30134 Assertion failed in buf_page_t::unfix() in buf_pool_t::watch_unset()
buf_pool_t::watch_set(): Always buffer-fix a block if one was found,
no matter if it is a watch sentinel or a buffer page. The type of
the block descriptor will be rechecked in buf_page_t::watch_unset().
Do not expect the caller to acquire the page hash latch. Starting with
commit bd5a6403ca it is safe to release
buf_pool.mutex before acquiring a buf_pool.page_hash latch.

buf_page_get_low(): Adjust to the changed buf_pool_t::watch_set().

This simplifies the logic and fixes a bug that was reproduced when
using debug builds and the setting innodb_change_buffering_debug=1.
2023-02-16 08:29:44 +02:00
Marko Mäkelä
9c15799462 MDEV-30397: MariaDB crash due to DB_FAIL reported for a corrupted page
buf_read_page_low(): Map the buf_page_t::read_complete() return
value DB_FAIL to DB_PAGE_CORRUPTED. The purpose of the DB_FAIL
return value is to avoid error log noise when read-ahead brings
in an unused page that is typically filled with NUL bytes.

If a synchronous read is bringing in a corrupted page where the
page frame does not contain the expected tablespace identifier and
page number, that must be treated as an attempt to read a corrupted
page. The correct error code for this is DB_PAGE_CORRUPTED.
The error code DB_FAIL is not handled by row_mysql_handle_errors().

This was missed in commit 0b47c126e3
(MDEV-13542).
2023-02-16 08:28:14 +02:00
Marko Mäkelä
de4030e4d4 MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT
This also fixes part of MDEV-29835 Partial server freeze
which is caused by violations of the latching order that was
defined in https://dev.mysql.com/worklog/task/?id=6326
(WL#6326: InnoDB: fix index->lock contention). Unless the
current thread is holding an exclusive dict_index_t::lock,
it must acquire page latches in a strict parent-to-child,
left-to-right order. Not all cases of MDEV-29835 are fixed yet.
Failure to follow the correct latching order will cause deadlocks
of threads due to lock order inversion.

As part of these changes, the BTR_MODIFY_TREE mode is modified
so that an Update latch (U a.k.a. SX) will be acquired on the
root page, and eXclusive latches (X) will be acquired on all pages
leading to the leaf page, as well as any left and right siblings
of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326
will be removed, because at the time the DEBUG_SYNC point is hit,
the thread is actually holding several page latches that will be
blocking a concurrent SELECT statement.

We also remove double bookkeeping that was caused due to excessive
information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo
store information of latched pages, and ensure that
mtr_memo_slot_t::object is never a null pointer.
The tree_blocks[] and tree_savepoints[] were redundant.

buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid
a hang, do not try to evict blocks if we are holding a latch on
a modified page. The test innodb.innodb-change-buffer-recovery
will be removed, because change buffering may no longer be forced
by debug injection when the change buffer comprises multiple pages.
Remove a debug assertion that could fail when
innodb_change_buffering_debug=1 fails to evict a page.
For other cases, the assertion is redundant, because we already
checked that right after the got_block: label. The test
innodb.innodb-change-buffering-recovery will be removed, because
due to this change, we will be unable to evict the desired page.

mtr_t::lock_register(): Register a change of a page latch
on an unmodified buffer-fixed block.

mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint():
Replaced by the use of mtr_t::upgrade_buffer_fix(), which now
also handles RW_S_LATCH.

mtr_t::set_modified(): For temporary tables, invoke
buf_page_t::set_modified() here and not in mtr_t::commit().
We will never set the MTR_MEMO_MODIFY flag on other than
persistent data pages, nor set mtr_t::m_modifications when
temporary data pages are modified.

mtr_t::commit(): Only invoke the buf_flush_note_modification() loop
if persistent data pages were modified.

mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo.
This avoids many redundant entries in mtr_t::m_memo, as well as
redundant calls to buf_page_get_gen() for blocks that had already
been looked up in a mini-transaction.

btr_get_latched_root(): Return a pointer to an already latched root page.
This replaces btr_root_block_get() in cases where the mini-transaction
has already latched the root page.

btr_page_get_parent(): Fetch a parent page that was already latched
in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched().
If needed, upgrade the root page U latch to X.
This avoids bloating mtr_t::m_memo as well as performing redundant
buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for
B-tree defragmentation, we will invoke btr_cur_search_to_nth_level().

btr_cur_search_to_nth_level(): This will only be used for non-leaf
(level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE
or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be
removed altogether, or retained for the case of
CHECK TABLE without QUICK.

btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page()
can retrieve the left sibling from the end of mtr_t::m_memo.

btr_cur_t::open_leaf(): Some clean-up.

btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
for searches to level=0 (the leaf level). We will never release
parent page latches before acquiring leaf page latches. If we need to
temporarily release the level=1 page latch in the BTR_SEARCH_PREV or
BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the
child node pointer so that we will land on the correct leaf page.

btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE
latching logic in the case that page splits or merges will be needed.
The parent pages (and their siblings) should already be latched on
the first dive to the leaf and be present in mtr_t::m_memo; there
should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost
suffices; it must be revised in MDEV-29835 and work-arounds removed
for cases where mtr_t::get_already_latched() fails to find a block.

rtr_search_to_nth_level(): A SPATIAL INDEX version of
btr_search_to_nth_level() that can search to any level
(including the leaf level).

rtr_search_leaf(), rtr_insert_leaf(): Wrappers for
rtr_search_to_nth_level().

rtr_search(): Replaces rtr_pcur_open().

rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike
in the B-tree code, there is no error handling in case the sibling
pages are corrupted.

rtr_cur_restore_position(): Remove an unused constant parameter.

btr_pcur_open_on_user_rec(): Remove the constant parameter
mode=PAGE_CUR_GE.

row_ins_clust_index_entry_low(): Use a new
mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page
when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC.

BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove.

BTR_CONT_MODIFY_TREE: Note that this is only used by
rtr_search_to_nth_level().

btr_pcur_optimistic_latch_leaves(): Replaces
btr_cur_optimistic_latch_leaves().

ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order
to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV).

btr_blob_log_check_t(): Acquire a U latch on the root page,
so that btr_page_alloc() in btr_store_big_rec_extern_fields()
will avoid a deadlock.

btr_store_big_rec_extern_fields(): Assert that the root page latch
is being held.

Tested by: Matthias Leich
Reviewed by: Vladislav Lesin
2023-01-24 14:09:21 +02:00
Marko Mäkelä
a8a5c8a1b8 Merge 10.5 into 10.6 2022-12-13 16:58:58 +02:00
Marko Mäkelä
1dc2f35598 Merge 10.4 into 10.5 2022-12-13 14:39:18 +02:00
Marko Mäkelä
fdf43b5c78 Merge 10.3 into 10.4 2022-12-13 11:37:33 +02:00
Marko Mäkelä
6d40274f65 Merge 10.5 into 10.6 2022-11-23 18:13:28 +02:00
Thirunarayanan Balathandayuthapani
71c93fb8fd MDEV-28462 Race condition between instant alter and AHI access
- InnoDB AHI tries to access the concurrent instant alter column,
leads to asan failure. Instant alter column should acquire the
clustered index search latch in exclusive mode before changing
the table cache definition.

- Removed the default parameter for the function
btr_search_drop_page_hash_index()

- Addressed the DWITH_INNODB_AHI=0 compilation failure
by passing two parameters from all callers of
btr_search_drop_page_hash_index()
2022-11-22 15:24:44 +05:30
Marko Mäkelä
4e5e8166b4 MDEV-19514 fixup: Fix recovery with innodb_change_buffering_debug=1
During crash recovery, recv_sys.apply(true) invokes
mlog_init.mark_ibuf_exist(), which in turn may invoke
recv_sys.apply(true) via the buf_flush_sync() call in
buf_page_get_low(). The simplest fix is to disable the
innodb_change_buffering_debug=1 instrumentation
during crash recovery.
2022-11-21 17:55:35 +02:00
Marko Mäkelä
2ac1edb1c3 Merge 10.5 into 10.6 2022-11-08 17:37:22 +02:00
Marko Mäkelä
a732d5e2ba Merge 10.4 into 10.5 2022-11-08 17:01:28 +02:00
Marko Mäkelä
8fb176c3c1 MDEV-27121 fixup: mariabackup.mdev-14447,full_crc32 2022-11-08 16:59:36 +02:00
Marko Mäkelä
93b4f84ab2 Merge 10.3 into 10.4 2022-11-08 16:04:01 +02:00
Marko Mäkelä
eabb3b35d5 MDEV-27121 fixup: mariabackup.mdev-14447 fault injection 2022-11-08 08:53:49 +02:00
Marko Mäkelä
65d0c57c1a Merge 10.3 into 10.4 2022-10-05 20:30:57 +03:00
Vlad Lesin
c0eda62aec MDEV-27927 row_sel_try_search_shortcut_for_mysql() does not latch a page, violating read view isolation
btr_search_guess_on_hash() would only acquire an index page latch if it
is invoked with ahi_latch=NULL. If it's invoked from
row_sel_try_search_shortcut_for_mysql() with ahi_latch!=NULL, a page
will not be latched, and row_search_mvcc() will get a pointer to the
record, which can be changed by some other transaction before the record
was stored in result buffer with row_sel_store_mysql_rec() call.

ahi_latch argument of btr_cur_search_to_nth_level_func() and
btr_pcur_open_with_no_init_func() is used only for
row_sel_try_search_shortcut_for_mysql().
btr_cur_search_to_nth_level_func(..., ahi_latch !=0, ...) is invoked
only from btr_pcur_open_with_no_init_func(..., ahi_latch !=0, ...),
which, in turns, is invoked only from
row_sel_try_search_shortcut_for_mysql().

I suppose that separate case with ahi_latch!=0 was intentionally
implemented to protect row_sel_store_mysql_rec() call in
row_search_mvcc() just after row_sel_try_search_shortcut_for_mysql()
call. After the ahi_latch was moved from row_seach_mvcc() to
row_sel_try_search_shortcut_for_mysql(), there is no need in it at all
if btr_search_guess_on_hash() latches a page unconditionally. And if
btr_search_guess_on_hash() latched the page, any access to the record in
row_sel_try_search_shortcut_for_mysql() after btr_pcur_open_with_no_init()
call will be protected with the page latch.

The fix is to remove ahi_latch argument from
btr_pcur_open_with_no_init_func(), btr_cur_search_to_nth_level_func()
and btr_search_guess_on_hash().

There will not be test, as to test it we need to freeze some SELECT
execution in the point between row_sel_try_search_shortcut_for_mysql()
and row_sel_store_mysql_rec() calls in row_search_mvcc(), and to change
the record in some other transaction to let row_sel_store_mysql_rec() to
store changed record in result buffer. Buf we can't do this with the
fix, as the page will be latched in btr_search_guess_on_hash() call.
2022-10-05 17:35:21 +03:00
Marko Mäkelä
0c0a569028 Merge 10.3 into 10.4 2022-09-20 12:38:25 +03:00
Marko Mäkelä
c22dff21a5 InnoDB cleanup: Replace UNIV_LINUX, UNIV_SOLARIS, UNIV_AIX
Let us use the normal platform-specific preprocessor symbols
__linux__, __sun__, _AIX instead of some homebrew ones.

The preprocessor symbol UNIV_HPUX must have lost its meaning
by f6deb00a56 (note: the symbol
UNIV_HPUX10 is being checked for, but only UNIV_HPUX is defined).
2022-09-19 12:20:53 +03:00
Marko Mäkelä
9203249987 MDEV-27983: InnoDB hangs after loading a ROW_FORMAT=COMPRESSED page
If multiple threads invoke buf_page_get_low() on a ROW_FORMAT=COMPRESSED
page that does not reside in the buffer pool, then one of the threads
will end up acquiring an exclusive page latch (the "if" statement
right before the new wait_for_unzip: label) and other threads will
end up waiting for a shared latch while holding a buffer-fix.
The exclusive latch holder would then wait for the buffer-fixes to
be released while the buffer-fix holders are waiting for the shared latch.

buf_page_get_low(): Prevent the hang that was introduced
in commit 9436c778c3 (MDEV-27058),
by releasing the buffer-fix, sleeping some time, and retrying the
page lookup.
2022-08-31 17:52:23 +03:00
Marko Mäkelä
bdf62ece6c MDEV-29374 InnoDB recovery fails with "Data structure corruption"
recv_sys_t::free_corrupted_page(): Identify the corrupted page in
an error or warning message.

buf_page_free(): Just in case, register the page as modified.
This should already have been done in mtr_t::free() as part of
fseg_free_page_low().

mtr_t::memo_push(): Simplify a condition, so that when invoked
with MTR_MEMO_PAGE_X_MODIFY, we will do the right thing.

fseg_free_page_low(): Remove an accidentally added return statement
that prevented mtr_t::free() from being called. This fixes a regression
that was introduced in
commit 0b47c126e3 (MDEV-13542).
2022-08-31 17:52:16 +03:00
Marko Mäkelä
76bb671e42 Merge 10.5 into 10.6 2022-08-25 16:02:44 +03:00
Marko Mäkelä
9929301ecd Merge 10.4 into 10.5 2022-08-25 15:31:19 +03:00
Marko Mäkelä
851058a3e6 Merge 10.3 into 10.4 2022-08-25 15:17:20 +03:00
Marko Mäkelä
d1a80c42ee MDEV-29384 Hangs caused by innodb_adaptive_hash_index=ON
buf_defer_drop_ahi(): Remove. Ever since
commit c7f8cfc9e7 (MDEV-27700)
it is safe to invoke btr_search_drop_page_hash_index(block, true)
to remove an orphan adaptive hash index.

Any attempt to upgrade page latches is prone to deadlocks. Recently,
we observed a few hangs that involved nothing more than a small table
consisting of one clustered index page, one secondary index page and
some undo pages.
2022-08-25 15:14:38 +03:00
Marko Mäkelä
fbb2b1f55f Merge 10.5 into 10.6 2022-08-23 08:47:21 +03:00
Marko Mäkelä
3b656ac8c1 Merge 10.4 into 10.5 2022-08-22 19:49:56 +03:00
Marko Mäkelä
b68ae6dc1d Merge 10.3 into 10.4 2022-08-22 16:22:09 +03:00
Thirunarayanan Balathandayuthapani
c7f8cfc9e7 MDEV-27700 ASAN: Heap_use_after_free in btr_search_drop_page_hash_index()
Reason:
=======
Race condition between btr_search_drop_hash_index() and
btr_search_lazy_free(). One thread does resizing of buffer pool
and clears the ahi on all pages in the buffer pool, frees the
index and table while removing the last reference. At the same time,
other thread access index->heap in btr_search_drop_hash_index().

Solution:
=========
Acquire the respective ahi latch before checking index->freed()

btr_search_drop_page_hash_index(): Added new parameter to indicate
that drop ahi entries only if the index is marked as freed

btr_search_check_marked_free_index(): Acquire all ahi latches and
return true if the index was freed
2022-08-22 16:29:46 +05:30
Oleksandr Byelkin
d2f1c3ed6c Merge branch '10.5' into bb-10.6-release 2022-08-03 12:19:59 +02:00
Oleksandr Byelkin
af143474d8 Merge branch '10.4' into 10.5 2022-08-03 07:12:27 +02:00
Oleksandr Byelkin
48e35b8cf6 Merge branch '10.3' into 10.4 2022-08-02 14:15:39 +02:00
Daniel Black
182a6383cd MDEV-16605 Always include buf_madvise_do_dump in binaries
The "used" attribute seems to do this

ref: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
2022-08-02 12:29:11 +10:00
Marko Mäkelä
62a20f8047 Merge 10.5 into 10.6 2022-07-01 15:24:50 +03:00
Marko Mäkelä
f09687094c Merge 10.4 into 10.5 2022-07-01 14:42:02 +03:00
Marko Mäkelä
392ee571c1 Merge 10.3 into 10.4 2022-07-01 13:10:36 +03:00
Marko Mäkelä
7c35ad16e3 MDEV-28389 fixup: Fix pre-GCC 10 -Wconversion
Before version 10, GCC would think that a right shift of an
unsigned char returns int. Let us explicitly cast that back,
to silence a bogus -Wconversion warning.
2022-07-01 13:02:43 +03:00