Issue: When getting a page (buf_page_get_gen) with no latch option
(RW_NO_LATCH), the caller is not expected to follow the B-tree latching
order. However in buf_page_get_low we try to acquire shared page latch
unconditionally to wait for a page that is being loaded by another
thread concurrently. In general it could lead to latch order violation
and deadlock.
Currently it affects the change buffer insert path btr_latch_prev()
which tries to load the previous page out of order with RW_NO_LATCH and
two concurrent inserts into IBUF tree cause deadlock. This problem is
introduced in 10.6 by following commit.
commit 9436c778c3 (MDEV-27058)
Fix: While trying to latch a page with RW_NO_LATCH, always use the
"*lock_try" interface and retry operation on failure after unfixing the
page.
In so-called optimistic buffer pool lookups, we must not
dereference a block descriptor before we have made sure that
it is accessible. While buf_pool_t::resize() is running,
block descriptors could become invalid.
The buf::Block_hint class was essentially duplicating
a buf_pool.page_hash lookup that was executed in
buf_page_optimistic_get() anyway. For better locality of
reference, we had better execute that lookup only once.
buf_page_optimistic_fix(): Prepare for buf_page_optimistic_get().
This basically is a simpler version of Buf::Block_hint.
buf_page_optimistic_get(): Assume that buf_page_optimistic_fix()
has been called and the page identifier verified. Should the block
be evicted, the block->modify_clock will be invalidated; we do not
need to check the block->page.id() again. It suffices to check
the block->modify_clock after acquiring the page latch.
btr_pcur_t::old_page_id: Store the expected page identifier
for buf_page_optimistic_fix().
btr_pcur_t::block_when_stored: Remove. This was duplicating
page_cur_t::block.
btr_pcur_optimistic_latch_leaves(): Remove redundant parameters.
First, invoke buf_page_optimistic_fix() on the requested page.
If needed, acquire a latch on the left page. Finally, acquire
a latch on the target page and recheck the block->modify_clock.
If the page had been freed while we were not holding a page latch,
fall back to the slow path. Validate the FIL_PAGE_PREV after
acquiring a latch on the current page. The block->modify_clock
is only being incremented when records are deleted or pages
reorganized or evicted; it does not guard against concurrent
page splits.
Reviewed by: Debarun Banerjee
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
By design, InnoDB has always hung when permanently running out of
buffer pool, for example when several threads are waiting to allocate
a block, and all of the buffer pool is buffer-fixed by the active threads.
The hang that we are fixing here occurs when the buffer pool is only
temporarily running out and the situation could be rescued by writing out
some dirty pages or evicting some clean pages.
buf_LRU_get_free_block(): Simplify the way how we wait for
the buf_flush_page_cleaner thread. This fixes occasional hangs
of the test encryption.innochecksum that were introduced by
commit a55b951e60 (MDEV-26827).
To play it safe, we use a timed wait when waiting for the
buf_flush_page_cleaner() thread to perform its job. Should that
thread get stuck, we will invoke buf_pool.LRU_warn() in order to
display a message that pages could not be freed, and keep trying
to wake up the buf_flush_page_cleaner() thread.
The INFORMATION_SCHEMA.INNODB_METRICS counters
buffer_LRU_single_flush_failure_count and
buffer_LRU_get_free_waits will be removed.
The latter is represented by buffer_pool_wait_free.
Also removed will be the message
"InnoDB: Difficult to find free blocks in the buffer pool"
because in d34479dc66 we
introduced a more precise message
"InnoDB: Could not free any blocks in the buffer pool"
in the buf_flush_page_cleaner thread.
buf_pool_t::LRU_warn(): Issue the warning message that we could
not free any blocks in the buffer pool. This may also be invoked
by buf_LRU_get_free_block() if buf_flush_page_cleaner() appears
to be stuck.
buf_pool_t::n_flush_dec(): Remove.
buf_pool_t::n_flush_dec_holding_mutex(): Rename to n_flush_dec().
buf_flush_LRU_list_batch(): Increment the eviction counter for blocks
of temporary, discarded or dropped tablespaces.
buf_flush_LRU(): Make static, and remove the constant parameter
evict=false. The only caller will be the buf_flush_page_cleaner()
thread.
IORequest::is_LRU(): Remove. The only case of evicting pages on
write completion will be when we are writing out pages of the
temporary tablespace. Those pages are not in buf_pool.flush_list,
only in buf_pool.LRU.
buf_page_t::flush(): Remove the parameter evict.
buf_page_t::write_complete(): Change the parameter "bool temporary"
to "bool persistent" and add a parameter for an already read state().
Reviewed by: Debarun Banerjee
Let us skip the recently added test main.mysql-interactive if
an instrumented ncurses library is not available.
In InnoDB, let us work around an uninstrumented libnuma, by
declaring that the objects returned by numa_get_mems_allowed()
are initialized.
Errors outputted as notes are weird, and when a user
cannot do anything about them, why output them at all.
Point taken, removed these, and left positive message
on initialization (from Marko).
Thanks Elena Stepanova.
The linear read-ahead (enabled by nonzero innodb_read_ahead_threshold)
works best if index leaf pages or undo log pages have been allocated
on adjacent page numbers. The read-ahead is assumed not to be helpful
in other types of page accesses, such as non-leaf index pages.
buf_page_get_low(): Do not invoke buf_page_t::set_accessed(),
buf_page_make_young_if_needed(), or buf_read_ahead_linear().
We will invoke them in those callers of buf_page_get_gen() or
buf_page_get() where it makes sense: the access is not
one-time-on-startup and the page and not going to be freed soon.
btr_copy_blob_prefix(), btr_pcur_move_to_next_page(),
trx_undo_get_prev_rec_from_prev_page(),
trx_undo_get_first_rec(), btr_cur_t::search_leaf(),
btr_cur_t::open_leaf(): Invoke buf_read_ahead_linear().
We will not invoke linear read-ahead in functions that would
essentially allocate or free pages, because pages that are
freshly allocated are expected to be initialized by buf_page_create()
and not read from the data file. Likewise, freeing pages should
not involve accessing any sibling pages, except for freeing
singly-linked lists of BLOB pages.
We will not invoke read-ahead in btr_cur_t::pessimistic_search_leaf()
or in a pessimistic operation of btr_cur_t::open_leaf(), because
it is assumed that pessimistic operations should be preceded by
optimistic operations, which should already have invoked read-ahead.
buf_page_make_young_if_needed(): Invoke also buf_page_t::set_accessed()
and return the result.
btr_cur_nonleaf_make_young(): Like buf_page_make_young_if_needed(),
but do not invoke buf_page_t::set_accessed().
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
buf_page_get_zip(): Do not wait for the page latch while holding hash_lock.
If the latch is not available, ensure that any concurrent
buf_pool_t::corrupted_evict() will be able to acquire the hash_lock,
and then retry the lookup. If the page was corrupted, we will finally
"goto must_read_page", retry the read once more, and then report an error.
Reviewed by: Thirunarayanan Balathandayuthapani
Eventfds have a simplier interface and are one file
descriptor rather than two.
Reuse the patten of the accepting socket connections
by testing for abort after a poll returns. This way
the same event descriptor can be used for Quit
and debugging trigger.
Also correct the registration of mem pressure file
descriptors.
When the constant OS_AIO_N_PENDING_IOS_PER_THREAD is changed from 256 to 1
and the server is run with the minimum parameters
innodb_read_io_threads=1 and innodb_write_io_threads=2, two hangs
were observed.
tpool::cache<T>::put(T*): Ensure that get() in io_slots::acquire()
will be woken up when the cache previously was empty.
buf_pool_t::io_buf_t::reserve(): Schedule a possibly partial doublewrite
batch so that os_aio_wait_until_no_pending_writes() has a chance of
returning. Add a Boolean parameter and pass wait_for_reads=false inside
buf_page_decrypt_after_read(), because those calls will be executed
inside a read completion callback, and therefore
os_aio_wait_until_no_pending_reads() would block indefinitely.
buf_page_t::set_os_unused(): Remove the system call that had been added in
commit 16c9718758 and revised in
commit c1fd082e9c for Microsoft Windows.
buf_pool_t::garbage_collect(): A new function to collect any garbage
from the InnoDB buffer pool that can be removed without writing any
log or data files. This will also invoke madvise() for all of buf_pool.free.
To trigger this the following MDEV is implemented:
MDEV-24670 avoid OOM by linux kernel co-operative memory management
To avoid frequent triggers that caused the MDEV-31953 regression, while
still preserving the 10.11 functionality of non-greedy kernel memory
usage, memory triggers are used.
On the triggering of memory pressure, if supported in the Linux kernel,
trigger the garbage collection of the innodb buffer pool.
The hard coded triggers occur where there is:
* some memory pressure in 5 of the last 10 seconds
* a full stall on memory pressure for 10ms in the last 2 seconds
The kernel will trigger only one in each of these time windows. To avoid
mariadb being in a constant state of memory garbage collection, this has
been limited to once per minute.
For a small set of kernels in 2023 (6.5, 6.6), there was a limit requiring
CAP_SYS_RESOURCE that was lifted[1] to support the use case of user
memory pressure. It not currently possible to set CAP_SYS_RESOURCES in
a systemd service as its setting a capability inside a usernamespace.
Running under systemd v254+ requires the default MemoryPressureWatch=auto
(or alternately "on").
Functionality was tested in a 6.4 kernel Fedora successfully under a
systemd service.
Running in a container requires that (unmask=)/sys/fs/cgroup be writable
by the mariadbd process.
To aid testing, the buf_pool_resize was a convient trigger point on
which to trigger garbage collection.
ref [1]: https://lore.kernel.org/all/CAMw=ZnQ56cm4Txgy5EhGYvR+Jt4s-KVgoA9_65HKWVMOXp7a9A@mail.gmail.com/T/#m3bd2a73c5ee49965cb73a830b1ccaa37ccf4e427
Co-Author: Daniel Black (on memory pressure trigger)
Reviewed by: Marko Mäkelä, Vladislav Vaintroub, Vladislav Lesin,
Thirunarayanan Balathandayuthapani
Tested by: Matthias Leich
This is a 10.5 backport from 10.6
commit 9593cccf28.
Adaptive flushing is enabled by setting innodb_max_dirty_pages_pct_lwm>0
(not default) and innodb_adaptive_flushing=ON (default).
There is also the parameter innodb_adaptive_flushing_lwm
(default: 10 per cent of the log capacity). It should enable some
adaptive flushing even when innodb_max_dirty_pages_pct_lwm=0.
That is not being changed here.
This idea was first presented by Inaam Rana several years ago,
and I discussed it with Jean-François Gagné at FOSDEM 2023.
buf_flush_page_cleaner(): When we are not near the log capacity limit
(neither buf_flush_async_lsn nor buf_flush_sync_lsn are set),
also try to move clean blocks from the buf_pool.LRU list to buf_pool.free
or initiate writes (but not the eviction) of dirty blocks, until
the remaining I/O capacity has been consumed.
buf_flush_LRU_list_batch(): Add the parameter bool evict, to specify
whether dirty least recently used pages (from buf_pool.LRU) should
be evicted immediately after they have been written out. Callers outside
buf_flush_page_cleaner() will pass evict=true, to retain the existing
behaviour.
buf_do_LRU_batch(): Add the parameter bool evict.
Return counts of evicted and flushed pages.
buf_flush_LRU(): Add the parameter bool evict.
Assume that the caller holds buf_pool.mutex and
will invoke buf_dblwr.flush_buffered_writes() afterwards.
buf_flush_list_holding_mutex(): A low-level variant of buf_flush_list()
whose caller must hold buf_pool.mutex and invoke
buf_dblwr.flush_buffered_writes() afterwards.
buf_flush_wait_batch_end_acquiring_mutex(): Remove. It is enough to have
buf_flush_wait_batch_end().
page_cleaner_flush_pages_recommendation(): Avoid some floating-point
arithmetics.
buf_flush_page(), buf_flush_check_neighbor(), buf_flush_check_neighbors(),
buf_flush_try_neighbors(): Rename the parameter "bool lru" to "bool evict".
buf_free_from_unzip_LRU_list_batch(): Remove the parameter.
Only actual page writes will contribute towards the limit.
buf_LRU_free_page(): Evict freed pages of temporary tables.
buf_pool.done_free: Broadcast whenever a block is freed
(and buf_pool.try_LRU_scan is set).
buf_pool_t::io_buf_t::reserve(): Retry indefinitely.
During the test encryption.innochecksum we easily run out of
these buffers for PAGE_COMPRESSED or ENCRYPTED pages.
Tested by Matthias Leich and Axel Schwenke
buf_page_free(): Flag the freed page as modified if it is found in
the buffer pool.
buf_flush_page(): If the page has been freed, ensure that the log
for it has been durably written, before removing the page
from buf_pool.flush_list.
FindBlockX: Find also MTR_MEMO_PAGE_X_MODIFY in order to avoid an
occasional failure of innodb.innodb_defrag_concurrent, which involves
freeing and reallocating pages in the same mini-transaction.
This fixes a regression that was introduced in
commit a35b4ae898 (MDEV-15528).
This logic was tested by commenting out the $shutdown_timeout line
from a test and running the following:
./mtr --rr innodb.scrub
rr replay var/log/mysqld.1.rr/mariadbd-0
A breakpoint in the modified buf_flush_page() was hit, and the
FIL_PAGE_LSN of that page had been last modified during the
mtr_t::commit() of a mini-transaction where buf_page_free()
had been executed on that page.
The MDEV-29693 conflict resolution is from Monty, as well as is
a bug fix where ANALYZE TABLE wrongly built histograms for
single-column PRIMARY KEY.
Also includes a fix for safe_malloc error reporting.
Other things:
- Copied main.log_slow from 10.4 to avoid mtr issue
Disabled test:
- spider/bugfix.mdev_27239 because we started to get
+Error 1429 Unable to connect to foreign data source: localhost
-Error 1158 Got an error reading communication packets
- main.delayed
- Bug#54332 Deadlock with two connections doing LOCK TABLE+INSERT DELAYED
This part is disabled for now as it fails randomly with different
warnings/errors (no corruption).
In commit 03ca6495df and
commit ff5d306e29
we forgot to remove some Google copyright notices related to
a contribution of using atomic memory access in the old InnoDB
mutex_t and rw_lock_t implementation.
The copyright notices had been mostly added in
commit c6232c06fa
due to commit a1bb700fd2.
The following Google contributions remain:
* some logic related to the parameter innodb_io_capacity
* innodb_encrypt_tables, added in MariaDB Server 10.1
The new statistics is enabled by adding the "engine", "innodb" or "full"
option to --log-slow-verbosity
Example output:
# Pages_accessed: 184 Pages_read: 95 Pages_updated: 0 Old_rows_read: 1
# Pages_read_time: 17.0204 Engine_time: 248.1297
Page_read_time is time doing physical reads inside a storage engine.
(Writes cannot be tracked as these are usually done in the background).
Engine_time is the time spent inside the storage engine for the full
duration of the read/write/update calls. It uses the same code as
'analyze statement' for calculating the time spent.
The engine statistics is done with a generic interface that should be
easy for any engine to use. It can also easily be extended to provide
even more statistics.
Currently only InnoDB has counters for Pages_% and Undo_% status.
Engine_time works for all engines.
Implementation details:
class ha_handler_stats holds all engine stats. This class is included
in handler and THD classes.
While a query is running, all statistics is updated in the handler. In
close_thread_tables() the statistics is added to the THD.
handler::handler_stats is a pointer to where statistics should be
collected. This is set to point to handler::active_handler_stats if
stats are requested. If not, it is set to 0.
handler_stats has also an element, 'active' that is 1 if stats are
requested. This is to allow engines to avoid doing any 'if's while
updating the statistics.
Cloned or partition tables have the pointer set to the base table if
status are requested.
There is a small performance impact when using --log-slow-verbosity=engine:
- All engine calls in 'select' will be timed.
- IO calls for InnoDB reads will be timed.
- Incrementation of counters are done on local variables and accesses
are inline, so these should have very little impact.
- Statistics has to be reset for each statement for the THD and each
used handler. This is only 40 bytes, which should be neglectable.
- For partition tables we have to loop over all partitions to update
the handler_status as part of table_init(). Can be optimized in the
future to only do this is log-slow-verbosity changes. For this to work
we have to update handler_status for all opened partitions and
also for all partitions opened in the future.
Other things:
- Added options 'engine' and 'full' to log-slow-verbosity.
- Some of the new files in the test suite comes from Percona server, which
has similar status information.
- buf_page_optimistic_get(): Do not increment any counter, since we are
only validating a pointer, not performing any buf_pool.page_hash lookup.
- Added THD argument to save_explain_data_intern().
- Switched arguments for save_explain_.*_data() to have
always THD first (generates better code as other functions also have THD
first).
os_aio_wait_until_no_pending_reads(), os_aio_wait_until_pending_writes():
Add a Boolean parameter to indicate whether the wait should be declared
in the thread pool.
buf_flush_wait(): The callers have already declared a wait, so let us
avoid doing that again, just call os_aio_wait_until_pending_writes(false).
buf_flush_wait_flushed(): Do not declare a wait in the rare case that
the buf_flush_page_cleaner thread has been shut down already.
buf_flush_page_cleaner(), buf_flush_buffer_pool(): In the code that runs
during shutdown, do not declare waits.
buf_flush_buffer_pool(): Remove a debug assertion that might fail.
What really matters here is buf_pool.flush_list.count==0.
buf_read_recv_pages(), srv_prepare_to_delete_redo_log_file():
Do not declare waits during InnoDB startup.
For more convenient monitoring of something that could greatly affect
the volume of page writes, we add the status variable
Innodb_buffer_pool_pages_split that was previously only available
via information_schema.innodb_metrics as "innodb_page_splits".
This was suggested by Axel Schwenke.
buf_flush_page_count: Replaced with buf_pool.stat.n_pages_written.
We protect buf_pool.stat (except n_page_gets) with buf_pool.mutex
and remove unnecessary export_vars indirection.
buf_pool.flush_list_bytes: Moved from buf_pool.stat.flush_list_bytes.
Protected by buf_pool.flush_list_mutex.
buf_pool_t::page_cleaner_status: Replaces buf_pool_t::n_flush_LRU_,
buf_pool_t::n_flush_list_, and buf_pool_t::page_cleaner_is_idle.
Protected by buf_pool.flush_list_mutex. We will exclusively broadcast
buf_pool.done_flush_list by the buf_flush_page_cleaner thread,
and only wait for it when communicating with buf_flush_page_cleaner.
There is no need to keep a count of pending writes by the
buf_pool.flush_list processing. A single flag suffices for that.
Waits for page write completion can be performed by
simply waiting on block->page.lock, or by invoking
buf_dblwr.wait_for_page_writes().
buf_LRU_block_free_non_file_page(): Broadcast buf_pool.done_free and
set buf_pool.try_LRU_scan when freeing a page. This would be
executed also as part of buf_page_write_complete().
buf_page_write_complete(): Do not broadcast buf_pool.done_flush_list,
and do not acquire buf_pool.mutex unless buf_pool.LRU eviction is needed.
Let buf_dblwr count all writes to persistent pages and broadcast a
condition variable when no outstanding writes remain.
buf_flush_page_cleaner(): Prioritize LRU flushing and eviction right after
"furious flushing" (lsn_limit). Simplify the conditions and reduce the
hold time of buf_pool.flush_list_mutex. Refuse to shut down
or sleep if buf_pool.ran_out(), that is, LRU eviction is needed.
buf_pool_t::page_cleaner_wakeup(): Add the optional parameter for_LRU.
buf_LRU_get_free_block(): Protect buf_lru_free_blocks_error_printed
with buf_pool.mutex. Invoke buf_pool.page_cleaner_wakeup(true) to
to ensure that buf_flush_page_cleaner() will process the LRU flush
request.
buf_do_LRU_batch(), buf_flush_list(), buf_flush_list_space():
Update buf_pool.stat.n_pages_written when submitting writes
(while holding buf_pool.mutex), not when completing them.
buf_page_t::flush(), buf_flush_discard_page(): Require that
the page U-latch be acquired upfront, and remove
buf_page_t::ready_for_flush().
buf_pool_t::delete_from_flush_list(): Remove the parameter "bool clear".
buf_flush_page(): Count pending page writes via buf_dblwr.
buf_flush_try_neighbors(): Take the block of page_id as a parameter.
If the tablespace is dropped before our page has been written out,
release the page U-latch.
buf_pool_invalidate(): Let the caller ensure that there are no
outstanding writes.
buf_flush_wait_batch_end(false),
buf_flush_wait_batch_end_acquiring_mutex(false):
Replaced with buf_dblwr.wait_for_page_writes().
buf_flush_wait_LRU_batch_end(): Replaces buf_flush_wait_batch_end(true).
buf_flush_list(): Remove some broadcast of buf_pool.done_flush_list.
buf_flush_buffer_pool(): Invoke also buf_dblwr.wait_for_page_writes().
buf_pool_t::io_pending(), buf_pool_t::n_flush_list(): Remove.
Outstanding writes are reflected by buf_dblwr.pending_writes().
buf_dblwr_t::init(): New function, to initialize the mutex and
the condition variables, but not the backing store.
buf_dblwr_t::is_created(): Replaces buf_dblwr_t::is_initialised().
buf_dblwr_t::pending_writes(), buf_dblwr_t::writes_pending:
Keeps track of writes of persistent data pages.
buf_flush_LRU(): Allow calls while LRU flushing may be in progress
in another thread.
Tested by Matthias Leich (correctness) and Axel Schwenke (performance)
Adaptive flushing is enabled by setting innodb_max_dirty_pages_pct_lwm>0
(not default) and innodb_adaptive_flushing=ON (default).
There is also the parameter innodb_adaptive_flushing_lwm
(default: 10 per cent of the log capacity). It should enable some
adaptive flushing even when innodb_max_dirty_pages_pct_lwm=0.
That is not being changed here.
This idea was first presented by Inaam Rana several years ago,
and I discussed it with Jean-François Gagné at FOSDEM 2023.
buf_flush_page_cleaner(): When we are not near the log capacity limit
(neither buf_flush_async_lsn nor buf_flush_sync_lsn are set),
also try to move clean blocks from the buf_pool.LRU list to buf_pool.free
or initiate writes (but not the eviction) of dirty blocks, until
the remaining I/O capacity has been consumed.
buf_flush_LRU_list_batch(): Add the parameter bool evict, to specify
whether dirty least recently used pages (from buf_pool.LRU) should
be evicted immediately after they have been written out. Callers outside
buf_flush_page_cleaner() will pass evict=true, to retain the existing
behaviour.
buf_do_LRU_batch(): Add the parameter bool evict.
Return counts of evicted and flushed pages.
buf_flush_LRU(): Add the parameter bool evict.
Assume that the caller holds buf_pool.mutex and
will invoke buf_dblwr.flush_buffered_writes() afterwards.
buf_flush_list_holding_mutex(): A low-level variant of buf_flush_list()
whose caller must hold buf_pool.mutex and invoke
buf_dblwr.flush_buffered_writes() afterwards.
buf_flush_wait_batch_end_acquiring_mutex(): Remove. It is enough to have
buf_flush_wait_batch_end().
page_cleaner_flush_pages_recommendation(): Avoid some floating-point
arithmetics.
buf_flush_page(), buf_flush_check_neighbor(), buf_flush_check_neighbors(),
buf_flush_try_neighbors(): Rename the parameter "bool lru" to "bool evict".
buf_free_from_unzip_LRU_list_batch(): Remove the parameter.
Only actual page writes will contribute towards the limit.
buf_LRU_free_page(): Evict freed pages of temporary tables.
buf_pool.done_free: Broadcast whenever a block is freed
(and buf_pool.try_LRU_scan is set).
buf_pool_t::io_buf_t::reserve(): Retry indefinitely.
During the test encryption.innochecksum we easily run out of
these buffers for PAGE_COMPRESSED or ENCRYPTED pages.
Tested by Matthias Leich and Axel Schwenke
buf_pool_t::watch_set(): Always buffer-fix a block if one was found,
no matter if it is a watch sentinel or a buffer page. The type of
the block descriptor will be rechecked in buf_page_t::watch_unset().
Do not expect the caller to acquire the page hash latch. Starting with
commit bd5a6403ca it is safe to release
buf_pool.mutex before acquiring a buf_pool.page_hash latch.
buf_page_get_low(): Adjust to the changed buf_pool_t::watch_set().
This simplifies the logic and fixes a bug that was reproduced when
using debug builds and the setting innodb_change_buffering_debug=1.
buf_read_page_low(): Map the buf_page_t::read_complete() return
value DB_FAIL to DB_PAGE_CORRUPTED. The purpose of the DB_FAIL
return value is to avoid error log noise when read-ahead brings
in an unused page that is typically filled with NUL bytes.
If a synchronous read is bringing in a corrupted page where the
page frame does not contain the expected tablespace identifier and
page number, that must be treated as an attempt to read a corrupted
page. The correct error code for this is DB_PAGE_CORRUPTED.
The error code DB_FAIL is not handled by row_mysql_handle_errors().
This was missed in commit 0b47c126e3
(MDEV-13542).
This also fixes part of MDEV-29835 Partial server freeze
which is caused by violations of the latching order that was
defined in https://dev.mysql.com/worklog/task/?id=6326
(WL#6326: InnoDB: fix index->lock contention). Unless the
current thread is holding an exclusive dict_index_t::lock,
it must acquire page latches in a strict parent-to-child,
left-to-right order. Not all cases of MDEV-29835 are fixed yet.
Failure to follow the correct latching order will cause deadlocks
of threads due to lock order inversion.
As part of these changes, the BTR_MODIFY_TREE mode is modified
so that an Update latch (U a.k.a. SX) will be acquired on the
root page, and eXclusive latches (X) will be acquired on all pages
leading to the leaf page, as well as any left and right siblings
of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326
will be removed, because at the time the DEBUG_SYNC point is hit,
the thread is actually holding several page latches that will be
blocking a concurrent SELECT statement.
We also remove double bookkeeping that was caused due to excessive
information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo
store information of latched pages, and ensure that
mtr_memo_slot_t::object is never a null pointer.
The tree_blocks[] and tree_savepoints[] were redundant.
buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid
a hang, do not try to evict blocks if we are holding a latch on
a modified page. The test innodb.innodb-change-buffer-recovery
will be removed, because change buffering may no longer be forced
by debug injection when the change buffer comprises multiple pages.
Remove a debug assertion that could fail when
innodb_change_buffering_debug=1 fails to evict a page.
For other cases, the assertion is redundant, because we already
checked that right after the got_block: label. The test
innodb.innodb-change-buffering-recovery will be removed, because
due to this change, we will be unable to evict the desired page.
mtr_t::lock_register(): Register a change of a page latch
on an unmodified buffer-fixed block.
mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint():
Replaced by the use of mtr_t::upgrade_buffer_fix(), which now
also handles RW_S_LATCH.
mtr_t::set_modified(): For temporary tables, invoke
buf_page_t::set_modified() here and not in mtr_t::commit().
We will never set the MTR_MEMO_MODIFY flag on other than
persistent data pages, nor set mtr_t::m_modifications when
temporary data pages are modified.
mtr_t::commit(): Only invoke the buf_flush_note_modification() loop
if persistent data pages were modified.
mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo.
This avoids many redundant entries in mtr_t::m_memo, as well as
redundant calls to buf_page_get_gen() for blocks that had already
been looked up in a mini-transaction.
btr_get_latched_root(): Return a pointer to an already latched root page.
This replaces btr_root_block_get() in cases where the mini-transaction
has already latched the root page.
btr_page_get_parent(): Fetch a parent page that was already latched
in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched().
If needed, upgrade the root page U latch to X.
This avoids bloating mtr_t::m_memo as well as performing redundant
buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for
B-tree defragmentation, we will invoke btr_cur_search_to_nth_level().
btr_cur_search_to_nth_level(): This will only be used for non-leaf
(level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE
or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be
removed altogether, or retained for the case of
CHECK TABLE without QUICK.
btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page()
can retrieve the left sibling from the end of mtr_t::m_memo.
btr_cur_t::open_leaf(): Some clean-up.
btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
for searches to level=0 (the leaf level). We will never release
parent page latches before acquiring leaf page latches. If we need to
temporarily release the level=1 page latch in the BTR_SEARCH_PREV or
BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the
child node pointer so that we will land on the correct leaf page.
btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE
latching logic in the case that page splits or merges will be needed.
The parent pages (and their siblings) should already be latched on
the first dive to the leaf and be present in mtr_t::m_memo; there
should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost
suffices; it must be revised in MDEV-29835 and work-arounds removed
for cases where mtr_t::get_already_latched() fails to find a block.
rtr_search_to_nth_level(): A SPATIAL INDEX version of
btr_search_to_nth_level() that can search to any level
(including the leaf level).
rtr_search_leaf(), rtr_insert_leaf(): Wrappers for
rtr_search_to_nth_level().
rtr_search(): Replaces rtr_pcur_open().
rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike
in the B-tree code, there is no error handling in case the sibling
pages are corrupted.
rtr_cur_restore_position(): Remove an unused constant parameter.
btr_pcur_open_on_user_rec(): Remove the constant parameter
mode=PAGE_CUR_GE.
row_ins_clust_index_entry_low(): Use a new
mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page
when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC.
BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove.
BTR_CONT_MODIFY_TREE: Note that this is only used by
rtr_search_to_nth_level().
btr_pcur_optimistic_latch_leaves(): Replaces
btr_cur_optimistic_latch_leaves().
ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order
to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV).
btr_blob_log_check_t(): Acquire a U latch on the root page,
so that btr_page_alloc() in btr_store_big_rec_extern_fields()
will avoid a deadlock.
btr_store_big_rec_extern_fields(): Assert that the root page latch
is being held.
Tested by: Matthias Leich
Reviewed by: Vladislav Lesin