in buf_dblwr_t::init_or_load_pages()
- InnoDB fails to set the TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED
flag in transaction system header page while recreating
the undo log tablespaces
buf_dblwr_t::init_or_load_pages(): Tries to reset the
space id and try to write into doublewrite buffer even
when read_only mode is enabled.
In srv_all_undo_tablespaces_open(), InnoDB should try to
open the extra unused undo tablespaces instead of trying to
creating it.
Issue: When getting a page (buf_page_get_gen) with no latch option
(RW_NO_LATCH), the caller is not expected to follow the B-tree latching
order. However in buf_page_get_low we try to acquire shared page latch
unconditionally to wait for a page that is being loaded by another
thread concurrently. In general it could lead to latch order violation
and deadlock.
Currently it affects the change buffer insert path btr_latch_prev()
which tries to load the previous page out of order with RW_NO_LATCH and
two concurrent inserts into IBUF tree cause deadlock. This problem is
introduced in 10.6 by following commit.
commit 9436c778c3 (MDEV-27058)
Fix: While trying to latch a page with RW_NO_LATCH, always use the
"*lock_try" interface and retry operation on failure after unfixing the
page.
Because the Red Hat Enterprise Linux 8 core repository does not include
libpmem, let us implement the necessary subset ourselves.
pmem_persist(): Implement for 64-bit x86, ARM, POWER, RISC-V, Loongarch
in a way that should be compatible with the https://github.com/pmem/pmdk/
implementation of pmem_persist().
The CMake option WITH_INNODB_PMEM can be used for enabling or disabling
this interface at compile time. By default, it is enabled on all applicable
systems that are covered by our CI system.
Note: libpmem had not been previously enabled for Loongarch in our
Debian packaging. It was enabled for RISC-V, but we will not enable it
by default on RISC-V or Loongarch because we lack CI coverage.
The generated code for x86_64 was reviewed and tested on two
Intel implementations: one that only supports clflush, and
another that supports both clflushopt and clwb.
The generated machine code was also reviewed on https://godbolt.org
using various compiler versions. Godbolt helpfully includes an option
to compile to binary code and display the encoding, which was
useful on POWER.
Reviewed by: Vladislav Vaintroub
In so-called optimistic buffer pool lookups, we must not
dereference a block descriptor before we have made sure that
it is accessible. While buf_pool_t::resize() is running,
block descriptors could become invalid.
The buf::Block_hint class was essentially duplicating
a buf_pool.page_hash lookup that was executed in
buf_page_optimistic_get() anyway. For better locality of
reference, we had better execute that lookup only once.
buf_page_optimistic_fix(): Prepare for buf_page_optimistic_get().
This basically is a simpler version of Buf::Block_hint.
buf_page_optimistic_get(): Assume that buf_page_optimistic_fix()
has been called and the page identifier verified. Should the block
be evicted, the block->modify_clock will be invalidated; we do not
need to check the block->page.id() again. It suffices to check
the block->modify_clock after acquiring the page latch.
btr_pcur_t::old_page_id: Store the expected page identifier
for buf_page_optimistic_fix().
btr_pcur_t::block_when_stored: Remove. This was duplicating
page_cur_t::block.
btr_pcur_optimistic_latch_leaves(): Remove redundant parameters.
First, invoke buf_page_optimistic_fix() on the requested page.
If needed, acquire a latch on the left page. Finally, acquire
a latch on the target page and recheck the block->modify_clock.
If the page had been freed while we were not holding a page latch,
fall back to the slow path. Validate the FIL_PAGE_PREV after
acquiring a latch on the current page. The block->modify_clock
is only being incremented when records are deleted or pages
reorganized or evicted; it does not guard against concurrent
page splits.
Reviewed by: Debarun Banerjee
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
MONITOR_INC_VALUE_CUMULATIVE is a multiline macro, so the second statement
will be executed always, regardless of "if" condition.
These problems first started with
commit b1ab211dee (MDEV-15053).
Thanks to Yury Chaikou from ServiceNow for the report.
By design, InnoDB has always hung when permanently running out of
buffer pool, for example when several threads are waiting to allocate
a block, and all of the buffer pool is buffer-fixed by the active threads.
The hang that we are fixing here occurs when the buffer pool is only
temporarily running out and the situation could be rescued by writing out
some dirty pages or evicting some clean pages.
buf_LRU_get_free_block(): Simplify the way how we wait for
the buf_flush_page_cleaner thread. This fixes occasional hangs
of the test encryption.innochecksum that were introduced by
commit a55b951e60 (MDEV-26827).
To play it safe, we use a timed wait when waiting for the
buf_flush_page_cleaner() thread to perform its job. Should that
thread get stuck, we will invoke buf_pool.LRU_warn() in order to
display a message that pages could not be freed, and keep trying
to wake up the buf_flush_page_cleaner() thread.
The INFORMATION_SCHEMA.INNODB_METRICS counters
buffer_LRU_single_flush_failure_count and
buffer_LRU_get_free_waits will be removed.
The latter is represented by buffer_pool_wait_free.
Also removed will be the message
"InnoDB: Difficult to find free blocks in the buffer pool"
because in d34479dc66 we
introduced a more precise message
"InnoDB: Could not free any blocks in the buffer pool"
in the buf_flush_page_cleaner thread.
buf_pool_t::LRU_warn(): Issue the warning message that we could
not free any blocks in the buffer pool. This may also be invoked
by buf_LRU_get_free_block() if buf_flush_page_cleaner() appears
to be stuck.
buf_pool_t::n_flush_dec(): Remove.
buf_pool_t::n_flush_dec_holding_mutex(): Rename to n_flush_dec().
buf_flush_LRU_list_batch(): Increment the eviction counter for blocks
of temporary, discarded or dropped tablespaces.
buf_flush_LRU(): Make static, and remove the constant parameter
evict=false. The only caller will be the buf_flush_page_cleaner()
thread.
IORequest::is_LRU(): Remove. The only case of evicting pages on
write completion will be when we are writing out pages of the
temporary tablespace. Those pages are not in buf_pool.flush_list,
only in buf_pool.LRU.
buf_page_t::flush(): Remove the parameter evict.
buf_page_t::write_complete(): Change the parameter "bool temporary"
to "bool persistent" and add a parameter for an already read state().
Reviewed by: Debarun Banerjee
The log_sys.lsn_lock is a very contended resource with a small
critical section in log_sys.append_prepare(). On many processor
microarchitectures, replacing the system call based log_sys.lsn_lock
with a pure spin lock would fare worse during high concurrency workloads,
wasting a significant amount of CPU cycles in the spin loop.
On other microarchitectures, we would see a significant amount of time
being spent in native_queued_spin_lock_slowpath() in the Linux kernel,
plus context switching between user and kernel address space. This was
pointed out by Steve Shaw from Intel Corporation.
Depending on the workload and the hardware implementation, it may be
useful to use a pure spin lock in log_sys.append_prepare().
We will introduce a parameter. The statement
SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50;
would enable a spin lock that will execute that many MY_RELAX_CPU()
operations (such as the x86 PAUSE instruction) between successive
attempts of acquiring the spin lock. The use of a system call based
log_sys.lsn_lock (which is the default setting) can be enabled by
SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0;
This patch will also introduce #ifdef LOG_LATCH_DEBUG
(part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate
tracking of log_sys.latch ownership and reorganize the fields of
log_sys to improve the locality of reference and to reduce the
chances of false sharing.
When a spin lock is being used, it will be maintained in the
most significant bit of log_sys.buf_free. This is useful, because that is
one of the fields that is covered by the lock. For IA-32 or AMD64, we
implement the spin lock specially via log_t::lsn_lock_bts(), employing the
i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would
translate into an inefficient loop around LOCK CMPXCHG.
mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay.
mtr_t::finisher: Pointer to the currently used mtr_t::finish_write()
implementation. This allows to avoid introducing conditional branches.
We no longer invoke log_sys.is_pmem() at the mini-transaction level,
but we would do that in log_write_up_to().
mtr_t::finisher_update(): Update finisher when spin_wait_delay is
changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or
vice versa).
Let us skip the recently added test main.mysql-interactive if
an instrumented ncurses library is not available.
In InnoDB, let us work around an uninstrumented libnuma, by
declaring that the objects returned by numa_get_mems_allowed()
are initialized.
buf_flush_page_cleaner(): Remove a loop that had originally been added
in commit 9d1466522e (MDEV-32029) and made
redundant by commit 5b53342a6a (MDEV-32588).
Starting with commit d34479dc66 (MDEV-33053)
this loop would cause a significant performance regression in workloads
where buf_pool.need_LRU_eviction() constantly holds in
buf_flush_page_cleaner().
Thanks to Steve Shaw of Intel for noticing this.
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
Ever since commit 412ee0330c
or commit a440d6ed3a
InnoDB should generally not abort when failing to open or create files.
In Datafile::open_or_create() we had failed to set the flag
to avoid abort() on failure, but everywhere else we were setting it.
We may still call abort() via os_file_handle_error().
Reviewed by: Vladislav Vaintroub
buf_read_ahead_linear(): If buf_pool.watch_is_sentinel(*bpage),
do not attempt to read the page frame because the pointer would be null
for the elements of buf_pool.watch[].
Hitting this bug requires the use of a non-default value of
innodb_change_buffering.
A mix of path separators looks odd.
InnoDB: Loading buffer pool(s) from C:\xampp\mysql\data/ib_buffer_pool
This was changed in cf552f5886
Both forward slashes and backward slashes work on Windows. We do not
use \\?\ names.
So we improve the consistent look of it so it doesn't look like a bug.
Normalize, in this case, the path separator to \ for making the filename.
Reported thanks to Github user @celestinoxp.
Closes: https://github.com/ApacheFriends/xampp-build/issues/33
Reviewed by: Marko Mäkelä and Vladislav Vaintroub
In commit a55b951e60 (MDEV-26827)
an error was introduced in a rarely executed code path of the
buf_flush_page_cleaner() thread. As a result, the function
buf_flush_LRU() could be invoked while not holding buf_pool.mutex.
Reviewed by: Debarun Banerjee
buf_flush_LRU(): Display a warning if no pages could be evicted and
no writes initiated.
buf_pool_t::need_LRU_eviction(): Renamed from buf_pool_t::ran_out().
Check if the amount of free pages is smaller than innodb_lru_scan_depth
instead of checking if it is 0.
buf_flush_page_cleaner(): For the final LRU flush after a checkpoint
flush, use a "budget" of innodb_io_capacity_max, like we do in the
case when we are not in "furious" checkpoint flushing.
Co-developed by: Debarun Banerjee
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
- InnoDB fails to find the space id from the page0 of
the tablespace. In that case, InnoDB can use
doublewrite buffer to recover the page0 and write
into the file.
- buf_dblwr_t::init_or_load_pages(): Loads only the pages
which are valid.(page lsn >= checkpoint). To do that,
InnoDB has to open the redo log before system
tablespace, read the latest checkpoint information.
recv_dblwr_t::find_first_page():
1) Iterate the doublewrite buffer pages and find the 0th page
2) Read the tablespace flags, space id from the 0th page.
3) Read the 1st, 2nd and 3rd page from tablespace file and
compare the space id with the space id which is stored
in doublewrite buffer.
4) If it matches then we can write into the file.
5) Return space which matches the pages from the file.
SysTablespace::read_lsn_and_check_flags(): Remove the
retry logic for validating the first page. After
restoring the first page from doublewrite buffer,
assign tablespace flags by reading the first page.
recv_recovery_read_max_checkpoint(): Reads the maximum
checkpoint information from log file
recv_recovery_from_checkpoint_start(): Avoid reading
the checkpoint header information from log file
Datafile::validate_first_page(): Throw error in case
of first page validation fails.
When innodb_undo_log_truncate=ON causes an InnoDB undo tablespace
to be truncated, we must guarantee that the undo tablespace will
be rebuilt atomically: After mtr_t::commit_shrink() has durably
written the mini-transaction that rebuilds the undo tablespace,
we must not write any old pages to the tablespace.
To guarantee this, in trx_purge_truncate_history() we used to
traverse the entire buf_pool.flush_list in order to acquire
exclusive latches on all pages for the undo tablespace that
reside in the buffer pool, so that those pages cannot be written
and will be evicted during mtr_t::commit_shrink(). But, this
traversal may interfere with the page writing activity of
buf_flush_page_cleaner(). It would be better to lazily discard
the old pages of the truncated undo tablespace.
fil_space_t::is_being_truncated, fil_space_t::clear_stopping(): Remove.
fil_space_t::create_lsn: A new field, identifying the LSN of the
latest rebuild of a tablespace.
buf_page_t::flush(), buf_flush_try_neighbors(): Evict pages whose
FIL_PAGE_LSN is below fil_space_t::create_lsn.
mtr_t::commit_shrink(): Update fil_space_t::create_lsn and
fil_space_t::size right before the log is durably written and the
tablespace file is being truncated.
fsp_page_create(), trx_purge_truncate_history(): Simplify the logic.
Reviewed by: Thirunarayanan Balathandayuthapani, Vladislav Lesin
Performance tested by: Axel Schwenke
Correctness tested by: Matthias Leich
buf_flush_page_cleaner(): A continue or break inside DBUG_EXECUTE_IF
actually is a no-op. Use an explicit call to _db_keyword_() to
actually avoid advancing the checkpoint.
buf_flush_list_now_set(): Invoke os_aio_wait_until_no_pending_writes()
to ensure that the page write to the system tablespace is completed.
Errors outputted as notes are weird, and when a user
cannot do anything about them, why output them at all.
Point taken, removed these, and left positive message
on initialization (from Marko).
Thanks Elena Stepanova.
buf_flush_page_cleaner(): Pass pct_lwm=srv_max_dirty_pages_pct_lwm
(innodb_max_dirty_pages_pct_lwm) to
page_cleaner_flush_pages_recommendation() unless the dirty page ratio
of the buffer pool is below that. Starting with
commit d4265fbde5 we used to always
pass pct_lwm=0.0, which was not intended.
Reviewed by: Vladislav Vaintroub
The linear read-ahead (enabled by nonzero innodb_read_ahead_threshold)
works best if index leaf pages or undo log pages have been allocated
on adjacent page numbers. The read-ahead is assumed not to be helpful
in other types of page accesses, such as non-leaf index pages.
buf_page_get_low(): Do not invoke buf_page_t::set_accessed(),
buf_page_make_young_if_needed(), or buf_read_ahead_linear().
We will invoke them in those callers of buf_page_get_gen() or
buf_page_get() where it makes sense: the access is not
one-time-on-startup and the page and not going to be freed soon.
btr_copy_blob_prefix(), btr_pcur_move_to_next_page(),
trx_undo_get_prev_rec_from_prev_page(),
trx_undo_get_first_rec(), btr_cur_t::search_leaf(),
btr_cur_t::open_leaf(): Invoke buf_read_ahead_linear().
We will not invoke linear read-ahead in functions that would
essentially allocate or free pages, because pages that are
freshly allocated are expected to be initialized by buf_page_create()
and not read from the data file. Likewise, freeing pages should
not involve accessing any sibling pages, except for freeing
singly-linked lists of BLOB pages.
We will not invoke read-ahead in btr_cur_t::pessimistic_search_leaf()
or in a pessimistic operation of btr_cur_t::open_leaf(), because
it is assumed that pessimistic operations should be preceded by
optimistic operations, which should already have invoked read-ahead.
buf_page_make_young_if_needed(): Invoke also buf_page_t::set_accessed()
and return the result.
btr_cur_nonleaf_make_young(): Like buf_page_make_young_if_needed(),
but do not invoke buf_page_t::set_accessed().
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
buf_page_get_zip(): Do not wait for the page latch while holding hash_lock.
If the latch is not available, ensure that any concurrent
buf_pool_t::corrupted_evict() will be able to acquire the hash_lock,
and then retry the lookup. If the page was corrupted, we will finally
"goto must_read_page", retry the read once more, and then report an error.
Reviewed by: Thirunarayanan Balathandayuthapani
Eventfds have a simplier interface and are one file
descriptor rather than two.
Reuse the patten of the accepting socket connections
by testing for abort after a poll returns. This way
the same event descriptor can be used for Quit
and debugging trigger.
Also correct the registration of mem pressure file
descriptors.
When the constant OS_AIO_N_PENDING_IOS_PER_THREAD is changed from 256 to 1
and the server is run with the minimum parameters
innodb_read_io_threads=1 and innodb_write_io_threads=2, two hangs
were observed.
tpool::cache<T>::put(T*): Ensure that get() in io_slots::acquire()
will be woken up when the cache previously was empty.
buf_pool_t::io_buf_t::reserve(): Schedule a possibly partial doublewrite
batch so that os_aio_wait_until_no_pending_writes() has a chance of
returning. Add a Boolean parameter and pass wait_for_reads=false inside
buf_page_decrypt_after_read(), because those calls will be executed
inside a read completion callback, and therefore
os_aio_wait_until_no_pending_reads() would block indefinitely.
The log_sys.lsn_lock that was introduced in
commit a635c40648
had better be located in the same cache line with log_sys.latch
so that log_t::append_prepare() needs to modify only two first
cache lines where log_sys is stored.
log_t::lsn_lock: On Linux, change the type from pthread_mutex_t to
something that may be as small as 32 bits, to pack more data members
in the same cache line. On Microsoft Windows, CRITICAL_SECTION works
better.
log_t::check_flush_or_checkpoint_: Renamed to need_checkpoint.
There is no need to pause all writer threads in log_free_check() when
we only need to write log_sys.buf to ib_logfile0. That will be done in
mtr_t::commit().
log_t::append_prepare_wait(): Make the member function non-static
to simplify the call interface, and add a parameter for the LSN.
log_t::append_prepare(): Invoke append_prepare_wait() at most once.
Only set_check_for_checkpoint() if a log checkpoint needs to
be written. If the log buffer needs to be written, we will take care
of it ourselves later in our caller. This will reduce interference
with log_free_check() in other threads.
mtr_t::commit(): Call log_write_up_to() if needed.
log_t::get_write_target(): Return a log_write_up_to() target
to mtr_t::commit().
buf_flush_ahead(): If we are in furious flushing, call
log_sys.set_check_for_checkpoint() so that all writers will wait
in log_free_check() until the checkpoint is done. Otherwise,
the test innodb.insert_into_empty could occasionally report
an error "Crash recovery is broken".
log_check_margins(): Replaced by log_free_check().
log_flush_margin(): Removed. This is part of mtr_t::commit()
and other operations that write log.
log_t::create(), log_t::attach(): Guarantee that buf_free < max_buf_free
will always hold on PMEM, to satisfy an assumption of
log_t::get_write_target().
log_write_up_to(): Assert lsn!=0. Such calls are not incorrect, but it
is cheaper to test that single unlikely condition in mtr_t::commit()
rather than test several conditions in log_write_up_to().
innodb_drop_database(), unlock_and_close_files(): Check the LSN before
calling log_write_up_to().
ha_innobase::commit_inplace_alter_table(): Remove redundant calls to
log_write_up_to() after calling unlock_and_close_files().
Reviewed by: Vladislav Vaintroub
Stress tested by: Matthias Leich
Performance tested by: Steve Shaw