Commit graph

213 commits

Author SHA1 Message Date
Marko Mäkelä
b0ff791618 MDEV-24054 Assertion in_LRU_list failed in buf_flush_try_neighbors()
buf_flush_try_neighbors(): Before invoking buf_page_t::ready_for_flush(),
check that the freshly looked up buf_pool.page_hash entry actually is
a buffer page and not a buf_pool.watch[] sentinel for purge buffering.

This race condition was introduced in MDEV-15053
(commit b1ab211dee).
It is rather hard to hit this bug, because
buf_flush_check_neighbors() already checked the condition.
The problem exists if buf_pool.watch_set() was invoked for
a page in the range after the check in buf_flush_check_neighbor()
had been finished.
2020-10-30 19:06:50 +02:00
Marko Mäkelä
c27e53f459 MDEV-23855: Use normal mutex for log_sys.mutex, log_sys.flush_order_mutex
With an unreasonably small innodb_log_file_size, the page cleaner
thread would frequently acquire log_sys.flush_order_mutex and spend
a significant portion of CPU time spinning on that mutex when
determining the checkpoint LSN.
2020-10-26 17:53:55 +02:00
Marko Mäkelä
a5a2ef079c MDEV-23855: Implement asynchronous doublewrite
Synchronous writes and calls to fdatasync(), fsync() or
FlushFileBuffers() would ruin performance. So, let us
submit asynchronous writes for the doublewrite buffer.
We submit a single request for the likely case that the
two doublewrite buffers are contiquous in the system tablespace.

buf_dblwr_t::flush_buffered_writes_completed(): The completion callback
of buf_dblwr_t::flush_buffered_writes().

os_aio_wait_until_no_pending_writes(): Also wait for doublewrite batches.

buf_dblwr_t::element::space: Remove. We can simply use
element::request.node->space instead.

Reviewed by: Vladislav Vaintroub
2020-10-26 17:53:55 +02:00
Marko Mäkelä
8cb01c51fb MDEV-16264 fixup: Clean up asynchronous I/O
os_aio_userdata_t: Remove. It was basically duplicating IORequest.

buf_page_write_complete(): Take only IORequest as a parameter.

os_aio_func(), pfs_os_aio_func(): Replaced with os_aio() that has
no redundant parameters. There is only one caller, so there is no
point to pass __FILE__, __LINE__ as a parameter.
2020-10-26 17:53:54 +02:00
Marko Mäkelä
118e258aaa MDEV-23855: Shrink fil_space_t
Merge n_pending_ios, n_pending_ops to std::atomic<uint32_t> n_pending.
Change some more fil_space_t members to uint32_t to reduce
the memory footprint.

fil_space_t::add(), fil_ibd_create(): Attach the already opened
handle to the tablespace, and enforce the fil_system.n_open limit.

dict_boot(): Initialize fil_system.max_assigned_id.

srv_boot(): Call srv_thread_pool_init() before anything else,
so that files should be opened in the correct mode on Windows.

fil_ibd_create(): Create the file in OS_FILE_AIO mode, just like
fil_node_open_file_low() does it.

dict_table_t::is_accessible(): Replaces fil_table_accessible().

Reviewed by: Vladislav Vaintroub
2020-10-26 17:53:54 +02:00
Marko Mäkelä
45ed9dd957 MDEV-23855: Remove fil_system.LRU and reduce fil_system.mutex contention
Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored
for system tablespace on SSD

When the maximum configured number of file is exceeded, InnoDB will
close data files. We used to maintain a fil_system.LRU list and
a counter fil_node_t::n_pending to achieve this, at the huge cost
of multiple fil_system.mutex operations per I/O operation.

fil_node_open_file_low(): Implement a FIFO replacement policy:
The last opened file will be moved to the end of fil_system.space_list,
and files will be closed from the start of the list. However, we will
not move tablespaces in fil_system.space_list while
i_s_tablespaces_encryption_fill_table() is executing
(producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION)
because it may cause information of some tablespaces to go missing.
We also avoid this in mariabackup --backup because datafiles_iter_next()
assumes that the ordering is not changed.

IORequest: Fold more parameters to IORequest::type.

fil_space_t::io(): Replaces fil_io().

fil_space_t::flush(): Replaces fil_flush().

OS_AIO_IBUF: Remove. We will always issue synchronous reads of the
change buffer pages in buf_read_page_low().

We will always ignore some errors for background reads.

This should reduce fil_system.mutex contention a little.

fil_node_t::complete_write(): Replaces fil_node_t::complete_io().
On both read and write completion, fil_space_t::release_for_io()
will have to be called.

fil_space_t::io(): Do not acquire fil_system.mutex in the normal
code path.

xb_delta_open_matching_space(): Do not try to open the system tablespace
which was already opened. This fixes a file sharing violation in
mariabackup --prepare --incremental.

Reviewed by: Vladislav Vaintroub
2020-10-26 17:09:01 +02:00
Marko Mäkelä
3a9a3be1c6 MDEV-23855: Improve InnoDB log checkpoint performance
After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability
bottleneck, log checkpoints became a new bottleneck.

If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is
set high and the workload fits in the buffer pool, the page cleaner
thread will perform very little flushing. When we reach the capacity
of the circular redo log file ib_logfile0 and must initiate a checkpoint,
some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF,
then flushing would continue at the innodb_io_capacity rate, and
writers would be throttled.)

We have the best chance of advancing the checkpoint LSN immediately
after a page flush batch has been completed. Hence, it is best to
perform checkpoints after every batch in the page cleaner thread,
attempting to run once per second.

By initiating high-priority flushing in the page cleaner as early
as possible, we aim to make the throughput more stable.

The function buf_flush_wait_flushed() used to sleep for 10ms, hoping
that the page cleaner thread would do something during that time.
The observed end result was that a large number of threads that call
log_free_check() would end up sleeping while nothing useful is happening.

We will revise the design so that in the default innodb_flush_sync=ON
mode, buf_flush_wait_flushed() will wake up the page cleaner thread
to perform the necessary flushing, and it will wait for a signal from
the page cleaner thread.

If innodb_io_capacity is set to a low value (causing the page cleaner to
throttle its work), a write workload would initially perform well, until
the capacity of the circular ib_logfile0 is reached and log_free_check()
will trigger checkpoints. At that point, the extra waiting in
buf_flush_wait_flushed() will start reducing throughput.

The page cleaner thread will also initiate log checkpoints after each
buf_flush_lists() call, because that is the best point of time for
the checkpoint LSN to advance by the maximum amount.

Even in 'furious flushing' mode we invoke buf_flush_lists() with
innodb_io_capacity_max pages at a time, and at the start of each
batch (in the log_flush() callback function that runs in a separate
task) we will invoke os_aio_wait_until_no_pending_writes(). This
tweak allows the checkpoint to advance in smaller steps and
significantly reduces the maximum latency. On an Intel Optane 960
NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds.
On Microsoft Windows with a slower SSD, it reduced from more than
180 seconds to 0.6 seconds.

We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity
per second whenever the dirty proportion of buffer pool pages exceeds
innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try
to make page_cleaner_flush_pages_recommendation() more consistent and
predictable: if we are below innodb_adaptive_flushing_lwm, let us flush
pages according to the return value of af_get_pct_for_dirty().

innodb_max_dirty_pages_pct_lwm: Revert the change of the default value
that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0
guarantees that a shutdown of an idle server will be fast. Users might
be surprised if normal shutdown suddenly became slower when upgrading
within a GA release series.

innodb_checkpoint_usec: Remove. The master task will no longer perform
periodic log checkpoints. It is the duty of the page cleaner thread.

log_sys.max_modified_age: Remove. The current span of the
buf_pool.flush_list expressed in LSN only matters for adaptive
flushing (outside the 'furious flushing' condition).
For the correctness of checkpoints, the only thing that matters is
the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn).
This run-time constant was also reported as log_max_modified_age_sync.

log_sys.max_checkpoint_age_async: Remove. This does not serve any
purpose, because the checkpoints will now be triggered by the page
cleaner thread. We will retain the log_sys.max_checkpoint_age limit
for engaging 'furious flushing'.

page_cleaner.slot: Remove. It turns out that
page_cleaner_slot.flush_list_time was duplicating
page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass
was duplicating page_cleaner.flush_pass.
Likewise, there were some redundant monitor counters, because the
page cleaner thread no longer performs any buf_pool.LRU flushing, and
because there only is one buf_flush_page_cleaner thread.

buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex.

buf_pool_t::get_oldest_modification(): Add a parameter to specify the
return value when no persistent data pages are dirty. Require the
caller to hold buf_pool.flush_list_mutex.

log_buf_pool_get_oldest_modification(): Take the fall-back LSN
as a parameter. All callers will also invoke log_sys.get_lsn().

log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed().

buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool
has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF)
and wait for the page cleaner to complete. If the page cleaner
thread is not running (which can be the case durign shutdown),
initiate the flush and wait for it directly.

buf_flush_ahead(): If innodb_flush_sync=ON (the default),
submit a new buf_flush_sync_lsn target for the page cleaner
but do not wait for the flushing to finish.

log_get_capacity(), log_get_max_modified_age_async(): Remove, to make
it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes.

page_cleaner_flush_pages_recommendation(): Protect all access to
buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there
were some race conditions in the calculation.

buf_flush_sync_for_checkpoint(): New function to process
buf_flush_sync_lsn in the page cleaner thread. At the end of
each batch, we try to wake up any blocked buf_flush_wait_flushed().
If everything up to buf_flush_sync_lsn has been flushed, we will
reset buf_flush_sync_lsn=0. The page cleaner thread will keep
'furious flushing' until the limit is reached. Any threads that
are waiting in buf_flush_wait_flushed() will be able to resume
as soon as their own limit has been satisfied.

buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not
sleep as long as it is set. Do not update any page_cleaner statistics
for this special mode of operation. In the normal mode
(buf_flush_sync_lsn is not set for innodb_flush_sync=ON),
try to wake up once per second. No longer check whether
srv_inc_activity_count() has been called. After each batch,
try to perform a log checkpoint, because the best chances for
the checkpoint LSN to advance by the maximum amount are upon
completing a flushing batch.

log_t: Move buf_free, max_buf_free possibly to the same cache line
with log_sys.mutex.

log_margin_checkpoint_age(): Simplify the logic, and replace
a 0.1-second sleep with a call to buf_flush_wait_flushed() to
initiate flushing. Moved to the same compilation unit
with the only caller.

log_close(): Clean up the calculations. (Should be no functional
change.) Return whether flush-ahead is needed. Moved to the same
compilation unit with the only caller.

mtr_t::finish_write(): Return whether flush-ahead is needed.

mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid
external calls in mtr_t::commit() and make the logic easier to follow
by having related code in a single compilation unit. Also, we will
invoke srv_stats.log_write_requests.inc() only once per
mini-transaction commit, while not holding mutexes.

log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age.
Upon reaching log_sys.max_checkpoint_age where we must wait to prevent
the log from getting corrupted, let us wait for at most 1MiB of LSN
at a time, before rechecking the condition. This should allow writers
to proceed even if the redo log capacity has been reached and
'furious flushing' is in progress. We no longer care about
log_sys.max_modified_age_sync or log_sys.max_modified_age_async.
The log_sys.max_modified_age_sync could be a relic from the time when
there was a srv_master_thread that wrote dirty pages to data files.
Also, we no longer have any log_sys.max_checkpoint_age_async limit,
because log checkpoints will now be triggered by the page cleaner
thread upon completing buf_flush_lists().

log_set_capacity(): Simplify the calculations of the limit
(no functional change).

log_checkpoint_low(): Split from log_checkpoint(). Moved to the
same compilation unit with the caller.

log_make_checkpoint(): Only wait for everything to be flushed until
the current LSN.

create_log_file(): After checkpoint, invoke log_write_up_to()
to ensure that the FILE_CHECKPOINT record has been written.
This avoids ut_ad(!srv_log_file_created) in create_log_file_rename().

srv_start(): Do not call recv_recovery_from_checkpoint_start()
if the log has just been created. Set fil_system.space_id_reuse_warned
before dict_boot() has been executed, and clear it after recovery
has finished.

dict_boot(): Initialize fil_system.max_assigned_id.

srv_check_activity(): Remove. The activity count is counting transaction
commits and therefore mostly interesting for the purge of history.

BtrBulk::insert(): Do not explicitly wake up the page cleaner,
but do invoke srv_inc_activity_count(), because that counter is
still being used in buf_load_throttle_if_needed() for some
heuristics. (It might be cleaner to execute buf_load() in the
page cleaner thread!)

Reviewed by: Vladislav Vaintroub
2020-10-26 17:09:01 +02:00
Marko Mäkelä
bd67cb9284 MDEV-23399 fixup: Assertion bpage->in_file() failed
buf_flush_remove_pages(), buf_flush_dirty_pages(): Because
buf_page_t::state() is protected by buf_pool.mutex, which we
are not holding, the state may be BUF_BLOCK_REMOVE_HASH when
the page is being relocated. Let us relax these assertions
similar to buf_flush_validate_low().

The other in_file() assertions in buf0flu.cc look valid.
2020-10-26 17:09:01 +02:00
Marko Mäkelä
5999d5120e MDEV-23399 fixup: Avoid crash on Mariabackup shutdown
innodb_preshutdown(): Terminate the encryption threads before
the page cleaner thread can be shut down.

innodb_shutdown(): Always wait for the encryption threads and
page cleaner to shut down.

srv_shutdown_all_bg_threads(): Wait for the encryption threads and
the page cleaner to shut down. (After an aborted startup,
innodb_shutdown() would not be called.)

row_get_background_drop_list_len_low(): Remove.

os_thread_count: Remove. Alternatively, at the end of
srv_shutdown_all_bg_threads() we could try to wait longer
for the count to reach 0. On some platforms, an assertion
os_thread_count==0 could fail even after a small delay,
even though in the core dump all threads would have exited.

srv_shutdown_threads(): Renamed from srv_shutdown_all_bg_threads().
Do not wait for the page cleaner to shut down, because the later
innodb_shutdown(), which may invoke
logs_empty_and_mark_files_at_shutdown(), assumes that it exists.
2020-10-26 15:05:41 +02:00
Marko Mäkelä
7cffb5f6e8 MDEV-23399: Performance regression with write workloads
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.

The configuration parameters will be changed as follows:

innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)

Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
 * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
 * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
   the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().

The status variables will be changed as follows:

innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call

When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.

Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.

The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.

The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.

Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.

We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.

To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.

For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.

mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().

buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.

buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.

buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.

buf_pool_t::done_flush_list: Condition variable for !n_flush_list.

buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.

buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.

buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().

flush_counters_t::unzip_LRU_evicted: Remove.

IORequest: Make more members const. FIXME: m_fil_node should be removed.

buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).

page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.

pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().

recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().

recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.

srv_started_redo: Replaces srv_start_state.

SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().

buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().

buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.

buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.

buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.

buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().

buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.

buf_flush_check_neighbor(): Take id.fold() as a parameter.

buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.

buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.

buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_do_LRU_batch(): Return the number of pages flushed.

buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.

buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.

buf_LRU_check_size_of_non_data_objects(): Simplify the code.

buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.

buf_page_create(): Take a pre-allocated page as a parameter.

buf_pool_t::free_block(): Free a pre-allocated block.

recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.

BtrBulk::logFreeCheck(): Skip a redundant condition.

row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.

sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.

Reviewed by: Vladislav Vaintroub
2020-10-15 17:04:56 +03:00
Marko Mäkelä
46b1f50098 MDEV-23399: Remove buf_pool.flush_rbt
Normally, buf_pool.flush_list must be sorted by
buf_page_t::oldest_modification, so that log_checkpoint()
can choose MIN(oldest_modification) as the checkpoint LSN.

During recovery, buf_pool.flush_rbt used to guarantee the
ordering. However, we can allow the buf_pool.flush_list to
be in an arbitrary order during recovery, and simply ensure
that it is in the correct order by the time a log checkpoint
needs to be executed.

recv_sys_t::apply(): To keep it simple, we will always flush the
buffer pool at the end of each batch.

Note that log_checkpoint() will invoke recv_sys_t::apply() in case
a checkpoint is initiated during the last batch of recovery,
when we already allow writes to data pages and the redo log.

Reviewed by: Vladislav Vaintroub
2020-10-15 10:33:23 +03:00
Marko Mäkelä
b535a79044 MDEV-23399: Remove recv_writer_thread
Recovery works just fine without a separate thread whose only
task is to tell the page cleaner thread to do its job.

recv_sys_t::apply(): Flush the buffer pool at the end of each batch.

Reviewed by: Vladislav Vaintroub
2020-10-15 10:33:20 +03:00
Marko Mäkelä
fa70c1462a MDEV-23399 preparation: Remove buf_pool.zip_clean
The debug data structure may have been useful during the development of
ROW_FORMAT=COMPRESSED page frames. Let us simplify code by removing it.
2020-10-15 09:49:20 +03:00
Marko Mäkelä
d312d64146 MDEV-23909 innodb_flush_neighbors=2 is treated like innodb_flush_neighbors=0
In MDEV-15053 (commit b1ab211dee)
we inadvertently removed a check whether innodb_flush_neighbors is 0,
and thus started treating only the value 1 in a special way.

buf_flush_check_neighbors(): Add the parameter contiguous,
which can be set to skip the check for non-contiguous page number ranges.

Reviewed by: Thirunarayanan Balathandayuthapani
2020-10-08 11:13:47 +03:00
Marko Mäkelä
a9550c47e4 MDEV-16264 fixup: Remove unused code and data
LATCH_ID_OS_AIO_READ_MUTEX,
LATCH_ID_OS_AIO_WRITE_MUTEX,
LATCH_ID_OS_AIO_LOG_MUTEX,
LATCH_ID_OS_AIO_IBUF_MUTEX,
LATCH_ID_OS_AIO_SYNC_MUTEX: Remove. The tpool is not instrumented.

lock_set_timeout_event(): Remove.

srv_sys_mutex_key, srv_sys_t::mutex, SYNC_THREADS: Remove.

srv_slot_t::suspended: Remove. We only ever assigned this data member
true, so it is redundant.

ib_wqueue_wait(), ib_wqueue_timedwait(): Remove.

os_thread_join(): Remove.

os_thread_create(), os_thread_exit(): Remove redundant parameters.

These were missed in commit 5e62b6a5e0.
2020-09-30 14:28:11 +03:00
Marko Mäkelä
50a11f396a Merge 10.4 into 10.5 2020-08-01 14:42:51 +03:00
Marko Mäkelä
9216114ce7 Merge 10.3 into 10.4 2020-07-31 18:09:08 +03:00
Marko Mäkelä
66ec3a770f Merge 10.2 into 10.3 2020-07-31 13:51:28 +03:00
Thirunarayanan Balathandayuthapani
fe39d02f51 MDEV-20638 Remove the deadcode from srv_master_thread() and srv_active_wake_master_thread_low()
- Due to commit fe95cb2e40 (MDEV-16125),
InnoDB master thread does not need to call srv_resume_thread()
and therefore there is no need to wake up the thread.
Due to the above patch, InnoDB should remove the following dead code.

srv_check_activity(): Makes the parameter as in,out and returns the
recent activity value

innobase_active_small(): Removed

srv_active_wake_master_thread(): Removed

srv_wake_master_thread(): Removed

srv_active_wake_master_thread_low(): Removed

Simplify srv_master_thread() and remove switch cases, added the assert.

Replace srv_wake_master_thread() with srv_inc_activity_count()

INNOBASE_WAKE_INTERVAL: Removed
2020-07-23 16:23:20 +05:30
Thirunarayanan Balathandayuthapani
92014bd1c6 MDEV-23252 Assertion failure 'req_type.is_dblwr_recover() || err == DB_SUCCESS' for page_compressed tables
- This issue is caused by a5584b13d1
(MDEV-15528). os_file_punch_hole() is added to fil_io() in MDEV-15528.
But it fails to handle failure of os_file_punch_hole(). InnoDB should
handle the DB_IO_NO_PUNCH_HOLE error and silently transform to
DB_SUCCESS. InnoDB should set the punch hole flag correctly when
tablespace is loaded

fil_node_t::read_page0(): Set the punch hole flag when tablespace is loaded

fil_io(): Handle the DB_IO_NO_PUNCH_HOLE error

buf_flush_free_pages(): Checks the punch hole condition earlier using
tablespace punch hole flag
2020-07-22 16:10:56 +05:30
Marko Mäkelä
4d4865de6f Merge 10.4 into 10.5 2020-07-20 15:55:59 +03:00
Marko Mäkelä
4b959bd8df Merge 10.3 into 10.4 2020-07-20 15:34:59 +03:00
Marko Mäkelä
acc58fd835 Merge 10.2 into 10.3 2020-07-20 15:11:59 +03:00
Marko Mäkelä
ca9276e37e Merge 10.1 into 10.2 2020-07-20 14:53:24 +03:00
Marko Mäkelä
57ec42bc32 MDEV-23190 InnoDB data file extension is not crash-safe
When InnoDB is extending a data file, it is updating the FSP_SIZE
field in the first page of the data file.

In commit 8451e09073 (MDEV-11556)
we removed a work-around for this bug and made recovery stricter,
by making it track changes to FSP_SIZE via redo log records, and
extend the data files before any changes are being applied to them.

It turns out that the function fsp_fill_free_list() is not crash-safe
with respect to this when it is initializing the change buffer bitmap
page (page 1, or generally, N*innodb_page_size+1). It uses a separate
mini-transaction that is committed (and will be written to the redo
log file) before the mini-transaction that actually extended the data
file. Hence, recovery can observe a reference to a page that is
beyond the current end of the data file.

fsp_fill_free_list(): Initialize the change buffer bitmap page in
the same mini-transaction.

The rest of the changes are fixing a bug that the use of the separate
mini-transaction was attempting to work around. Namely, we must ensure
that no other thread will access the change buffer bitmap page before
our mini-transaction has been committed and all page latches have been
released.

That is, for read-ahead as well as neighbour flushing, we must avoid
accessing pages that might not yet be durably part of the tablespace.

fil_space_t::committed_size: The size of the tablespace
as persisted by mtr_commit().

fil_space_t::max_page_number_for_io(): Limit the highest page
number for I/O batches to committed_size.

MTR_MEMO_SPACE_X_LOCK: Replaces MTR_MEMO_X_LOCK for fil_space_t::latch.

mtr_x_space_lock(): Replaces mtr_x_lock() for fil_space_t::latch.

mtr_memo_slot_release_func(): When releasing MTR_MEMO_SPACE_X_LOCK,
copy space->size to space->committed_size. In this way, read-ahead
or flushing will never be invoked on pages that do not yet exist
according to FSP_SIZE.
2020-07-20 14:48:56 +03:00
Marko Mäkelä
1813d92d0c Merge 10.4 into 10.5 2020-07-02 09:41:44 +03:00
Marko Mäkelä
f347b3e0e6 Merge 10.3 into 10.4 2020-07-02 07:39:33 +03:00
Marko Mäkelä
1df1a63924 Merge 10.2 into 10.3 2020-07-02 06:17:51 +03:00
Marko Mäkelä
c36834c832 MDEV-20377: Make WITH_MSAN more usable
MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The only exception is the
C runtime library. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.

In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.

Note: Before MariaDB Server 10.5, ./mtr will typically fail
due to the old PCRE library, which was updated in MDEV-14024.

The following cmake options were tested on 10.5
in commit 94d0bb4dbe:

cmake \
-DCMAKE_C_FLAGS='-march=native -O2' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \
-DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DWITH_MSAN=ON

MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and __msan_unpoison().

MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for
VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow().

InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros.

ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.
2020-07-01 17:23:00 +03:00
Marko Mäkelä
d2c593c2a6 MDEV-22877 Avoid unnecessary buf_pool.page_hash S-latch acquisition
MDEV-15053 did not remove all unnecessary buf_pool.page_hash S-latch
acquisition. There are code paths where we are holding buf_pool.mutex
(which will sufficiently protect buf_pool.page_hash against changes)
and unnecessarily acquire the latch. Many invocations of
buf_page_hash_get_locked() can be replaced with the much simpler
buf_pool.page_hash_get_low().

In the worst case the thread that is holding buf_pool.mutex will become
a victim of MDEV-22871, suffering from a spurious reader-reader conflict
with another thread that genuinely needs to acquire a buf_pool.page_hash
S-latch.

In many places, we were also evaluating page_id_t::fold() while holding
buf_pool.mutex. Low-level functions such as buf_pool.page_hash_get_low()
must get the page_id_t::fold() as a parameter.

buf_buddy_relocate(): Defer the hash_lock acquisition to the critical
section that starts by calling buf_page_t::can_relocate().
2020-06-12 16:22:03 +03:00
Thirunarayanan Balathandayuthapani
c92f7e287f MDEV-8139 Fix Scrubbing
fil_space_t::freed_ranges: Store ranges of freed page numbers.

fil_space_t::last_freed_lsn: Store the most recent LSN of
freeing a page.

fil_space_t::freed_mutex: Protects freed_ranges, last_freed_lsn.

fil_space_create(): Initialize the freed_range mutex.

fil_space_free_low(): Frees the freed_range mutex.

range_set: Ranges of page numbers.

buf_page_create(): Removes the page from freed_ranges when page
is being reused.

btr_free_root(): Remove the PAGE_INDEX_ID invalidation. Because
btr_free_root() and dict_drop_index_tree() are executed in
the same atomic mini-transaction, there is no need to
invalidate the root page.

buf_release_freed_page(): Split from buf_flush_freed_page().
Skip any I/O

buf_flush_freed_pages(): Get the freed ranges from tablespace and
Write punch-hole or zeroes of the freed ranges.

buf_flush_try_neighbors(): Handles the flushing of freed ranges.

mtr_t::freed_pages: Variable to store the list of freed pages.

mtr_t::add_freed_pages(): To add freed pages.

mtr_t::clear_freed_pages(): To clear the freed pages.

mtr_t::m_freed_in_system_tablespace: Variable to indicate whether page has
been freed in system tablespace.

mtr_t::m_trim_pages: Variable to indicate whether the space has been trimmed.

mtr_t::commit(): Add the freed page and update the last freed lsn
in the tablespace and clear the tablespace freed range if space is
trimmed.

file_name_t::freed_pages: Store the freed pages during recovery.

file_name_t::add_freed_page(), file_name_t::remove_freed_page(): To
add and remove freed page during recovery.

store_freed_or_init_rec(): Store or remove the freed pages while
encountering FREE_PAGE or INIT_PAGE redo log record.

recv_init_crash_recovery_spaces(): Add the freed page encountered
during recovery to respective tablespace.
2020-06-12 09:17:51 +05:30
Marko Mäkelä
1b01833a4b MDEV-15053 follow-up to reduce buf_pool.mutex contention
buf_LRU_make_block_young(): Merge with buf_page_make_young().

buf_pool_check_no_pending_io(): Remove. Replaced with
buf_pool.any_io_pending() and buf_pool.io_pending(),
which do not unnecessarily acquire buf_pool.mutex.

buf_pool_t::init_flush[]: Use atomic access, so that
buf_flush_wait_LRU_batch_end() can avoid acquiring buf_pool.mutex.

buf_pool_t::try_LRU_scan: Declare as bool.
2020-06-08 20:58:44 +03:00
Marko Mäkelä
6877ef9a7c Merge 10.4 into 10.5 2020-06-05 20:36:43 +03:00
Marko Mäkelä
68d9d512e9 Merge 10.3 into 10.4 2020-06-05 18:05:22 +03:00
Marko Mäkelä
680463a8d9 Merge 10.2 into 10.3 2020-06-05 16:51:26 +03:00
Marko Mäkelä
efc70da5fd MDEV-22769 Shutdown hang or crash due to XA breaking locks
The background drop table queue in InnoDB is a work-around for
cases where the SQL layer is requesting DDL on tables on which
transactional locks exist.

One such case are XA transactions. Our test case exploits the
fact that the recovery of XA PREPARE transactions will
only resurrect InnoDB table locks, but not MDL that should
block any concurrent DDL.

srv_shutdown_t: Introduce the srv_shutdown_state=SRV_SHUTDOWN_INITIATED
for the initial part of shutdown, to wait for the background drop
table queue to be emptied.

srv_shutdown_bg_undo_sources(): Assign
srv_shutdown_state=SRV_SHUTDOWN_INITIATED
before waiting for the background drop table queue to be emptied.

row_drop_tables_for_mysql_in_background(): On slow shutdown, if
no active transactions exist (excluding ones that are in
XA PREPARE state), skip any tables on which locks exist.

row_drop_table_for_mysql(): Do not unnecessarily attempt to
drop InnoDB persistent statistics for tables that have
already been added to the background drop table queue.

row_mysql_close(): Relax an assertion, and free all memory
even if innodb_force_recovery=2 would prevent the background
drop table queue from being emptied.
2020-06-05 15:22:46 +03:00
Marko Mäkelä
b1ab211dee MDEV-15053 Reduce buf_pool_t::mutex contention
User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
and will no longer report the PAGE_STATE value READY_FOR_USE.

We will remove some fields from buf_page_t and move much code to
member functions of buf_pool_t and buf_page_t, so that the access
rules of data members can be enforced consistently.

Evicting or adding pages in buf_pool.LRU will remain covered by
buf_pool.mutex.

Evicting or adding pages in buf_pool.page_hash will remain
covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.

After this fix, buf_pool.page_hash lookups can entirely
avoid acquiring buf_pool.mutex, only relying on
buf_pool.hash_lock_get() S-latch.

Similarly, buf_flush_check_neighbors() can will rely solely on
buf_pool.mutex, no buf_pool.page_hash latch at all.

The buf_pool.mutex is rather contended in I/O heavy benchmarks,
especially when the workload does not fit in the buffer pool.

The first attempt to alleviate the contention was the
buf_pool_t::mutex split in
commit 4ed7082eef
which introduced buf_block_t::mutex, which we are now removing.

Later, multiple instances of buf_pool_t were introduced
in commit c18084f71b
and recently removed by us in
commit 1a6f708ec5 (MDEV-15058).

UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
related debugging in otherwise non-debug builds has not been used
for years. Instead, we have been using UNIV_DEBUG, which is enabled
in CMAKE_BUILD_TYPE=Debug.

buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
std::atomic and the buf_pool.page_hash latches, and in some cases
depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
We must always release buf_block_t::lock before invoking
unfix() or io_unfix(), to prevent a glitch where a block that was
added to the buf_pool.free list would apper X-latched. See
commit c5883debd6 how this glitch
was finally caught in a debug environment.

We move some buf_pool_t::page_hash specific code from the
ha and hash modules to buf_pool, for improved readability.

buf_pool_t::close(): Assert that all blocks are clean, except
on aborted startup or crash-like shutdown.

buf_pool_t::validate(): No longer attempt to validate
n_flush[] against the number of BUF_IO_WRITE fixed blocks,
because buf_page_t::flush_type no longer exists.

buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
Reduce mutex contention by separating the buf_pool.watch[]
allocation and the insert into buf_pool.page_hash.

buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
buf_pool.page_hash latch.
Replaces and extends buf_page_hash_lock_s_confirm()
and buf_page_hash_lock_x_confirm().

buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.

buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
Use Atomic_counter.

buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().

buf_pool_t::LRU_remove(): Remove a block from the LRU list
and return its predecessor. Incorporates buf_LRU_adjust_hp(),
which was removed.

buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
BTR_DELETE_OP (purge), which is never invoked on temporary tables.

buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.

buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.

buf_LRU_free_page(): Clarify the function comment.

buf_flush_check_neighbor(), buf_flush_check_neighbors():
Rewrite the construction of the page hash range. We will hold
the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
consecutive lookups of buf_pool.page_hash.

buf_flush_page_and_try_neighbors(): Remove.
Merge to its only callers, and remove redundant operations in
buf_flush_LRU_list_batch().

buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
Do not acquire buf_pool.mutex, and iterate directly with page_id_t.

ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
and avoids any loops.

fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.

buf_flush_page(): Add a fil_space_t* parameter. Minimize the
buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
atomically with the io_fix, and we will protect most buf_block_t
fields with buf_block_t::lock. The function
buf_flush_write_block_low() is removed and merged here.

buf_page_init_for_read(): Use static linkage. Initialize the newly
allocated block and acquire the exclusive buf_block_t::lock while not
holding any mutex.

IORequest::IORequest(): Remove the body. We only need to invoke
set_punch_hole() in buf_flush_page() and nowhere else.

buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
This field is only used during a fil_io() call.
That function already takes IORequest as a parameter, so we had
better introduce  for the rarely changing field.

buf_block_t::init(): Replaces buf_page_init().

buf_page_t::init(): Replaces buf_page_init_low().

buf_block_t::initialise(): Initialise many fields, but
keep the buf_page_t::state(). Both buf_pool_t::validate() and
buf_page_optimistic_get() requires that buf_page_t::in_file()
be protected atomically with buf_page_t::in_page_hash
and buf_page_t::in_LRU_list.

buf_page_optimistic_get(): Now that buf_block_t::mutex
no longer exists, we must check buf_page_t::io_fix()
after acquiring the buf_pool.page_hash lock, to detect
whether buf_page_init_for_read() has been initiated.
We will also check the io_fix() before acquiring hash_lock
in order to avoid unnecessary computation.
The field buf_block_t::modify_clock (protected by buf_block_t::lock)
allows buf_page_optimistic_get() to validate the block.

buf_page_t::real_size: Remove. It was only used while flushing
pages of page_compressed tables.

buf_page_encrypt(): Add an output parameter that allows us ot eliminate
buf_page_t::real_size. Replace a condition with debug assertion.

buf_page_should_punch_hole(): Remove.

buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
Add the parameter size (to replace buf_page_t::real_size).

buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
Add the parameter size (to replace buf_page_t::real_size).

fil_system_t::detach(): Replaces fil_space_detach().
Ensure that fil_validate() will not be violated even if
fil_system.mutex is released and reacquired.

fil_node_t::complete_io(): Renamed from fil_node_complete_io().

fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
Avoid invoking fil_node_t::close() because fil_system.n_open
has already been decremented in fil_space_t::detach().

BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.

BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
and distinguish dirty pages by buf_page_t::oldest_modification().

BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
This state was only being used for buf_page_t that are in
buf_pool.watch.

buf_pool_t::watch[]: Remove pointer indirection.

buf_page_t::in_flush_list: Remove. It was set if and only if
buf_page_t::oldest_modification() is nonzero.

buf_page_decrypt_after_read(), buf_corrupt_page_release(),
buf_page_check_corrupt(): Change the const fil_space_t* parameter
to const fil_node_t& so that we can report the correct file name.

buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.

buf_page_io_complete(): Split to buf_page_read_complete() and
buf_page_write_complete().

buf_dblwr_t::in_use: Remove.

buf_dblwr_t::buf_block_array: Add IORequest::flush_t.

buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
os_aio_wait_until_no_pending_writes().

buf_flush_write_complete(): Declare static, not global.
Add the parameter IORequest::flush_t.

buf_flush_freed_page(): Simplify the code.

recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.

fil_read(), fil_write(): Replaced with direct use of fil_io().

fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.

fil_mutex_enter_and_prepare_for_io(): Return the resolved
fil_space_t* to avoid a duplicated lookup in the caller.

fil_report_invalid_page_access(): Clean up the parameters.

fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
Always invoke fil_space_t::acquire_for_io() and let either the
sync=true caller or fil_aio_callback() invoke
fil_space_t::release_for_io().

fil_aio_callback(): Rewrite to replace buf_page_io_complete().

fil_check_pending_operations(): Remove a parameter, and remove some
redundant lookups.

fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
do an extra lookup of the tablespace between fil_io() and the
completion of the operation, we must give fil_node_t::complete_io() a
chance to decrement the counter.

fil_close_tablespace(): Remove unused parameter trx, and document
that this is only invoked during the error handling of IMPORT TABLESPACE.

row_import_discard_changes(): Merged with the only caller,
row_import_cleanup(). Do not lock up the data dictionary while
invoking fil_close_tablespace().

logs_empty_and_mark_files_at_shutdown(): Do not invoke
fil_close_all_files(), to avoid a !needs_flush assertion failure
on fil_node_t::close().

innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().

fil_close_all_files(): Invoke fil_flush_file_spaces()
to ensure proper durability.

thread_pool::unbind(): Fix a crash that would occur on Windows
after srv_thread_pool->disable_aio() and os_file_close().
This fix was submitted by Vladislav Vaintroub.

Thanks to Matthias Leich and Axel Schwenke for extensive testing,
Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
Marko Mäkelä
eba2d10ac5 MDEV-22721 Remove bloat caused by InnoDB logger class
Introduce a new ATTRIBUTE_NOINLINE to
ib::logger member functions, and add UNIV_UNLIKELY hints to callers.

Also, remove some crash reporting output. If needed, the
information will be available using debugging tools.

Furthermore, remove some fts_enable_diag_print output that included
indexed words in raw form. The code seemed to assume that words are
NUL-terminated byte strings. It is not clear whether a NUL terminator
is always guaranteed to be present. Also, UCS2 or UTF-16 strings would
typically contain many NUL bytes.
2020-06-04 10:24:10 +03:00
Marko Mäkelä
701efbb25b Merge 10.4 into 10.5 2020-06-03 09:45:39 +03:00
Marko Mäkelä
8059148154 Merge 10.3 into 10.4 2020-06-03 07:32:09 +03:00
Marko Mäkelä
8300f639a1 Merge 10.2 into 10.3 2020-06-02 10:25:11 +03:00
Marko Mäkelä
0d6d63e150 MDEV-22027 Assertion oldest_lsn >= log_sys.last_checkpoint_lsn failed
log_buf_pool_get_oldest_modification(): Acquire
log_sys_t::flush_order_mutex in order to prevent a race condition
that was introduced in
commit 1a6f708ec5 (MDEV-15058).

Before that change, log_buf_pool_get_oldest_modification()
was protected by both log_sys.mutex and log_sys.flush_order_mutex
like it was supposed to be ever since
commit a52c4820a3 (MySQL 5.5.10).

buf_pool_t::get_oldest_modification(): Replaces
buf_pool_get_oldest_modification(), to emphasize that
log_sys.flush_order_mutex must be acquired by the caller if needed.

log_close(): Invoke log_buf_pool_get_oldest_modification()
in order to ensure a clean shutdown.

The scenario of the race condition is as follows:

1. The buffer pool is clean (no writes are pending).
2. mtr_add_dirtied_pages_to_flush_list() releases log_sys.mutex.
3. log_buf_pool_get_oldest_modification() observes that the
buffer pool is clean and returns log_sys.lsn.
4. log_checkpoint() completes, writing a wrong checkpoint header
according to which everything up to log_sys.lsn was clean.
5. mtr_add_dirtied_pages_to_flush_list() completes the execution
of mtr_memo_note_modifications(), releases the page latches and
the flush_order_mutex.
6. On a subsequent log_checkpoint(), the assertion could fail
if the page modifications had not been flushed yet.

The failing assertion (which is valid) was added in MySQL 5.7
mysql/mysql-server@5c6c6ec693
and merged to MariaDB Server 10.2.2 in
commit fec844aca8.
2020-06-02 08:17:10 +03:00
Thirunarayanan Balathandayuthapani
02f68552a4 MDEV-22650 Dirty compressed page checksum validation fails
Problem:
=======
  While evicting the uncompressed page from buffer pool, InnoDB writes
the checksum for the compressed page in buf_LRU_free_page().
So while flushing the compressed page, checksum validation fails
when innodb_checksum_algorithm variable changed to strict_none.

Solution:
========
- Calculate the checksum only during flushing of page. Removed the
checksum write in buf_LRU_free_page().
2020-06-01 14:34:16 +05:30
Marko Mäkelä
64488a6f2d Cleanup: Remove global functions or redundant parameters
fil_flush_file_spaces(): Remove the constant parameter.

buf_flush_start(), buf_flush_end(): Use static linkage.

fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.
2020-05-05 07:06:37 +03:00
Marko Mäkelä
a786f50de5 MDEV-21962 Allocate buf_pool statically
Thanks to MDEV-15058, there is only one InnoDB buffer pool.
Allocating buf_pool statically removes one level of pointer indirection
and makes code more readable, and removes the awkward initialization of
some buf_pool members.

While doing this, we will also declare some buf_pool_t data members
private and replace some functions with member functions. This is
mostly affecting buffer pool resizing.

This is not aiming to be a complete rewrite of buf_pool_t to
a proper class. Most of the buffer pool interface, such as
buf_page_get_gen(), will remain in the C programming style
for now.

buf_pool_t::withdrawing: Replaces buf_pool_withdrawing.
buf_pool_t::withdraw_clock_: Replaces buf_withdraw_clock.

buf_pool_t::create(): Repalces buf_pool_init().
buf_pool_t::close(): Replaces buf_pool_free().

buf_bool_t::will_be_withdrawn(): Replaces buf_block_will_be_withdrawn(),
buf_frame_will_be_withdrawn().

buf_pool_t::clear_hash_index(): Replaces buf_pool_clear_hash_index().
buf_pool_t::get_n_pages(): Replaces buf_pool_get_n_pages().
buf_pool_t::validate(): Replaces buf_validate().
buf_pool_t::print(): Replaces buf_print().
buf_pool_t::block_from_ahi(): Replaces buf_block_from_ahi().
buf_pool_t::is_block_field(): Replaces buf_pointer_is_block_field().
buf_pool_t::is_block_mutex(): Replaces buf_pool_is_block_mutex().
buf_pool_t::is_block_lock(): Replaces buf_pool_is_block_lock().
buf_pool_t::is_obsolete(): Replaces buf_pool_is_obsolete().
buf_pool_t::io_buf: Make default-constructible.
buf_pool_t::io_buf::create(): Delayed 'constructor'
buf_pool_t::io_buf::close(): Early 'destructor'

HazardPointer: Make default-constructible. Define all member functions
inline, also for derived classes.
2020-03-18 22:32:40 +02:00
Marko Mäkelä
fbe662a503 MDEV-15058: Remove buf_pool_get_dirty_pages_count()
Starting with commit 1a6f708ec5
the function buf_pool_get_dirty_pages_count() is only used
in a debug check. It was dead code for non-debug builds.

buf_flush_dirty_pages(): Perform the debug check inline,
and replace the assertion
	ut_ad(first || buf_pool_get_dirty_pages_count(id) == 0);
with another one that is executed while holding the mutexes:
	ut_ad(id != bpage->id.space());
2020-03-13 10:09:15 +02:00
Marko Mäkelä
574d8b2940 MDEV-21907: Fix most clang -Wconversion in InnoDB
Declare innodb_purge_threads as 4-byte integer (UINT)
instead of 4-or-8-byte (ULONG) and adjust the documentation string.
2020-03-11 08:29:48 +02:00
Marko Mäkelä
e2e2f89303 MDEV-15528: Minor cleanup
buf_flush_freed_page(): Reformat in the common style, and
simplify some code. Prefer to request all information from
smaller data structures (buf_page_t) than from fil_space_t
or the global variable srv_immediate_scrub_data_uncompressed.

SysTablespace::open_or_create(): Assert that the temporary
tablespace will not be created in page_compressed format, so that
buf_flush_freed_page() can avoid checking that on every call.

IORequest: Remove duplicated constructors, and do not explicitly
declare a default constructor.
2020-03-10 09:53:23 +01:00
Thirunarayanan Balathandayuthapani
a35b4ae898 MDEV-15528 Punch holes when pages are freed
When a InnoDB data file page is freed, its contents becomes garbage,
and any storage allocated in the data file is wasted. During flushing,
InnoDB initializes the page with zeros if scrubbing is enabled. If the
tablespace is compressed then InnoDB should punch a hole else ignore the
flushing of the freed page.

buf_page_t:
- Replaced the variable file_page_was_freed, init_on_flush in buf_page_t
with status enum variable.
- Changed all debug assert of file_page_was_freed to DBUG_ASSERT
of buf_page_t::status

Removed buf_page_set_file_page_was_freed(),
buf_page_reset_file_page_was_freed().

buf_page_free(): Newly added function which takes X-lock on the page
before marking the status as FREED. So that InnoDB flush handler can
avoid concurrent flush of the freed page. Also while flushing the page,
InnoDB make sure that redo log which does freeing of the page also written
to the disk. Currently, this function only marks the page as FREED if
it is in buffer pool

buf_flush_freed_page(): Newly added function which initializes zeros
asynchorously if innodb_immediate_scrub_data_uncompressed is enabled.
Punch a hole to the file synchorously if page_compressed is enabled.
Reset the io_fix to NORMAL. Release the block from flush list and
associated mutex before writing zeros or punch a hole to the file.

buf_flush_page(): Removed the unnecessary usage of temporary
variable "flush"

fil_io(): Introduce new parameter called punch_hole. It allows fil_io()
to punch the hole to the file for the given offset.

buf_page_create(): Let the callers assign buf_page_t::status.
Every caller should eventually invoke mtr_t::init().

fsp_page_create(): Remove the unused mtr_t parameter.

In all other callers of buf_page_create() except fsp_page_create(),
before invoking mtr_t::init(), invoke
mtr_t::sx_latch_at_savepoint() or mtr_t::x_latch_at_savepoint().

mtr_t::init(): Initialize buf_page_t::status also for the temporary
tablespace (when redo logging is disabled), to avoid assertion failures.
2020-03-10 10:51:08 +05:30
Marko Mäkelä
a4ab54d70f MDEV-14425 Cleanup: Use std::atomic for some log_sys members
Some fields were protected by log_sys.mutex, which adds quite some
overhead for readers. Some readers were submitting dirty reads.

log_t::lsn: Declare private and atomic. Add wrappers get_lsn()
and set_lsn() that will use relaxed memory access. Many accesses
to log_sys.lsn are still protected by log_sys.mutex; we avoid the
mutex for some readers.

log_t::flushed_to_disk_lsn: Declare private and atomic, and move
to the same cache line with log_t::lsn.

log_t::buf_free: Declare as size_t, and move to the same cache line
with log_t::lsn.

log_t::check_flush_or_checkpoint_: Declare private and atomic,
and move to the same cache line with log_t::lsn.

log_get_lsn(): Define as an alias of log_sys.get_lsn().

log_get_lsn_nowait(), log_peek_lsn(): Remove.

log_get_flush_lsn(): Define as an alias of log_sys.get_flush_lsn().

log_t::initiate_write(): Replaces log_buffer_sync_in_background().
2020-03-05 16:21:31 +02:00