The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.
The configuration parameters will be changed as follows:
innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)
Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
* When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
* As a buf_pool.free limit in buf_LRU_list_batch() for terminating
the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
The status variables will be changed as follows:
innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call
When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.
Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.
The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.
The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.
Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.
We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.
To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.
For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.
mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.
buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.
buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.
buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.
buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().
flush_counters_t::unzip_LRU_evicted: Remove.
IORequest: Make more members const. FIXME: m_fil_node should be removed.
buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).
page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().
recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.
srv_started_redo: Replaces srv_start_state.
SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().
buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().
buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.
buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.
buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.
buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.
buf_flush_check_neighbor(): Take id.fold() as a parameter.
buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.
buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_do_LRU_batch(): Return the number of pages flushed.
buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.
buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.
buf_LRU_check_size_of_non_data_objects(): Simplify the code.
buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.
buf_page_create(): Take a pre-allocated page as a parameter.
buf_pool_t::free_block(): Free a pre-allocated block.
recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.
BtrBulk::logFreeCheck(): Skip a redundant condition.
row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.
sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.
Reviewed by: Vladislav Vaintroub
Recovery works just fine without a separate thread whose only
task is to tell the page cleaner thread to do its job.
recv_sys_t::apply(): Flush the buffer pool at the end of each batch.
Reviewed by: Vladislav Vaintroub
We can simply use C++11 std::atomic for avoiding undefined behaviour
related to concurrent stores to a shared variable. On most if not all
ISAs, std::memory_order_relaxed loads and stores will not really
differ from non-atomic loads or stores.
LATCH_ID_OS_AIO_READ_MUTEX,
LATCH_ID_OS_AIO_WRITE_MUTEX,
LATCH_ID_OS_AIO_LOG_MUTEX,
LATCH_ID_OS_AIO_IBUF_MUTEX,
LATCH_ID_OS_AIO_SYNC_MUTEX: Remove. The tpool is not instrumented.
lock_set_timeout_event(): Remove.
srv_sys_mutex_key, srv_sys_t::mutex, SYNC_THREADS: Remove.
srv_slot_t::suspended: Remove. We only ever assigned this data member
true, so it is redundant.
ib_wqueue_wait(), ib_wqueue_timedwait(): Remove.
os_thread_join(): Remove.
os_thread_create(), os_thread_exit(): Remove redundant parameters.
These were missed in commit 5e62b6a5e0.
It turns out that we must check for DISCARD TABLESPACE both
when the table is being rebuilt and when the AUTO_INCREMENT
value of the table is being added.
This was caught by the test innodb.alter_missing_tablespace.
Somehow I failed to run all tests. Sorry!
The statement ALTER TABLE...DISCARD TABLESPACE is problematic,
because its designed purpose is to break the referential integrity
of the data dictionary and make a table point to nowhere.
ha_innobase::commit_inplace_alter_table(): Check whether the
table has been discarded. (This is a bit late to check it, right
before committing the change.) Previously, we performed this check
only in a specific branch of the function commit_set_autoinc().
Note: We intentionally allow non-rebuilding ALTER TABLE even if
the tablespace has been discarded, to remain compatible with MySQL.
(See the various tests with "wl5522" in the name, such as
innodb.innodb-wl5522.)
The test case would crash starting with 10.3 only, but it does not hurt
to minimize the code and test difference between 10.2 and 10.3.
Since commit 8ccb3caafb it should be
more efficient to use page_id_t rather than two separate variables
for tablespace identifier and page number.
lock_rec_fold(): Replaced with page_id_t::fold().
lock_rec_hash(): Replaced with lock_sys.hash(page_id).
lock_rec_expl_exist_on_page(), lock_rec_get_first_on_page_addr(),
lock_rec_get_first_on_page(): Replaced with lock_sys.get_first().
build_template_field(): Initialize templ->rec_field_is_prefix
also for indexes on virtual columns. This was caught on 10.5 by
MemorySanitizer as use-of-uninitialized-value in
row_search_with_covering_prefix() when running the test
main.fast_prefix_index_fetch_innodb.
commit de942c9f61 (MDEV-15983)
introduced a race condition that we inadequately fixed in
commit 93b69825ad (MDEV-16169).
Because fil_space_t::release() or fil_space_t::acquire() are
not protected by fil_system.mutex like their predecessors,
it is possible that stop_new_ops was set between the time
a thread checked fil_space_t::is_stopping() and invoked
fil_space_t::acquire().
In an execution trace, this happened in fil_system_t::keyrotate_next(),
causing an assertion failure in fil_delete_tablespace()
in the other thread that seeked to stop new operations.
We fix this bug by merging the flag fil_space_t::stop_new_ops
and the reference count fil_space_t::n_pending_ops into a
single word that is only being accessed by atomic memory operations.
fil_space_t::set_stopping(): Accessor for changing the state of
the former stop_new_ops flag.
fil_space_t::acquire(): Return whether the acquisition succeeded.
It would fail between set_stopping(true) and set_stopping(false).
Add a proper error handling of innobase_get_computed_value results in
row_upd_store_row/row_upd_store_v_row.
Also add an assertion in row_vers_build_clust_v_col to fail during row
purge.
Add one more assertion in row_sel_sec_rec_is_for_clust_rec for possible
future catches.
The problem was in improper error handling behavior in
`row_upd_build_difference_binary`:
`innobase_free_row_for_vcol` wasn't called.
To eliminate this problem in all potential places, a refactoring has been
made:
* class ib_vcol_row is added. It owns VCOL_STORAGE and heap and maintains
it in RAII manner
* all innobase_allocate_row_for_vcol/innobase_free_row_for_vcol pairs are
substituted
with ib_vcol_row usage
* row_merge_buf_add is only left untouched because it doesn't own vheap
passed as an argument
* innobase_allocate_row_for_vcol does not allocate VCOL_STORAGE anymore and
accepts it as an argument -- this reduces a number of memory allocations
* move rec_printer out of `#ifndef DBUG_OFF` and mark it cold
In trx_free() we used to declare the entire trx_t unaccessible
and then declare that some data members are accessible.
This involves a race condition with other threads that may concurrently
access the data members that must remain accessible.
One type of error is "AddressSanitizer: unknown-crash", whose
exact cause we have not determined.
Another type of error (reported in MDEV-23472) is "use-after-poison",
where the reported shadow bytes would in fact be 00, indicating that
the memory was no longer poisoned. The poison-access-unpoison race
condition was confirmed by "rr replay".
We eliminate the race condition by invoking MEM_NOACCESS on each
individual data member of trx_t before freeing the memory to the pool.
The memory would not be unpoisoned until the pool is freed
or the memory is being reused for another allocation.
trx_t::free(): Replaces trx_free().
trx_t::active_commit_ordered: Changed to bool, so that MEM_NOACCESS
can be invoked. Removed some accessor functions.
Pool: Remove all MEM_ instrumentation.
TrxFactory: Move the MEM_ instrumentation from Pool.
TrxFactory::debug(): Removed. Moved to trx_t::free(). Because
the memory was already marked unaccessible in trx_t::free(), the
Factory::debug() call in Pool::putl() would be unable to access it.
trx_allocate_for_background(): Replaces trx_create_low().
trx_t::free(): Perform all consistency checks while avoiding
duplication, and declare most data members unaccessible.
Since commit 1509363970 (MDEV-23484)
the rollback of InnoDB transactions is no longer protected by
dict_operation_lock. Removing that protection revealed a race
condition between transaction rollback and the rollback of an
online table-rebuilding operation (OPTIMIZE TABLE, or any online
ALTER TABLE that is rebuilding the table).
row_undo_mod_clust(): Re-check dict_index_is_online_ddl() after
acquiring index->lock, similar to how row_undo_ins_remove_clust_rec()
is doing it. Because innobase_online_rebuild_log_free() is holding
exclusive index->lock while invoking row_log_free(), this re-check
will ensure that row_log_table_low() will not be invoked when
index->online_log=NULL.
A different race condition is possible between the rollback of a
recovered transaction and the start of online secondary index creation.
Because prepare_inplace_alter_table_dict() is not acquiring an InnoDB
table lock in this case, and because recovered transactions are not
covered by metadata locks (MDL), the dict_table_t::indexes could be
modified by prepare_inplace_alter_table_dict() while the rollback of
a recovered transaction is being executed. Normal transactions would
be covered by MDL, and during prepare_inplace_alter_table_dict() we
do hold MDL_EXCLUSIVE, that is, an online ALTER TABLE operation may
not execute concurrently with other transactions that have accessed
the table.
row_undo(): To prevent a race condition with
prepare_inplace_alter_table_dict(), acquire dict_operation_lock
for all recovered transactions. Before MDEV-23484 we used to acquire
it for all transactions, not only recovered ones.
Note: row_merge_drop_indexes() would not invoke
dict_index_remove_from_cache() while transactional locks
exist on the table, or while any thread is holding an open table handle.
OK, it does that for FULLTEXT INDEX, but ADD FULLTEXT INDEX is not
supported as an online operation, and therefore
prepare_inplace_alter_table_dict() would acquire a table S lock,
which cannot succeed as long as recovered transactions on the table
exist, because they would hold a conflicting IX lock on the table.
In commit fe39d02f51 (MDEV-20638)
we removed some wake-up signaling of the master thread that should
have been there, to ensure a steady log checkpointing workload.
Common sense suggests that the commit omitted some necessary calls
to srv_inc_activity_count(). But, an attempt to add the call to
trx_flush_log_if_needed_low() as well as to reinstate the function
innobase_active_small() did not restore the performance for the
case where sync_binlog=1 is set.
Therefore, we will revert the entire commit in MariaDB Server 10.2.
In MariaDB Server 10.5, adding a srv_inc_activity_count() call to
trx_flush_log_if_needed_low() did restore the performance, so we
will not revert MDEV-20638 across all versions.
Regretfully, the parameter innodb_log_checksums was introduced
in MySQL 5.7.9 (the first GA release of that series) by
mysql/mysql-server@af0acedd88
which partly replaced a parameter that had been introduced in 5.7.8
mysql/mysql-server@22ba38218e
as innodb_log_checksum_algorithm.
Given that the CRC-32C operations are accelerated on many processor
implementations (AMD64 with SSE4.2; since MDEV-22669 also on IA-32
with SSE4.2, POWER 8 and later, ARMv8 with some extensions)
and by lookup tables when only generic SISD instructions are available,
there should be no valid reason to disable checksums.
In MariaDB 10.5.2, as a preparation for MDEV-12353, MDEV-19543 deprecated
and ignored the parameter innodb_log_checksums altogether. This should
imply that after a clean shutdown with innodb_log_checksums=OFF one
cannot upgrade to MariaDB Server 10.5 at all.
Due to these problems, let us deprecate the parameter innodb_log_checksums
and honor it only during server startup.
The command SET GLOBAL innodb_log_checksums will always set the
parameter to ON.
The usage message for the innodb_compression_algorithm system variable
did not list snappy, which was added as an optional compression algorithm
in MariaDB 10.1.3 and might actually work since
commit 90c52e5291 (MDEV-12615)
in MariaDB 10.1.24.
Unfortunately, we will include also unavailable compression algorithms
in the list, because ENUM parameters allow numeric values, and we do
not want innodb_compression_algorithm=3 to change meaning depending on
the way how the source code was compiled.
innobase_pk_order_preserved(): Treat an added AUTO_INCREMENT
column in the same way as an added existing column.
In either case, the column values are not guaranteed to
be constant, and thus the ordering may change if such a column
is added before any existing PRIMARY KEY columns.
prepare_inplace_alter_table_dict(): Initialize
dict_table_t::persistent_autoinc before invoking
innobase_pk_order_preserved().
Final added to:
- All reasonable classes inhereted from Field
- All classes inhereted from Protocol
- Almost all Handler classes
- Some important Item classes
The stripped size of mariadbd is just 4K smaller, but several object files
showed notable improvements in common execution paths.
- Checked field.o and item_sum.o
Other things:
- Added 'override' to a few class functions touched by this patch.
- Removed 'virtual' from a new class functions that had/got 'override'
- Changed Protocol_discard to inherit from Protocol instad of Protocol_text
In commit fd9ca2a742 (MDEV-23295)
we added a debug assertion, which caught a similar bug.
prepare_inplace_alter_table_dict(): If we had promised that
ALGORITHM=INPLACE or ALGORITHM=NOCOPY is supported, we must
preserve the ROW_FORMAT.