The test case was setting aria_sort_buffer_size to MAX_ULONGLONG-1
which was not handled gracefully by my_malloc() or safemalloc().
Fixed by ensuring that the malloc functions returns 0 if the size
is too big.
I also added some protection to Aria repair:
- Limit sort_buffer_size to 16G (after that a bigger sort buffer will
not help that much anyway)
- Limit sort_buffer_size also according to sort file size. This will
help by not allocating less memory if someone sets the buffer size too
high.
There were multiple problems here
* wsrep_trx_fragment_size should not be set when wsrep is disabled or provider is not loaded
* wsrep_trx_fragment_unit should not be set when wsrep is disabled or provider is not loaded
* wsrep_debug has no effect if wsrep is disabled or provider is not loaded
* wsrep_start_position should not be set when wsrep is disabled or provider is not loaded any other value than default
* wsrep_start_position should be changed only when we are joiner or initialized
* wsrep_start_position should be allowed to set only a value that exits, thus
we need to add error handling to wsrep_sst_complete
Actual assertion mentioned on MDEV seems to be already fixed but
setting seqno to -2 will trigger a different assertion
mysqld: /home/jan/mysql/10.4-bugs/wsrep-lib/src/server_state.cpp:702: void wsrep::server_state::sst_received(wsrep::client_service&, int): Assertion `state_ == s_joiner || state_ == s_initialized' failed.
Fixed this by not allowing user to set seqno < -1 (-1 is special
seqno meaning undefined and seqno is initialized to it). MariaDB
releases 10.2 and 10.3 already do not allow to set seqno < -1.
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.
Let us declare the parameter as MARIADB_REMOVED_OPTION.
For earlier versions, commit ea9cd97f85
declared the parameter deprecated.
MDEV-23855 broke the handling of innodb_flush_sync=OFF.
That parameter is supposed to limit the page write rate
in case the log capacity is being exceeded and log checkpoints
are needed.
With this fix, the following should pass:
./mtr --mysqld=--loose-innodb-flush-sync=0
One of our best regression tests for page flushing is
encryption.innochecksum. With innodb_page_size=16k and
innodb_flush_sync=OFF it would likely hang without this fix.
log_sys.last_checkpoint_lsn: Declare as Atomic_relaxed<lsn_t>
so that we are allowed to read the value while not holding
log_sys.mutex.
buf_flush_wait_flushed(): Let the page cleaner perform the flushing
also if innodb_flush_sync=OFF. After the page cleaner has
completed, perform a checkpoint if it is needed, because
buf_flush_sync_for_checkpoint() will not be run if
innodb_flush_sync=OFF.
buf_flush_ahead(): Simplify the condition. We do not really care
whether buf_flush_page_cleaner() is running.
buf_flush_page_cleaner(): Evaluate innodb_flush_sync at the low
level. If innodb_flush_sync=OFF, rate-limit the batches to
innodb_io_capacity_max pages per second.
Reviewed by: Vladislav Vaintroub
session_track_system_variables and max_relay_log_size.
lock LOCK_global_system_variables around the get_one_variable() call
in the Session_sysvars_tracker::store_variable().
problem:
========
mysqltest: In included file "./include/assert.inc":
included from mysql-test/suite/sys_vars/t/rpl_init_slave_func.test at line 69:
Assertion text: '@@global.max_connections = @start_max_connections'
Assertion result: '0'
mysqltest: In included file "./include/assert.inc":
included from mysql-test/suite/sys_vars/t/rpl_init_slave_func.test at line 86:
Assertion text: '@@global.max_connections = @start_max_connections + 1'
Assertion result: '0'
Analysis:
=========
A slave SQL thread sets its Running state to Yes very early in its
initialisation, before the majority of initialisation actions, including
executing the init_slave command, are done. Thus the testcase has a race
condition where the initial replication setup might finish executing later
than the testcase SET GLOBAL init_slave, making the testcase see its effect
where it checks for its absence.
Fix:
===
Include 'sync_slave_sql_with_master.inc' at the beginning of the test to
ensure that slave applier has completed the execution of 'init_slave' command
and proceeded to event application. Replace the apparently needless RESET
MASTER / RESET SLAVE etc.
Patch is based on:
b91e2e6f90
Author: laurynas-biveinis
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.
The configuration parameters will be changed as follows:
innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)
Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
* When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
* As a buf_pool.free limit in buf_LRU_list_batch() for terminating
the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
The status variables will be changed as follows:
innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call
When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.
Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.
The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.
The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.
Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.
We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.
To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.
For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.
mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.
buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.
buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.
buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.
buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().
flush_counters_t::unzip_LRU_evicted: Remove.
IORequest: Make more members const. FIXME: m_fil_node should be removed.
buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).
page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().
recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.
srv_started_redo: Replaces srv_start_state.
SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().
buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().
buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.
buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.
buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.
buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.
buf_flush_check_neighbor(): Take id.fold() as a parameter.
buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.
buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_do_LRU_batch(): Return the number of pages flushed.
buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.
buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.
buf_LRU_check_size_of_non_data_objects(): Simplify the code.
buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.
buf_page_create(): Take a pre-allocated page as a parameter.
buf_pool_t::free_block(): Free a pre-allocated block.
recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.
BtrBulk::logFreeCheck(): Skip a redundant condition.
row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.
sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.
Reviewed by: Vladislav Vaintroub
Changes to be committed:
modified: mysql-test/suite/sys_vars/r/wsrep_cluster_address_basic.result
modified: mysql-test/suite/sys_vars/t/wsrep_cluster_address_basic.test
The setting innodb_lock_schedule_algorithm=VATS that was introduced
in MDEV-11039 (commit 021212b525)
causes conflicting exclusive locks to be incorrectly granted to
two transactions. Specifically, in lock_rec_insert_by_trx_age()
the predicate !lock_rec_has_to_wait_in_queue(in_lock) would hold even
though an active transaction is already holding an exclusive lock.
This was observed between two DELETE of the same clustered index record.
The HASH_DELETE invocation in lock_rec_enqueue_waiting() may be related.
Due to lack of progress in diagnosing the problem, we will deprecate the
option and issue a warning that using it may corrupt data. The unsafe
option was enabled between
commit 0c15d1a6ff (MariaDB 10.2.3)
and the parent of
commit 1cc1d0429d (MariaDB 10.2.17, 10.3.9).
Example of the failure:
http://buildbot.askmonty.org/buildbot/builders/bld-p9-rhel7/builds/4417/steps/mtr/logs/stdio
```
main.mysqld--help 'unix' w17 [ fail ]
Test ended at 2020-06-20 18:51:45
CURRENT_TEST: main.mysqld--help
--- /opt/buildbot-slave/bld-p9-rhel7/build/mysql-test/main/mysqld--help.result 2020-06-20 16:06:49.903604179 +0300
+++ /opt/buildbot-slave/bld-p9-rhel7/build/mysql-test/main/mysqld--help.reject 2020-06-20 18:51:44.886766820 +0300
@@ -1797,10 +1797,10 @@
sync-relay-log-info 10000
sysdate-is-now FALSE
system-versioning-alter-history ERROR
-table-cache 421
+table-cache 2000
table-definition-cache 400
-table-open-cache 421
-table-open-cache-instances 1
+table-open-cache 2000
+table-open-cache-instances 8
tc-heuristic-recover OFF
tcp-keepalive-interval 0
tcp-keepalive-probes 0
mysqltest: Result length mismatch
```
mtr: table_open_cache_basic autosized:
Lets assume that >400 are available and that
we can set the result back to the start value.
All of these system variables are autosized and can
generate MTR output differences.
Closes#1527
The parameters innodb_thread_concurrency and innodb_commit_concurrency
were useful years ago when both computing resources and the implementation
of some shared data structures were limited. MySQL 5.0 or 5.1 had trouble
scaling beyond 8 concurrent connections. Most of the scalability bottlenecks
have been removed since then, and the transactions per second delivered
by MariaDB Server 10.5 should not dramatically drop upon exceeding the
'optimal' number of connections.
Hence, enabling any concurrency throttling for InnoDB actually makes
things worse. We have seen many customers mistakenly setting this to a
small value like 16 or 64 and then complaining the server was slow.
Ignoring the parameters allows us to remove some normally unused code
and data structures, which could slightly improve performance.
innodb_thread_concurrency, innodb_commit_concurrency,
innodb_replication_delay, innodb_concurrency_tickets,
innodb_thread_sleep_delay, innodb_adaptive_max_sleep_delay:
Deprecate and ignore; hard-wire to 0.
The column INFORMATION_SCHEMA.INNODB_TRX.trx_concurrency_tickets
will always report 0.
Error state is not stored in check_and_do_in_subquery_rewrites() when there is
illegal combination of optimizer switches. So all the functions eventually
return false. Thus the assetion failure.
The rw_lock_s_lock() calls for the buf_pool.page_hash became a
clear bottleneck after MDEV-15053 reduced the contention on
buf_pool.mutex. We will replace that use of rw_lock_t with a
special implementation that is optimized for memory bus traffic.
The hash_table_locks instrumentation will be removed.
buf_pool_t::page_hash: Use a special implementation whose API is
compatible with hash_table_t, and store the custom rw-locks
directly in buf_pool.page_hash.array, intentionally sharing
cache lines with the hash table pointers.
rw_lock: A low-level rw-lock implementation based on std::atomic<uint32_t>
where read_trylock() becomes a simple fetch_add(1).
buf_pool_t::page_hash_latch: The special of rw_lock for the page_hash.
buf_pool_t::page_hash_latch::read_lock(): Assert that buf_pool.mutex
is not being held by the caller.
buf_pool_t::page_hash_latch::write_lock() may be called while not holding
buf_pool.mutex. buf_pool_t::watch_set() is such a caller.
buf_pool_t::page_hash_latch::read_lock_wait(),
page_hash_latch::write_lock_wait(): The spin loops.
These will obey the global parameters innodb_sync_spin_loops and
innodb_sync_spin_wait_delay.
buf_pool_t::freed_page_hash: A singly linked list of copies of
buf_pool.page_hash that ever existed. The fact that we never
free any buf_pool.page_hash.array guarantees that all
page_hash_latch that ever existed will remain valid until shutdown.
buf_pool_t::resize_hash(): Replaces buf_pool_resize_hash().
Prepend a shallow copy of the old page_hash to freed_page_hash.
buf_pool_t::page_hash_table::n_cells: Declare as Atomic_relaxed.
buf_pool_t::page_hash_table::lock(): Explain what prevents a
race condition with buf_pool_t::resize_hash().
For DECIMAL[(M[,D])] datatype max_sort_length was not being honoured which was leading to buffer
overflow while making the sort key. The fix to this problem would be to create sort keys for decimals
with atmost max_sort_key bytes
Important:
The minimum value of max_sort_length has been raised to 8 (previously was 4),
so fixed size datatypes like DOUBLE and BIGINIT are not truncated for
lower values of max_sort_length.