Commit graph

347 commits

Author SHA1 Message Date
Marko Mäkelä
101da87228 Merge 10.5 into 10.6 2021-06-23 19:36:45 +03:00
Marko Mäkelä
22b62edaed MDEV-25113: Make page flushing faster
buf_page_write_complete(): Reduce the buf_pool.mutex hold time,
and do not acquire buf_pool.flush_list_mutex at all.
Instead, mark blocks clean by setting oldest_modification to 1.
Dirty pages of temporary tables will be identified by the special
value 2 instead of the previous special value 1.
(By design of the ib_logfile0 format, actual LSN values smaller
than 2048 are not possible.)

buf_LRU_free_page(), buf_pool_t::get_oldest_modification()
and many other functions will remove the garbage (clean blocks)
from buf_pool.flush_list while holding buf_pool.flush_list_mutex.

buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaced with non-atomic variables, protected by buf_pool.mutex,
to avoid unnecessary synchronization when modifying the counts.

export_vars: Remove unnecessary indirection for
innodb_pages_created, innodb_pages_read, innodb_pages_written.
2021-06-23 19:06:52 +03:00
Marko Mäkelä
f778a5d5e2 MDEV-25854: Remove garbage tables after restoring a backup
In commit 1c5ae99194 (MDEV-25666)
we had changed Mariabackup so that it would no longer skip files
whose names start with #sql. This turned out to be wrong.
Because operations on such named files are not protected by any
locks in the server, it is not safe to copy them.

Not copying the files may make the InnoDB data dictionary
inconsistent with the file system. So, we must do something
in InnoDB to adjust for that.

If InnoDB is being started up without the redo log (ib_logfile0)
or with a zero-length log file, we will assume that the server
was restored from a backup, and adjust things as follows:

dict_check_sys_tables(), fil_ibd_open(): Do not complain about
missing #sql files if they would be dropped a little later.

dict_stats_update_if_needed(): Never add #sql tables to
the recomputing queue. This avoids a potential race condition when
dropping the garbage tables.

drop_garbage_tables_after_restore(): Try to drop any garbage tables.

innodb_ddl_recovery_done(): Invoke drop_garbage_tables_after_restore()
if srv_start_after_restore (a new flag) was set and we are not in
read-only mode (innodb_read_only=ON or innodb_force_recovery>3).

The tests and dbug_mariabackup_event() instrumentation
were developed by Vladislav Vaintroub, who also reviewed this.
2021-06-17 13:46:16 +03:00
Krunal Bauskar
102ff4208d MDEV-25882: Statistics used to track b-tree (non-adaptive) searches
should be updated only when adaptive hashing is turned-on

Currently, btr_cur_n_non_sea is used to track the search that missed
adaptive hash index. adaptive hash index is turned off by default
but the said variable is updated always though the value of it makes sense
only when an adaptive index is enabled. It is meant to check how many
searches didn't go through an adaptive hash index.

Given a global variable that is updated on each search path it causes
a contention with a multi-threaded workload.

Patch moves the said variables inside a loop that is now updated
only when the adaptive hash index is enabled and that in theory should
also, reduce the update frequency of the said variable as the majority of
the request should be serviced through the adaptive hash index.

Variables (btr_cur_n_non_sea and btr_cur_n_sea) are also converted to
use distributed counter to avoid contention.

User visible changes:

This also means that user will now see
Innodb_adaptive_hash_non_hash_searches (viewed as part of show status)
only if code is compiled with DWITH_INNODB_AHI=ON (default) and it will
be updated only if innodb_adaptive_hash_index=1 else it reported as 0.
2021-06-11 12:32:03 +03:00
Marko Mäkelä
e538cb095f Merge 10.5 into 10.6 2021-03-27 18:03:03 +02:00
Marko Mäkelä
80459bcbd4 Merge 10.4 into 10.5 2021-03-27 17:37:42 +02:00
Marko Mäkelä
7ae37ff74f Merge 10.3 into 10.4 2021-03-27 17:12:28 +02:00
Marko Mäkelä
3157fa182a Merge 10.2 into 10.3 2021-03-27 16:11:26 +02:00
Marko Mäkelä
0f8caadc96 MDEV-22653: Remove the useless parameter innodb_simulate_comp_failures
The debug parameter innodb_simulate_comp_failures injected compression
failures for ROW_FORMAT=COMPRESSED tables, breaking the pre-existing
logic that I had implemented in the InnoDB Plugin for MySQL 5.1 to prevent
compressed page overflows. A much better check is already achieved by
defining UNIV_ZIP_COPY at the compilation time.
(Only UNIV_ZIP_DEBUG is part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON.)
2021-03-22 18:12:44 +02:00
Marko Mäkelä
1055cf967f Merge 10.5 into 10.6 2021-03-20 18:40:07 +02:00
Marko Mäkelä
e7ddf46632 MDEV-25211 Remove useless counter Innodb_buffered_aio_submitted
In commit 412533b4a7 (MDEV-18582),
one of the counters that was ported from XtraDB is useless.
Innodb_buffered_aio_submitted would be 0 or 1, depending on whether
is_linux_native_aio_supported() was executed to the point where
it would be incremented.

Let us remove this counter, because it has no practical value.
Even if its value were 1, io_setup() can still fail and we may
end up with innodb_use_native_aio=0.
2021-03-20 13:47:05 +02:00
Krunal Bauskar
8d4e3ec2f7 MDEV-21212: buf_page_get_gen -> buf_pool->stat.n_page_gets++ is a cpu waste
n_page_gets is a global counter that is updated on each page access.
This also means it is updated pretty often and with a multi-core machine
it easily boils up to be the hottest counter as also reported by perf.

Using existing distributed counter framework help ease the contention
and improve the performance.

Patch also tend to increase the slot of the distributed counter from original
64 to 128 given that is new normal for next-generation machines.

The original idea and patch came from Daniel Black which is now ported
to 10.6 with some improvement and adjustment.
2021-03-12 12:56:23 +02:00
Marko Mäkelä
7a4fbb55b0 MDEV-25105 Remove innodb_checksum_algorithm values none,innodb,...
Historically, InnoDB supported a buggy page checksum algorithm that did not
compute a checksum over the full page. Later, well before MySQL 4.1
introduced .ibd files and the innodb_file_per_table option, the algorithm
was corrected and the first 4 bytes of each page were redefined to be
a checksum.

The original checksum was so slow that an option to disable page checksum
was introduced for benchmarketing purposes.

The Intel Nehalem microarchitecture introduced the SSE4.2 instruction set
extension, which includes instructions for faster computation of CRC-32C.
In MySQL 5.6 (and MariaDB 10.0), innodb_checksum_algorithm=crc32 was
implemented to make of that. As that option was changed to be the default
in MySQL 5.7, a bug was found on big-endian platforms and some work-around
code was added to weaken that checksum further. MariaDB disables that
work-around by default since MDEV-17958.

Later, SIMD-accelerated CRC-32C has been implemented in MariaDB for POWER
and ARM and also for IA-32/AMD64, making use of carry-less multiplication
where available.

Long story short, innodb_checksum_algorithm=crc32 is faster and more secure
than the pre-MySQL 5.6 checksum, called innodb_checksum_algorithm=innodb.
It should have removed any need to use innodb_checksum_algorithm=none.

The setting innodb_checksum_algorithm=crc32 is the default in
MySQL 5.7 and MariaDB Server 10.2, 10.3, 10.4. In MariaDB 10.5,
MDEV-19534 made innodb_checksum_algorithm=full_crc32 the default.
It is even faster and more secure.

The default settings in MariaDB do allow old data files to be read,
no matter if a worse checksum algorithm had been used.
(Unfortunately, before innodb_checksum_algorithm=full_crc32,
the data files did not identify which checksum algorithm is being used.)

The non-default settings innodb_checksum_algorithm=strict_crc32 or
innodb_checksum_algorithm=strict_full_crc32 would only allow CRC-32C
checksums. The incompatibility with old data files is why they are
not the default.

The newest server not to support innodb_checksum_algorithm=crc32
were MySQL 5.5 and MariaDB 5.5. Both have reached their end of life.
A valid reason for using innodb_checksum_algorithm=innodb could have
been the ability to downgrade. If it is really needed, data files
can be converted with an older version of the innochecksum utility.

Because there is no good reason to allow data files to be written
with insecure checksums, we will reject those option values:

    innodb_checksum_algorithm=none
    innodb_checksum_algorithm=innodb
    innodb_checksum_algorithm=strict_none
    innodb_checksum_algorithm=strict_innodb

Furthermore, the following innochecksum options will be removed,
because only strict crc32 will be supported:

    innochecksum --strict-check=crc32
    innochecksum -C crc32
    innochecksum --write=crc32
    innochecksum -w crc32

If a user wishes to convert a data file to use a different checksum
(so that it might be used with the no-longer-supported
MySQL 5.5 or MariaDB 5.5, which do not support IMPORT TABLESPACE
nor system tablespace format changes that were made in MariaDB 10.3),
then the innochecksum tool from MariaDB 10.2, 10.3, 10.4, 10.5 or
MySQL 5.7 can be used.

Reviewed by: Thirunarayanan Balathandayuthapani
2021-03-11 12:46:18 +02:00
Marko Mäkelä
584e52118c MDEV-20612 fixup: Make comments refer to lock_sys.latch 2021-02-17 12:18:03 +02:00
Sergei Golubchik
00a313ecf3 Merge branch 'bb-10.3-release' into bb-10.4-release
Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution"
was null-merged. 10.4 version of the fix is coming up separately
2021-02-12 17:44:22 +01:00
Marko Mäkelä
5f46385764 MDEV-24731 Excessive mutex contention in DeadlockChecker::check_and_resolve()
The DeadlockChecker expects to be able to freeze the waits-for graph.
Hence, it is best executed somewhere where we are not holding any
additional mutexes.

lock_wait(): Defer the deadlock check to this function, instead
of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting().

DeadlockChecker::trx_rollback(): Merge with the only caller,
check_and_resolve().

LockMutexGuard: RAII accessor for lock_sys.mutex.

lock_sys.deadlocks: Replaces lock_deadlock_found.

trx_t: Clean up some comments.
2021-02-04 16:38:07 +02:00
Sergei Golubchik
60ea09eae6 Merge branch '10.2' into 10.3 2021-02-01 13:49:33 +01:00
Vladislav Vaintroub
d8373fea5f MDEV-24685 - remove IO thread states output from SHOW ENGINE INNODB STATUS
There are no IO threads anymore.
2021-01-29 18:02:14 +02:00
Marko Mäkelä
e71e613353 MDEV-24671: Replace lock_wait_timeout_task with mysql_cond_timedwait()
lock_wait(): Replaces lock_wait_suspend_thread(). Wait for the lock to
be granted or the transaction to be killed using mysql_cond_timedwait()
or mysql_cond_wait().

lock_wait_end(): Replaces que_thr_end_lock_wait() and
lock_wait_release_thread_if_suspended().

lock_wait_timeout_task: Remove. The operating system kernel will
resume the mysql_cond_timedwait() in lock_wait(). An added benefit
is that innodb_lock_wait_timeout no longer has a 'jitter' of 1 second,
which was caused by this wake-up task waking up only once per second,
and then waking up any threads for which the timeout (which was only
measured in seconds) was exceeded.

innobase_kill_query(): Set trx->error_state=DB_INTERRUPTED,
so that a call trx_is_interrupted(trx) in lock_wait() can be avoided.

We will protect things more consistently with lock_sys.wait_mutex,
which will be moved below lock_sys.mutex in the latching order.

trx_lock_t::cond: Condition variable for !wait_lock, used with
lock_sys.wait_mutex.

srv_slot_t: Remove. Replaced by trx_lock_t::cond,

lock_grant_after_reset(): Merged to to lock_grant().

lock_rec_get_index_name(): Remove.

lock_sys_t: Introduce wait_pending, wait_count, wait_time, wait_time_max
that are protected by wait_mutex.

trx_lock_t::que_state: Remove.

que_thr_state_t: Remove QUE_THR_COMMAND_WAIT, QUE_THR_LOCK_WAIT.

que_thr_t: Remove is_active, start_running(), stop_no_error().

que_fork_t::n_active_thrs, trx_lock_t::n_active_thrs: Remove.
2021-01-27 15:45:39 +02:00
Vladislav Vaintroub
c310f4c381 MDEV-24685 - remove IO thread states output from SHOW ENGINE INNODB STATUS
There are no IO threads anymore.
2021-01-27 14:17:43 +01:00
Marko Mäkelä
6d05a95c65 Merge 10.5 into 10.6 2021-01-14 16:13:31 +02:00
Marko Mäkelä
e4205fba7c MDEV-24536 innodb_idle_flush_pct has no effect
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.

Let us declare the parameter as MARIADB_REMOVED_OPTION.
For earlier versions, commit ea9cd97f85
declared the parameter deprecated.
2021-01-13 19:11:31 +02:00
Marko Mäkelä
ea9cd97f85 MDEV-24536 innodb_idle_flush_pct has no effect
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.

Let us declare the parameter as deprecated and having no effect.
2021-01-13 18:55:56 +02:00
Marko Mäkelä
666565c7f0 Merge 10.5 into 10.6 2021-01-11 17:32:08 +02:00
Marko Mäkelä
fae87e0c74 MDEV-24544 innodb_buffer_pool_wait_free is not protected by mutex
buf_LRU_get_free_block(): Increment the counter after acquiring
buf_pool.mutex.

buf_pool.stat.LRU_waits: Replaces export_vars.innodb_buffer_pool_wait_free
and srv_stats.buf_pool_wait_free.

Reads of the counter will remain dirty, not acquiring the mutex.
2021-01-07 11:18:13 +02:00
Marko Mäkelä
92abdcca5a Merge 10.5 into 10.6 2021-01-07 09:08:09 +02:00
Marko Mäkelä
a993310593 MDEV-24537 innodb_max_dirty_pages_pct_lwm=0 lost its special meaning
In commit 3a9a3be1c6 (MDEV-23855)
some previous logic was replaced with the condition
dirty_pct < srv_max_dirty_pages_pct_lwm, which caused
the default value of the parameter innodb_max_dirty_pages_pct_lwm=0
to lose its special meaning: 'refer to innodb_max_dirty_pages_pct instead'.

This implicit special meaning was visible in the function
af_get_pct_for_dirty(), which was removed in
commit f0c295e2de (MDEV-24369).

page_cleaner_flush_pages_recommendation(): Restore the special
meaning that was removed in MDEV-24369.

buf_flush_page_cleaner(): If srv_max_dirty_pages_pct_lwm==0.0,
refer to srv_max_buf_pool_modified_pct. This fixes the observed
performance regression due to excessive page flushing.

buf_pool_t::page_cleaner_wakeup(): Revise the wakeup condition.

innodb_init(): Do initialize srv_max_io_capacity in Mariabackup.
It was previously constantly 0, which caused mariadb-backup --prepare
to hang in buf_flush_sync(), making no progress.
2021-01-06 13:53:14 +02:00
Marko Mäkelä
ff5d306e29 MDEV-21452: Replace ib_mutex_t with mysql_mutex_t
SHOW ENGINE INNODB MUTEX functionality is completely removed,
as are the InnoDB latching order checks.

We will enforce innodb_fatal_semaphore_wait_threshold
only for dict_sys.mutex and lock_sys.mutex.

dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex.

lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex.

FIXME: srv_sys should be removed altogether; it is duplicating tpool
functionality.

fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must
not hold fil_system.mutex.

fil_close_all_files(): To prevent SAFE_MUTEX warnings for
fil_space_destroy_crypt_data(), we must not hold fil_system.mutex
while invoking fil_space_free_low() on a detached tablespace.
2020-12-15 17:56:18 +02:00
Marko Mäkelä
db006a9a43 MDEV-21452: Remove os_event_t, MUTEX_EVENT, TTASEventMutex, sync_array
We will default to MUTEXTYPE=sys (using OSTrackMutex) for those
ib_mutex_t that have not been replaced yet.

The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed.

The parameter innodb_sync_array_size is removed.

FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced.
We should enforce it for lock_sys.mutex and dict_sys.mutex somehow!

innodb_sync_debug=ON might still cover ib_mutex_t.
2020-12-15 17:56:17 +02:00
Marko Mäkelä
38fd7b7d91 MDEV-21452: Replace all direct use of os_event_t
Let us replace os_event_t with mysql_cond_t, and replace the
necessary ib_mutex_t with mysql_mutex_t so that they can be
used with condition variables.

Also, let us replace polling (os_thread_sleep() or timed waits)
with plain mysql_cond_wait() wherever possible.

Furthermore, we will use the lightweight srw_mutex for trx_t::mutex,
to hopefully reduce contention on lock_sys.mutex.

FIXME: Add test coverage of
mariabackup --backup --kill-long-queries-timeout
2020-12-15 17:56:17 +02:00
Marko Mäkelä
f24b738318 MDEV-24313 (2 of 2): Silently ignored innodb_use_native_aio=1
In commit 5e62b6a5e0 (MDEV-16264)
the logic of os_aio_init() was changed so that it will never fail,
but instead automatically disable innodb_use_native_aio (which is
enabled by default) if the io_setup() system call would fail due
to resource limits being exceeded. This is questionable, especially
because falling back to simulated AIO may lead to significantly
reduced performance.

srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads:
Change the data type from ulong to uint.

os_aio_init(): Remove the parameters, and actually return an error code.

thread_pool::configure_aio(): Do not silently fall back to simulated AIO.

Reviewed by: Vladislav Vaintroub
2020-12-14 15:27:03 +02:00
Marko Mäkelä
83591a23d6 MDEV-24350 buf_dblwr unnecessarily uses memory-intensive srv_stats counters
The counters in srv_stats use std::atomic and multiple cache lines per
counter. This is an overkill in a case where a critical section already
exists in the code. A regular variable will work just fine, with much
smaller memory bus impact.
2020-12-04 17:52:23 +02:00
Marko Mäkelä
657fcdf430 MDEV-24280 InnoDB triggers too many independent periodic tasks
A side effect of MDEV-16264 is that a large number of threads will
be created at server startup, to be destroyed after a minute or two.

One source of such thread creation is srv_start_periodic_timer().
InnoDB is creating 3 periodic tasks: srv_master_callback (1Hz)
srv_error_monitor_task (1Hz), and srv_monitor_task (0.2Hz).

It appears that we can merge srv_error_monitor_task and srv_monitor_task
and have them invoked 4 times per minute (every 15 seconds). This will
affect our ability to enforce innodb_fatal_semaphore_wait_threshold and
some computations around BUF_LRU_STAT_N_INTERVAL.

We could remove srv_master_callback along with the DROP TABLE queue
at some point of time in the future. We must keep it independent
of the innodb_fatal_semaphore_wait_threshold detection, because
the background DROP TABLE queue could get stuck due to dict_sys
being locked by another thread. For now, srv_master_callback
must be invoked once per second, so that
innodb_flush_log_at_timeout=1 can work.

BUF_LRU_STAT_N_INTERVAL: Reduce the precision and extend the time
from 50*1 second to 4*15 seconds.

srv_error_monitor_timer: Remove.

MAX_MUTEX_NOWAIT: Increase from 20*1 second to 2*15 seconds.

srv_refresh_innodb_monitor_stats(): Avoid a repeated call to time(NULL).
Change the interval to less than 60 seconds.

srv_monitor(): Renamed from srv_monitor_task.

srv_monitor_task(): Renamed from srv_error_monitor_task().
Invoked only once in 15 seconds. Invoke also srv_monitor().
Increase the fatal_cnt threshold from 10*1 second to 1*15 seconds.

sync_array_print_long_waits_low(): Invoke time(NULL) only once.
Remove a bogus message about printouts for 30 seconds. Those
printouts were effectively already disabled in MDEV-16264
(commit 5e62b6a5e0).
2020-11-25 16:54:00 +02:00
Marko Mäkelä
3a9a3be1c6 MDEV-23855: Improve InnoDB log checkpoint performance
After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability
bottleneck, log checkpoints became a new bottleneck.

If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is
set high and the workload fits in the buffer pool, the page cleaner
thread will perform very little flushing. When we reach the capacity
of the circular redo log file ib_logfile0 and must initiate a checkpoint,
some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF,
then flushing would continue at the innodb_io_capacity rate, and
writers would be throttled.)

We have the best chance of advancing the checkpoint LSN immediately
after a page flush batch has been completed. Hence, it is best to
perform checkpoints after every batch in the page cleaner thread,
attempting to run once per second.

By initiating high-priority flushing in the page cleaner as early
as possible, we aim to make the throughput more stable.

The function buf_flush_wait_flushed() used to sleep for 10ms, hoping
that the page cleaner thread would do something during that time.
The observed end result was that a large number of threads that call
log_free_check() would end up sleeping while nothing useful is happening.

We will revise the design so that in the default innodb_flush_sync=ON
mode, buf_flush_wait_flushed() will wake up the page cleaner thread
to perform the necessary flushing, and it will wait for a signal from
the page cleaner thread.

If innodb_io_capacity is set to a low value (causing the page cleaner to
throttle its work), a write workload would initially perform well, until
the capacity of the circular ib_logfile0 is reached and log_free_check()
will trigger checkpoints. At that point, the extra waiting in
buf_flush_wait_flushed() will start reducing throughput.

The page cleaner thread will also initiate log checkpoints after each
buf_flush_lists() call, because that is the best point of time for
the checkpoint LSN to advance by the maximum amount.

Even in 'furious flushing' mode we invoke buf_flush_lists() with
innodb_io_capacity_max pages at a time, and at the start of each
batch (in the log_flush() callback function that runs in a separate
task) we will invoke os_aio_wait_until_no_pending_writes(). This
tweak allows the checkpoint to advance in smaller steps and
significantly reduces the maximum latency. On an Intel Optane 960
NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds.
On Microsoft Windows with a slower SSD, it reduced from more than
180 seconds to 0.6 seconds.

We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity
per second whenever the dirty proportion of buffer pool pages exceeds
innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try
to make page_cleaner_flush_pages_recommendation() more consistent and
predictable: if we are below innodb_adaptive_flushing_lwm, let us flush
pages according to the return value of af_get_pct_for_dirty().

innodb_max_dirty_pages_pct_lwm: Revert the change of the default value
that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0
guarantees that a shutdown of an idle server will be fast. Users might
be surprised if normal shutdown suddenly became slower when upgrading
within a GA release series.

innodb_checkpoint_usec: Remove. The master task will no longer perform
periodic log checkpoints. It is the duty of the page cleaner thread.

log_sys.max_modified_age: Remove. The current span of the
buf_pool.flush_list expressed in LSN only matters for adaptive
flushing (outside the 'furious flushing' condition).
For the correctness of checkpoints, the only thing that matters is
the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn).
This run-time constant was also reported as log_max_modified_age_sync.

log_sys.max_checkpoint_age_async: Remove. This does not serve any
purpose, because the checkpoints will now be triggered by the page
cleaner thread. We will retain the log_sys.max_checkpoint_age limit
for engaging 'furious flushing'.

page_cleaner.slot: Remove. It turns out that
page_cleaner_slot.flush_list_time was duplicating
page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass
was duplicating page_cleaner.flush_pass.
Likewise, there were some redundant monitor counters, because the
page cleaner thread no longer performs any buf_pool.LRU flushing, and
because there only is one buf_flush_page_cleaner thread.

buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex.

buf_pool_t::get_oldest_modification(): Add a parameter to specify the
return value when no persistent data pages are dirty. Require the
caller to hold buf_pool.flush_list_mutex.

log_buf_pool_get_oldest_modification(): Take the fall-back LSN
as a parameter. All callers will also invoke log_sys.get_lsn().

log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed().

buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool
has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF)
and wait for the page cleaner to complete. If the page cleaner
thread is not running (which can be the case durign shutdown),
initiate the flush and wait for it directly.

buf_flush_ahead(): If innodb_flush_sync=ON (the default),
submit a new buf_flush_sync_lsn target for the page cleaner
but do not wait for the flushing to finish.

log_get_capacity(), log_get_max_modified_age_async(): Remove, to make
it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes.

page_cleaner_flush_pages_recommendation(): Protect all access to
buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there
were some race conditions in the calculation.

buf_flush_sync_for_checkpoint(): New function to process
buf_flush_sync_lsn in the page cleaner thread. At the end of
each batch, we try to wake up any blocked buf_flush_wait_flushed().
If everything up to buf_flush_sync_lsn has been flushed, we will
reset buf_flush_sync_lsn=0. The page cleaner thread will keep
'furious flushing' until the limit is reached. Any threads that
are waiting in buf_flush_wait_flushed() will be able to resume
as soon as their own limit has been satisfied.

buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not
sleep as long as it is set. Do not update any page_cleaner statistics
for this special mode of operation. In the normal mode
(buf_flush_sync_lsn is not set for innodb_flush_sync=ON),
try to wake up once per second. No longer check whether
srv_inc_activity_count() has been called. After each batch,
try to perform a log checkpoint, because the best chances for
the checkpoint LSN to advance by the maximum amount are upon
completing a flushing batch.

log_t: Move buf_free, max_buf_free possibly to the same cache line
with log_sys.mutex.

log_margin_checkpoint_age(): Simplify the logic, and replace
a 0.1-second sleep with a call to buf_flush_wait_flushed() to
initiate flushing. Moved to the same compilation unit
with the only caller.

log_close(): Clean up the calculations. (Should be no functional
change.) Return whether flush-ahead is needed. Moved to the same
compilation unit with the only caller.

mtr_t::finish_write(): Return whether flush-ahead is needed.

mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid
external calls in mtr_t::commit() and make the logic easier to follow
by having related code in a single compilation unit. Also, we will
invoke srv_stats.log_write_requests.inc() only once per
mini-transaction commit, while not holding mutexes.

log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age.
Upon reaching log_sys.max_checkpoint_age where we must wait to prevent
the log from getting corrupted, let us wait for at most 1MiB of LSN
at a time, before rechecking the condition. This should allow writers
to proceed even if the redo log capacity has been reached and
'furious flushing' is in progress. We no longer care about
log_sys.max_modified_age_sync or log_sys.max_modified_age_async.
The log_sys.max_modified_age_sync could be a relic from the time when
there was a srv_master_thread that wrote dirty pages to data files.
Also, we no longer have any log_sys.max_checkpoint_age_async limit,
because log checkpoints will now be triggered by the page cleaner
thread upon completing buf_flush_lists().

log_set_capacity(): Simplify the calculations of the limit
(no functional change).

log_checkpoint_low(): Split from log_checkpoint(). Moved to the
same compilation unit with the caller.

log_make_checkpoint(): Only wait for everything to be flushed until
the current LSN.

create_log_file(): After checkpoint, invoke log_write_up_to()
to ensure that the FILE_CHECKPOINT record has been written.
This avoids ut_ad(!srv_log_file_created) in create_log_file_rename().

srv_start(): Do not call recv_recovery_from_checkpoint_start()
if the log has just been created. Set fil_system.space_id_reuse_warned
before dict_boot() has been executed, and clear it after recovery
has finished.

dict_boot(): Initialize fil_system.max_assigned_id.

srv_check_activity(): Remove. The activity count is counting transaction
commits and therefore mostly interesting for the purge of history.

BtrBulk::insert(): Do not explicitly wake up the page cleaner,
but do invoke srv_inc_activity_count(), because that counter is
still being used in buf_load_throttle_if_needed() for some
heuristics. (It might be cleaner to execute buf_load() in the
page cleaner thread!)

Reviewed by: Vladislav Vaintroub
2020-10-26 17:09:01 +02:00
Marko Mäkelä
61161d51d7 Cleanup: Remove export_vars.innodb_num_open_files 2020-10-15 17:06:16 +03:00
Marko Mäkelä
7cffb5f6e8 MDEV-23399: Performance regression with write workloads
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.

The configuration parameters will be changed as follows:

innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)

Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
 * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
 * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
   the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().

The status variables will be changed as follows:

innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call

When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.

Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.

The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.

The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.

Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.

We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.

To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.

For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.

mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().

buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.

buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.

buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.

buf_pool_t::done_flush_list: Condition variable for !n_flush_list.

buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.

buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.

buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().

flush_counters_t::unzip_LRU_evicted: Remove.

IORequest: Make more members const. FIXME: m_fil_node should be removed.

buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).

page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.

pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().

recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().

recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.

srv_started_redo: Replaces srv_start_state.

SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().

buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().

buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.

buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.

buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.

buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().

buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.

buf_flush_check_neighbor(): Take id.fold() as a parameter.

buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.

buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.

buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_do_LRU_batch(): Return the number of pages flushed.

buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.

buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.

buf_LRU_check_size_of_non_data_objects(): Simplify the code.

buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.

buf_page_create(): Take a pre-allocated page as a parameter.

buf_pool_t::free_block(): Free a pre-allocated block.

recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.

BtrBulk::logFreeCheck(): Skip a redundant condition.

row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.

sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.

Reviewed by: Vladislav Vaintroub
2020-10-15 17:04:56 +03:00
Marko Mäkelä
b535a79044 MDEV-23399: Remove recv_writer_thread
Recovery works just fine without a separate thread whose only
task is to tell the page cleaner thread to do its job.

recv_sys_t::apply(): Flush the buffer pool at the end of each batch.

Reviewed by: Vladislav Vaintroub
2020-10-15 10:33:20 +03:00
Marko Mäkelä
b15224c71a MDEV-16264 fixup: Remove unused declarations
Remove stale references to srv_sys.sys_threads and srv_sys.mutex
that were removed in
commit 5e62b6a5e0.
2020-10-01 14:35:42 +03:00
Marko Mäkelä
a9550c47e4 MDEV-16264 fixup: Remove unused code and data
LATCH_ID_OS_AIO_READ_MUTEX,
LATCH_ID_OS_AIO_WRITE_MUTEX,
LATCH_ID_OS_AIO_LOG_MUTEX,
LATCH_ID_OS_AIO_IBUF_MUTEX,
LATCH_ID_OS_AIO_SYNC_MUTEX: Remove. The tpool is not instrumented.

lock_set_timeout_event(): Remove.

srv_sys_mutex_key, srv_sys_t::mutex, SYNC_THREADS: Remove.

srv_slot_t::suspended: Remove. We only ever assigned this data member
true, so it is redundant.

ib_wqueue_wait(), ib_wqueue_timedwait(): Remove.

os_thread_join(): Remove.

os_thread_create(), os_thread_exit(): Remove redundant parameters.

These were missed in commit 5e62b6a5e0.
2020-09-30 14:28:11 +03:00
Marko Mäkelä
8bc4ebed67 MDEV-21534 fixup: Remove traces of removed log_sys.write_mutex
This was missed in commit 30ea63b7d2.
2020-09-30 13:48:20 +03:00
Marko Mäkelä
363da42190 Cleanup: Remove unnecessary declarations 2020-09-30 13:37:15 +03:00
Marko Mäkelä
2fa9f8c53a Merge 10.3 into 10.4 2020-08-20 11:01:47 +03:00
Marko Mäkelä
de0e7cd72a Merge 10.2 into 10.3 2020-08-20 09:12:16 +03:00
Marko Mäkelä
309302a3da MDEV-23475 InnoDB performance regression for write-heavy workloads
In commit fe39d02f51 (MDEV-20638)
we removed some wake-up signaling of the master thread that should
have been there, to ensure a steady log checkpointing workload.

Common sense suggests that the commit omitted some necessary calls
to srv_inc_activity_count(). But, an attempt to add the call to
trx_flush_log_if_needed_low() as well as to reinstate the function
innobase_active_small() did not restore the performance for the
case where sync_binlog=1 is set.

Therefore, we will revert the entire commit in MariaDB Server 10.2.
In MariaDB Server 10.5, adding a srv_inc_activity_count() call to
trx_flush_log_if_needed_low() did restore the performance, so we
will not revert MDEV-20638 across all versions.
2020-08-19 11:18:56 +03:00
Marko Mäkelä
bbd70fcc43 MDEV-23379 Deprecate&ignore InnoDB concurrency throttling parameters
The parameters innodb_thread_concurrency and innodb_commit_concurrency
were useful years ago when both computing resources and the implementation
of some shared data structures were limited. MySQL 5.0 or 5.1 had trouble
scaling beyond 8 concurrent connections. Most of the scalability bottlenecks
have been removed since then, and the transactions per second delivered
by MariaDB Server 10.5 should not dramatically drop upon exceeding the
'optimal' number of connections.

Hence, enabling any concurrency throttling for InnoDB actually makes
things worse. We have seen many customers mistakenly setting this to a
small value like 16 or 64 and then complaining the server was slow.

Ignoring the parameters allows us to remove some normally unused code
and data structures, which could slightly improve performance.

innodb_thread_concurrency, innodb_commit_concurrency,
innodb_replication_delay, innodb_concurrency_tickets,
innodb_thread_sleep_delay, innodb_adaptive_max_sleep_delay:
Deprecate and ignore; hard-wire to 0.

The column INFORMATION_SCHEMA.INNODB_TRX.trx_concurrency_tickets
will always report 0.
2020-08-04 06:59:29 +03:00
Marko Mäkelä
50a11f396a Merge 10.4 into 10.5 2020-08-01 14:42:51 +03:00
Marko Mäkelä
9216114ce7 Merge 10.3 into 10.4 2020-07-31 18:09:08 +03:00
Marko Mäkelä
66ec3a770f Merge 10.2 into 10.3 2020-07-31 13:51:28 +03:00
Thirunarayanan Balathandayuthapani
fe39d02f51 MDEV-20638 Remove the deadcode from srv_master_thread() and srv_active_wake_master_thread_low()
- Due to commit fe95cb2e40 (MDEV-16125),
InnoDB master thread does not need to call srv_resume_thread()
and therefore there is no need to wake up the thread.
Due to the above patch, InnoDB should remove the following dead code.

srv_check_activity(): Makes the parameter as in,out and returns the
recent activity value

innobase_active_small(): Removed

srv_active_wake_master_thread(): Removed

srv_wake_master_thread(): Removed

srv_active_wake_master_thread_low(): Removed

Simplify srv_master_thread() and remove switch cases, added the assert.

Replace srv_wake_master_thread() with srv_inc_activity_count()

INNOBASE_WAKE_INTERVAL: Removed
2020-07-23 16:23:20 +05:30
Marko Mäkelä
5155a300fa MDEV-22871: Reduce InnoDB buf_pool.page_hash contention
The rw_lock_s_lock() calls for the buf_pool.page_hash became a
clear bottleneck after MDEV-15053 reduced the contention on
buf_pool.mutex. We will replace that use of rw_lock_t with a
special implementation that is optimized for memory bus traffic.

The hash_table_locks instrumentation will be removed.

buf_pool_t::page_hash: Use a special implementation whose API is
compatible with hash_table_t, and store the custom rw-locks
directly in buf_pool.page_hash.array, intentionally sharing
cache lines with the hash table pointers.

rw_lock: A low-level rw-lock implementation based on std::atomic<uint32_t>
where read_trylock() becomes a simple fetch_add(1).

buf_pool_t::page_hash_latch: The special of rw_lock for the page_hash.

buf_pool_t::page_hash_latch::read_lock(): Assert that buf_pool.mutex
is not being held by the caller.

buf_pool_t::page_hash_latch::write_lock() may be called while not holding
buf_pool.mutex. buf_pool_t::watch_set() is such a caller.

buf_pool_t::page_hash_latch::read_lock_wait(),
page_hash_latch::write_lock_wait(): The spin loops.
These will obey the global parameters innodb_sync_spin_loops and
innodb_sync_spin_wait_delay.

buf_pool_t::freed_page_hash: A singly linked list of copies of
buf_pool.page_hash that ever existed. The fact that we never
free any buf_pool.page_hash.array guarantees that all
page_hash_latch that ever existed will remain valid until shutdown.

buf_pool_t::resize_hash(): Replaces buf_pool_resize_hash().
Prepend a shallow copy of the old page_hash to freed_page_hash.

buf_pool_t::page_hash_table::n_cells: Declare as Atomic_relaxed.

buf_pool_t::page_hash_table::lock(): Explain what prevents a
race condition with buf_pool_t::resize_hash().
2020-06-18 14:16:01 +03:00