mariadb

mirror of https://github.com/MariaDB/server.git synced 2026-05-14 19:07:15 +02:00

Author	SHA1	Message	Date
Marko Mäkelä	101da87228	Merge 10.5 into 10.6	2021-06-23 19:36:45 +03:00
Marko Mäkelä	22b62edaed	MDEV-25113: Make page flushing faster buf_page_write_complete(): Reduce the buf_pool.mutex hold time, and do not acquire buf_pool.flush_list_mutex at all. Instead, mark blocks clean by setting oldest_modification to 1. Dirty pages of temporary tables will be identified by the special value 2 instead of the previous special value 1. (By design of the ib_logfile0 format, actual LSN values smaller than 2048 are not possible.) buf_LRU_free_page(), buf_pool_t::get_oldest_modification() and many other functions will remove the garbage (clean blocks) from buf_pool.flush_list while holding buf_pool.flush_list_mutex. buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaced with non-atomic variables, protected by buf_pool.mutex, to avoid unnecessary synchronization when modifying the counts. export_vars: Remove unnecessary indirection for innodb_pages_created, innodb_pages_read, innodb_pages_written.	2021-06-23 19:06:52 +03:00
Marko Mäkelä	f778a5d5e2	MDEV-25854: Remove garbage tables after restoring a backup In commit `1c5ae99194` (MDEV-25666) we had changed Mariabackup so that it would no longer skip files whose names start with #sql. This turned out to be wrong. Because operations on such named files are not protected by any locks in the server, it is not safe to copy them. Not copying the files may make the InnoDB data dictionary inconsistent with the file system. So, we must do something in InnoDB to adjust for that. If InnoDB is being started up without the redo log (ib_logfile0) or with a zero-length log file, we will assume that the server was restored from a backup, and adjust things as follows: dict_check_sys_tables(), fil_ibd_open(): Do not complain about missing #sql files if they would be dropped a little later. dict_stats_update_if_needed(): Never add #sql tables to the recomputing queue. This avoids a potential race condition when dropping the garbage tables. drop_garbage_tables_after_restore(): Try to drop any garbage tables. innodb_ddl_recovery_done(): Invoke drop_garbage_tables_after_restore() if srv_start_after_restore (a new flag) was set and we are not in read-only mode (innodb_read_only=ON or innodb_force_recovery>3). The tests and dbug_mariabackup_event() instrumentation were developed by Vladislav Vaintroub, who also reviewed this.	2021-06-17 13:46:16 +03:00
Krunal Bauskar	102ff4208d	MDEV-25882: Statistics used to track b-tree (non-adaptive) searches should be updated only when adaptive hashing is turned-on Currently, btr_cur_n_non_sea is used to track the search that missed adaptive hash index. adaptive hash index is turned off by default but the said variable is updated always though the value of it makes sense only when an adaptive index is enabled. It is meant to check how many searches didn't go through an adaptive hash index. Given a global variable that is updated on each search path it causes a contention with a multi-threaded workload. Patch moves the said variables inside a loop that is now updated only when the adaptive hash index is enabled and that in theory should also, reduce the update frequency of the said variable as the majority of the request should be serviced through the adaptive hash index. Variables (btr_cur_n_non_sea and btr_cur_n_sea) are also converted to use distributed counter to avoid contention. User visible changes: This also means that user will now see Innodb_adaptive_hash_non_hash_searches (viewed as part of show status) only if code is compiled with DWITH_INNODB_AHI=ON (default) and it will be updated only if innodb_adaptive_hash_index=1 else it reported as 0.	2021-06-11 12:32:03 +03:00
Marko Mäkelä	e538cb095f	Merge 10.5 into 10.6	2021-03-27 18:03:03 +02:00
Marko Mäkelä	80459bcbd4	Merge 10.4 into 10.5	2021-03-27 17:37:42 +02:00
Marko Mäkelä	7ae37ff74f	Merge 10.3 into 10.4	2021-03-27 17:12:28 +02:00
Marko Mäkelä	3157fa182a	Merge 10.2 into 10.3	2021-03-27 16:11:26 +02:00
Marko Mäkelä	0f8caadc96	MDEV-22653: Remove the useless parameter innodb_simulate_comp_failures The debug parameter innodb_simulate_comp_failures injected compression failures for ROW_FORMAT=COMPRESSED tables, breaking the pre-existing logic that I had implemented in the InnoDB Plugin for MySQL 5.1 to prevent compressed page overflows. A much better check is already achieved by defining UNIV_ZIP_COPY at the compilation time. (Only UNIV_ZIP_DEBUG is part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON.)	2021-03-22 18:12:44 +02:00
Marko Mäkelä	1055cf967f	Merge 10.5 into 10.6	2021-03-20 18:40:07 +02:00
Marko Mäkelä	e7ddf46632	MDEV-25211 Remove useless counter Innodb_buffered_aio_submitted In commit `412533b4a7` (MDEV-18582), one of the counters that was ported from XtraDB is useless. Innodb_buffered_aio_submitted would be 0 or 1, depending on whether is_linux_native_aio_supported() was executed to the point where it would be incremented. Let us remove this counter, because it has no practical value. Even if its value were 1, io_setup() can still fail and we may end up with innodb_use_native_aio=0.	2021-03-20 13:47:05 +02:00
Krunal Bauskar	8d4e3ec2f7	MDEV-21212: buf_page_get_gen -> buf_pool->stat.n_page_gets++ is a cpu waste n_page_gets is a global counter that is updated on each page access. This also means it is updated pretty often and with a multi-core machine it easily boils up to be the hottest counter as also reported by perf. Using existing distributed counter framework help ease the contention and improve the performance. Patch also tend to increase the slot of the distributed counter from original 64 to 128 given that is new normal for next-generation machines. The original idea and patch came from Daniel Black which is now ported to 10.6 with some improvement and adjustment.	2021-03-12 12:56:23 +02:00
Marko Mäkelä	7a4fbb55b0	MDEV-25105 Remove innodb_checksum_algorithm values none,innodb,... Historically, InnoDB supported a buggy page checksum algorithm that did not compute a checksum over the full page. Later, well before MySQL 4.1 introduced .ibd files and the innodb_file_per_table option, the algorithm was corrected and the first 4 bytes of each page were redefined to be a checksum. The original checksum was so slow that an option to disable page checksum was introduced for benchmarketing purposes. The Intel Nehalem microarchitecture introduced the SSE4.2 instruction set extension, which includes instructions for faster computation of CRC-32C. In MySQL 5.6 (and MariaDB 10.0), innodb_checksum_algorithm=crc32 was implemented to make of that. As that option was changed to be the default in MySQL 5.7, a bug was found on big-endian platforms and some work-around code was added to weaken that checksum further. MariaDB disables that work-around by default since MDEV-17958. Later, SIMD-accelerated CRC-32C has been implemented in MariaDB for POWER and ARM and also for IA-32/AMD64, making use of carry-less multiplication where available. Long story short, innodb_checksum_algorithm=crc32 is faster and more secure than the pre-MySQL 5.6 checksum, called innodb_checksum_algorithm=innodb. It should have removed any need to use innodb_checksum_algorithm=none. The setting innodb_checksum_algorithm=crc32 is the default in MySQL 5.7 and MariaDB Server 10.2, 10.3, 10.4. In MariaDB 10.5, MDEV-19534 made innodb_checksum_algorithm=full_crc32 the default. It is even faster and more secure. The default settings in MariaDB do allow old data files to be read, no matter if a worse checksum algorithm had been used. (Unfortunately, before innodb_checksum_algorithm=full_crc32, the data files did not identify which checksum algorithm is being used.) The non-default settings innodb_checksum_algorithm=strict_crc32 or innodb_checksum_algorithm=strict_full_crc32 would only allow CRC-32C checksums. The incompatibility with old data files is why they are not the default. The newest server not to support innodb_checksum_algorithm=crc32 were MySQL 5.5 and MariaDB 5.5. Both have reached their end of life. A valid reason for using innodb_checksum_algorithm=innodb could have been the ability to downgrade. If it is really needed, data files can be converted with an older version of the innochecksum utility. Because there is no good reason to allow data files to be written with insecure checksums, we will reject those option values: innodb_checksum_algorithm=none innodb_checksum_algorithm=innodb innodb_checksum_algorithm=strict_none innodb_checksum_algorithm=strict_innodb Furthermore, the following innochecksum options will be removed, because only strict crc32 will be supported: innochecksum --strict-check=crc32 innochecksum -C crc32 innochecksum --write=crc32 innochecksum -w crc32 If a user wishes to convert a data file to use a different checksum (so that it might be used with the no-longer-supported MySQL 5.5 or MariaDB 5.5, which do not support IMPORT TABLESPACE nor system tablespace format changes that were made in MariaDB 10.3), then the innochecksum tool from MariaDB 10.2, 10.3, 10.4, 10.5 or MySQL 5.7 can be used. Reviewed by: Thirunarayanan Balathandayuthapani	2021-03-11 12:46:18 +02:00
Marko Mäkelä	584e52118c	MDEV-20612 fixup: Make comments refer to lock_sys.latch	2021-02-17 12:18:03 +02:00
Sergei Golubchik	00a313ecf3	Merge branch 'bb-10.3-release' into bb-10.4-release Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution" was null-merged. 10.4 version of the fix is coming up separately	2021-02-12 17:44:22 +01:00
Marko Mäkelä	5f46385764	MDEV-24731 Excessive mutex contention in DeadlockChecker::check_and_resolve() The DeadlockChecker expects to be able to freeze the waits-for graph. Hence, it is best executed somewhere where we are not holding any additional mutexes. lock_wait(): Defer the deadlock check to this function, instead of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(). DeadlockChecker::trx_rollback(): Merge with the only caller, check_and_resolve(). LockMutexGuard: RAII accessor for lock_sys.mutex. lock_sys.deadlocks: Replaces lock_deadlock_found. trx_t: Clean up some comments.	2021-02-04 16:38:07 +02:00
Sergei Golubchik	60ea09eae6	Merge branch '10.2' into 10.3	2021-02-01 13:49:33 +01:00
Vladislav Vaintroub	d8373fea5f	MDEV-24685 - remove IO thread states output from SHOW ENGINE INNODB STATUS There are no IO threads anymore.	2021-01-29 18:02:14 +02:00
Marko Mäkelä	e71e613353	MDEV-24671: Replace lock_wait_timeout_task with mysql_cond_timedwait() lock_wait(): Replaces lock_wait_suspend_thread(). Wait for the lock to be granted or the transaction to be killed using mysql_cond_timedwait() or mysql_cond_wait(). lock_wait_end(): Replaces que_thr_end_lock_wait() and lock_wait_release_thread_if_suspended(). lock_wait_timeout_task: Remove. The operating system kernel will resume the mysql_cond_timedwait() in lock_wait(). An added benefit is that innodb_lock_wait_timeout no longer has a 'jitter' of 1 second, which was caused by this wake-up task waking up only once per second, and then waking up any threads for which the timeout (which was only measured in seconds) was exceeded. innobase_kill_query(): Set trx->error_state=DB_INTERRUPTED, so that a call trx_is_interrupted(trx) in lock_wait() can be avoided. We will protect things more consistently with lock_sys.wait_mutex, which will be moved below lock_sys.mutex in the latching order. trx_lock_t::cond: Condition variable for !wait_lock, used with lock_sys.wait_mutex. srv_slot_t: Remove. Replaced by trx_lock_t::cond, lock_grant_after_reset(): Merged to to lock_grant(). lock_rec_get_index_name(): Remove. lock_sys_t: Introduce wait_pending, wait_count, wait_time, wait_time_max that are protected by wait_mutex. trx_lock_t::que_state: Remove. que_thr_state_t: Remove QUE_THR_COMMAND_WAIT, QUE_THR_LOCK_WAIT. que_thr_t: Remove is_active, start_running(), stop_no_error(). que_fork_t::n_active_thrs, trx_lock_t::n_active_thrs: Remove.	2021-01-27 15:45:39 +02:00
Vladislav Vaintroub	c310f4c381	MDEV-24685 - remove IO thread states output from SHOW ENGINE INNODB STATUS There are no IO threads anymore.	2021-01-27 14:17:43 +01:00
Marko Mäkelä	6d05a95c65	Merge 10.5 into 10.6	2021-01-14 16:13:31 +02:00
Marko Mäkelä	e4205fba7c	MDEV-24536 innodb_idle_flush_pct has no effect The parameter innodb_idle_flush_pct that was introduced in MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since the InnoDB changes from MySQL 5.7.9 were applied in commit `2e814d4702`. Let us declare the parameter as MARIADB_REMOVED_OPTION. For earlier versions, commit `ea9cd97f85` declared the parameter deprecated.	2021-01-13 19:11:31 +02:00
Marko Mäkelä	ea9cd97f85	MDEV-24536 innodb_idle_flush_pct has no effect The parameter innodb_idle_flush_pct that was introduced in MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since the InnoDB changes from MySQL 5.7.9 were applied in commit `2e814d4702`. Let us declare the parameter as deprecated and having no effect.	2021-01-13 18:55:56 +02:00
Marko Mäkelä	666565c7f0	Merge 10.5 into 10.6	2021-01-11 17:32:08 +02:00
Marko Mäkelä	fae87e0c74	MDEV-24544 innodb_buffer_pool_wait_free is not protected by mutex buf_LRU_get_free_block(): Increment the counter after acquiring buf_pool.mutex. buf_pool.stat.LRU_waits: Replaces export_vars.innodb_buffer_pool_wait_free and srv_stats.buf_pool_wait_free. Reads of the counter will remain dirty, not acquiring the mutex.	2021-01-07 11:18:13 +02:00
Marko Mäkelä	92abdcca5a	Merge 10.5 into 10.6	2021-01-07 09:08:09 +02:00
Marko Mäkelä	a993310593	MDEV-24537 innodb_max_dirty_pages_pct_lwm=0 lost its special meaning In commit `3a9a3be1c6` (MDEV-23855) some previous logic was replaced with the condition dirty_pct < srv_max_dirty_pages_pct_lwm, which caused the default value of the parameter innodb_max_dirty_pages_pct_lwm=0 to lose its special meaning: 'refer to innodb_max_dirty_pages_pct instead'. This implicit special meaning was visible in the function af_get_pct_for_dirty(), which was removed in commit `f0c295e2de` (MDEV-24369). page_cleaner_flush_pages_recommendation(): Restore the special meaning that was removed in MDEV-24369. buf_flush_page_cleaner(): If srv_max_dirty_pages_pct_lwm==0.0, refer to srv_max_buf_pool_modified_pct. This fixes the observed performance regression due to excessive page flushing. buf_pool_t::page_cleaner_wakeup(): Revise the wakeup condition. innodb_init(): Do initialize srv_max_io_capacity in Mariabackup. It was previously constantly 0, which caused mariadb-backup --prepare to hang in buf_flush_sync(), making no progress.	2021-01-06 13:53:14 +02:00
Marko Mäkelä	ff5d306e29	MDEV-21452: Replace ib_mutex_t with mysql_mutex_t SHOW ENGINE INNODB MUTEX functionality is completely removed, as are the InnoDB latching order checks. We will enforce innodb_fatal_semaphore_wait_threshold only for dict_sys.mutex and lock_sys.mutex. dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex. lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex. FIXME: srv_sys should be removed altogether; it is duplicating tpool functionality. fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must not hold fil_system.mutex. fil_close_all_files(): To prevent SAFE_MUTEX warnings for fil_space_destroy_crypt_data(), we must not hold fil_system.mutex while invoking fil_space_free_low() on a detached tablespace.	2020-12-15 17:56:18 +02:00
Marko Mäkelä	db006a9a43	MDEV-21452: Remove os_event_t, MUTEX_EVENT, TTASEventMutex, sync_array We will default to MUTEXTYPE=sys (using OSTrackMutex) for those ib_mutex_t that have not been replaced yet. The view INFORMATION_SCHEMA.INNODB_SYS_SEMAPHORE_WAITS is removed. The parameter innodb_sync_array_size is removed. FIXME: innodb_fatal_semaphore_wait_threshold will no longer be enforced. We should enforce it for lock_sys.mutex and dict_sys.mutex somehow! innodb_sync_debug=ON might still cover ib_mutex_t.	2020-12-15 17:56:17 +02:00
Marko Mäkelä	38fd7b7d91	MDEV-21452: Replace all direct use of os_event_t Let us replace os_event_t with mysql_cond_t, and replace the necessary ib_mutex_t with mysql_mutex_t so that they can be used with condition variables. Also, let us replace polling (os_thread_sleep() or timed waits) with plain mysql_cond_wait() wherever possible. Furthermore, we will use the lightweight srw_mutex for trx_t::mutex, to hopefully reduce contention on lock_sys.mutex. FIXME: Add test coverage of mariabackup --backup --kill-long-queries-timeout	2020-12-15 17:56:17 +02:00
Marko Mäkelä	f24b738318	MDEV-24313 (2 of 2): Silently ignored innodb_use_native_aio=1 In commit `5e62b6a5e0` (MDEV-16264) the logic of os_aio_init() was changed so that it will never fail, but instead automatically disable innodb_use_native_aio (which is enabled by default) if the io_setup() system call would fail due to resource limits being exceeded. This is questionable, especially because falling back to simulated AIO may lead to significantly reduced performance. srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads: Change the data type from ulong to uint. os_aio_init(): Remove the parameters, and actually return an error code. thread_pool::configure_aio(): Do not silently fall back to simulated AIO. Reviewed by: Vladislav Vaintroub	2020-12-14 15:27:03 +02:00
Marko Mäkelä	83591a23d6	MDEV-24350 buf_dblwr unnecessarily uses memory-intensive srv_stats counters The counters in srv_stats use std::atomic and multiple cache lines per counter. This is an overkill in a case where a critical section already exists in the code. A regular variable will work just fine, with much smaller memory bus impact.	2020-12-04 17:52:23 +02:00
Marko Mäkelä	657fcdf430	MDEV-24280 InnoDB triggers too many independent periodic tasks A side effect of MDEV-16264 is that a large number of threads will be created at server startup, to be destroyed after a minute or two. One source of such thread creation is srv_start_periodic_timer(). InnoDB is creating 3 periodic tasks: srv_master_callback (1Hz) srv_error_monitor_task (1Hz), and srv_monitor_task (0.2Hz). It appears that we can merge srv_error_monitor_task and srv_monitor_task and have them invoked 4 times per minute (every 15 seconds). This will affect our ability to enforce innodb_fatal_semaphore_wait_threshold and some computations around BUF_LRU_STAT_N_INTERVAL. We could remove srv_master_callback along with the DROP TABLE queue at some point of time in the future. We must keep it independent of the innodb_fatal_semaphore_wait_threshold detection, because the background DROP TABLE queue could get stuck due to dict_sys being locked by another thread. For now, srv_master_callback must be invoked once per second, so that innodb_flush_log_at_timeout=1 can work. BUF_LRU_STAT_N_INTERVAL: Reduce the precision and extend the time from 501 second to 415 seconds. srv_error_monitor_timer: Remove. MAX_MUTEX_NOWAIT: Increase from 201 second to 215 seconds. srv_refresh_innodb_monitor_stats(): Avoid a repeated call to time(NULL). Change the interval to less than 60 seconds. srv_monitor(): Renamed from srv_monitor_task. srv_monitor_task(): Renamed from srv_error_monitor_task(). Invoked only once in 15 seconds. Invoke also srv_monitor(). Increase the fatal_cnt threshold from 101 second to 115 seconds. sync_array_print_long_waits_low(): Invoke time(NULL) only once. Remove a bogus message about printouts for 30 seconds. Those printouts were effectively already disabled in MDEV-16264 (commit `5e62b6a5e0`).	2020-11-25 16:54:00 +02:00
Marko Mäkelä	3a9a3be1c6	MDEV-23855: Improve InnoDB log checkpoint performance After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability bottleneck, log checkpoints became a new bottleneck. If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is set high and the workload fits in the buffer pool, the page cleaner thread will perform very little flushing. When we reach the capacity of the circular redo log file ib_logfile0 and must initiate a checkpoint, some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF, then flushing would continue at the innodb_io_capacity rate, and writers would be throttled.) We have the best chance of advancing the checkpoint LSN immediately after a page flush batch has been completed. Hence, it is best to perform checkpoints after every batch in the page cleaner thread, attempting to run once per second. By initiating high-priority flushing in the page cleaner as early as possible, we aim to make the throughput more stable. The function buf_flush_wait_flushed() used to sleep for 10ms, hoping that the page cleaner thread would do something during that time. The observed end result was that a large number of threads that call log_free_check() would end up sleeping while nothing useful is happening. We will revise the design so that in the default innodb_flush_sync=ON mode, buf_flush_wait_flushed() will wake up the page cleaner thread to perform the necessary flushing, and it will wait for a signal from the page cleaner thread. If innodb_io_capacity is set to a low value (causing the page cleaner to throttle its work), a write workload would initially perform well, until the capacity of the circular ib_logfile0 is reached and log_free_check() will trigger checkpoints. At that point, the extra waiting in buf_flush_wait_flushed() will start reducing throughput. The page cleaner thread will also initiate log checkpoints after each buf_flush_lists() call, because that is the best point of time for the checkpoint LSN to advance by the maximum amount. Even in 'furious flushing' mode we invoke buf_flush_lists() with innodb_io_capacity_max pages at a time, and at the start of each batch (in the log_flush() callback function that runs in a separate task) we will invoke os_aio_wait_until_no_pending_writes(). This tweak allows the checkpoint to advance in smaller steps and significantly reduces the maximum latency. On an Intel Optane 960 NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds. On Microsoft Windows with a slower SSD, it reduced from more than 180 seconds to 0.6 seconds. We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity per second whenever the dirty proportion of buffer pool pages exceeds innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try to make page_cleaner_flush_pages_recommendation() more consistent and predictable: if we are below innodb_adaptive_flushing_lwm, let us flush pages according to the return value of af_get_pct_for_dirty(). innodb_max_dirty_pages_pct_lwm: Revert the change of the default value that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0 guarantees that a shutdown of an idle server will be fast. Users might be surprised if normal shutdown suddenly became slower when upgrading within a GA release series. innodb_checkpoint_usec: Remove. The master task will no longer perform periodic log checkpoints. It is the duty of the page cleaner thread. log_sys.max_modified_age: Remove. The current span of the buf_pool.flush_list expressed in LSN only matters for adaptive flushing (outside the 'furious flushing' condition). For the correctness of checkpoints, the only thing that matters is the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn). This run-time constant was also reported as log_max_modified_age_sync. log_sys.max_checkpoint_age_async: Remove. This does not serve any purpose, because the checkpoints will now be triggered by the page cleaner thread. We will retain the log_sys.max_checkpoint_age limit for engaging 'furious flushing'. page_cleaner.slot: Remove. It turns out that page_cleaner_slot.flush_list_time was duplicating page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass was duplicating page_cleaner.flush_pass. Likewise, there were some redundant monitor counters, because the page cleaner thread no longer performs any buf_pool.LRU flushing, and because there only is one buf_flush_page_cleaner thread. buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex. buf_pool_t::get_oldest_modification(): Add a parameter to specify the return value when no persistent data pages are dirty. Require the caller to hold buf_pool.flush_list_mutex. log_buf_pool_get_oldest_modification(): Take the fall-back LSN as a parameter. All callers will also invoke log_sys.get_lsn(). log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed(). buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF) and wait for the page cleaner to complete. If the page cleaner thread is not running (which can be the case durign shutdown), initiate the flush and wait for it directly. buf_flush_ahead(): If innodb_flush_sync=ON (the default), submit a new buf_flush_sync_lsn target for the page cleaner but do not wait for the flushing to finish. log_get_capacity(), log_get_max_modified_age_async(): Remove, to make it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes. page_cleaner_flush_pages_recommendation(): Protect all access to buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there were some race conditions in the calculation. buf_flush_sync_for_checkpoint(): New function to process buf_flush_sync_lsn in the page cleaner thread. At the end of each batch, we try to wake up any blocked buf_flush_wait_flushed(). If everything up to buf_flush_sync_lsn has been flushed, we will reset buf_flush_sync_lsn=0. The page cleaner thread will keep 'furious flushing' until the limit is reached. Any threads that are waiting in buf_flush_wait_flushed() will be able to resume as soon as their own limit has been satisfied. buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not sleep as long as it is set. Do not update any page_cleaner statistics for this special mode of operation. In the normal mode (buf_flush_sync_lsn is not set for innodb_flush_sync=ON), try to wake up once per second. No longer check whether srv_inc_activity_count() has been called. After each batch, try to perform a log checkpoint, because the best chances for the checkpoint LSN to advance by the maximum amount are upon completing a flushing batch. log_t: Move buf_free, max_buf_free possibly to the same cache line with log_sys.mutex. log_margin_checkpoint_age(): Simplify the logic, and replace a 0.1-second sleep with a call to buf_flush_wait_flushed() to initiate flushing. Moved to the same compilation unit with the only caller. log_close(): Clean up the calculations. (Should be no functional change.) Return whether flush-ahead is needed. Moved to the same compilation unit with the only caller. mtr_t::finish_write(): Return whether flush-ahead is needed. mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid external calls in mtr_t::commit() and make the logic easier to follow by having related code in a single compilation unit. Also, we will invoke srv_stats.log_write_requests.inc() only once per mini-transaction commit, while not holding mutexes. log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age. Upon reaching log_sys.max_checkpoint_age where we must wait to prevent the log from getting corrupted, let us wait for at most 1MiB of LSN at a time, before rechecking the condition. This should allow writers to proceed even if the redo log capacity has been reached and 'furious flushing' is in progress. We no longer care about log_sys.max_modified_age_sync or log_sys.max_modified_age_async. The log_sys.max_modified_age_sync could be a relic from the time when there was a srv_master_thread that wrote dirty pages to data files. Also, we no longer have any log_sys.max_checkpoint_age_async limit, because log checkpoints will now be triggered by the page cleaner thread upon completing buf_flush_lists(). log_set_capacity(): Simplify the calculations of the limit (no functional change). log_checkpoint_low(): Split from log_checkpoint(). Moved to the same compilation unit with the caller. log_make_checkpoint(): Only wait for everything to be flushed until the current LSN. create_log_file(): After checkpoint, invoke log_write_up_to() to ensure that the FILE_CHECKPOINT record has been written. This avoids ut_ad(!srv_log_file_created) in create_log_file_rename(). srv_start(): Do not call recv_recovery_from_checkpoint_start() if the log has just been created. Set fil_system.space_id_reuse_warned before dict_boot() has been executed, and clear it after recovery has finished. dict_boot(): Initialize fil_system.max_assigned_id. srv_check_activity(): Remove. The activity count is counting transaction commits and therefore mostly interesting for the purge of history. BtrBulk::insert(): Do not explicitly wake up the page cleaner, but do invoke srv_inc_activity_count(), because that counter is still being used in buf_load_throttle_if_needed() for some heuristics. (It might be cleaner to execute buf_load() in the page cleaner thread!) Reviewed by: Vladislav Vaintroub	2020-10-26 17:09:01 +02:00
Marko Mäkelä	61161d51d7	Cleanup: Remove export_vars.innodb_num_open_files	2020-10-15 17:06:16 +03:00
Marko Mäkelä	7cffb5f6e8	MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will not be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit `d1ab89037a` unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub	2020-10-15 17:04:56 +03:00
Marko Mäkelä	b535a79044	MDEV-23399: Remove recv_writer_thread Recovery works just fine without a separate thread whose only task is to tell the page cleaner thread to do its job. recv_sys_t::apply(): Flush the buffer pool at the end of each batch. Reviewed by: Vladislav Vaintroub	2020-10-15 10:33:20 +03:00
Marko Mäkelä	b15224c71a	MDEV-16264 fixup: Remove unused declarations Remove stale references to srv_sys.sys_threads and srv_sys.mutex that were removed in commit `5e62b6a5e0`.	2020-10-01 14:35:42 +03:00
Marko Mäkelä	a9550c47e4	MDEV-16264 fixup: Remove unused code and data LATCH_ID_OS_AIO_READ_MUTEX, LATCH_ID_OS_AIO_WRITE_MUTEX, LATCH_ID_OS_AIO_LOG_MUTEX, LATCH_ID_OS_AIO_IBUF_MUTEX, LATCH_ID_OS_AIO_SYNC_MUTEX: Remove. The tpool is not instrumented. lock_set_timeout_event(): Remove. srv_sys_mutex_key, srv_sys_t::mutex, SYNC_THREADS: Remove. srv_slot_t::suspended: Remove. We only ever assigned this data member true, so it is redundant. ib_wqueue_wait(), ib_wqueue_timedwait(): Remove. os_thread_join(): Remove. os_thread_create(), os_thread_exit(): Remove redundant parameters. These were missed in commit `5e62b6a5e0`.	2020-09-30 14:28:11 +03:00
Marko Mäkelä	8bc4ebed67	MDEV-21534 fixup: Remove traces of removed log_sys.write_mutex This was missed in commit `30ea63b7d2`.	2020-09-30 13:48:20 +03:00
Marko Mäkelä	363da42190	Cleanup: Remove unnecessary declarations	2020-09-30 13:37:15 +03:00
Marko Mäkelä	2fa9f8c53a	Merge 10.3 into 10.4	2020-08-20 11:01:47 +03:00
Marko Mäkelä	de0e7cd72a	Merge 10.2 into 10.3	2020-08-20 09:12:16 +03:00
Marko Mäkelä	309302a3da	MDEV-23475 InnoDB performance regression for write-heavy workloads In commit `fe39d02f51` (MDEV-20638) we removed some wake-up signaling of the master thread that should have been there, to ensure a steady log checkpointing workload. Common sense suggests that the commit omitted some necessary calls to srv_inc_activity_count(). But, an attempt to add the call to trx_flush_log_if_needed_low() as well as to reinstate the function innobase_active_small() did not restore the performance for the case where sync_binlog=1 is set. Therefore, we will revert the entire commit in MariaDB Server 10.2. In MariaDB Server 10.5, adding a srv_inc_activity_count() call to trx_flush_log_if_needed_low() did restore the performance, so we will not revert MDEV-20638 across all versions.	2020-08-19 11:18:56 +03:00
Marko Mäkelä	bbd70fcc43	MDEV-23379 Deprecate&ignore InnoDB concurrency throttling parameters The parameters innodb_thread_concurrency and innodb_commit_concurrency were useful years ago when both computing resources and the implementation of some shared data structures were limited. MySQL 5.0 or 5.1 had trouble scaling beyond 8 concurrent connections. Most of the scalability bottlenecks have been removed since then, and the transactions per second delivered by MariaDB Server 10.5 should not dramatically drop upon exceeding the 'optimal' number of connections. Hence, enabling any concurrency throttling for InnoDB actually makes things worse. We have seen many customers mistakenly setting this to a small value like 16 or 64 and then complaining the server was slow. Ignoring the parameters allows us to remove some normally unused code and data structures, which could slightly improve performance. innodb_thread_concurrency, innodb_commit_concurrency, innodb_replication_delay, innodb_concurrency_tickets, innodb_thread_sleep_delay, innodb_adaptive_max_sleep_delay: Deprecate and ignore; hard-wire to 0. The column INFORMATION_SCHEMA.INNODB_TRX.trx_concurrency_tickets will always report 0.	2020-08-04 06:59:29 +03:00
Marko Mäkelä	50a11f396a	Merge 10.4 into 10.5	2020-08-01 14:42:51 +03:00
Marko Mäkelä	9216114ce7	Merge 10.3 into 10.4	2020-07-31 18:09:08 +03:00
Marko Mäkelä	66ec3a770f	Merge 10.2 into 10.3	2020-07-31 13:51:28 +03:00
Thirunarayanan Balathandayuthapani	fe39d02f51	MDEV-20638 Remove the deadcode from srv_master_thread() and srv_active_wake_master_thread_low() - Due to commit `fe95cb2e40` (MDEV-16125), InnoDB master thread does not need to call srv_resume_thread() and therefore there is no need to wake up the thread. Due to the above patch, InnoDB should remove the following dead code. srv_check_activity(): Makes the parameter as in,out and returns the recent activity value innobase_active_small(): Removed srv_active_wake_master_thread(): Removed srv_wake_master_thread(): Removed srv_active_wake_master_thread_low(): Removed Simplify srv_master_thread() and remove switch cases, added the assert. Replace srv_wake_master_thread() with srv_inc_activity_count() INNOBASE_WAKE_INTERVAL: Removed	2020-07-23 16:23:20 +05:30
Marko Mäkelä	5155a300fa	MDEV-22871: Reduce InnoDB buf_pool.page_hash contention The rw_lock_s_lock() calls for the buf_pool.page_hash became a clear bottleneck after MDEV-15053 reduced the contention on buf_pool.mutex. We will replace that use of rw_lock_t with a special implementation that is optimized for memory bus traffic. The hash_table_locks instrumentation will be removed. buf_pool_t::page_hash: Use a special implementation whose API is compatible with hash_table_t, and store the custom rw-locks directly in buf_pool.page_hash.array, intentionally sharing cache lines with the hash table pointers. rw_lock: A low-level rw-lock implementation based on std::atomic<uint32_t> where read_trylock() becomes a simple fetch_add(1). buf_pool_t::page_hash_latch: The special of rw_lock for the page_hash. buf_pool_t::page_hash_latch::read_lock(): Assert that buf_pool.mutex is not being held by the caller. buf_pool_t::page_hash_latch::write_lock() may be called while not holding buf_pool.mutex. buf_pool_t::watch_set() is such a caller. buf_pool_t::page_hash_latch::read_lock_wait(), page_hash_latch::write_lock_wait(): The spin loops. These will obey the global parameters innodb_sync_spin_loops and innodb_sync_spin_wait_delay. buf_pool_t::freed_page_hash: A singly linked list of copies of buf_pool.page_hash that ever existed. The fact that we never free any buf_pool.page_hash.array guarantees that all page_hash_latch that ever existed will remain valid until shutdown. buf_pool_t::resize_hash(): Replaces buf_pool_resize_hash(). Prepend a shallow copy of the old page_hash to freed_page_hash. buf_pool_t::page_hash_table::n_cells: Declare as Atomic_relaxed. buf_pool_t::page_hash_table::lock(): Explain what prevents a race condition with buf_pool_t::resize_hash().	2020-06-18 14:16:01 +03:00

1 2 3 4 5 ...

347 commits