mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-17 04:22:27 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	37492960f3	Merge 10.5 into 10.6	2023-05-19 12:24:58 +03:00
Marko Mäkelä	e0084b9d31	MDEV-31234 InnoDB does not free UNDO after the fix of MDEV-30671 trx_purge_truncate_history(): Only call trx_purge_truncate_rseg_history() if the rollback segment is safe to process. This will avoid leaking undo log pages that are not yet ready to be processed. This fixes a regression that was introduced in commit `0de3be8cfd` (MDEV-30671). trx_sys_t::any_active_transactions(): Separately count XA PREPARE transactions. srv_purge_should_exit(): Terminate slow shutdown if the history size does not change and XA PREPARE transactions exist in the system. This will avoid a hang of the test innodb.recovery_shutdown. Tested by: Matthias Leich	2023-05-19 12:19:26 +03:00
Marko Mäkelä	4105017a58	MDEV-30357 Performance regression in locking reads from secondary indexes lock_sec_rec_some_has_impl(): Remove a harmful condition that caused the performance regression and should not have been added in commit `b6e41e3872` in the first place. Locking transactions that have not modified any persistent tables can carry the transaction identifier 0. trx_t::max_inactive_id: A cache for trx_sys_t::find_same_or_older(). The value is not reset on transaction commit so that previous results can be reused for subsequent transactions. The smallest active transaction ID can only increase over time, not decrease. trx_sys_t::find_same_or_older(): Remember the maximum previous id for which rw_trx_hash.iterate() returned false, to avoid redundant iterations. lock_sec_rec_read_check_and_lock(): Add an early return in case we are already holding a covering table lock. lock_rec_convert_impl_to_expl(): Add a template parameter to avoid a redundant run-time check on whether the index is secondary. lock_rec_convert_impl_to_expl_for_trx(): Move some code from lock_rec_convert_impl_to_expl(), to reduce code duplication due to the added template parameter. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2023-03-16 16:00:45 +02:00
Marko Mäkelä	959ad2f30f	MDEV-29612 ReadViewBase::snapshot() misses an optimization ReadViewBase::snapshot(): In case m_low_limit_no==m_low_limit_id and m_ids would include everything between that and m_up_limit_id, set all fields to m_up_limit_id and clear m_ids, to speed up changes_visible() and append(). rw_trx_hash_t::debug_iterator(): Add an assertion.	2022-10-06 13:14:16 +03:00
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Marko Mäkelä	05d049bdbe	Merge 10.5 into 10.6	2022-05-25 14:39:42 +03:00
Marko Mäkelä	ea40c75c27	Merge 10.4 into 10.5	2022-05-25 14:24:51 +03:00
Marko Mäkelä	99c8aed00d	MDEV-28601 InnoDB history list length was reverted to 32 bits srv_do_purge(): In commit `edde1f6e0d` when the de-facto 32-bit trx_sys_t::history_size() was replaced with 32-bit trx_sys.rseg_history_len, some more variables were changed from ulint (size_t) to uint32_t. The history list length is the number of committed transactions whose undo logs are waiting to be purged. Each TRX_RSEG_HISTORY list is storing the number of entries in a 32-bit field and each transaction will occupy at least one undo log page. It is thinkable that the length of each TRX_RSEG_HISTORY list may approach the maximum representable number. The number cannot be exceeded, because the rollback segment header is allocated from the same tablespace as the undo log header pages it is pointing to, and because the page numbers of a tablespace are stored in 32 bits. In any case, it is possible that the total number of unpurged committed transactions cannot be represented in 32 but 39 bits (corresponding to 128 rollback segments and undo tablespaces).	2022-05-25 14:06:04 +03:00
Marko Mäkelä	b6e41e3872	MDEV-28445 Secondary index locking invokes costly trx_sys.get_min_trx_id() lock_sec_rec_read_check_and_lock(): Remove a redundant check for trx_sys.get_min_trx_id(). This would be checked in lock_sec_rec_some_has_impl() anyway. Also, test the cheapest conditions first. lock_sec_rec_some_has_impl(): Replace trx_sys.get_min_trx_id() with trx_sys.find_same_or_older() that is much easier to evaluate. Inspired by mysql/mysql-server@0a0c0db97e	2022-04-29 12:42:29 +03:00
Marko Mäkelä	2aed566d22	Cleanup: alignas(CPU_LEVEL1_DCACHE_LINESIZE) Let us replace all use of MY_ALIGNED in InnoDB with C++11 alignas. CACHE_LINE_SIZE: Replaced with CPU_LEVEL1_DCACHE_LINESIZE.	2022-04-14 10:40:26 +03:00
Marko Mäkelä	8074ab5784	MDEV-28313: Shrink rw_trx_hash_element_t::mutex The element mutex is unnecessarily large. The PERFORMANCE_SCHEMA instrumentation was not even enabled.	2022-04-14 10:10:44 +03:00
Marko Mäkelä	0cd2e6c614	MDEV-28313: InnoDB transactions are not aligned at cache lines trx_lock_t: Remove byte pad[256] and use alignas(CPU_LEVEL1_DCACHE_LINESIZE) instead. trx_t: Declare n_ref (the first member) aligned at cache line. Pool: Assert that the sizes are multiples of CPU_LEVEL1_DCACHE_LINESIZE, and invoke an aligned allocator.	2022-04-14 10:06:34 +03:00
Marko Mäkelä	aaef2e1d8c	MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.	2021-11-18 17:47:19 +02:00
Oleksandr Byelkin	7841a7eb09	Merge branch '10.3' into 10.4	2021-07-31 22:59:58 +02:00
Marko Mäkelä	f50eb0d398	Merge 10.2 into 10.3	2021-07-27 10:47:17 +03:00
Marko Mäkelä	6e12ebd4a7	MDEV-25062: Reduce trx_rseg_t::mutex contention redo_rseg_mutex, noredo_rseg_mutex: Remove the PERFORMANCE_SCHEMA keys. The rollback segment mutex will be uninstrumented. trx_sys_t: Remove pointer indirection for rseg_array, temp_rseg. Align each element to the cache line. trx_sys_t::rseg_id(): Replaces trx_rseg_t::id. trx_rseg_t::ref: Replaces needs_purge, trx_ref_count, skip_allocation in a single std::atomic<uint32_t>. trx_rseg_t::latch: Replaces trx_rseg_t::mutex. trx_rseg_t::history_size: Replaces trx_sys_t::rseg_history_len trx_sys_t::history_size_approx(): Replaces trx_sys.rseg_history_len in those places where the exact count does not matter. We must not acquire any trx_rseg_t::latch while holding index page latches, because normally the trx_rseg_t::latch is acquired before any page latches. trx_sys_t::history_exists(): Replaces trx_sys.rseg_history_len!=0 with an approximation. We remove some unnecessary trx_rseg_t::latch acquisition around trx_undo_set_state_at_prepare() and trx_undo_set_state_at_finish(). Those operations will only access fields that remain constant after trx_rseg_t::init().	2021-06-23 13:42:11 +03:00
Marko Mäkelä	f09d33f521	Merge 10.5 into 10.6	2021-05-18 11:13:45 +03:00
Marko Mäkelä	7b51d11cca	MDEV-25594: Improve debug checks trx_t::will_lock: Changed the type to bool. trx_t::is_autocommit_non_locking(): Replaces trx_is_autocommit_non_locking(). trx_is_ac_nl_ro(): Remove (replaced with equivalent assertion expressions). assert_trx_nonlocking_or_in_list(): Remove. Replaced with at least as strict checks in each place. check_trx_state(): Moved to a static function; partially replaced with individual debug assertions implementing equivalent or stricter checks.	2021-05-18 09:27:59 +03:00
Marko Mäkelä	4930f9c94b	Merge 10.5 into 10.6	2021-04-21 11:45:00 +03:00
Marko Mäkelä	80ed136e6d	Merge 10.4 into 10.5	2021-04-21 09:01:01 +03:00
Marko Mäkelä	a0588d54a2	Merge 10.3 into 10.4	2021-04-21 07:58:42 +03:00
Marko Mäkelä	75c01f39b1	Merge 10.2 into 10.3	2021-04-21 07:25:48 +03:00
Marko Mäkelä	00528a0445	Merge 10.5 into 10.6	2021-03-19 13:35:18 +02:00
Marko Mäkelä	be881ec457	Merge 10.4 into 10.5	2021-03-19 13:09:21 +02:00
Marko Mäkelä	b01d8e1a33	MDEV-20612: Replace lock_sys.mutex with lock_sys.latch For now, we will acquire the lock_sys.latch only in exclusive mode, that is, use it as a mutex. This is preparation for the next commit where we will introduce a less intrusive alternative, combining a shared lock_sys.latch with dict_table_t::lock_mutex or a mutex embedded in lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash.	2021-02-11 14:52:10 +02:00
Marko Mäkelä	487fbc2e15	MDEV-21452 fixup: Introduce trx_t::mutex_is_owner() When we replaced trx_t::mutex with srw_mutex in commit `38fd7b7d91` we lost the SAFE_MUTEX instrumentation. Let us introduce a replacement and restore the assertions.	2021-02-05 16:37:06 +02:00
Marko Mäkelä	46234f03c8	Merge 10.5 into 10.6	2021-01-25 12:56:30 +02:00
Marko Mäkelä	961c7938bb	Merge 10.4 into 10.5	2021-01-25 12:44:24 +02:00
Marko Mäkelä	3467f63764	Merge 10.3 into 10.4	2021-01-25 11:02:07 +02:00
Marko Mäkelä	0e10d7ea14	MDEV-22351 InnoDB may recover wrong information after RESET MASTER Ever since commit `947efe17ed` InnoDB no longer writes binlog position in one place. It will not at all be written to the TRX_SYS page, and instead it will be written to the undo log header page that changes the transaction state. trx_rseg_mem_restore(): Recover the information from the latest written page.	2021-01-22 16:44:17 +02:00
Marko Mäkelä	ff5d306e29	MDEV-21452: Replace ib_mutex_t with mysql_mutex_t SHOW ENGINE INNODB MUTEX functionality is completely removed, as are the InnoDB latching order checks. We will enforce innodb_fatal_semaphore_wait_threshold only for dict_sys.mutex and lock_sys.mutex. dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex. lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex. FIXME: srv_sys should be removed altogether; it is duplicating tpool functionality. fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must not hold fil_system.mutex. fil_close_all_files(): To prevent SAFE_MUTEX warnings for fil_space_destroy_crypt_data(), we must not hold fil_system.mutex while invoking fil_space_free_low() on a detached tablespace.	2020-12-15 17:56:18 +02:00
Marko Mäkelä	38fd7b7d91	MDEV-21452: Replace all direct use of os_event_t Let us replace os_event_t with mysql_cond_t, and replace the necessary ib_mutex_t with mysql_mutex_t so that they can be used with condition variables. Also, let us replace polling (os_thread_sleep() or timed waits) with plain mysql_cond_wait() wherever possible. Furthermore, we will use the lightweight srw_mutex for trx_t::mutex, to hopefully reduce contention on lock_sys.mutex. FIXME: Add test coverage of mariabackup --backup --kill-long-queries-timeout	2020-12-15 17:56:17 +02:00
Marko Mäkelä	ac028ec5d8	MDEV-24142: Remove the LatchDebug interface to rw-locks The latching order checks for rw-locks have not caught many bugs in the past few years and they are greatly complicating the code. Last time the debug checks were useful was in commit `59caf2c3c1` (MDEV-13485). The B-tree hang MDEV-14637 was not caught by LatchDebug, because the granularity of the checks is not sufficient to distinguish the levels of non-leaf B-tree pages. The interface was already made dead code by the grandparent commit `03ca6495df`.	2020-12-03 15:27:50 +02:00
Marko Mäkelä	814bc21305	Cleanup: Use Atomic_relaxed for trx_t::state For reading trx_t::state we can avoid acquiring trx_t::mutex. Atomic load and store should be similar to normal load and store on most instruction set architectures. The atomicity of the operation would merely prohibit the compiler from reordering some operations.	2020-11-24 15:44:55 +02:00
Marko Mäkelä	7b2bb67113	Merge 10.3 into 10.4	2020-10-29 13:38:38 +02:00
Marko Mäkelä	a8de8f261d	Merge 10.2 into 10.3	2020-10-28 10:01:50 +02:00
Marko Mäkelä	45ed9dd957	MDEV-23855: Remove fil_system.LRU and reduce fil_system.mutex contention Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored for system tablespace on SSD When the maximum configured number of file is exceeded, InnoDB will close data files. We used to maintain a fil_system.LRU list and a counter fil_node_t::n_pending to achieve this, at the huge cost of multiple fil_system.mutex operations per I/O operation. fil_node_open_file_low(): Implement a FIFO replacement policy: The last opened file will be moved to the end of fil_system.space_list, and files will be closed from the start of the list. However, we will not move tablespaces in fil_system.space_list while i_s_tablespaces_encryption_fill_table() is executing (producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION) because it may cause information of some tablespaces to go missing. We also avoid this in mariabackup --backup because datafiles_iter_next() assumes that the ordering is not changed. IORequest: Fold more parameters to IORequest::type. fil_space_t::io(): Replaces fil_io(). fil_space_t::flush(): Replaces fil_flush(). OS_AIO_IBUF: Remove. We will always issue synchronous reads of the change buffer pages in buf_read_page_low(). We will always ignore some errors for background reads. This should reduce fil_system.mutex contention a little. fil_node_t::complete_write(): Replaces fil_node_t::complete_io(). On both read and write completion, fil_space_t::release_for_io() will have to be called. fil_space_t::io(): Do not acquire fil_system.mutex in the normal code path. xb_delta_open_matching_space(): Do not try to open the system tablespace which was already opened. This fixes a file sharing violation in mariabackup --prepare --incremental. Reviewed by: Vladislav Vaintroub	2020-10-26 17:09:01 +02:00
Marko Mäkelä	7cffb5f6e8	MDEV-23399: Performance regression with write workloads The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will not be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit `d1ab89037a` unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub	2020-10-15 17:04:56 +03:00
Marko Mäkelä	0b4ed0b7fb	Merge 10.4 into 10.5	2020-08-21 20:32:04 +03:00
Marko Mäkelä	aa6cb7ed03	Merge 10.3 into 10.4	2020-08-21 19:57:22 +03:00
Marko Mäkelä	c277bcd591	Merge 10.2 into 10.3	2020-08-21 19:18:34 +03:00
Eugene Kosov	e9c389c334	MDEV-22701 InnoDB: encapsulate trx_sys.mutex and trx_sys.trx_list into a separate class thread_safe_trx_ilist_t: almost generic one UT_LIST was replaced with ilist<t> innobase_kill_query: wrong comment removed.	2020-06-23 19:11:57 +03:00
Sergey Vojtovich	ccdfcedf10	MDEV-22693 - InnoDB: get rid of casts for rw_trx_hash iterator Less error prone, stricter type control.	2020-05-30 10:24:11 +04:00
Sergey Vojtovich	03dcdad251	MDEV-22697 - InnoDB: remove trx->no trx->no is duplicated by trx->rw_trx_hash_element->no for no good reason. The latter is preferred for performance reasons: allows to avoid taking trx->rw_trx_hash_element->mutex when snapshotting.	2020-05-26 17:11:31 +04:00
Sergey Vojtovich	50b0ce44a2	MDEV-22593 - InnoDB: don't take trx_sys.mutex in ReadView::open() This was the last abuse of trx_sys.mutex, which is now exclusively protecting trx_sys.trx_list. This global acquisition was also potential scalability bottleneck for oltp_read_write benchmark. Although it didn't expose itself as such due to larger scalability issues. Replaced trx_sys.mutex based synchronisation between ReadView creator thread and purge coordinator thread performing latest view clone with ReadView::m_mutex. It also allowed to simplify tri-state view m_state down to boolean m_open flag. For performance reasons trx->read_view.close() is left as atomic relaxed store, so that we don't have to waste resources for otherwise meaningless mutex acquisition.	2020-05-26 17:11:20 +04:00
Marko Mäkelä	3826178da8	Fix the Windows non-debug build warning C4390: ';': empty controlled statement found; is this the intent?	2019-11-28 19:53:00 +02:00
Marko Mäkelä	4beace3316	MDEV-21171 InnoDB is unnecessarily resetting FIL_PAGE_TYPE for full_crc32 files Before commit `c0f47a4a58` (MDEV-12026) introduced the innodb_checksum_algorithm=full_crc32 format, it was impossible to tell if InnoDB data files contained garbage in the FIL_PAGE_TYPE header field (and possibly other fields). This is because before commit `3926673ce7` in MySQL 5.1.48, InnoDB would write uninitialized data to some fields, and because there was no way to tell with which InnoDB version a data file was created. If fil_space_t::full_crc32() holds, the data file cannot contain uninitialized garbage or invalid FIL_PAGE_TYPE, and thus fil_block_check_type() should not be invoked to correct anything.	2019-11-28 16:22:53 +02:00
Marko Mäkelä	2b7aa60b7e	Use constexpr for constants on data pages	2019-11-13 14:34:52 +02:00
Sergei Golubchik	244f0e6dd8	Merge branch '10.3' into 10.4	2019-09-06 11:53:10 +02:00
Marko Mäkelä	2c9e75ccfe	MDEV-15326 after-merge fixes trx_t::is_recovered: Revert most of the changes that were made by the merge of MDEV-15326 from 10.2. The trx_sys.rw_trx_hash and the recovery of transactions at startup is quite different in 10.3. trx_free_at_shutdown(): Avoid excessive mutex protection. Reading fields that can only be modified by the current thread (owning the transaction) can be done outside mutex. trx_t::commit_state(): Restore a tighter assertion. trx_rollback_recovered(): Clarify why there is no potential race condition with other transactions. lock_trx_release_locks(): Merge with trx_t::release_locks(), and avoid holding lock_sys.mutex unnecessarily long. rw_trx_hash_t::find(): Remove redundant code, and avoid starving the committer by checking trx_t::state before trx_t::reference().	2019-09-05 15:58:31 +03:00

1 2 3 4

156 commits