mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-17 20:42:30 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	851c56771e	Merge 10.5 into 10.6	2023-01-23 13:15:41 +02:00
Thirunarayanan Balathandayuthapani	647a7232ff	MDEV-30438 innodb.undo_truncate,4k fails when innodb-immediate-scrub-data-uncompressed is enabled - InnoDB fails to clear the freed ranges during truncation of innodb undo log tablespace. During shutdown, InnoDB flushes the freed page ranges and throws the out of bound error. mtr_t::commit_shrink(): clear the freed ranges while doing undo tablespace truncation	2023-01-23 09:55:49 +05:30
Marko Mäkelä	f9cac8d2cb	MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT This also fixes part of MDEV-29835 Partial server freeze which is caused by violations of the latching order that was defined in https://dev.mysql.com/worklog/task/?id=6326 (WL#6326: InnoDB: fix index->lock contention). Unless the current thread is holding an exclusive dict_index_t::lock, it must acquire page latches in a strict parent-to-child, left-to-right order. Not all cases are fixed yet. Failure to follow the correct latching order will cause deadlocks of threads due to lock order inversion. As part of these changes, the BTR_MODIFY_TREE mode is modified so that an Update latch (U a.k.a. SX) will be acquired on the root page, and eXclusive latches (X) will be acquired on all pages leading to the leaf page, as well as any left and right siblings of the pages along the path. The test innodb.innodb_wl6326 will be removed, because at the time the DEBUG_SYNC point is hit, the thread is actually holding several page latches that will be blocking a concurrent SELECT statement. We also remove double bookkeeping that was caused due to excessive information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo store information of latched pages, and ensure that mtr_memo_slot_t::object is never a null pointer. The tree_blocks[] and tree_savepoints[] were redundant. mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo. This avoids many redundant entries in mtr_t::m_memo, as well as redundant calls to buf_page_get_gen() for blocks that had already been looked up in a mini-transaction. btr_get_latched_root(): Return a pointer to an already latched root page. This replaces btr_root_block_get() in cases where the mini-transaction has already latched the root page. btr_page_get_parent(): Fetch a parent page that was already latched in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched(). If needed, upgrade the root page U latch to X. This avoids bloating mtr_t::m_memo as well as redundant buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for B-tree defragmentation, we will invoke btr_cur_search_to_nth_level(). btr_cur_search_to_nth_level(): This will only be used for non-leaf (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be removed altogether, or retained for the case of CHECK TABLE without QUICK. btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level() for searches to level=0 (the leaf level). btr_cur_t::pessimistic_search_leaf(): Implement the new BTR_MODIFY_TREE latching logic in the case that page splits or merges will be needed. The parent pages (and their siblings) should already be latched on the first dive to the leaf and be present in mtr_t::m_memo; there should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost suffices; MDEV-29835 will have to revise it and remove work-arounds where mtr_t::get_already_latched() fails to find a block. rtr_search_to_nth_level(): A SPATIAL INDEX version of btr_search_to_nth_level() that can search to any level (including the leaf level). rtr_search_leaf(), rtr_insert_leaf(): Wrappers for rtr_search_to_nth_level(). rtr_search(): Replaces rtr_pcur_open(). rtr_cur_restore_position(): Remove an unused constant parameter. btr_pcur_open_on_user_rec(): Remove the constant parameter mode=PAGE_CUR_GE. btr_cur_latch_leaves(): Update a pre-existing mtr_t::m_memo entry for the current leaf page. row_ins_clust_index_entry_low(): Use a new mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC. btr_cur_t::open_leaf(): Some clean-up. mtr_t::lock_register(): Register a page latch on a buffer-fixed block. BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove. BTR_CONT_MODIFY_TREE: Note that this is only used by rtr_search_to_nth_level(). btr_pcur_optimistic_latch_leaves(): Replaces btr_cur_optimistic_latch_leaves(). ibuf_delete_rec(): Acquire ibuf.index->lock.u_lock() in order to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV). Tested by: Matthias Leich	2023-01-19 17:19:18 +02:00
Marko Mäkelä	67dc8af2a7	MDEV-30289: Implement small_vector for mtr_t::m_memo To avoid heap memory allocation overhead for mtr_t::m_memo, we will allocate a small number of elements statically in mtr_t::m_memo::small. Only if that preallocated data is insufficient, we will invoke my_alloc() or my_realloc() for more storage. The implementation of the data structure is inspired by llvm::SmallVector.	2023-01-19 16:10:29 +02:00
Marko Mäkelä	7fa5cce305	MDEV-30289: Remove the pointer indirection for mtr_t::m_memo	2023-01-19 16:10:18 +02:00
Marko Mäkelä	a8a5c8a1b8	Merge 10.5 into 10.6	2022-12-13 16:58:58 +02:00
Marko Mäkelä	1862273c43	MDEV-30209 Race condition between fil_node_open_file() and renaming files mtr_t::commit_file(): Protect the rename operation with fil_system.mutex like we used to do before commit `2e43af69e3` in order to prevent fil_node_open_file() from running concurrently. In other words, fil_system.mutex will protect the consistency of fil_node_t::name and the file name in the file system. This race condition should be very hard to trigger. We would need a low value of innodb_open_files or table_cache limit so that fil_space_t::try_to_close() will be invoked frequently. Simultaneously with a RENAME operation, something (such as a write of a data page) would have to try to open the file.	2022-12-12 11:41:12 +02:00
Marko Mäkelä	05cd10b74c	MDEV-29603 fixup: GCC -Wunused-but-set-variable	2022-11-24 16:46:34 +02:00
Marko Mäkelä	24fe53477c	MDEV-29603 btr_cur_open_at_index_side() is missing some consistency checks btr_cur_t: Zero-initialize all fields in the default constructor. btr_cur_t::index: Remove; it duplicated page_cur.index. Many functions: Remove arguments that were duplicating page_cur_t::index and page_cur_t::block. page_cur_open_level(), btr_pcur_open_level(): Replaces btr_cur_open_at_index_side() for dict_stats_analyze_index(). At the end, release all latches except the dict_index_t::lock and the buf_page_t::lock on the requested page. dict_stats_analyze_index(): Rely on mtr_t::rollback_to_savepoint() to release all uninteresting page latches. btr_search_guess_on_hash(): Simplify the logic, and invoke mtr_t::rollback_to_savepoint(). We will use plain C++ std::vector<mtr_memo_slot_t> for mtr_t::m_memo. In this way, we can avoid setting mtr_memo_slot_t::object to nullptr and instead just remove garbage from m_memo. mtr_t::rollback_to_savepoint(): Shrink the vector. We will be needing this in dict_stats_analyze_index(), where we will release page latches and only retain the index->lock in mtr_t::m_memo. mtr_t::release_last_page(): Release the last acquired page latch. Replaces btr_leaf_page_release(). mtr_t::release(const buf_block_t&): Release a single page latch. Used in btr_pcur_move_backward_from_page(). mtr_t::memo_release(): Replaced with mtr_t::release(). mtr_t::upgrade_buffer_fix(): Acquire a latch for a buffer-fixed page. This replaces the double bookkeeping in btr_cur_t::open_leaf(). Reviewed by: Vladislav Lesin	2022-11-17 08:19:01 +02:00
Marko Mäkelä	ae6ebafd81	Merge 10.5 into 10.6	2022-11-14 15:44:55 +02:00
Marko Mäkelä	e0e096faaa	MDEV-29982 Improve the InnoDB log overwrite error message The InnoDB write-ahead log ib_logfile0 is of fixed size, specified by innodb_log_file_size. If the tail of the log manages to overwrite the head (latest checkpoint) of the log, crash recovery will be broken. Let us clarify the messages about this, including adding a message on the completion of a log checkpoint that notes that the dangerous situation is over. To reproduce the dangerous scenario, we will introduce the debug injection label ib_log_checkpoint_avoid_hard, which will avoid log checkpoints even harder than the previous ib_log_checkpoint_avoid. log_t::overwrite_warned: The first known dangerous log sequence number. Set in log_close() and cleared in log_write_checkpoint_info(), which will output a "Crash recovery was broken" message.	2022-11-14 12:18:03 +02:00
Marko Mäkelä	da21f3f428	Merge 10.5 into 10.6	2022-11-10 17:30:15 +02:00
Marko Mäkelä	7ee612c912	MDEV-21174 fixup: Remove mtr_t::release_page() mtr_t::release_page(): Remove. The function became unused in commit `56f6dab1d0` when the call was replaced with a call to mtr_t::memo_release().	2022-11-10 12:50:44 +02:00
Marko Mäkelä	0fbcb0a2b8	MDEV-29383 Assertion mysql_mutex_assert_owner(&log_sys.flush_order_mutex) failed in mtr_t::commit() In commit `0b47c126e3` (MDEV-13542) a few calls to mtr_t::memo_push() were moved before a write latch on the page was acquired. This introduced a race condition: 1. is_block_dirtied() returned false to mtr_t::memo_push() 2. buf_page_t::write_complete() was executed, the block marked clean, and a page latch released 3. The page latch was acquired by the caller of mtr_t::memo_push(), and mtr_t::m_made_dirty was not set even though the block is in a clean state. The impact of this race condition is that crash recovery and backups may fail. btr_cur_latch_leaves(), btr_store_big_rec_extern_fields(), btr_free_externally_stored_field(), trx_purge_free_segment(): Acquire the page latch before invoking mtr_t::memo_push(). This fixes the regression caused by MDEV-13542. Side note: It would suffice to set mtr_t::m_made_dirty at the time we set the MTR_MEMO_MODIFY flag for a block. Currently that flag is unnecessarily set if a mini-transaction acquires a page latch on a page that is in a clean state, and will not actually modify the block. This may cause unnecessary acquisitions of log_sys.flush_order_mutex on mtr_t::commit(). mtr_t::free(): If the block had been exclusively latched in this mini-transaction, set the m_made_dirty flag so that the flush order mutex will be acquired during mtr_t::commit(). This should have been part of commit `4179f93d28` (MDEV-18976). It was necessary to change mtr_t::free() so that WriteOPT_PAGE_CHECKSUM::operator() would be able to avoid writing checksums for freed pages.	2022-08-26 11:41:43 +03:00
Marko Mäkelä	76bb671e42	Merge 10.5 into 10.6	2022-08-25 16:02:44 +03:00
Marko Mäkelä	01f9c81237	MDEV-29336: Potential deadlock in btr_page_alloc_low() with the AHI The index root page contains the fields BTR_SEG_TOP and BTR_SEG_LEAF which keep track of allocated pages in the index tree. These fields are normally protected by an Update latch, so that concurrent read access to other parts of the page will be possible. When the index root page is already exclusively latched in the mini-transaction, we must not try to acquire a lower-grade Update latch. In fact, when the root page is already X or U latched in the mini-transaction, there is no point to acquire another latch. Moreover, after a U latch was acquired on top of an X-latch, mtr_t::defer_drop_ahi() would trigger an assertion failure or lock corruption in block->page.lock.u_x_upgrade() because X locks already exist on the block. This problem may have been introduced in commit `03ca6495df` (MDEV-24142). btr_page_alloc_low(), btr_page_free(): Initially buffer-fix the root page. If it is already U or X latched, release the buffer-fix. Else, upgrade the buffer-fix to a U latch. mtr_t::u_lock_register(): Upgrade a buffer-fix to U latch. mtr_t::have_u_or_x_latch(): Check if U or X latches are already registered in the mini-transaction.	2022-08-23 08:47:49 +03:00
Marko Mäkelä	558f1eff64	MDEV-29115 mariabackup.mdev-14447 started failing in a new way The test mariabackup.mdev-14447 is inserting relatively much data while concurrently backing up the data. The test often fails on CI systems, possibly due to an inherent race condition between the producer (server) and consumer (backup) that would be solved if the backup was being produced by the server (MDEV-14992). The written data volume was increased somewhat by commit `4179f93d28` (MDEV-18976). Let us trim the log volume by not writing PAGE_CHECKSUM records or row-level undo log records, and make the test cover what was intended to cover by creating the table in the system tablespace.	2022-08-04 09:30:53 +03:00
Marko Mäkelä	2e43af69e3	MDEV-28870 InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT There was a race condition between log_checkpoint_low() and deleting or renaming data files. The scenario is as follows: 1. The buffer pool does not contain dirty pages. 2. A FILE_DELETE or FILE_RENAME record is written. 3. The checkpoint LSN will be moved ahead of the write of the record. 4. The server is killed before the file is actually renamed or deleted. We will prevent this race condition by ensuring that a log checkpoint cannot occur between the durable write and the file system operation: 1. Durably write the FILE_DELETE or FILE_RENAME record. 2. Perform the file system operation. 3. Allow any log checkpoint to proceed. mtr_t::commit_file(): Implement the DELETE or RENAME logic. fil_delete_tablespace(): Delegate some of the logic to mtr_t::commit_file(). fil_space_t::rename(): Delegate some logic to mtr_t::commit_file(). Remove the debug injection point fil_rename_tablespace_failure_2 because we do test RENAME failures without any debug injection. fil_name_write_rename_low(), fil_name_write_rename(): Remove. Tested by Matthias Leich	2022-06-21 16:59:21 +03:00
Marko Mäkelä	4179f93d28	MDEV-18976 Implement OPT_PAGE_CHECKSUM log record for improved validation We will introduce an optional log record OPT_PAGE_CHECKSUM for recording page checksums, so that more inconsistencies on crash recovery may be caught. mtr_t::page_checksum(const buf_page_t&): Write OPT_PAGE_CHECKSUM (currently not for ROW_FORMAT=COMPRESSED pages). mtr_t::do_write(): Write OPT_PAGE_CHECKSUM records for all pages (currently, in debug builds only). mtr_t::is_logged(): Return whether log should be written. mtr_t::set_log_mode_sub(const mtr_t&): Set the logging mode of a sub-minitransaction when another mini-transaction is holding latches on some modified pages. When creating or freeing BLOB pages, we may only write OPT_PAGE_CHECKSUM records in the main mini-transaction, after all changes have been written to the log. MTR_LOG_SUB: Log mode for a sub-mini-transaction. mtr_t::free(): Define non-inline, and invoke MarkFreed. MarkFreed: For any matching page in the mini-transaction log, change the first entry to say MTR_MEMO_PAGE_X_MODIFY and any subsequent entries to MTR_MEMO_PAGE_X_FIX. FindModified: Simplify a condition. MTR_MEMO_MODIFY can only be set if MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX are set. FindBlockX: Consider also MTR_MEMO_PAGE_X_MODIFY. recv_sys_t::parse(): Store OPT_PAGE_CHECKSUM records. log_phys_t::apply(): Validate OPT_PAGE_CHECKSUM records. log_phys_t::page_checksum(): Validate an OPT_PAGE_CHECKSUM record. Tested by: Matthias Leich	2022-06-06 14:05:01 +03:00
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Marko Mäkelä	2f8d0af883	Merge 10.5 into 10.6	2022-06-02 17:39:13 +03:00
Marko Mäkelä	22f935d6da	MDEV-28731 Race condition on log checkpoint mtr_t::modify(): Set the m_made_dirty flag if needed, so that buf_pool_t::insert_into_flush_list() will be invoked while holding log_sys.flush_order_mutex. This is something that was should have been part of commit `b212f1dac2` (MDEV-22107).	2022-06-02 17:33:03 +03:00
Marko Mäkelä	5294695ebd	Clean up mtr_t mtr_t::is_empty(): Replaces mtr_t::get_log() and mtr_t::get_memo(). mtr_t::get_log_size(): Replaces mtr_t::get_log(). mtr_t::print(): Remove, unused function. ReleaseBlocks::ReleaseBlocks(): Remove an unused parameter.	2022-06-02 17:18:00 +03:00
Marko Mäkelä	b242c3141f	Merge 10.5 into 10.6	2022-03-29 16:16:21 +03:00
Vlad Lesin	dcb2968f90	MDEV-27557 InnoDB unnecessarily commits mtr during secondary index search to preserve clustered index latching order New function to release latches till savepoint was added in mtr_t. As there is no longer need to limit MDEV-20605 fix usage for locking reads only, the limitation is removed.	2022-03-25 09:22:36 +03:00
Marko Mäkelä	cce994057b	Merge 10.5 into 10.6	2022-02-09 15:49:50 +02:00
Marko Mäkelä	fd101daa84	MDEV-27716 mtr_t::commit() acquires log_sys.mutex when writing no log mtr_t::is_block_dirtied(), mtr_t::memo_push(): Never set m_made_dirty for pages of the temporary tablespace. Ever since commit `5eb539555b` we never add those pages to buf_pool.flush_list. mtr_t::commit(): Implement part of mtr_t::prepare_write() here, and avoid acquiring log_sys.mutex if no log is written. During IMPORT TABLESPACE fixup, we do not write log, but we must add pages to buf_pool.flush_list and for that, be prepared to acquire log_sys.flush_order_mutex. mtr_t::do_write(): Replaces mtr_t::prepare_write().	2022-02-09 15:10:10 +02:00
Marko Mäkelä	0261eac57f	Merge 10.5 into 10.6	2022-01-12 12:33:19 +02:00
Marko Mäkelä	017d1b867b	MDEV-27476 heap-use-after-free in buf_pool_t::is_block_field() mtr_t::modify(): Remove a debug assertion that had been added in commit `05fa4558e0` (MDEV-22110). The function buf_pool_t::is_uncompressed() is only safe to invoke while holding a buf_pool.page_hash latch so that buf_pool_t::resize() cannot concurrently invoke free() on any chunks.	2022-01-12 12:29:16 +02:00
Marko Mäkelä	9436c778c3	MDEV-27058 fixup: Fix MemorySanitizer, and GCC 4.8.5 ICE on ARMv8 buf_LRU_scan_and_free_block(): It turns out that even with -fno-expensive-optimizations, GCC 4.8.5 may fail to split an instruction. For the non-embedded server, -O1 would fail and -Og would seem to work, while the embedded server build seems to require -O0. buf_block_init(): Correct the MemorySanitizer instrumentation. buf_page_get_low(): Do not read dirty data from read-fixed blocks. These data races were identified by MemorySanitizer. If a read-fixed block is being accessed, we must acquire and release a page latch, so that the read-fix (and the exclusive page latch) will be released and it will be safe to read the page frame contents if needed, even before acquiring the final page latch. We do that in buf_read_ahead_linear() and for the allow_ibuf_merge check. mtr_t::page_lock(): Assert that the block is not read-fixed.	2021-11-20 13:57:23 +02:00
Marko Mäkelä	aaef2e1d8c	MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.	2021-11-18 17:47:19 +02:00
Marko Mäkelä	993b8edf3f	MDEV-26933 InnoDB fails to detect page number mismatch mtr_t::page_lock(): Validate the page number. ibuf_tree_root_get(): Remove assertions that became redundant. The assertions in btr_validate_level() are kind of redundant as well, but because they are ut_a(), they are also present in release builds, while the ones in mtr_t::page_lock() are only present in debug builds. btr_cur_position(): Do not duplicate an assertion that is part of page_cur_position(). dict_load_tablespace(): Introduce a new option DICT_ERR_IGNORE_TABLESPACE that will suppress loading a tablespace when a table is going to be dropped.	2021-11-01 13:35:47 +02:00
Marko Mäkelä	d95361107c	Merge 10.5 into 10.6	2021-09-24 14:38:52 +03:00
Marko Mäkelä	f5794e1dc6	MDEV-26445 innodb_undo_log_truncate is unnecessarily slow trx_purge_truncate_history(): Do not force a write of the undo tablespace that is being truncated. Instead, prevent page writes by acquiring an exclusive latch on all dirty pages of the tablespace. fseg_create(): Relax an assertion that could fail if a dirty undo page is being initialized during undo tablespace truncation (and trx_purge_truncate_history() already acquired an exclusive latch on it). fsp_page_create(): If we are truncating a tablespace, try to reuse a page that we may have already latched exclusively (because it was in buf_pool.flush_list). To some extent, this helps the test innodb.undo_truncate,16k to avoid running out of buffer pool. mtr_t::commit_shrink(): Mark as clean all pages that are outside the new bounds of the tablespace, and only add the newly reinitialized pages to the buf_pool.flush_list. buf_page_create(): Do not unnecessarily invoke change buffer merge on undo tablespaces. buf_page_t::clear_oldest_modification(bool temporary): Move some assertions to the caller buf_page_write_complete(). innodb.undo_truncate: Use a bigger innodb_buffer_pool_size=24M. On my system, it would otherwise hang 1 out of 1547 attempts (on the 40th repeat of innodb.undo_truncate,16k). Other page sizes were not affected.	2021-09-24 08:24:03 +03:00
Marko Mäkelä	f5fddae3cb	MDEV-26450: Corruption due to innodb_undo_log_truncate At least since commit `055a3334ad` (MDEV-13564) the undo log truncation in InnoDB did not work correctly. The main issue is that during the execution of trx_purge_truncate_history() some pages of the newly truncated undo tablespace could be discarded. This is improved from commit `1cb218c37c` which was applied to earlier-version branches. fsp_try_extend_data_file(): Apply the peculiar rounding of fil_space_t::size_in_header only to the system tablespace, whose size can be expressed in megabytes in a configuration parameter. Other files may freely grow by a number of pages. fseg_alloc_free_page_low(): Do allow the extension of undo tablespaces, and mention the file name in the error message. mtr_t::commit_shrink(): Implement crash-safe shrinking of a tablespace: (1) durably write the log (2) release the page latches of the rebuilt tablespace (3) release the mutexes (4) truncate the file (5) release the tablespace latch This is refactored from trx_purge_truncate_history(). log_write_and_flush_prepare(), log_write_and_flush(): New functions to durably write log during mtr_t::commit_shrink().	2021-09-24 08:22:19 +03:00
Vladislav Vaintroub	e7f4daf88c	merge 10.5 to 10.6	2021-07-16 22:12:09 +02:00
Vladislav Vaintroub	fc2ec25733	MDEV-26166 replace log_write_up_to(LSN_MAX,...) with log_buffer_flush_to_disk() Also, remove comparison lsn > flush/write lsn, prior to calling log_write_up_to. The checks and early returns are part of this function.	2021-07-16 18:44:58 +02:00
Marko Mäkelä	101da87228	Merge 10.5 into 10.6	2021-06-23 19:36:45 +03:00
Marko Mäkelä	6441bc614a	MDEV-25113: Introduce a page cleaner mode before 'furious flush' MDEV-23855 changed the way how the page cleaner is signaled by user threads. If a threshold is exceeded, a mini-transaction commit would invoke buf_flush_ahead() in order to initiate page flushing before all writers would eventually grind to halt in log_free_check(), waiting for the checkpoint age to reduce. However, buf_flush_ahead() would always initiate 'furious flushing', making the buf_flush_page_cleaner thread write innodb_io_capacity_max pages per batch, and sleeping no time between batches, until the limit LSN is reached. Because this could saturate the I/O subsystem, system throughput could significantly reduce during these 'furious flushing' spikes. With this change, we introduce a gentler version of flush-ahead, which would write innodb_io_capacity_max pages per second until the 'soft limit' is reached. buf_flush_ahead(): Add a parameter to specify whether furious flushing is requested. buf_flush_async_lsn: Similar to buf_flush_sync_lsn, a limit for the less intrusive flushing. buf_flush_page_cleaner(): Keep working until buf_flush_async_lsn has been reached. log_close(): Suppress a warning message in the event that a new log is being created during startup, when old logs did not exist. Return what type of page cleaning will be needed. mtr_t::finish_write(): Also when m_log.is_small(), invoke log_close(). Return what type of page cleaning will be needed. mtr_t::commit(): Invoke buf_flush_ahead() based on the return value of mtr_t::finish_write().	2021-06-23 19:06:52 +03:00
Marko Mäkelä	f619d79bda	MDEV-25491: Race condition between DROP TABLE and purge of SYS_INDEXES record btr_free_if_exists(): Always use the BUF_GET_POSSIBLY_FREED mode when accessing pages, because due to MDEV-24589 the function fil_space_t::set_stopping(true) can be called at any time during the execution of this function. mtr_t::m_freeing_tree: New data member for debugging purposes. buf_page_get_low(): Assert that the BUF_GET mode is not being used anywhere during the execution of btr_free_if_exists(). In all code related to freeing or allocating pages, we will add some robustness, by making more use of BUF_GET_POSSIBLY_FREED and by reporting an error instead of crashing in some cases of corruption.	2021-04-28 17:24:44 +03:00
Marko Mäkelä	a1542f8a57	MDEV-24643: Assertion failed in rw_lock::update_unlock() mtr_defer_drop_ahi(): Upgrade the U lock to X lock and downgrade it back to U lock in case the adaptive hash index needs to be dropped. This regression was introduced in commit `03ca6495df` (MDEV-24142).	2021-02-12 17:44:58 +02:00
Marko Mäkelä	394fc71f4f	MDEV-24569: Merge 10.5 into 10.6 fseg_page_is_free(): Because MDEV-24167 changed fil_space_t::latch to a simple non-recursive rw-lock, we must avoid acquiring a shared latch if the current thread already holds an exclusive latch. This affects the test innodb.innodb_bug59733, which is exercising the change buffer. fil_space_t::is_owner(): Make available in non-debug builds.	2021-01-14 16:35:05 +02:00
Marko Mäkelä	ad436d9221	Merge 10.5 into 10.6	2020-12-21 19:52:49 +02:00
Marko Mäkelä	e87a8efd32	MDEV-24455 Assertion `!m_freed_space' failed in mtr_t::start In commit `0c23e32d27` (MDEV-24445) we forgot to keep m_freed_space in sync with m_freed_pages in one case.	2020-12-21 10:48:51 +02:00
Marko Mäkelä	30dc4287ec	Merge 10.5 into 10.6	2020-12-19 12:23:06 +02:00
Marko Mäkelä	0c23e32d27	MDEV-24445 Using innodb_undo_tablespaces corrupts system tablespace In the rewrite of MDEV-8139 (based on MDEV-15528), we introduced a wrong assumption that any persistent tablespace that is not an .ibd file is the system tablespace. This assumption is broken when innodb_undo_tablespaces (files undo001, undo002, ...) are being used. By default, we have innodb_undo_tablespaces=0 (the persistent undo log is being stored in the system tablespace). In MDEV-15528 and MDEV-8139 we rewrote the page scrubbing logic so that it will follow the tried-and-true write-ahead logging protocol, first writing FREE_PAGE records and then in the page flushing, zerofilling or hole-punching freed pages. Unfortunately, the implementation included a wrong assumption that that anything that is not in an .ibd file must be the system tablespace. This wrong assumption would cause overwrites of valid data pages in the system tablespace. mtr_t::m_freed_in_system_tablespace: Remove. mtr_t::m_freed_space: The tablespace associated with m_freed_pages. buf_page_free(): Take the tablespace and page number as a parameter, instead of taking a page identifier.	2020-12-18 19:23:34 +02:00
Marko Mäkelä	ca821692af	Merge 10.5 into 10.6	2020-12-09 14:36:38 +02:00
Marko Mäkelä	aa0e380568	MDEV-24348 InnoDB shutdown hang with innodb_flush_sync=0 This hang was caused by MDEV-23855, and we failed to fix it in MDEV-24109 (commit `4cbfdeca84`). When buf_flush_ahead() is invoked soon before server shutdown and the non-default setting innodb_flush_sync=OFF is in effect and the buffer pool contains dirty pages of temporary tables, the page cleaner thread may remain in an infinite loop without completing its work, thus causing the shutdown to hang. buf_flush_page_cleaner(): If the buffer pool contains no unmodified persistent pages, ensure that buf_flush_sync_lsn= 0 will be assigned, so that shutdown will proceed. The test case is not deterministic. On my system, it reproduced the hang with 95% probability when running multiple instances of the test in parallel, and 4% when running single-threaded. Thanks to Eugene Kosov for debugging and testing this.	2020-12-04 14:11:48 +02:00
Marko Mäkelä	03ca6495df	MDEV-24142: Replace InnoDB rw_lock_t with sux_lock InnoDB buffer pool block and index tree latches depend on a special kind of read-update-write lock that allows reentrant (recursive) acquisition of the 'update' and 'write' locks as well as an upgrade from 'update' lock to 'write' lock. The 'update' lock allows any number of reader locks from other threads, but no concurrent 'update' or 'write' lock. If there were no requirement to support an upgrade from 'update' to 'write', we could compose the lock out of two srw_lock (implemented as any type of native rw-lock, such as SRWLOCK on Microsoft Windows). Removing this requirement is very difficult, so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we implemented an 'update' mode to our srw_lock. Re-entrant or recursive locking is mostly needed when writing or freeing BLOB pages, but also in crash recovery or when merging buffered changes to an index page. The re-entrancy allows us to attach a previously acquired page to a sub-mini-transaction that will be committed before whatever else is holding the page latch. The SUX lock supports Shared ('read'), Update, and eXclusive ('write') locking modes. The S latches are not re-entrant, but a single S latch may be acquired even if the thread already holds an U latch. The idea of the U latch is to allow a write of something that concurrent readers do not care about (such as the contents of BTR_SEG_LEAF, BTR_SEG_TOP and other page allocation metadata structures, or the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field is only updated when a dict_table_t for the table exists, and only read when a dict_table_t for the table is being added to dict_sys.) block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page() to allow concurrent readers but no concurrent modifications while the page is being written to the data file. That latch will be released by buf_page_write_complete() in a different thread. Hence, we use the special lock owner value FOR_IO. The index_lock::u_lock() improves concurrency on operations that involve non-leaf index pages. The interface has been cleaned up a little. We will use x_lock_recursive() instead of x_lock() when we know that a lock is already held by the current thread. Similarly, a lock upgrade from U to X is only allowed via u_x_upgrade() or x_lock_upgraded() but not via x_lock(). We will disable the LatchDebug and sync_array interfaces to InnoDB rw-locks. The SEMAPHORES section of SHOW ENGINE INNODB STATUS output will no longer include any information about InnoDB rw-locks, only TTASEventMutex (cmake -DMUTEXTYPE=event) waits. This will make a part of the 'innotop' script dead code. The block_lock buf_block_t::lock will not be covered by any PERFORMANCE_SCHEMA instrumentation. SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES will no longer output source code file names or line numbers. The dict_index_t::lock will be identified by index and table names, which should be much more useful. PERFORMANCE_SCHEMA is lumping information about all dict_index_t::lock together as event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'. buf_page_free(): Remove the file,line parameters. The sux_lock will not store such diagnostic information. buf_block_dbg_add_level(): Define as empty macro, to be removed in a subsequent commit. Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO the index_lock dict_index_t::lock will be instrumented via PERFORMANCE_SCHEMA. Similar to commit `1669c8890c` we will distinguish lock waits by registering shared_lock,exclusive_lock events instead of try_shared_lock,try_exclusive_lock. Actual 'try' operations will not be instrumented at all. rw_lock_list: Remove. After MDEV-24167, this only covered buf_block_t::lock and dict_index_t::lock. We will output their information by traversing buf_pool or dict_sys.	2020-12-03 15:19:49 +02:00
Marko Mäkelä	a13fac9eee	Merge 10.5 into 10.6	2020-12-03 08:12:47 +02:00

1 2 3 4 5

221 commits