mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-31 11:01:52 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	431200090e	MDEV-22867 Assertion instant.n_core_fields == n_core_fields failed This is a race condition where a table on which a 10.3-style instant ADD COLUMN is emptied during the execution of ALTER TABLE ... DROP COLUMN ..., DROP INDEX ..., ALGORITHM=NOCOPY. In commit `2c4844c9e7` the function instant_metadata_lock() would prevent this race condition. But, it would also hold a page latch on the leftmost leaf page of clustered index for the duration of a possible DROP INDEX operation. The race could be fixed by restoring the function instant_metadata_lock() that was removed in commit `ea37b14409` but it would be more future-proof to prevent the dict_index_t::clear_instant_add() call from being issued at all. We at some point support DROP COLUMN ..., ADD INDEX ..., ALGORITHM=NOCOPY and that would spend a non-trivial amount of execution time in ha_innobase::inplace_alter(), making a server hang possible. Currently this is not supported and our added test case will notice when the support is introduced. dict_index_t::must_avoid_clear_instant_add(): Determine if a call to clear_instant_add() must be avoided. btr_discard_only_page_on_level(): Preserve the metadata record if must_avoid_clear_instant_add() holds. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Do not remove the metadata record even if the table becomes empty but must_avoid_clear_instant_add() holds. btr_pcur_store_position(): Relax a debug assertion. This is joint work with Thirunarayanan Balathandayuthapani.	2020-06-12 20:33:39 +03:00
Thirunarayanan Balathandayuthapani	c92f7e287f	MDEV-8139 Fix Scrubbing fil_space_t::freed_ranges: Store ranges of freed page numbers. fil_space_t::last_freed_lsn: Store the most recent LSN of freeing a page. fil_space_t::freed_mutex: Protects freed_ranges, last_freed_lsn. fil_space_create(): Initialize the freed_range mutex. fil_space_free_low(): Frees the freed_range mutex. range_set: Ranges of page numbers. buf_page_create(): Removes the page from freed_ranges when page is being reused. btr_free_root(): Remove the PAGE_INDEX_ID invalidation. Because btr_free_root() and dict_drop_index_tree() are executed in the same atomic mini-transaction, there is no need to invalidate the root page. buf_release_freed_page(): Split from buf_flush_freed_page(). Skip any I/O buf_flush_freed_pages(): Get the freed ranges from tablespace and Write punch-hole or zeroes of the freed ranges. buf_flush_try_neighbors(): Handles the flushing of freed ranges. mtr_t::freed_pages: Variable to store the list of freed pages. mtr_t::add_freed_pages(): To add freed pages. mtr_t::clear_freed_pages(): To clear the freed pages. mtr_t::m_freed_in_system_tablespace: Variable to indicate whether page has been freed in system tablespace. mtr_t::m_trim_pages: Variable to indicate whether the space has been trimmed. mtr_t::commit(): Add the freed page and update the last freed lsn in the tablespace and clear the tablespace freed range if space is trimmed. file_name_t::freed_pages: Store the freed pages during recovery. file_name_t::add_freed_page(), file_name_t::remove_freed_page(): To add and remove freed page during recovery. store_freed_or_init_rec(): Store or remove the freed pages while encountering FREE_PAGE or INIT_PAGE redo log record. recv_init_crash_recovery_spaces(): Add the freed page encountered during recovery to respective tablespace.	2020-06-12 09:17:51 +05:30
Marko Mäkelä	17a7bafec0	MDEV-22110 preparation: Remove mtr_memo_contains macros Let us invoke the debug member functions of mtr_t directly. mtr_t::memo_contains(): Change the parameter type to const rw_lock_t&. This function cannot be invoked on buf_block_t::lock. The function mtr_t::memo_contains_flagged() is intended to be invoked on buf_block_t* or rw_lock_t*, and it along with mtr_t::memo_contains_page_flagged() are the way to check whether a buffer pool page has been latched within a mini-transaction.	2020-06-10 07:50:09 +03:00
Marko Mäkelä	b1ab211dee	MDEV-15053 Reduce buf_pool_t::mutex contention User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0 and will no longer report the PAGE_STATE value READY_FOR_USE. We will remove some fields from buf_page_t and move much code to member functions of buf_pool_t and buf_page_t, so that the access rules of data members can be enforced consistently. Evicting or adding pages in buf_pool.LRU will remain covered by buf_pool.mutex. Evicting or adding pages in buf_pool.page_hash will remain covered by both buf_pool.mutex and the buf_pool.page_hash X-latch. After this fix, buf_pool.page_hash lookups can entirely avoid acquiring buf_pool.mutex, only relying on buf_pool.hash_lock_get() S-latch. Similarly, buf_flush_check_neighbors() can will rely solely on buf_pool.mutex, no buf_pool.page_hash latch at all. The buf_pool.mutex is rather contended in I/O heavy benchmarks, especially when the workload does not fit in the buffer pool. The first attempt to alleviate the contention was the buf_pool_t::mutex split in commit `4ed7082eef` which introduced buf_block_t::mutex, which we are now removing. Later, multiple instances of buf_pool_t were introduced in commit `c18084f71b` and recently removed by us in commit `1a6f708ec5` (MDEV-15058). UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool related debugging in otherwise non-debug builds has not been used for years. Instead, we have been using UNIV_DEBUG, which is enabled in CMAKE_BUILD_TYPE=Debug. buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on std::atomic and the buf_pool.page_hash latches, and in some cases depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before. We must always release buf_block_t::lock before invoking unfix() or io_unfix(), to prevent a glitch where a block that was added to the buf_pool.free list would apper X-latched. See commit `c5883debd6` how this glitch was finally caught in a debug environment. We move some buf_pool_t::page_hash specific code from the ha and hash modules to buf_pool, for improved readability. buf_pool_t::close(): Assert that all blocks are clean, except on aborted startup or crash-like shutdown. buf_pool_t::validate(): No longer attempt to validate n_flush[] against the number of BUF_IO_WRITE fixed blocks, because buf_page_t::flush_type no longer exists. buf_pool_t::watch_set(): Replaces buf_pool_watch_set(). Reduce mutex contention by separating the buf_pool.watch[] allocation and the insert into buf_pool.page_hash. buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a buf_pool.page_hash latch. Replaces and extends buf_page_hash_lock_s_confirm() and buf_page_hash_lock_x_confirm(). buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES. buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads: Use Atomic_counter. buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out(). buf_pool_t::LRU_remove(): Remove a block from the LRU list and return its predecessor. Incorporates buf_LRU_adjust_hp(), which was removed. buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(), for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by BTR_DELETE_OP (purge), which is never invoked on temporary tables. buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments. buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition. buf_LRU_free_page(): Clarify the function comment. buf_flush_check_neighbor(), buf_flush_check_neighbors(): Rewrite the construction of the page hash range. We will hold the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64) consecutive lookups of buf_pool.page_hash. buf_flush_page_and_try_neighbors(): Remove. Merge to its only callers, and remove redundant operations in buf_flush_LRU_list_batch(). buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite. Do not acquire buf_pool.mutex, and iterate directly with page_id_t. ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined and avoids any loops. fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove. buf_flush_page(): Add a fil_space_t* parameter. Minimize the buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated atomically with the io_fix, and we will protect most buf_block_t fields with buf_block_t::lock. The function buf_flush_write_block_low() is removed and merged here. buf_page_init_for_read(): Use static linkage. Initialize the newly allocated block and acquire the exclusive buf_block_t::lock while not holding any mutex. IORequest::IORequest(): Remove the body. We only need to invoke set_punch_hole() in buf_flush_page() and nowhere else. buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type. This field is only used during a fil_io() call. That function already takes IORequest as a parameter, so we had better introduce for the rarely changing field. buf_block_t::init(): Replaces buf_page_init(). buf_page_t::init(): Replaces buf_page_init_low(). buf_block_t::initialise(): Initialise many fields, but keep the buf_page_t::state(). Both buf_pool_t::validate() and buf_page_optimistic_get() requires that buf_page_t::in_file() be protected atomically with buf_page_t::in_page_hash and buf_page_t::in_LRU_list. buf_page_optimistic_get(): Now that buf_block_t::mutex no longer exists, we must check buf_page_t::io_fix() after acquiring the buf_pool.page_hash lock, to detect whether buf_page_init_for_read() has been initiated. We will also check the io_fix() before acquiring hash_lock in order to avoid unnecessary computation. The field buf_block_t::modify_clock (protected by buf_block_t::lock) allows buf_page_optimistic_get() to validate the block. buf_page_t::real_size: Remove. It was only used while flushing pages of page_compressed tables. buf_page_encrypt(): Add an output parameter that allows us ot eliminate buf_page_t::real_size. Replace a condition with debug assertion. buf_page_should_punch_hole(): Remove. buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch(). Add the parameter size (to replace buf_page_t::real_size). buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page(). Add the parameter size (to replace buf_page_t::real_size). fil_system_t::detach(): Replaces fil_space_detach(). Ensure that fil_validate() will not be violated even if fil_system.mutex is released and reacquired. fil_node_t::complete_io(): Renamed from fil_node_complete_io(). fil_node_t::close_to_free(): Replaces fil_node_close_to_free(). Avoid invoking fil_node_t::close() because fil_system.n_open has already been decremented in fil_space_t::detach(). BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY. BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE, and distinguish dirty pages by buf_page_t::oldest_modification(). BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead. This state was only being used for buf_page_t that are in buf_pool.watch. buf_pool_t::watch[]: Remove pointer indirection. buf_page_t::in_flush_list: Remove. It was set if and only if buf_page_t::oldest_modification() is nonzero. buf_page_decrypt_after_read(), buf_corrupt_page_release(), buf_page_check_corrupt(): Change the const fil_space_t* parameter to const fil_node_t& so that we can report the correct file name. buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function. buf_page_io_complete(): Split to buf_page_read_complete() and buf_page_write_complete(). buf_dblwr_t::in_use: Remove. buf_dblwr_t::buf_block_array: Add IORequest::flush_t. buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of os_aio_wait_until_no_pending_writes(). buf_flush_write_complete(): Declare static, not global. Add the parameter IORequest::flush_t. buf_flush_freed_page(): Simplify the code. recv_sys_t::flush_lru: Renamed from flush_type and changed to bool. fil_read(), fil_write(): Replaced with direct use of fil_io(). fil_buffering_disabled(): Remove. Check srv_file_flush_method directly. fil_mutex_enter_and_prepare_for_io(): Return the resolved fil_space_t* to avoid a duplicated lookup in the caller. fil_report_invalid_page_access(): Clean up the parameters. fil_io(): Return fil_io_t, which comprises fil_node_t and error code. Always invoke fil_space_t::acquire_for_io() and let either the sync=true caller or fil_aio_callback() invoke fil_space_t::release_for_io(). fil_aio_callback(): Rewrite to replace buf_page_io_complete(). fil_check_pending_operations(): Remove a parameter, and remove some redundant lookups. fil_node_close_to_free(): Wait for n_pending==0. Because we no longer do an extra lookup of the tablespace between fil_io() and the completion of the operation, we must give fil_node_t::complete_io() a chance to decrement the counter. fil_close_tablespace(): Remove unused parameter trx, and document that this is only invoked during the error handling of IMPORT TABLESPACE. row_import_discard_changes(): Merged with the only caller, row_import_cleanup(). Do not lock up the data dictionary while invoking fil_close_tablespace(). logs_empty_and_mark_files_at_shutdown(): Do not invoke fil_close_all_files(), to avoid a !needs_flush assertion failure on fil_node_t::close(). innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files(). fil_close_all_files(): Invoke fil_flush_file_spaces() to ensure proper durability. thread_pool::unbind(): Fix a crash that would occur on Windows after srv_thread_pool->disable_aio() and os_file_close(). This fix was submitted by Vladislav Vaintroub. Thanks to Matthias Leich and Axel Schwenke for extensive testing, Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.	2020-06-05 12:35:46 +03:00
Marko Mäkelä	58d2d82022	MDEV-22710 Assertion ...status != buf_page_t::FREED in ibuf_remove_free_page() The buf_page_free() call that was introduced in MDEV-15528 was performed too early in fseg_free_page(), tripping a debug check in ibuf_remove_free_page(). In all other callers, we can (and will) invoke buf_page_free() right after fseg_free_page(), but in ibuf_remove_free_page() we will defer that call to the end of the mini-transaction. (That call was already present.)	2020-06-03 14:13:16 +03:00
Marko Mäkelä	23047d3ed4	Merge 10.4 into 10.5	2020-05-18 17:30:02 +03:00
Marko Mäkelä	9e6e43551f	Merge 10.3 into 10.4 We will expose some more std::atomic internals in Atomic_counter, so that dict_index_t::lock will support the default assignment operator.	2020-05-16 07:39:15 +03:00
Marko Mäkelä	3d0bb2b7f1	Merge 10.2 into 10.3	2020-05-15 19:11:57 +03:00
Marko Mäkelä	ad6171b91c	MDEV-22456 Dropping the adaptive hash index may cause DDL to lock up InnoDB If the InnoDB buffer pool contains many pages for a table or index that is being dropped or rebuilt, and if many of such pages are pointed to by the adaptive hash index, dropping the adaptive hash index may consume a lot of time. The time-consuming operation of dropping the adaptive hash index entries is being executed while the InnoDB data dictionary cache dict_sys is exclusively locked. It is not actually necessary to drop all adaptive hash index entries at the time a table or index is being dropped or rebuilt. We can let the LRU replacement policy of the buffer pool take care of this gradually. For this to work, we must detach the dict_table_t and dict_index_t objects from the main dict_sys cache, and once the last adaptive hash index entry for the detached table is removed (when the garbage page is evicted from the buffer pool) we can free the dict_table_t and dict_index_t object. Related to this, in MDEV-16283, we made ALTER TABLE...DISCARD TABLESPACE skip both the buffer pool eviction and the drop of the adaptive hash index. We shifted the burden to ALTER TABLE...IMPORT TABLESPACE or DROP TABLE. We can remove the eviction from DROP TABLE. We must retain the eviction in the ALTER TABLE...IMPORT TABLESPACE code path, so that in case the discarded table is being re-imported with the same tablespace identifier, the fresh data from the imported tablespace will replace any stale pages in the buffer pool. rpl.rpl_failed_drop_tbl_binlog: Remove the test. DROP TABLE can no longer be interrupted inside InnoDB. fseg_free_page(), fseg_free_step(), fseg_free_step_not_header(), fseg_free_page_low(), fseg_free_extent(): Remove the parameter that specifies whether the adaptive hash index should be dropped. btr_search_lazy_free(): Lazily free an index when the last reference to it is dropped from the adaptive hash index. buf_pool_clear_hash_index(): Declare static, and move to the same compilation unit with the bulk of the adaptive hash index code. dict_index_t::clone(), dict_index_t::clone_if_needed(): Clone an index that is being rebuilt while adaptive hash index entries exist. The original index will be inserted into dict_table_t::freed_indexes and dict_index_t::set_freed() will be called. dict_index_t::set_freed(), dict_index_t::freed(): Note that or check whether the index has been freed. We will use the impossible page number 1 to denote this condition. dict_index_t::n_ahi_pages(): Replaces btr_search_info_get_ref_count(). dict_index_t::detach_columns(): Move the assignment n_fields=0 to ha_innobase_inplace_ctx::clear_added_indexes(). We must have access to the columns when freeing the adaptive hash index. Note: dict_table_t::v_cols[] will remain valid. If virtual columns are dropped or added, the table definition will be reloaded in ha_innobase::commit_inplace_alter_table(). buf_page_mtr_lock(): Drop a stale adaptive hash index if needed. We will also reduce the number of btr_get_search_latch() calls and enclose some more code inside #ifdef BTR_CUR_HASH_ADAPT in order to benefit cmake -DWITH_INNODB_AHI=OFF.	2020-05-15 17:23:08 +03:00
Marko Mäkelä	7bcaa541aa	Merge 10.4 into 10.5	2020-05-05 21:16:22 +03:00
Marko Mäkelä	2c3c851d2c	Merge 10.3 into 10.4	2020-05-05 20:33:10 +03:00
Oleksandr Byelkin	7fb73ed143	Merge branch '10.2' into 10.3	2020-05-04 16:47:11 +02:00
Daniel Black	ba2061da52	MDEV-21595: innodb offset_t rename to rec_offs thanks to: perl -i -pe 's/\boffset_t\b/rec_offs/g' $(git grep -lw offset_t storage/innobase)	2020-04-29 12:02:47 +03:00
Marko Mäkelä	0870b75af7	MDEV-22126 Rename confusing constant mtr_t::OPT The template parameter mtr_t::OPT refers to optional, not optimized. Also the default parameter mtr_t::NORMAL refers to optimized writes. The name MAYBE_NOP would be more descriptive, conveying the idea that a write to a durable page might not actually have any effect.	2020-04-03 08:50:46 +03:00
Marko Mäkelä	f224525204	MDEV-21907: InnoDB: Enable -Wconversion on clang and GCC The -Wconversion in GCC seems to be stricter than in clang. GCC at least since version 4.4.7 issues truncation warnings for assignments to bitfields, while clang 10 appears to only issue warnings when the sizes in bytes rounded to the nearest integer powers of 2 are different. Before GCC 10.0.0, -Wconversion required more casts and would not allow some operations, such as x<<=1 or x+=1 on a data type that is narrower than int. GCC 5 (but not GCC 4, GCC 6, or any later version) is complaining about x\|=y even when x and y are compatible types that are narrower than int. Hence, we must rewrite some x\|=y as x=static_cast<byte>(x\|y) or similar, or we must disable -Wconversion. In GCC 6 and later, the warning for assigning wider to bitfields that are narrower than 8, 16, or 32 bits can be suppressed by applying a bitwise & with the exact bitmask of the bitfield. For older GCC, we must disable -Wconversion for GCC 4 or 5 in such cases. The bitwise negation operator appears to promote short integers to a wider type, and hence we must add explicit truncation casts around them. Microsoft Visual C does not allow a static_cast to truncate a constant, such as static_cast<byte>(1) truncating int. Hence, we will use the constructor-style cast byte(~1) for such cases. This has been tested at least with GCC 4.8.5, 5.4.0, 7.4.0, 9.2.1, 10.0.0, clang 9.0.1, 10.0.0, and MSVC 14.22.27905 (Microsoft Visual Studio 2019) on 64-bit and 32-bit targets (IA-32, AMD64, POWER 8, POWER 9, ARMv8).	2020-03-12 19:46:41 +02:00
Marko Mäkelä	8be3794b42	MDEV-21924 Clean up InnoDB GIS record comparison The extension of the record comparison functions for SPATIAL INDEX in mysql/mysql-server@b66ad511b6 was suboptimal for multiple reasons: Some functions used unnecessary temporary variables of the int type, instead of the more appropriate size_t, causing type mismatch. Many functions unnecessarily required rec_get_offsets() to be computed, or a parameter for length, although the size of the minimum bounding rectangle (MBR) is hard-coded as SPDIMS * 2 * sizeof(double), or 32 bytes. In InnoDB SPATIAL INDEX records, there always is a 32-byte key followed by either a 4-byte child page number or the PRIMARY KEY value. The length parameters were not properly validated. The function cmp_geometry_field() was making an incorrect attempt at checking that the lengths are at least sizeof(double) (8 bytes), even though the function is accessing up to 32 bytes in both MBR. Functions that are called from only one compilation unit are defined in another compilation unit, making the code harder to follow and potentially slower to execute. cmp_dtuple_rec_with_gis(): FIXME: Correct the debug assertion and possibly the function TABLE_SHARE::init_from_binary_frm_image() or related code, which causes an unexpected length of DATA_MBR_LEN + 2 bytes to be passed to this function.	2020-03-12 18:13:53 +02:00
Marko Mäkelä	574d8b2940	MDEV-21907: Fix most clang -Wconversion in InnoDB Declare innodb_purge_threads as 4-byte integer (UINT) instead of 4-or-8-byte (ULONG) and adjust the documentation string.	2020-03-11 08:29:48 +02:00
Marko Mäkelä	96901d9545	Cleanup: Remove dict_ind_redundant There is no reason for the dummy index object dict_ind_redundant to exist any more. It was only being passed to btr_create(). btr_create(): If !index, assume that a ROW_FORMAT=REDUNDANT table is being created. We could pass ibuf.index, dict_sys.sys_tables->indexes.start and so on, if those objects had been initialized before the function btr_create() is called.	2020-02-20 22:00:43 +02:00
Marko Mäkelä	23de5b8f07	MDEV-21725 Optimize btr_page_reorganize_low() redo logging btr_page_reorganize_low(): Log only the changed data in the page. TODO: Do not copy the entire changed payload to the redo log. Emit a combination of MEMMOVE and WRITE records to reduce the log volume.	2020-02-18 10:54:28 +02:00
Marko Mäkelä	fc87698048	MDEV-12353: Write less log for BLOB pages fsp_page_create(): Always initialize the page. The logic to avoid initialization was made redundant and should have been removed in mysql/mysql-server@ce0a1e85e2 (MySQL 5.7.5). btr_store_big_rec_extern_fields(): Remove the redundant initialization of FIL_PAGE_PREV and FIL_PAGE_NEXT. An INIT_PAGE record will have been written already. Only write the ROW_FORMAT=COMPRESSED page payload from FIL_PAGE_DATA onwards. We were unnecessarily writing from FIL_PAGE_TYPE onwards, which caused an assertion failure on recovery: recv_sys_t::alloc(size_t): Assertion 'len <= srv_page_size' failed when running the following tests: ./mtr --no-reorder innodb_zip.blob,4k innodb_zip.bug56680,4k	2020-02-17 10:13:32 +02:00
Marko Mäkelä	f8a9f90667	MDEV-12353: Remove support for crash-upgrade We tighten some assertions regarding dict_index_t::is_dummy and crash recovery, now that redo log processing will no longer create dummy objects.	2020-02-13 19:13:45 +02:00
Marko Mäkelä	7ae21b18a6	MDEV-12353: Change the redo log encoding log_t::FORMAT_10_5: physical redo log format tag log_phys_t: Buffered records in the physical format. The log record bytes will follow the last data field, making use of alignment padding that would otherwise be wasted. If there are multiple records for the same page, also those may be appended to an existing log_phys_t object if the memory is available. In the physical format, the first byte of a record identifies the record and its length (up to 15 bytes). For longer records, the immediately following bytes will encode the remaining length in a variable-length encoding. Usually, a variable-length-encoded page identifier will follow, followed by optional payload, whose length is included in the initially encoded total record length. When a mini-transaction is updating multiple fields in a page, it can avoid repeating the tablespace identifier and page number by setting the same_page flag (most significant bit) in the first byte of the log record. The byte offset of the record will be relative to where the previous record for that page ended. Until MDEV-14425 introduces a separate file-level log for redo log checkpoints and file operations, we will write the file-level records in the page-level redo log file. The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT) will be removed in MDEV-14425, and one sequential scan of the page recovery log will suffice. Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags. If the information is needed, it can be parsed from WRITE records that modify FSP_SPACE_FLAGS. MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily as part of this work, before being replaced with WRITE (along with MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES). mtr_buf_t::empty(): Check if the buffer is empty. mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty. mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record, for the same_page encoding. page_recv_t::last_offset: Reflects mtr_t::m_last_offset. Valid values for last_offset during recovery should be 0 or above 8. (The first 8 bytes of a page are the checksum and the page number, and neither are ever updated directly by log records.) Internally, the special value 1 indicates that the same_page form will not be allowed for the subsequent record. mtr_t::page_create(): Take the block descriptor as parameter, so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE record will always followed by a subtype byte, because same_page records must be longer than 1 byte. trx_undo_page_init(): Combine the writes in WRITE record. trx_undo_header_create(): Write 4 bytes using a special MEMSET record that includes 1 bytes of length and 2 bytes of payload. flst_write_addr(): Define as a static function. Combine the writes. flst_zero_both(): Replaces two flst_zero_addr() calls. flst_init(): Do not inline the function. fsp_free_seg_inode(): Zerofill the whole inode. fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT to FIL_NULL when using the physical format. btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page() must have been invoked. fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE. fil_names_dirty_and_write(): Remove the parameter mtr. Write the records using a separate mini-transaction object, because any FILE_ records must be at the start of a mini-transaction log. recv_recover_page(): Add a fil_space_t* parameter. After applying log to the a ROW_FORMAT=COMPRESSED page, invoke buf_zip_decompress() to restore the uncompressed page. buf_page_io_complete(): Remove the temporary hack to discard the uncompressed page of a ROW_FORMAT=COMPRESSED page. page_zip_write_header(): Remove. Use mtr_t::write() or mtr_t::memset() instead, and update the compressed page frame separately. trx_undo_header_add_space_for_xid(): Remove. trx_undo_seg_create(): Perform the changes that were previously made by trx_undo_header_add_space_for_xid(). btr_reset_instant(): New function: Reset the table to MariaDB 10.2 or 10.3 format when rolling back an instant ALTER TABLE operation. page_rec_find_owner_rec(): Merge with the only callers. page_cur_insert_rec_low(): Combine writes by using a local buffer. MEMMOVE data from the preceding record whenever feasible (copying at least 3 bytes). page_cur_insert_rec_zip(): Combine writes to page header fields. PageBulk::insertPage(): Issue MEMMOVE records to copy a matching part from the preceding record. PageBulk::finishPage(): Combine the writes to the page header and to the sparse page directory slots. mtr_t::write(): Only log the least significant (last) bytes of multi-byte fields that actually differ. For updating FSP_SIZE, we must always write all 4 bytes to the redo log, so that the fil_space_set_recv_size() logic in recv_sys_t::parse() will work. mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument instead of a numeric offset to the page frame. Only log the last bytes of multi-byte fields that actually differ. In fil_space_crypt_t::write_page0(), we must log also any unchanged bytes, so that recovery will recognize the record and invoke fil_crypt_parse(). Future work: MDEV-21724 Optimize page_cur_insert_rec_low() redo logging MDEV-21725 Optimize btr_page_reorganize_low() redo logging MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED	2020-02-13 19:12:17 +02:00
Marko Mäkelä	2e7a084283	MDEV-21174: Remove mlog_write_initial_log_record_fast() Pass buf_block_t* to all functions that write redo log. Specifically, replace the parameters page,page_zip with buf_block_t* block in page_zip_ functions.	2020-02-13 18:19:15 +02:00
Marko Mäkelä	2a77b2a510	MDEV-12353: Replace MLOG_LIST__DELETE and MLOG_*REC_DELETE No longer write the following redo log records: MLOG_COMP_LIST_END_DELETE, MLOG_LIST_END_DELETE, MLOG_COMP_LIST_START_DELETE, MLOG_LIST_START_DELETE, MLOG_REC_DELETE,MLOG_COMP_REC_DELETE. Each individual deleted record will be logged separately using physical log records. page_dir_slot_set_n_owned(), page_zip_rec_set_owned(), page_zip_dir_delete(), page_zip_clear_rec(): Add the parameter mtr, and write redo log. page_dir_slot_set_rec(): Remove. Replaced with lower-level operations that write redo log when necessary. page_rec_set_n_owned(): Replaces rec_set_n_owned_old(), rec_set_n_owned_new(). rec_set_heap_no(): Replaces rec_set_heap_no_old(), rec_set_heap_no_new(). page_mem_free(), page_dir_split_slot(), page_dir_balance_slot(): Add the parameter mtr. page_dir_set_n_slots(): Merge with the caller page_dir_split_slot(). page_dir_slot_set_rec(): Merge with the callers page_dir_split_slot() and page_dir_balance_slot(). page_cur_insert_rec_low(), page_cur_insert_rec_zip(): Suppress the logging of lower-level operations. page_cur_delete_rec_write_log(): Remove. page_cur_delete_rec(): Do not tolerate mtr=NULL. rec_convert_dtuple_to_rec_old(), rec_convert_dtuple_to_rec_comp(): Replace rec_set_heap_no_old() and rec_set_heap_no_new() with direct access that does not involve redo logging. mtr_t::memcpy(): Do allow non-redo-logged writes to uncompressed pages of ROW_FORMAT=COMPRESSED pages. buf_page_io_complete(): Evict the uncompressed page of a ROW_FORMAT=COMPRESSED page after recovery. Because we no longer write logical log records for deleting index records, but instead write physical records that may refer directly to the compressed page frame of a ROW_FORMAT=COMPRESSED page, and because on recovery we will only apply the changes to the ROW_FORMAT=COMPRESSED page, the uncompressed page frame can be stale until page_zip_decompress() is executed. recv_parse_or_apply_log_rec_body(): After applying MLOG_ZIP_WRITE_STRING, ensure that the FIL_PAGE_TYPE of the uncompressed page matches the compressed page, because buf_flush_init_for_writing() assumes that field to be valid. mlog_init_t::mark_ibuf_exist(): Invoke page_zip_decompress(), because the uncompressed page after buf_page_create() is not necessarily up to date. buf_LRU_block_remove_hashed(): Bypass a page_zip_validate() check during redo log apply. recv_apply_hashed_log_recs(): Invoke mlog_init.mark_ibuf_exist() also for the last batch, to ensure that page_zip_decompress() will be called for freshly initialized pages.	2020-02-13 18:19:14 +02:00
Marko Mäkelä	d00185c40d	MDEV-12353: Replace MLOG_PAGE_CREATE_RTREE, MLOG_PAGE_COMP_CREATE_RTREE page_create(): Create normal B-tree pages. Callers that create R-tree pages will set FIL_PAGE_TYPE and reset the split sequence number afterwards. The creation of ROW_FORMAT=COMPRESSED pages is unaffected; they will be logged as compressed page images. page_create_low(): Take const buf_block_t* as a parameter. Let the callers invoke buf_block_modify_clock_inc().	2020-02-13 18:19:14 +02:00
Marko Mäkelä	db5cdc3195	MDEV-12353: Replace MLOG_PAGE_REORGANIZE, MLOG_COMP_PAGE_REORGANIZE Log page reorganize as a series of insert operations. This will make the redo log volume proportional to the page payload size. btr_page_reorganize_low(): Add template <bool recovery=false> btr_page_reorganize_block(): Remove the parameter 'bool recovery'	2020-02-13 18:19:14 +02:00
Marko Mäkelä	acd265b69b	MDEV-12353: Exclusively use page_zip_reorganize() for ROW_FORMAT=COMPRESSED page_zip_reorganize(): Restore the page on failure. In callers, omit now-redundant calls to page_zip_decompress(). btr_page_reorganize_low(): Define in static scope only, and remove the z_level parameter. Assert that ROW_FORMAT is not COMPRESSED. btr_page_reorganize_block(), btr_page_reorganize(): Invoke page_zip_reorganize() for ROW_FORMAT=COMPRESSED.	2020-02-13 18:19:14 +02:00
Marko Mäkelä	5bea43f5e0	MDEV-12353: Deprecate and ignore innodb_log_compressed_pages page_zip_compress_write_log_no_data(): Remove. We no longer write the MLOG_ZIP_PAGE_COMPRESS_NO_DATA record. Instead, we will write MLOG_ZIP_PAGE_COMPRESS records.	2020-02-13 18:19:13 +02:00
Marko Mäkelä	1a6f708ec5	MDEV-15058: Deprecate and ignore innodb_buffer_pool_instances Our benchmarking efforts indicate that the reasons for splitting the buf_pool in commit `c18084f71b` have mostly gone away, possibly as a result of mysql/mysql-server@ce6109ebfd or similar work. Only in one write-heavy benchmark where the working set size is ten times the buffer pool size, the buf_pool->mutex would be less contended with 4 buffer pool instances than with 1 instance, in buf_page_io_complete(). That contention could be alleviated further by making more use of std::atomic and by splitting buf_pool_t::mutex further (MDEV-15053). We will deprecate and ignore the following parameters: innodb_buffer_pool_instances innodb_page_cleaners There will be only one buffer pool and one page cleaner task. In a number of INFORMATION_SCHEMA views, columns that indicated the buffer pool instance will be removed: information_schema.innodb_buffer_page.pool_id information_schema.innodb_buffer_page_lru.pool_id information_schema.innodb_buffer_pool_stats.pool_id information_schema.innodb_cmpmem.buffer_pool_instance information_schema.innodb_cmpmem_reset.buffer_pool_instance	2020-02-12 14:45:21 +02:00
Marko Mäkelä	fc2f2fa853	MDEV-19747: Deprecate and ignore innodb_log_optimize_ddl During native table rebuild or index creation, InnoDB used to skip redo logging and write MLOG_INDEX_LOAD records to inform crash recovery and Mariabackup of the gaps in redo log. This is fragile and prohibits some optimizations, such as skipping the doublewrite buffer for newly (re)initialized pages (MDEV-19738). row_merge_write_redo(): Remove. We do not write MLOG_INDEX_LOAD records any more. Instead, we write full redo log. FlushObserver: Remove. fseg_free_page_func(): Remove the parameter log. Redo logging cannot be disabled. fil_space_t::redo_skipped_count: Remove. We cannot remove buf_block_t::skip_flush_check, because PageBulk will temporarily generate invalid B-tree pages in the buffer pool.	2020-02-11 18:44:26 +02:00
Eugene Kosov	700e010309	fix aligned memcpy()-like functions usage I found that memcpy_aligned was used incorrectly at redo log and decided to put assertions in aligned functions. And found even more incorrect cases. Given the amount discovered of bugs, I left assertions to prevent future bugs. my_assume_aligned(): instead of MY_ASSUME_ALIGNED macro	2020-01-23 00:12:43 +08:00
Marko Mäkelä	28c89b7151	Merge 10.4 into 10.5	2019-12-16 07:47:17 +02:00
Marko Mäkelä	745fd4b39f	MDEV-21174: Remove some mlog_write_initial_log_record_fast() Pass buf_block_t* to more functions that write redo log. page_zip_write_node_ptr(), page_zip_write_blob_ptr(), page_zip_compress_write_log_no_data(): Take buf_block_t* as parameter, and do not tolerate mtr=NULL. page_zip_compress(): Do not tolerate mtr=NULL. page_zip_dir_insert(): Take page_cur_t* as parameter. mlog_write_initial_log_record(): Remove. This function was unused. RecIterator::remove(): Remove the redundant page_zip parameter. PageConverter::m_page_zip_ptr: Remove.	2019-12-13 18:15:51 +02:00
Marko Mäkelä	2b5a269cb4	MDEV-21174: Clean up record insertion page_cur_insert_rec_low(): Take page_cur_t* as a parameter, and do not tolerate mtr=NULL. page_cur_insert_rec_zip(): Do not tolerate mtr=NULL.	2019-12-13 18:15:51 +02:00
Marko Mäkelä	8fa759a576	Merge 10.3 into 10.4 We disable the MDEV-21189 test galera.galera_partition because it times out.	2019-12-13 17:30:37 +02:00
Marko Mäkelä	3466b47b0d	Merge 10.2 into 10.3	2019-12-13 10:08:57 +02:00
Eugene Kosov	f0aa073f2b	MDEV-20950 Reduce size of record offsets offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.	2019-12-13 00:26:50 +07:00
Marko Mäkelä	bb45941685	MDEV-21205 Assertion failure in btr_sec_min_rec_mark In commit `af5947f433` the function btr_discard_page() is invoking btr_set_min_rec_mark() with the wrong buf_block_t* object. node_ptr is on merge_block, not block. btr_discard_page(): Remove the variables merge_page, page, and always refer to block->frame or merge_block->frame instead. Also, limit the scope of node_ptr and avoid duplicated conditions. btr_set_min_rec_mark(): Add a template parameter, so that the caller can specify whether the page is supposed to have a left sibling. Otherwise, the assertion (which was introduced in the same commit) would fail in btr_discard_page().	2019-12-04 10:51:38 +02:00
Marko Mäkelä	af5947f433	MDEV-21174: Replace mlog_write_string() with mtr_t::memcpy() mtr_t::memcpy(): Replaces mlog_write_string(), mlog_log_string(). The buf_block_t is passed a parameter, so that mlog_write_initial_log_record_low() can be used instead of mlog_write_initial_log_record_fast(). fil_space_crypt_t::write_page0(): Remove the fil_space_t* parameter.	2019-12-03 11:05:19 +02:00
Marko Mäkelä	87839258f8	MDEV-21174: Replace mlog_memset() with mtr_t::memset() Passing buf_block_t helps us avoid calling mlog_write_initial_log_record_fast() and page_get_page_no(), and allows us to implement more debug checks, such as that on ROW_FORMAT=COMPRESSED index pages, only the page header may be modified by MLOG_MEMSET records. fseg_n_reserved_pages(): Add a buf_block_t parameter.	2019-12-03 11:05:19 +02:00
Marko Mäkelä	caea64df18	Cleanup: Remove some page_get_page_no() calls Refer to buf_page_t::id instead of parsing the tablespace identifier or page number from the buffer pool page.	2019-12-03 11:05:19 +02:00
Marko Mäkelä	56f6dab1d0	MDEV-21174: Replace mlog_write_ulint() with mtr_t::write() mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull(). Optimize away writes if the page contents does not change, except when a dummy write has been explicitly requested. Because the member function template takes a block descriptor as a parameter, it is possible to introduce better consistency checks. Due to this, the code for handling file-based lists, undo logs and user transactions was refactored to pass around buf_block_t.	2019-12-03 11:05:18 +02:00
Marko Mäkelä	cd92c6c83d	MDEV-12353 preparation: Do not write MLOG_REC_MIN_MARK btr_set_min_rec_mark(): Write MLOG_1BYTE instead of MLOG_REC_MIN_MARK or MLOG_COMP_REC_MIN_MARK. On ROW_FORMAT=COMPRESSED pages, the minimum record flag is not stored at all. The flag is computed for the uncompressed page by page_zip_decompress(). Hence, nothing needs to be logged for ROW_FORMAT=COMPRESSED tables for this operation. To facilitate crash-upgrade and hot backup from older versions, we will retain the code to parse and apply the old log record types MLOG_REC_MIN_MARK and MLOG_COMP_REC_MIN_MARK.	2019-12-03 11:05:18 +02:00
Marko Mäkelä	bf2cc46798	MDEV-21133: Remove buf_frame_copy()	2019-12-03 11:05:18 +02:00
Marko Mäkelä	a6e8a7df82	Cleanup: flst_read_addr(), fil_addr_t fil_addr_t: Use exactly sized data types. flst_read_addr(): Remove the unused parameter mtr. page_offset(): Return uint16_t.	2019-11-28 11:44:40 +02:00
Marko Mäkelä	25e2a556de	MDEV-21133 Optimize access to InnoDB page header fields Introduce memcpy_aligned<N>(), memcmp_aligned<N>(), memset_aligned<N>() and use them for accessing InnoDB page header fields that are known to be aligned. MY_ASSUME_ALIGNED(): Wrapper for the GCC/clang __builtin_assume_aligned(). Nothing similar seems to exist in Microsoft Visual Studio, and the C++20 std::assume_aligned is not available to us yet. Explicitly specified alignment guarantees allow compilers to generate faster code on platforms with strict alignment rules, instead of emitting calls to potentially unaligned memcpy(), memcmp(), or memset().	2019-11-26 10:15:03 +02:00
Marko Mäkelä	ae90f8431b	Merge 10.4 into 10.5	2019-11-14 14:49:20 +02:00
Marko Mäkelä	89ae01fd00	Merge 10.3 into 10.4	2019-11-14 13:23:36 +02:00
Marko Mäkelä	3d4a801533	MDEV-12353 preparation: Replace mtr_x_lock() and friends Apart from page latches (buf_block_t::lock), mini-transactions are keeping track of at most one dict_index_t::lock and fil_space_t::latch at a time, and in a rare case, purge_sys.latch. Let us introduce interfaces for acquiring an index latch or a tablespace latch. In a later version, we may want to introduce mtr_t members for holding a latched dict_index_t* and fil_space_t, and replace the remaining use of mtr_t::m_memo with std::set<buf_block_t> or with a map<buf_block_t,byte> pointing to log records.	2019-11-14 11:40:33 +02:00
Marko Mäkelä	c99470b366	Merge 10.4 into 10.5	2019-11-13 20:38:14 +02:00

1 2 3 4 5

203 commits