mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-16 12:02:42 +01:00

Author	SHA1	Message	Date
Thirunarayanan Balathandayuthapani	be0dfcdb99	MDEV-34200 InnoDB tries to write to read-only system tablespace in buf_dblwr_t::init_or_load_pages() - InnoDB fails to set the TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED flag in transaction system header page while recreating the undo log tablespaces buf_dblwr_t::init_or_load_pages(): Tries to reset the space id and try to write into doublewrite buffer even when read_only mode is enabled. In srv_all_undo_tablespaces_open(), InnoDB should try to open the extra unused undo tablespaces instead of trying to creating it.	2024-05-22 13:16:10 +05:30
Marko Mäkelä	9d20853c74	Merge 10.6 into 10.11	2024-01-18 19:22:23 +02:00
Marko Mäkelä	3a96eba25f	Merge 10.5 into 10.6	2024-01-17 13:35:05 +02:00
Thirunarayanan Balathandayuthapani	caad34df54	MDEV-32968 InnoDB fails to restore tablespace first page from doublewrite buffer when page is empty - InnoDB fails to find the space id from the page0 of the tablespace. In that case, InnoDB can use doublewrite buffer to recover the page0 and write into the file. - buf_dblwr_t::init_or_load_pages(): Loads only the pages which are valid.(page lsn >= checkpoint). To do that, InnoDB has to open the redo log before system tablespace, read the latest checkpoint information. recv_dblwr_t::find_first_page(): 1) Iterate the doublewrite buffer pages and find the 0th page 2) Read the tablespace flags, space id from the 0th page. 3) Read the 1st, 2nd and 3rd page from tablespace file and compare the space id with the space id which is stored in doublewrite buffer. 4) If it matches then we can write into the file. 5) Return space which matches the pages from the file. SysTablespace::read_lsn_and_check_flags(): Remove the retry logic for validating the first page. After restoring the first page from doublewrite buffer, assign tablespace flags by reading the first page. recv_recovery_read_max_checkpoint(): Reads the maximum checkpoint information from log file recv_recovery_from_checkpoint_start(): Avoid reading the checkpoint header information from log file Datafile::validate_first_page(): Throw error in case of first page validation fails.	2024-01-15 14:08:27 +05:30
Oleksandr Byelkin	04d9a46c41	Merge branch '10.6' into 10.10	2023-11-08 16:23:30 +01:00
Marko Mäkelä	39e3ca8bd2	MDEV-31826 InnoDB may fail to recover after being killed in fil_delete_tablespace() InnoDB was violating the write-ahead-logging protocol when a file was being deleted, like this: 1. fil_delete_tablespace() set the fil_space_t::STOPPING flag 2. The buf_flush_page_cleaner() thread discards some changed pages for this tablespace advances the log checkpoint a little. 3. The server process is killed before fil_delete_tablespace() wrote a FILE_DELETE record. 4. Recovery will try to apply log to pages of the tablespace, because there was no FILE_DELETE record. This will fail, because some pages that had been modified since the latest checkpoint had not been written by the page cleaner. Page writes must not be stopped before a FILE_DELETE record has been durably written. fil_space_t::drop(): Replaces fil_space_t::check_pending_operations(). Add the parameter detached_handle, and return a tablespace pointer if this thread was the first one to stop I/O on the tablespace. mtr_t::commit_file(): Remove the parameter detached_handle, and move some handling to fil_space_t::drop(). fil_space_t: STOPPING_READS, STOPPING_WRITES: Separate flags for STOPPING. We want to stop reads (and encryption) before stopping page writes. fil_space_t::is_stopping_writes(), fil_space_t::get_for_write(): Special accessors for the write path. fil_space_t::flush_low(): Ignore the STOPPING_READS flag and only stop if STOPPING_WRITES is set, to avoid an infinite loop in fil_flush_file_spaces(), which was occasionally repeated by running the test encryption.create_or_replace. Reviewed by: Vladislav Lesin Tested by: Matthias Leich	2023-10-26 15:07:59 +03:00
Marko Mäkelä	c92d06748a	Merge 10.6 into 10.10	2023-10-19 14:35:31 +03:00
Marko Mäkelä	6991b1c47c	Merge 10.5 into 10.6	2023-10-19 13:50:00 +03:00
Thirunarayanan Balathandayuthapani	3da5d047b8	MDEV-31851 After crash recovery, undo tablespace fails to open Problem: ======== - InnoDB fails to open undo tablespace when page0 is corrupted and fails to throw error. Solution: ========= - InnoDB throws DB_CORRUPTION error when InnoDB encounters page0 corruption of undo tablespace. - InnoDB restores the page0 of undo tablespace from doublewrite buffer if it encounters page corruption - Moved Datafile::restore_from_doublewrite() to recv_dblwr_t::restore_first_page(). So that undo tablespace and system tablespace can use this function instead of duplicating the code srv_undo_tablespace_open(): Returns 0 if file doesn't exist or ULINT_UNDEFINED if page0 is corrupted.	2023-10-17 18:41:21 +05:30
Marko Mäkelä	d5e15424d8	Merge 10.6 into 10.10 The MDEV-29693 conflict resolution is from Monty, as well as is a bug fix where ANALYZE TABLE wrongly built histograms for single-column PRIMARY KEY. Also includes a fix for safe_malloc error reporting. Other things: - Copied main.log_slow from 10.4 to avoid mtr issue Disabled test: - spider/bugfix.mdev_27239 because we started to get +Error 1429 Unable to connect to foreign data source: localhost -Error 1158 Got an error reading communication packets - main.delayed - Bug#54332 Deadlock with two connections doing LOCK TABLE+INSERT DELAYED This part is disabled for now as it fails randomly with different warnings/errors (no corruption).	2023-10-14 13:36:11 +03:00
Marko Mäkelä	6cc88c3db1	Clean up buf0buf.inl Let us move some #include directives from buf0buf.inl to the compilation units where they are really used.	2023-08-21 15:51:10 +03:00
Marko Mäkelä	2f9e264781	MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress The progress reporting of InnoDB crash recovery was rather intermittent. Nothing was reported during the single-threaded log record parsing, which could consume minutes when parsing a large log. During log application, there only was progress reporting in background threads that would be invoked on data page read completion. The progress reporting here will be detailed like this: InnoDB: Starting crash recovery from checkpoint LSN=503549688 InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages InnoDB: Multi-batch recovery needed at LSN 4189599815 InnoDB: End of log at LSN=4483551634 InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages ... InnoDB: log sequence number 4483551634 ... The previous messages "Starting a batch to recover" or "Starting a final batch to recover" will be replaced by "To recover: ... pages" messages. If a batch lasts longer than 15 seconds, then there will be progress reports every 15 seconds, showing the number of remaining pages. For the non-final batch, the "To recover:" message includes two end LSN: that of the batch, and of the recovered log. This is the primary measure of progress. The batch will end once the number of pages to recover reaches 0. If recovery is possible in a single batch, the output will look like this, with a shorter "To recover:" message that counts only the remaining pages: InnoDB: Starting crash recovery from checkpoint LSN=503549688 InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages InnoDB: End of log at LSN=4483551634 ... InnoDB: To recover: 331797 pages ... InnoDB: log sequence number 4483551634 ... We will also speed up recovery by improving the memory management and implementing multi-threaded recovery of data pages that will not need to be read into the buffer pool ("fake read"). Log application in the "fake read" threads will be protected by an atomic being_recovered field and exclusive buf_page_t::latch. Recovery will reserve for data pages two thirds of the buffer pool, or 256 pages, whichever is smaller. Previously, we could only use at most one third of the buffer pool for buffered log records. This would typically mean that with large buffer pools, recovery unnecessary consisted of multiple batches. If recovery runs out of memory, it will "roll back" or "rewind" the current mini-transaction. The recv_sys.lsn and recv_sys.pages will correspond to the "out of memory LSN", at the end of the previous complete mini-transaction. If recovery runs out of memory while executing the final recovery batch, we can simply invoke recv_sys.apply(false) to make room, and resume parsing. If recovery runs out of memory before the final batch, we will scan the redo log to the end (recv_sys.scanned_lsn) and check for any missing or inconsistent files. If recv_init_crash_recovery_spaces() does not report any potentially missing tablespaces, we can make use of the already stored recv_sys.pages and only rewind to the "out of memory LSN". Else, we must keep parsing and invoking recv_validate_tablespace() until an error has been found or everything has been resolved, and ultimatily rewind to to the checkpoint LSN. recv_sys_t::pages_it: A cached iterator to recv_sys.pages recv_sys_t::parse_mtr(): Remove an ATTRIBUTE_NOINLINE that would prevent tail call optimization in recv_sys_t::parse_pmem(). recv_sys_t::parse(), recv_sys_t::parse_mtr(), recv_sys_t::parse_pmem(): Add template<bool store> parameter. Redo log record parsing (store=false) is better specialized from store=true (with bool if_exists) so that we can avoid some conditional branches in frequently invoked low-level code. recv_sys_t::is_memory_exhausted(): Remove. The special parse() status GOT_OOM will report out-of-memory situation at the low level. recv_sys_t::rewind(), page_recv_t::recs_t::rewind(): Remove all log starting with a specific LSN. recv_scan_log(): Separate some code for only parsing, not storing log. In rewound_lsn, remember the LSN at which last_phase=false recovery ran out of memory. This is where the next call to recv_scan_log() will resume storing the log. This replaces recv_sys.last_stored_lsn. recv_sys_t::parse(): Evaluate the template parameter store in a few more cases, to allow dead code to be eliminated at compile time. recv_sys_t::scanned_lsn: The end of the log found by recv_scan_log(). The special value 1 means that recv_sys has been initialized but no log has been parsed. IORequest::write_complete(), IORequest::read_complete(): Replaces fil_aio_callback(). read_io_callback(), write_io_callback(): Replaces io_callback(). IORequest::fake_read_complete(), fake_io_callback(), os_fake_read(): Process a "fake read" request for concurrent recovery. recv_sys_t::apply_batch(): Choose a number of successive pages for a recovery batch. recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a page whose recovery is not in progress. Log application threads will not invoke this; they will only set being_recovered=-1 to indicate that the entry is no longer needed. recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries. recv_sys_t::wait_for_pool(): Wait for some space to become available in the buffer pool. mlog_init_t::mark_ibuf_exist(): Avoid calls to recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low(). Such calls would lead to double locking of recv_sys.mutex, which depending on implementation could cause a deadlock. We will use lower-level calls to look up index pages. buf_LRU_block_remove_hashed(): Disable consistency checks for freed ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage. This fixes an occasional failure of the test innodb.innodb_bulk_create_index_debug. Tested by: Matthias Leich	2023-05-19 15:15:38 +03:00
Marko Mäkelä	c15c8ef3e3	Merge 10.6 into 10.8	2023-04-26 13:58:40 +03:00
Marko Mäkelä	818d5e4814	Merge 10.5 into 10.6	2023-04-25 13:10:33 +03:00
Marko Mäkelä	50f3b7d164	MDEV-31124 Innodb_data_written miscounts doublewrites When commit `a5a2ef079c` implemented asynchronous doublewrite, the writes via the doublewrite buffer started to be counted incorrectly, without multiplying them by innodb_page_size. srv_export_innodb_status(): Correctly count the Innodb_data_written. buf_dblwr_t: Remove submitted(), because it is close to written() and only Innodb_data_written was interested in it. According to its name, it should count completed and not submitted writes. Tested by: Axel Schwenke	2023-04-25 12:17:06 +03:00
Marko Mäkelä	1d1e0ab2cc	Merge 10.6 into 10.8	2023-04-12 15:50:08 +03:00
Marko Mäkelä	a091d6ac4e	MDEV-26827 fixup: Do not duplicate io_slots::pending_io_count() os_aio_pending_reads_approx(), os_aio_pending_reads(): Replaces buf_pool.n_pend_reads. os_aio_pending_writes(): Replaces buf_dblwr.pending_writes(). buf_dblwr_t::write_cond, buf_dblwr_t::writes_pending: Remove.	2023-04-12 13:49:57 +03:00
Marko Mäkelä	dd2fe81122	Merge 10.6 into 10.8	2023-03-29 15:16:42 +03:00
Marko Mäkelä	07460c31e3	MDEV-30900 Crash on macOS due to zero-initialized buf_dblwr.write_cond buf_dblwr_t::init(), buf_dblwr_t::close(): Cover also write_cond, which was added in commit `a55b951e60` without explicit initialization. On GNU/Linux, PTHREAD_COND_INITIALIZER is a zero-initializer. That is why the default zero initialization happened to work on that platform.	2023-03-24 07:59:27 +02:00
Marko Mäkelä	acf46b7b36	Merge 10.6 into 10.8	2023-03-16 18:11:37 +02:00
Marko Mäkelä	a55b951e60	MDEV-26827 Make page flushing even faster For more convenient monitoring of something that could greatly affect the volume of page writes, we add the status variable Innodb_buffer_pool_pages_split that was previously only available via information_schema.innodb_metrics as "innodb_page_splits". This was suggested by Axel Schwenke. buf_flush_page_count: Replaced with buf_pool.stat.n_pages_written. We protect buf_pool.stat (except n_page_gets) with buf_pool.mutex and remove unnecessary export_vars indirection. buf_pool.flush_list_bytes: Moved from buf_pool.stat.flush_list_bytes. Protected by buf_pool.flush_list_mutex. buf_pool_t::page_cleaner_status: Replaces buf_pool_t::n_flush_LRU_, buf_pool_t::n_flush_list_, and buf_pool_t::page_cleaner_is_idle. Protected by buf_pool.flush_list_mutex. We will exclusively broadcast buf_pool.done_flush_list by the buf_flush_page_cleaner thread, and only wait for it when communicating with buf_flush_page_cleaner. There is no need to keep a count of pending writes by the buf_pool.flush_list processing. A single flag suffices for that. Waits for page write completion can be performed by simply waiting on block->page.lock, or by invoking buf_dblwr.wait_for_page_writes(). buf_LRU_block_free_non_file_page(): Broadcast buf_pool.done_free and set buf_pool.try_LRU_scan when freeing a page. This would be executed also as part of buf_page_write_complete(). buf_page_write_complete(): Do not broadcast buf_pool.done_flush_list, and do not acquire buf_pool.mutex unless buf_pool.LRU eviction is needed. Let buf_dblwr count all writes to persistent pages and broadcast a condition variable when no outstanding writes remain. buf_flush_page_cleaner(): Prioritize LRU flushing and eviction right after "furious flushing" (lsn_limit). Simplify the conditions and reduce the hold time of buf_pool.flush_list_mutex. Refuse to shut down or sleep if buf_pool.ran_out(), that is, LRU eviction is needed. buf_pool_t::page_cleaner_wakeup(): Add the optional parameter for_LRU. buf_LRU_get_free_block(): Protect buf_lru_free_blocks_error_printed with buf_pool.mutex. Invoke buf_pool.page_cleaner_wakeup(true) to to ensure that buf_flush_page_cleaner() will process the LRU flush request. buf_do_LRU_batch(), buf_flush_list(), buf_flush_list_space(): Update buf_pool.stat.n_pages_written when submitting writes (while holding buf_pool.mutex), not when completing them. buf_page_t::flush(), buf_flush_discard_page(): Require that the page U-latch be acquired upfront, and remove buf_page_t::ready_for_flush(). buf_pool_t::delete_from_flush_list(): Remove the parameter "bool clear". buf_flush_page(): Count pending page writes via buf_dblwr. buf_flush_try_neighbors(): Take the block of page_id as a parameter. If the tablespace is dropped before our page has been written out, release the page U-latch. buf_pool_invalidate(): Let the caller ensure that there are no outstanding writes. buf_flush_wait_batch_end(false), buf_flush_wait_batch_end_acquiring_mutex(false): Replaced with buf_dblwr.wait_for_page_writes(). buf_flush_wait_LRU_batch_end(): Replaces buf_flush_wait_batch_end(true). buf_flush_list(): Remove some broadcast of buf_pool.done_flush_list. buf_flush_buffer_pool(): Invoke also buf_dblwr.wait_for_page_writes(). buf_pool_t::io_pending(), buf_pool_t::n_flush_list(): Remove. Outstanding writes are reflected by buf_dblwr.pending_writes(). buf_dblwr_t::init(): New function, to initialize the mutex and the condition variables, but not the backing store. buf_dblwr_t::is_created(): Replaces buf_dblwr_t::is_initialised(). buf_dblwr_t::pending_writes(), buf_dblwr_t::writes_pending: Keeps track of writes of persistent data pages. buf_flush_LRU(): Allow calls while LRU flushing may be in progress in another thread. Tested by Matthias Leich (correctness) and Axel Schwenke (performance)	2023-03-16 17:19:58 +02:00
Marko Mäkelä	0751bfbcaf	Merge 10.7 into 10.8	2022-11-30 12:12:07 +02:00
Marko Mäkelä	b7ae4d442a	Merge 10.6 into 10.7	2022-11-30 12:09:01 +02:00
Marko Mäkelä	15ab2e122d	MDEV-30132 Crash after recovery, with InnoDB: Tried to read ... os_file_read(): Merged with os_file_read_no_error_handling(). Crashing on a partial page read is as unhelpful as crashing on a corrupted page read (commit `0b47c126e3`). Report the file name if it is available via IORequest.	2022-11-30 10:54:03 +02:00
Marko Mäkelä	57d4a242da	Merge 10.7 into 10.8	2022-06-06 16:22:09 +03:00
Marko Mäkelä	7e39470e33	Merge 10.6 into 10.7	2022-06-06 14:56:20 +03:00
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Marko Mäkelä	685d958e38	MDEV-14425 Improve the redo log for concurrency The InnoDB redo log used to be formatted in blocks of 512 bytes. The log blocks were encrypted and the checksum was calculated while holding log_sys.mutex, creating a serious scalability bottleneck. We remove the fixed-size redo log block structure altogether and essentially turn every mini-transaction into a log block of its own. This allows encryption and checksum calculations to be performed on local mtr_t::m_log buffers, before acquiring log_sys.mutex. The mutex only protects a memcpy() of the data to the shared log_sys.buf, as well as the padding of the log, in case the to-be-written part of the log would not end in a block boundary of the underlying storage. For now, the "padding" consists of writing a single NUL byte, to allow recovery and mariadb-backup to detect the end of the circular log faster. Like the previous implementation, we will overwrite the last log block over and over again, until it has been completely filled. It would be possible to write only up to the last completed block (if no more recent write was requested), or to write dummy FILE_CHECKPOINT records to fill the incomplete block, by invoking the currently disabled function log_pad(). This would require adjustments to some logic around log checkpoints, page flushing, and shutdown. An upgrade after a crash of any previous version is not supported. Logically empty log files from a previous version will be upgraded. An attempt to start up InnoDB without a valid ib_logfile0 will be refused. Previously, the redo log used to be created automatically if it was missing. Only with with innodb_force_recovery=6, it is possible to start InnoDB in read-only mode even if the log file does not exist. This allows the contents of a possibly corrupted database to be dumped. Because a prepared backup from an earlier version of mariadb-backup will create a 0-sized log file, we will allow an upgrade from such log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system tablespace looks valid. The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced with 64-byte log checkpoint blocks at 0x1000 and 0x2000. The start of log records will move from 0x800 to 0x3000. This allows us to use 4096-byte aligned blocks for all I/O in a future revision. We extend the MDEV-12353 redo log record format as follows. (1) Empty mini-transactions or extra NUL bytes will not be allowed. (2) The end-of-minitransaction marker (a NUL byte) will be replaced with a 1-bit sequence number, which will be toggled each time when the circular log file wraps back to the beginning. (3) After the sequence bit, a CRC-32C checksum of all data (excluding the sequence bit) will written. (4) If the log is encrypted, 8 bytes will be written before the checksum and included in it. This is part of the initialization vector (IV) of encrypted log data. (5) File names, page numbers, and checkpoint information will not be encrypted. Only the payload bytes of page-level log will be encrypted. The tablespace ID and page number will form part of the IV. (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written, with all-zero payload, and with the normal end marker and checksum. The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON. In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup will require a valid log file. When resizing the log, we will create a logically empty ib_logfile101 at the current LSN and use an atomic rename to replace ib_logfile0 with it. See the test innodb.log_file_size. Because there is no mandatory padding in the log file, we are able to create a dummy log file as of an arbitrary log sequence number. See the test mariabackup.huge_lsn. The parameter innodb_log_write_ahead_size and the INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed. The minimum value of innodb_log_buffer_size will be increased to 2MiB (because log_sys.buf will replace recv_sys.buf) and the increment adjusted to 4096 bytes (the maximum log block size). The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed: os_log_fsyncs os_log_pending_fsyncs log_pending_log_flushes log_pending_checkpoint_writes The following status variables will be removed: Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs) Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design) log_sys.get_block_size(): Return the physical block size of the log file. This is only implemented on Linux and Microsoft Windows for now, and for the power-of-2 block sizes between 64 and 4096 bytes (the minimum and maximum size of a checkpoint block). If the block size is anything else, the traditional 512-byte size will be used via normal file system buffering. If the file system buffers can be bypassed, a message like the following will be issued: InnoDB: File system buffers for log disabled (block size=512 bytes) InnoDB: File system buffers for log disabled (block size=4096 bytes) This has been tested on Linux and Microsoft Windows with both sizes. On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC. Tests in 3 different environments where the log is stored in a device with a physical block size of 512 bytes are yielding better throughput without O_DIRECT. This could be due to the fact that in the event the last log block is being overwritten (if multiple transactions would become durable at the same time, and each of will write a small number of bytes to the last log block), it should be faster to re-copy data from log_sys.buf or log_sys.flush_buf to the kernel buffer, to be finally written at fdatasync() time. The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for data files. This option will enable O_DIRECT on the log file on Linux. It may be unsafe to use when the storage device does not support FUA (Force Unit Access) mode. When the server is compiled WITH_PMEM=ON, we will use memory-mapped I/O for the log file if the log resides on a "mount -o dax" device. We will identify PMEM in a start-up message: InnoDB: log sequence number 0 (memory-mapped); transaction id 3 On Linux, we will also invoke mmap() on any ib_logfile0 that resides in /dev/shm, effectively treating the log file as persistent memory. This should speed up "./mtr --mem" and increase the test coverage of PMEM on non-PMEM hardware. It also allows users to estimate how much the performance would be improved by installing persistent memory. On other tmpfs file systems such as /run, we will not use mmap(). mariadb-backup: Eliminated several variables. We will refer directly to recv_sys and log_sys. backup_wait_for_lsn(): Detect non-progress of xtrabackup_copy_logfile(). In this new log format with arbitrary-sized blocks, we can only detect log file overrun indirectly, by observing that the scanned log sequence number is not advancing. xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit, because we are not allowed to modify the server's log file, and our memory mapping is read-only. trx_flush_log_if_needed_low(): Do not use the callback on pmem. Using neither flush_lock nor write_lock around PMEM writes seems to yield the best performance. The pmem_persist() calls may still be somewhat slower than the pwrite() and fdatasync() based interface (PMEM mounted without -o dax). recv_sys_t::buf: Remove. We will use log_sys.buf for parsing. recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE. recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn. recv_sys_t, log_sys_t: Removed many data members. recv_sys.lsn: Renamed from recv_sys.recovered_lsn. recv_sys.offset: Renamed from recv_sys.recovered_offset. log_sys.buf_size: Replaces srv_log_buffer_size. recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset] when the buffer is being allocated from the memory heap. recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is backed by ib_logfile0. The pointer will wrap from recv_sys.len (log_sys.file_size) to log_sys.START_OFFSET. For the record that wraps around, we may copy file name or record payload data to the auxiliary buffer decrypt_buf in order to have a contiguous block of memory. The maximum size of a record is less than innodb_page_size bytes. recv_sys_t::parse(): Take the smart pointer as a template parameter. Do not temporarily add a trailing NUL byte to FILE_ records, because we are not supposed to modify the memory-mapped log file. (It is attached in read-write mode already during recovery.) recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse(). recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be returned on PMEM, use recv_ring to wrap around the buffer to the start. mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free on PMEM, because it has no meaning on the mmap-based log. log_sys.write_to_buf: Count writes to log_sys.buf. Replaces srv_stats.log_write_requests and export_vars.innodb_log_write_requests. Protected by log_sys.mutex. Updated consistently in log_close(). Previously, mtr_t::commit() conditionally updated the count, which was inconsistent. log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf, for writing to log_sys.log (the ib_logfile0). Replaces srv_stats.log_writes and export_vars.innodb_log_writes. Protected by log_sys.mutex. log_sys.waits: Count waits in append_prepare(). Replaces srv_stats.log_waits and export_vars.innodb_log_waits. recv_recover_page(): Do not unnecessarily acquire log_sys.flush_order_mutex. We are inserting the blocks in arbitary order anyway, to be adjusted in recv_sys.apply(true). We will change the definition of flush_lock and write_lock to avoid potential false sharing. Depending on sizeof(log_sys) and CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could share a cache line with each other or with the last data members of log_sys. Thanks to Matthias Leich for providing https://rr-project.org traces for various failures during the development, and to Thirunarayanan Balathandayuthapani for his help in debugging some of the recovery code. And thanks to the developers of the rr debugger for a tool without which extensive changes to InnoDB would be very challenging to get right. Thanks to Vladislav Vaintroub for useful feedback and to him, Axel Schwenke and Krunal Bauskar for testing the performance.	2022-01-21 16:03:47 +02:00
Marko Mäkelä	7dfaded962	Merge 10.6 into 10.7	2022-01-04 09:55:58 +02:00
Marko Mäkelä	3f5726768f	Merge 10.5 into 10.6	2022-01-04 09:26:38 +02:00
Marko Mäkelä	4c3ad24413	MDEV-27416 InnoDB hang in buf_flush_wait_flushed(), on log checkpoint InnoDB could sometimes hang when triggering a log checkpoint. This is due to commit `7b1252c03d` (MDEV-24278), which introduced an untimed wait to buf_flush_page_cleaner(). The hang was noticed by occasional failures of IMPORT TABLESPACE tests, such as innodb.innodb-wl5522, which would (unnecessarily) invoke log_make_checkpoint() from row_import_cleanup(). The reason of the hang was that buf_flush_page_cleaner() would enter untimed sleep despite buf_flush_sync_lsn being set. The exact failure scenario is unclear, because buf_flush_sync_lsn should actually be protected by buf_pool.flush_list_mutex. We prevent the hang by invoking buf_pool.page_cleaner_set_idle(false) whenever we are setting buf_flush_sync_lsn and signaling buf_pool.do_flush_list. The bulk of these changes was originally developed as a preparation for MDEV-26827, to invoke buf_flush_list() from fewer threads, and tested on 10.6 by Matthias Leich. This fix was tested by running 100 repetitions of 100 concurrent instances of the test innodb.innodb-wl5522 on a RelWithDebInfo build, using ext4fs and innodb_flush_method=O_DIRECT on a SATA SSD with 4096-byte block size. During the test, the call to log_make_checkpoint() in row_import_cleanup() was present. buf_flush_list(): Make static. buf_flush_wait(): Wait for buf_pool.get_oldest_modification() to reach a target, by work done in the buf_flush_page_cleaner. If buf_flush_sync_lsn is going to be set, we will invoke buf_pool.page_cleaner_set_idle(false). buf_flush_ahead(): If buf_flush_sync_lsn or buf_flush_async_lsn is going to be set and the page cleaner woken up, we will invoke buf_pool.page_cleaner_set_idle(false). buf_flush_wait_flushed(): Invoke buf_flush_wait(). buf_flush_sync(): Invoke recv_sys.apply() at the start in case crash recovery is active. Invoke buf_flush_wait(). buf_flush_sync_batch(): A lower-level variant of buf_flush_sync() that is only called by recv_sys_t::apply(). buf_flush_sync_for_checkpoint(): Do not trigger log apply or checkpoint during recovery. buf_dblwr_t::create(): Only initiate a buffer pool flush, not a checkpoint. row_import_cleanup(): Do not unnecessarily invoke log_make_checkpoint(). Invoking buf_flush_list_space() before starting to generate redo log for the imported tablespace should suffice. srv_prepare_to_delete_redo_log_file(): Set recv_sys.recovery_on in order to prevent buf_flush_sync_for_checkpoint() from initiating a checkpoint while the log is inaccessible. Remove a wait loop that is already part of buf_flush_sync(). Do not invoke fil_names_clear() if the log is being upgraded, because the FILE_MODIFY record is specific to the latest format. create_log_file(): Clear recv_sys.recovery_on only after calling log_make_checkpoint(), to prevent buf_flush_page_cleaner from invoking a checkpoint. innodb_shutdown(): Simplify the logic in mariadb-backup --prepare. os_aio_wait_until_no_pending_writes(): Update the function comment. Apart from row_quiesce_table_start() during FLUSH TABLES...FOR EXPORT, this is being called by buf_flush_list_space(), which is invoked by ALTER TABLE...IMPORT TABLESPACE as well as some encryption operations.	2022-01-04 07:40:31 +02:00
Marko Mäkelä	7e8a13d9d7	Merge 10.6 into 10.7	2021-11-19 17:45:52 +02:00
Marko Mäkelä	aaef2e1d8c	MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.	2021-11-18 17:47:19 +02:00
Marko Mäkelä	db915f7387	MDEV-27058: Move buf_page_t::slot to IORequest::slot MDEV-23855 and MDEV-23399 already moved some transient data fields from buffer pool page descriptors to IORequest, but the write buffer of PAGE_COMPRESSED or ENCRYPTED tables was missed. Since is only needed during asynchronous page write requests, it belongs to IORequest.	2021-11-18 17:44:33 +02:00
Marko Mäkelä	ca501ffb04	MDEV-26195: Use a 32-bit data type for some tablespace fields In the InnoDB data files, we allocate 32 bits for tablespace identifiers and page numbers as well as tablespace flags. But, in main memory data structures we allocate 32 or 64 bits, depending on the register width of the processor. Let us always use 32-bit fields to eliminate a mismatch and reduce the memory footprint on 64-bit systems.	2021-07-22 11:22:47 +03:00
Vladislav Vaintroub	e7f4daf88c	merge 10.5 to 10.6	2021-07-16 22:12:09 +02:00
Vladislav Vaintroub	fc2ec25733	MDEV-26166 replace log_write_up_to(LSN_MAX,...) with log_buffer_flush_to_disk() Also, remove comparison lsn > flush/write lsn, prior to calling log_write_up_to. The checks and early returns are part of this function.	2021-07-16 18:44:58 +02:00
Marko Mäkelä	30edd5549d	MDEV-26029: Sparse files are inefficient on thinly provisioned storage The MariaDB implementation of page_compressed tables for InnoDB used sparse files. In the worst case, in the data file, every data page will consist of some data followed by a hole. This may be extremely inefficient in some file systems. If the underlying storage device is thinly provisioned (can compress data on the fly), it would be good to write regular files (with sequences of NUL bytes at the end of each page_compressed block) and let the storage device take care of compressing the data. For reads, sparse file regions and regions containing NUL bytes will be indistinguishable. my_test_if_disable_punch_hole(): A new predicate for detecting thinly provisioned storage. (Not implemented yet.) innodb_atomic_writes: Correct the comment. buf_flush_page(): Support all values of fil_node_t::punch_hole. On a thinly provisioned storage device, we will always write NUL-padded innodb_page_size bytes also for page_compressed tables. buf_flush_freed_pages(): Remove a redundant condition. fil_space_t::atomic_write_supported: Remove. (This was duplicating fil_node_t::atomic_write.) fil_space_t::punch_hole: Remove. (Duplicated fil_node_t::punch_hole.) fil_node_t: Remove magic_n, and consolidate flags into bitfields. For punch_hole we introduce a third value that indicates a thinly provisioned storage device. fil_node_t::find_metadata(): Detect all attributes of the file.	2021-06-29 15:18:22 +03:00
Marko Mäkelä	a8350cfb5e	Merge 10.5 into 10.6	2021-06-24 21:56:44 +03:00
Marko Mäkelä	e329dc8d86	MDEV-25948 fixup: Demote a warning to a note buf_dblwr_t::recover(): Issue a note, not a warning, about pages whose FIL_PAGE_LSN is in the future. This was supposed to be part of commit `762bcb81b5` (MDEV-25948) but had been accidentally omitted.	2021-06-24 18:51:05 +03:00
Marko Mäkelä	101da87228	Merge 10.5 into 10.6	2021-06-23 19:36:45 +03:00
Marko Mäkelä	762bcb81b5	MDEV-25948 Remove log_flush_task Vladislav Vaintroub suggested that invoking log_flush_up_to() for every page could perform better than invoking a log write between buf_pool.flush_list batches, like we started doing in commit `3a9a3be1c6` (MDEV-23855). This could depend on the sequence in which pages are being modified. The buf_pool.flush_list is ordered by oldest_modification, while the FIL_PAGE_LSN of the pages is theoretically independent of that. In the pathological case, we will wait for a log write before writing each individual page. It turns out that we can defer the call to log_flush_up_to() until just before submitting the page write. If the doublewrite buffer is being used, we can submit a write batch of "future" pages to the doublewrite buffer, and only wait for the log write right before we are writing an already doublewritten page. The next doublewrite batch will not be initiated before the last page write from the current batch has completed. When a future version introduces asynchronous writes if the log, we could initiate a write at the start of a flushing batch, to reduce waiting further.	2021-06-23 19:06:52 +03:00
Marko Mäkelä	6dfd44c828	MDEV-25954: Trim os_aio_wait_until_no_pending_writes() It turns out that we had some unnecessary waits for no outstanding write requests to exist. They were basically working around a bug that was fixed in MDEV-25953. On write completion callback, blocks will be marked clean. So, it is sufficient to consult buf_pool.flush_list to determine which writes have not been completed yet. On FLUSH TABLES...FOR EXPORT we must still wait for all pending asynchronous writes to complete, because buf_flush_file_space() would merely guarantee that writes will have been initiated.	2021-06-23 19:06:49 +03:00
Marko Mäkelä	cf552f5886	MDEV-25312 Replace fil_space_t::name with fil_space_t::name() A consistency check for fil_space_t::name is causing recovery failures in MDEV-25180 (Atomic ALTER TABLE). So, we'd better remove that field altogether. fil_space_t::name was more or less a copy of dict_table_t::name (except for some special cases), and it was not being used for anything useful. There used to be a name_hash, but it had been removed already in commit `a75dbfd718` (MDEV-12266). We will also remove os_normalize_path(), OS_PATH_SEPARATOR, OS_PATH_SEPATOR_ALT. On Microsoft Windows, we will treat \ and / roughly in the same way. The intention is that for per-table tablespaces, the filenames will always follow the pattern prefix/databasename/tablename.ibd. (Any \ in the prefix must not be converted.) ut_basename_noext(): Remove (unused function). read_link_file(): Replaces RemoteDatafile::read_link_file(). We will ensure that the last two path component separators are forward slashes (converting up to 2 trailing backslashes on Microsoft Windows), so that everywhere else we can assume that data file names end in "/databasename/tablename.ibd". Note: On Microsoft Windows, path names that start with \\?\ must not contain / as path component separators. Previously, such paths did work in the DATA DIRECTORY argument of InnoDB tables. Reviewed by: Vladislav Vaintroub	2021-04-07 18:01:13 +03:00
Marko Mäkelä	7a4fbb55b0	MDEV-25105 Remove innodb_checksum_algorithm values none,innodb,... Historically, InnoDB supported a buggy page checksum algorithm that did not compute a checksum over the full page. Later, well before MySQL 4.1 introduced .ibd files and the innodb_file_per_table option, the algorithm was corrected and the first 4 bytes of each page were redefined to be a checksum. The original checksum was so slow that an option to disable page checksum was introduced for benchmarketing purposes. The Intel Nehalem microarchitecture introduced the SSE4.2 instruction set extension, which includes instructions for faster computation of CRC-32C. In MySQL 5.6 (and MariaDB 10.0), innodb_checksum_algorithm=crc32 was implemented to make of that. As that option was changed to be the default in MySQL 5.7, a bug was found on big-endian platforms and some work-around code was added to weaken that checksum further. MariaDB disables that work-around by default since MDEV-17958. Later, SIMD-accelerated CRC-32C has been implemented in MariaDB for POWER and ARM and also for IA-32/AMD64, making use of carry-less multiplication where available. Long story short, innodb_checksum_algorithm=crc32 is faster and more secure than the pre-MySQL 5.6 checksum, called innodb_checksum_algorithm=innodb. It should have removed any need to use innodb_checksum_algorithm=none. The setting innodb_checksum_algorithm=crc32 is the default in MySQL 5.7 and MariaDB Server 10.2, 10.3, 10.4. In MariaDB 10.5, MDEV-19534 made innodb_checksum_algorithm=full_crc32 the default. It is even faster and more secure. The default settings in MariaDB do allow old data files to be read, no matter if a worse checksum algorithm had been used. (Unfortunately, before innodb_checksum_algorithm=full_crc32, the data files did not identify which checksum algorithm is being used.) The non-default settings innodb_checksum_algorithm=strict_crc32 or innodb_checksum_algorithm=strict_full_crc32 would only allow CRC-32C checksums. The incompatibility with old data files is why they are not the default. The newest server not to support innodb_checksum_algorithm=crc32 were MySQL 5.5 and MariaDB 5.5. Both have reached their end of life. A valid reason for using innodb_checksum_algorithm=innodb could have been the ability to downgrade. If it is really needed, data files can be converted with an older version of the innochecksum utility. Because there is no good reason to allow data files to be written with insecure checksums, we will reject those option values: innodb_checksum_algorithm=none innodb_checksum_algorithm=innodb innodb_checksum_algorithm=strict_none innodb_checksum_algorithm=strict_innodb Furthermore, the following innochecksum options will be removed, because only strict crc32 will be supported: innochecksum --strict-check=crc32 innochecksum -C crc32 innochecksum --write=crc32 innochecksum -w crc32 If a user wishes to convert a data file to use a different checksum (so that it might be used with the no-longer-supported MySQL 5.5 or MariaDB 5.5, which do not support IMPORT TABLESPACE nor system tablespace format changes that were made in MariaDB 10.3), then the innochecksum tool from MariaDB 10.2, 10.3, 10.4, 10.5 or MySQL 5.7 can be used. Reviewed by: Thirunarayanan Balathandayuthapani	2021-03-11 12:46:18 +02:00
Marko Mäkelä	520c76bfb4	Merge 10.5 into 10.6	2021-02-07 12:32:33 +02:00
Marko Mäkelä	4f4a4cf9eb	MDEV-23399 fixup: Use plain pthread_cond The condition variables that were introduced in commit `7cffb5f6e8` (MDEV-23399) are never instrumented with PERFORMANCE_SCHEMA. Let us avoid the storage overhead and dead code.	2021-02-07 12:19:24 +02:00
Marko Mäkelä	ff5d306e29	MDEV-21452: Replace ib_mutex_t with mysql_mutex_t SHOW ENGINE INNODB MUTEX functionality is completely removed, as are the InnoDB latching order checks. We will enforce innodb_fatal_semaphore_wait_threshold only for dict_sys.mutex and lock_sys.mutex. dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex. lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex. FIXME: srv_sys should be removed altogether; it is duplicating tpool functionality. fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must not hold fil_system.mutex. fil_close_all_files(): To prevent SAFE_MUTEX warnings for fil_space_destroy_crypt_data(), we must not hold fil_system.mutex while invoking fil_space_free_low() on a detached tablespace.	2020-12-15 17:56:18 +02:00
Marko Mäkelä	9ecd766526	Merge 10.5 into 10.6	2020-12-14 18:02:40 +02:00
Marko Mäkelä	8677c14e65	MDEV-24391 heap-use-after-free in fil_space_t::flush_low() We observed a race condition that involved two threads executing fil_flush_file_spaces() and one thread executing fil_delete_tablespace(). After one of the fil_flush_file_spaces() observed that space.needs_flush_not_stopping() is set and was releasing the fil_system.mutex, the other fil_flush_file_spaces() would complete the execution of fil_space_t::flush_low() on the same tablespace. Then, fil_delete_tablespace() would destroy the object, because the value of fil_space_t::n_pending did not prevent that. Finally, the fil_flush_file_spaces() would resume execution and invoke fil_space_t::flush_low() on the freed object. This race condition was introduced in commit `118e258aaa` of MDEV-23855. fil_space_t::flush(): Add a template parameter that indicates whether the caller is holding a reference to prevent the tablespace from being freed. buf_dblwr_t::flush_buffered_writes_completed(), row_quiesce_table_start(): Acquire a reference for the duration of the fil_space_t::flush_low() operation. It should be impossible for the object to be freed in these code paths, but we want to satisfy the debug assertions. fil_space_t::flush_low(): Do not increment or decrement the reference count, but instead assert that the caller is holding a reference. fil_space_extend_must_retry(), fil_flush_file_spaces(): Acquire a reference before releasing fil_system.mutex. This is what will fix the race condition.	2020-12-11 09:05:26 +02:00

1 2 3 4 5

201 commits