mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-04-10 01:05:48 +02:00

Author	SHA1	Message	Date
Marko Mäkelä	990b010b09	MDEV-35438 Annotate InnoDB I/O functions with noexcept Most InnoDB functions do not throw any exceptions, not even indirectly std::bad_alloc, which could be thrown by a C++ memory allocation function. Let us annotate many functions with noexcept in order to reduce the code footprint related to exception handling. Reviewed by: Thirunarayanan Balathandayuthapani	2025-01-09 07:43:24 +02:00
Marko Mäkelä	69e20cab28	Merge 10.5 into 10.6	2024-12-11 14:46:43 +02:00
Daniel Black	807e4f320f	Change my_umask{,_dir} to mode_t and remove os_innodb_umask os_innodb_umask was of the incorrect type resulting in warnings in clang-19. The correct type is mode_t. As os_innodb_umask was set during innnodb_init from my_umask, corrected the type there along with its companion my_umask_dir. Because of this, the defaults mask values in innodb never had an effect. The resulting change allow found signed differences in my_create{,_nosymlink}, open_nosymlinks: mysys/my_create.c:47:20: error: operand of ?: changes signedness from ‘int’ to ‘mode_t’ {aka ‘unsigned int’} due to unsignedness of other operand [-Werror=sign-compare] 47 \| CreateFlags ? CreateFlags : my_umask); Ref: clang-19 warnings: [55/123] Building CXX object storage/innobase/CMakeFiles/innobase.dir/os/os0file.cc.o storage/innobase/os/os0file.cc:1075:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32] 1075 \| file = open(name, create_flag \| O_CLOEXEC, os_innodb_umask); \| ~~~~ ^~~~~~~~~~~~~~~ storage/innobase/os/os0file.cc:1249:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32] 1249 \| file = open(name, create_flag \| O_CLOEXEC, os_innodb_umask); \| ~~~~ ^~~~~~~~~~~~~~~ storage/innobase/os/os0file.cc:1381:45: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32] 1381 \| file = open(name, create_flag \| O_CLOEXEC, os_innodb_umask); \| ~~~~ ^~~~~~~~~~~~~~~	2024-12-11 17:21:01 +11:00
Oleksandr Byelkin	f00711bba2	Merge branch '10.5' into 10.6	2024-10-29 14:20:03 +01:00
Marko Mäkelä	decdd4bf49	MDEV-29015/MDEV-29260/MDEV-34938: os_file_get_size() WSL work-around When MariaDB Server is run in a container under Windows Subsystem for Linux, the fstat(2) system calls that InnoDB invokes in os_file_set_size() or os_file_get_size() are causing a failure in case the file had been renamed in the past while the file handle was open. This affects at least ALTER TABLE and OPTIMIZE TABLE. os_file_get_size(): Invoke lseek(2) instead of fstat(2). We do not mind if the file pointer is moving to the end of the file, because InnoDB exclusively invokes positioned reads and writes, or in some rare cases, appends to an existing file. os_file_set_size(): Invoke os_file_get_size() instead of fstat(2). Define the POSIX and Windows versions separately. Formerly, the Windows version was called os_file_change_size_win32(). fil_node_t::read_page0(): Use os_file_get_size() to determine the size, and do not crash on error. fil_node_t::read_metadata(): Remove the non-Windows stat* parameter and always invoke fstat(2) outside Windows, but do tolerate errors. Because fstat(2) is more likely to fail than lseek(2), and this is not time critical code, we can afford the extra lseek(2) system call. Reviewed by: Vladislav Vaintroub	2024-10-24 16:08:56 +03:00
Vladislav Vaintroub	e8db5c8760	MDEV-35171 OS_FILE_NORMAL and OS_FILE_AIO are misleading Removed 'purpose' parameter from os_file_create() and related functions. Always use FILE_FLAG_OVERLAPPED when opening Windows files. No performance regression was measured, nor there is any measurable improvement.	2024-10-21 15:31:32 +02:00
Marko Mäkelä	fa8a46eb68	MDEV-33613 InnoDB may still hang when temporarily running out of buffer pool By design, InnoDB has always hung when permanently running out of buffer pool, for example when several threads are waiting to allocate a block, and all of the buffer pool is buffer-fixed by the active threads. The hang that we are fixing here occurs when the buffer pool is only temporarily running out and the situation could be rescued by writing out some dirty pages or evicting some clean pages. buf_LRU_get_free_block(): Simplify the way how we wait for the buf_flush_page_cleaner thread. This fixes occasional hangs of the test encryption.innochecksum that were introduced by commit `a55b951e60` (MDEV-26827). To play it safe, we use a timed wait when waiting for the buf_flush_page_cleaner() thread to perform its job. Should that thread get stuck, we will invoke buf_pool.LRU_warn() in order to display a message that pages could not be freed, and keep trying to wake up the buf_flush_page_cleaner() thread. The INFORMATION_SCHEMA.INNODB_METRICS counters buffer_LRU_single_flush_failure_count and buffer_LRU_get_free_waits will be removed. The latter is represented by buffer_pool_wait_free. Also removed will be the message "InnoDB: Difficult to find free blocks in the buffer pool" because in `d34479dc66` we introduced a more precise message "InnoDB: Could not free any blocks in the buffer pool" in the buf_flush_page_cleaner thread. buf_pool_t::LRU_warn(): Issue the warning message that we could not free any blocks in the buffer pool. This may also be invoked by buf_LRU_get_free_block() if buf_flush_page_cleaner() appears to be stuck. buf_pool_t::n_flush_dec(): Remove. buf_pool_t::n_flush_dec_holding_mutex(): Rename to n_flush_dec(). buf_flush_LRU_list_batch(): Increment the eviction counter for blocks of temporary, discarded or dropped tablespaces. buf_flush_LRU(): Make static, and remove the constant parameter evict=false. The only caller will be the buf_flush_page_cleaner() thread. IORequest::is_LRU(): Remove. The only case of evicting pages on write completion will be when we are writing out pages of the temporary tablespace. Those pages are not in buf_pool.flush_list, only in buf_pool.LRU. buf_page_t::flush(): Remove the parameter evict. buf_page_t::write_complete(): Change the parameter "bool temporary" to "bool persistent" and add a parameter for an already read state(). Reviewed by: Debarun Banerjee	2024-03-22 14:17:39 +02:00
Marko Mäkelä	a6290a5bc5	MDEV-33095 innodb_flush_method=O_DIRECT creates excessive errors on Solaris The directio(3C) function on Solaris is supported on NFS and UFS while the majority of users should be on ZFS, which is a copy-on-write file system that implements transparent compression and therefore cannot support unbuffered I/O. Let us remove the call to directio() and simply treat innodb_flush_method=O_DIRECT in the same way as the previous default value innodb_flush_method=fsync on Solaris. Also, let us remove some dead code around calls to os_file_set_nocache() on platforms where fcntl(2) is not usable with O_DIRECT. On IBM AIX, O_DIRECT is not documented for fcntl(2), only for open(2).	2024-01-19 15:34:33 +11:00
Marko Mäkelä	72928e640e	MDEV-27593: Crashing on I/O error is unhelpful buf_page_t::write_complete(), buf_page_write_complete(), IORequest::write_complete(): Add a parameter for passing an error code. If an error occurred, we will release the io-fix, buffer-fix and page latch but not reset the oldest_modification field. The block would remain in buf_pool.LRU and possibly buf_pool.flush_list, to be written again later, by buf_flush_page_cleaner(). If all page writes start consistently failing, all write threads should eventually hang in log_free_check() because the log checkpoint cannot be advanced to make room in the circular write-ahead-log ib_logfile0. IORequest::read_complete(): Add a parameter for passing an error code. If a read operation fails, we report the error and discard the page, just like we would do if the page checksum was not validated or the page could not be decrypted. This only affects asynchronous reads, due to linear or random read-ahead or crash recovery. When buf_page_get_low() invokes buf_read_page(), that will be a synchronous read, not involving this code. This was tested by randomly injecting errors in write_io_callback() and read_io_callback(), like this: if (!ut_rnd_interval(100)) cb->m_err= 42;	2023-08-01 14:39:29 +03:00
Marko Mäkelä	f2c17cc9d9	MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress This is a 10.6 port of commit `2f9e264781` from MariaDB Server 10.9 that is missing some optimization due to a more complex redo log format and recovery logic (which was simplified in commit `685d958e38`). The progress reporting of InnoDB crash recovery was rather intermittent. Nothing was reported during the single-threaded log record parsing, which could consume minutes when parsing a large log. During log application, there only was progress reporting in background threads that would be invoked on data page read completion. The progress reporting here will be detailed like this: InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 InnoDB: Read redo log up to LSN=1963895808 InnoDB: Multi-batch recovery needed at LSN 2534560930 InnoDB: Read redo log up to LSN=3312233472 InnoDB: Read redo log up to LSN=1599646720 InnoDB: Read redo log up to LSN=2160831488 InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages InnoDB: Read redo log up to LSN=3195776000 InnoDB: Read redo log up to LSN=3687099392 InnoDB: Read redo log up to LSN=4165315584 InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages InnoDB: Read redo log up to LSN=4508724224 InnoDB: Read redo log up to LSN=5094550528 InnoDB: To recover: 205230 pages The previous messages "Starting a batch to recover" or "Starting a final batch to recover" will be replaced by "To recover: ... pages" messages. If a batch lasts longer than 15 seconds, then there will be progress reports every 15 seconds, showing the number of remaining pages. For the non-final batch, the "To recover:" message includes two end LSN: that of the batch, and of the recovered log. This is the primary measure of progress. The batch will end once the number of pages to recover reaches 0. If recovery is possible in a single batch, the output will look like this, with a shorter "To recover:" message that counts only the remaining pages: InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799 InnoDB: Read redo log up to LSN=1984539648 InnoDB: Read redo log up to LSN=2710875136 InnoDB: Read redo log up to LSN=3358895104 InnoDB: Read redo log up to LSN=3965299712 InnoDB: Read redo log up to LSN=4557417472 InnoDB: Read redo log up to LSN=5219527680 InnoDB: To recover: 450915 pages We will also speed up recovery by improving the memory management and implementing multi-threaded recovery of data pages that will not need to be read into the buffer pool ("fake read"). Log application in the "fake read" threads will be protected by an atomic being_recovered field and exclusive buf_page_t::lock. Recovery will reserve for data pages two thirds of the buffer pool, or 256 pages, whichever is smaller. Previously, we could only use at most one third of the buffer pool for buffered log records. This would typically mean that with large buffer pools, recovery unnecessary consisted of multiple batches. If recovery runs out of memory, it will "roll back" or "rewind" the current mini-transaction. The recv_sys.recovered_lsn and recv_sys.pages will correspond to the "out of memory LSN", at the end of the previous complete mini-transaction. If recovery runs out of memory while executing the final recovery batch, we can simply invoke recv_sys.apply(false) to make room, and resume parsing. If recovery runs out of memory before the final batch, we will scan the redo log to the end and check for any missing or inconsistent files. In this version of the patch, we will throw away any previously buffered recv_sys.pages and rescan the log from the checkpoint onwards. recv_sys_t::pages_it: A cached iterator to recv_sys.pages. recv_sys_t::is_memory_exhausted(): Remove. We will have out-of-memory handling deep inside recv_sys_t::parse(). recv_sys_t::rewind(), page_recv_t::recs_t::rewind(): Remove all log starting with a specific LSN. IORequest::write_complete(), IORequest::read_complete(): Replaces fil_aio_callback(). read_io_callback(), write_io_callback(): Replaces io_callback(). IORequest::fake_read_complete(), fake_io_callback(), os_fake_read(): Process a "fake read" request for concurrent recovery. recv_sys_t::apply_batch(): Choose a number of successive pages for a recovery batch. recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a page whose recovery is not in progress. Log application threads will not invoke this; they will only set being_recovered=-1 to indicate that the entry is no longer needed. recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries. recv_sys_t::wait_for_pool(): Wait for some space to become available in the buffer pool. mlog_init_t::mark_ibuf_exist(): Avoid calls to recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low(). Such calls would lead to double locking of recv_sys.mutex, which depending on implementation could cause a deadlock. We will use lower-level calls to look up index pages. buf_LRU_block_remove_hashed(): Disable consistency checks for freed ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage. This fixes an occasional failure of the test innodb.innodb_bulk_create_index_debug. Tested by: Matthias Leich	2023-05-19 15:20:07 +03:00
Oleksandr Byelkin	1d74927c58	Merge branch '10.4' into 10.5	2023-04-24 12:43:47 +02:00
Marko Mäkelä	0976afec88	MDEV-31114 Assertion !...is_waiting() failed in os_aio_wait_until_no_pending_writes() os_aio_wait_until_no_pending_reads(), os_aio_wait_until_pending_writes(): Add a Boolean parameter to indicate whether the wait should be declared in the thread pool. buf_flush_wait(): The callers have already declared a wait, so let us avoid doing that again, just call os_aio_wait_until_pending_writes(false). buf_flush_wait_flushed(): Do not declare a wait in the rare case that the buf_flush_page_cleaner thread has been shut down already. buf_flush_page_cleaner(), buf_flush_buffer_pool(): In the code that runs during shutdown, do not declare waits. buf_flush_buffer_pool(): Remove a debug assertion that might fail. What really matters here is buf_pool.flush_list.count==0. buf_read_recv_pages(), srv_prepare_to_delete_redo_log_file(): Do not declare waits during InnoDB startup.	2023-04-24 09:57:58 +03:00
Alexander Barkov	9f98a2acd7	MDEV-30968 mariadb-backup does not copy Aria logs if aria_log_dir_path is used - `mariadb-backup --backup` was fixed to fetch the value of the @@aria_log_dir_path server variable and copy aria_log* files from @@aria_log_dir_path directory to the backup directory. Absolute and relative (to --datadir) paths are supported. Before this change aria_log* files were copied to the backup only if they were in the default location in @@datadir. - `mariadb-backup --copy-back` now understands a new my.cnf and command line parameter --aria-log-dir-path. `mariadb-backup --copy-back` in the main loop in copy_back() (when copying back from the backup directory to --datadir) was fixed to ignore all aria_log* files. A new function copy_back_aria_logs() was added. It consists of a separate loop copying back aria_log* files from the backup directory to the directory specified in --aria-log-dir-path. Absolute and relative (to --datadir) paths are supported. If --aria-log-dir-path is not specified, aria_log* files are copied to --datadir by default. - The function is_absolute_path() was fixed to understand MTR style paths on Windows with forward slashes, e.g. --aria-log-dir-path=D:/Buildbot/amd64-windows/build/mysql-test/var/...	2023-04-21 19:08:35 +04:00
Marko Mäkelä	a091d6ac4e	MDEV-26827 fixup: Do not duplicate io_slots::pending_io_count() os_aio_pending_reads_approx(), os_aio_pending_reads(): Replaces buf_pool.n_pend_reads. os_aio_pending_writes(): Replaces buf_dblwr.pending_writes(). buf_dblwr_t::write_cond, buf_dblwr_t::writes_pending: Remove.	2023-04-12 13:49:57 +03:00
Marko Mäkelä	5bada1246d	Merge 10.5 into 10.6	2023-04-11 16:15:19 +03:00
Oleksandr Byelkin	ac5a534a4c	Merge remote-tracking branch '10.4' into 10.5	2023-03-31 21:32:41 +02:00
Vlad Lesin	4c226c1850	MDEV-29050 mariabackup issues error messages during InnoDB tablespaces export on partial backup preparing The solution is to suppress error messages for missing tablespaces if mariabackup is launched with "--prepare --export" options. "mariabackup --prepare --export" invokes itself with --mysqld parameter. If the parameter is set, then it starts server to feed "FLUSH TABLES ... FOR EXPORT;" queries for exported tablespaces. This is "normal" server start, that's why new srv_operation value is introduced. Reviewed by Marko Makela.	2023-03-27 20:15:10 +03:00
Vicențiu Ciorbaru	08c852026d	Apply clang-tidy to remove empty constructors / destructors This patch is the result of running run-clang-tidy -fix -header-filter=.* -checks='-,modernize-use-equals-default' . Code style changes have been done on top. The result of this change leads to the following improvements: 1. Binary size reduction. For a -DBUILD_CONFIG=mysql_release build, the binary size is reduced by ~400kb. * A raw -DCMAKE_BUILD_TYPE=Release reduces the binary size by ~1.4kb. 2. Compiler can better understand the intent of the code, thus it leads to more optimization possibilities. Additionally it enabled detecting unused variables that had an empty default constructor but not marked so explicitly. Particular change required following this patch in sql/opt_range.cc result_keys, an unused template class Bitmap now correctly issues unused variable warnings. Setting Bitmap template class constructor to default allows the compiler to identify that there are no side-effects when instantiating the class. Previously the compiler could not issue the warning as it assumed Bitmap class (being a template) would not be performing a NO-OP for its default constructor. This prevented the "unused variable warning".	2023-02-09 16:09:08 +02:00
Marko Mäkelä	15ab2e122d	MDEV-30132 Crash after recovery, with InnoDB: Tried to read ... os_file_read(): Merged with os_file_read_no_error_handling(). Crashing on a partial page read is as unhelpful as crashing on a corrupted page read (commit `0b47c126e3`). Report the file name if it is available via IORequest.	2022-11-30 10:54:03 +02:00
Marko Mäkelä	30914389fe	Merge 10.5 into 10.6	2022-07-27 17:52:37 +03:00
Marko Mäkelä	098c0f2634	Merge 10.4 into 10.5	2022-07-27 17:17:24 +03:00
Marko Mäkelä	e5c4f4e590	Merge 10.3 into 10.4	2022-07-27 14:25:36 +03:00
Marko Mäkelä	0ee1082bd2	MDEV-28495 InnoDB corruption due to lack of file locking Starting with commit `da094188f6` (MDEV-24393), MariaDB will no longer acquire advisory file locks on InnoDB data files by default, because it would create a large number of entries in Linux /proc/locks. The motivation for acquiring the file locks is to prevent accidental concurrent startup of multiple server processes on the same data files. Such mistake still turns out to be relatively common, based on corruption bug reports from the community. To prevent corruption due to concurrent startup attempts, the Aria storage engine would unconditionally acquire an advisory lock on one of its log files. Solution: InnoDB will always lock its system tablespace files. (Ever since commit `685d958e38` the InnoDB log file will not necessarily be open while the server is running, because it can be accessed via memory-mapped I/O.) If more protection is desired, then the option --external-locking can be used. The mandatory advisory lock also fixes intermittent failures of some crash recovery tests. It turns out that when the mtr test harness kills and restarts the server, it will not actually ensure that the old process has terminated before starting the new one.	2022-07-27 14:15:14 +03:00
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Oleksandr Byelkin	f5c5f8e41e	Merge branch '10.5' into 10.6	2022-02-03 17:01:31 +01:00
Oleksandr Byelkin	cf63eecef4	Merge branch '10.4' into 10.5	2022-02-01 20:33:04 +01:00
Oleksandr Byelkin	a576a1cea5	Merge branch '10.3' into 10.4	2022-01-30 09:46:52 +01:00
Oleksandr Byelkin	41a163ac5c	Merge branch '10.2' into 10.3	2022-01-29 15:41:05 +01:00
Vladislav Vaintroub	47e18af906	MDEV-27494 Rename .ic files to .inl	2022-01-17 16:41:51 +01:00
Marko Mäkelä	3f5726768f	Merge 10.5 into 10.6	2022-01-04 09:26:38 +02:00
Marko Mäkelä	4c3ad24413	MDEV-27416 InnoDB hang in buf_flush_wait_flushed(), on log checkpoint InnoDB could sometimes hang when triggering a log checkpoint. This is due to commit `7b1252c03d` (MDEV-24278), which introduced an untimed wait to buf_flush_page_cleaner(). The hang was noticed by occasional failures of IMPORT TABLESPACE tests, such as innodb.innodb-wl5522, which would (unnecessarily) invoke log_make_checkpoint() from row_import_cleanup(). The reason of the hang was that buf_flush_page_cleaner() would enter untimed sleep despite buf_flush_sync_lsn being set. The exact failure scenario is unclear, because buf_flush_sync_lsn should actually be protected by buf_pool.flush_list_mutex. We prevent the hang by invoking buf_pool.page_cleaner_set_idle(false) whenever we are setting buf_flush_sync_lsn and signaling buf_pool.do_flush_list. The bulk of these changes was originally developed as a preparation for MDEV-26827, to invoke buf_flush_list() from fewer threads, and tested on 10.6 by Matthias Leich. This fix was tested by running 100 repetitions of 100 concurrent instances of the test innodb.innodb-wl5522 on a RelWithDebInfo build, using ext4fs and innodb_flush_method=O_DIRECT on a SATA SSD with 4096-byte block size. During the test, the call to log_make_checkpoint() in row_import_cleanup() was present. buf_flush_list(): Make static. buf_flush_wait(): Wait for buf_pool.get_oldest_modification() to reach a target, by work done in the buf_flush_page_cleaner. If buf_flush_sync_lsn is going to be set, we will invoke buf_pool.page_cleaner_set_idle(false). buf_flush_ahead(): If buf_flush_sync_lsn or buf_flush_async_lsn is going to be set and the page cleaner woken up, we will invoke buf_pool.page_cleaner_set_idle(false). buf_flush_wait_flushed(): Invoke buf_flush_wait(). buf_flush_sync(): Invoke recv_sys.apply() at the start in case crash recovery is active. Invoke buf_flush_wait(). buf_flush_sync_batch(): A lower-level variant of buf_flush_sync() that is only called by recv_sys_t::apply(). buf_flush_sync_for_checkpoint(): Do not trigger log apply or checkpoint during recovery. buf_dblwr_t::create(): Only initiate a buffer pool flush, not a checkpoint. row_import_cleanup(): Do not unnecessarily invoke log_make_checkpoint(). Invoking buf_flush_list_space() before starting to generate redo log for the imported tablespace should suffice. srv_prepare_to_delete_redo_log_file(): Set recv_sys.recovery_on in order to prevent buf_flush_sync_for_checkpoint() from initiating a checkpoint while the log is inaccessible. Remove a wait loop that is already part of buf_flush_sync(). Do not invoke fil_names_clear() if the log is being upgraded, because the FILE_MODIFY record is specific to the latest format. create_log_file(): Clear recv_sys.recovery_on only after calling log_make_checkpoint(), to prevent buf_flush_page_cleaner from invoking a checkpoint. innodb_shutdown(): Simplify the logic in mariadb-backup --prepare. os_aio_wait_until_no_pending_writes(): Update the function comment. Apart from row_quiesce_table_start() during FLUSH TABLES...FOR EXPORT, this is being called by buf_flush_list_space(), which is invoked by ALTER TABLE...IMPORT TABLESPACE as well as some encryption operations.	2022-01-04 07:40:31 +02:00
Marko Mäkelä	db915f7387	MDEV-27058: Move buf_page_t::slot to IORequest::slot MDEV-23855 and MDEV-23399 already moved some transient data fields from buffer pool page descriptors to IORequest, but the write buffer of PAGE_COMPRESSED or ENCRYPTED tables was missed. Since is only needed during asynchronous page write requests, it belongs to IORequest.	2021-11-18 17:44:33 +02:00
Eugene Kosov	d089b51d66	TSAN: unprotected global variable WARNING: ThreadSanitizer: data race (pid=1510842) Write of size 8 at 0x0000067b1e98 by main thread: #0 os_file_pwrite(IORequest const&, int, unsigned char const, unsigned long, unsigned long, dberr_t) /storage/innobase/os/os0file.cc:2928:2 (mariadbd+0x234c5ac) #1 os_file_write_func(IORequest const&, char const, int, void const, unsigned long, unsigned long) /storage/innobase/os/os0file.cc:2963:20 (mariadbd+0x234c019) #2 file_os_io::write(char const, unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:320:10 (mariadbd+0x22eaa50) #3 log_file_t::write(unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:434:18 (mariadbd+0x22eb1d8) #4 log_t::file::write(unsigned long, st_::span<unsigned char>) /storage/innobase/log/log0log.cc:496:29 (mariadbd+0x22ebb55) #5 log_write_buf(unsigned char, unsigned long, unsigned long, unsigned long, unsigned long) /storage/innobase/log/log0log.cc:614:14 (mariadbd+0x22f1b51) #6 log_write(bool) /storage/innobase/log/log0log.cc:755:2 (mariadbd+0x22ed2ec) #7 log_write_up_to(unsigned long, bool, bool, completion_callback const) /storage/innobase/log/log0log.cc:817:5 (mariadbd+0x22eca44) #8 log_checkpoint_low(unsigned long, unsigned long) /storage/innobase/buf/buf0flu.cc:1734:5 (mariadbd+0x20d37c1) #9 log_checkpoint() /storage/innobase/buf/buf0flu.cc:1787:10 (mariadbd+0x20cd155) #10 buf_flush_wait_flushed(unsigned long) /storage/innobase/buf/buf0flu.cc:1867:5 (mariadbd+0x20ccf8f) #11 log_make_checkpoint() /storage/innobase/buf/buf0flu.cc:1793:3 (mariadbd+0x20cc4c9) #12 buf_dblwr_t::create() /storage/innobase/buf/buf0dblwr.cc:216:3 (mariadbd+0x209076a) #13 srv_start(bool) /storage/innobase/srv/srv0start.cc:1685:20 (mariadbd+0x256b4aa) #14 innodb_init(void) /storage/innobase/handler/ha_innodb.cc:4188:8 (mariadbd+0x1ed40da) #15 ha_initialize_handlerton(st_plugin_int) /sql/handler.cc:659:31 (mariadbd+0xf7c2b6) #16 plugin_initialize(st_mem_root, st_plugin_int, int, char*, bool) /sql/sql_plugin.cc:1463:9 (mariadbd+0x160fedb) #17 plugin_init(int, char, int) /sql/sql_plugin.cc:1756:15 (mariadbd+0x160f53f) #18 init_server_components() /sql/mysqld.cc:5043:7 (mariadbd+0xd71462) #19 mysqld_main(int, char) /sql/mysqld.cc:5655:7 (mariadbd+0xd6ae87) #20 main /sql/main.cc:34:10 (mariadbd+0xd661c8) Previous write of size 8 at 0x0000067b1e98 by thread T3: #0 os_file_pwrite(IORequest const&, int, unsigned char const, unsigned long, unsigned long, dberr_t) /storage/innobase/os/os0file.cc:2928:2 (mariadbd+0x234c5ac) #1 os_file_write_func(IORequest const&, char const, int, void const, unsigned long, unsigned long) /storage/innobase/os/os0file.cc:2963:20 (mariadbd+0x234c019) #2 file_os_io::write(char const, unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:320:10 (mariadbd+0x22eaa50) #3 log_file_t::write(unsigned long, st_::span<unsigned char const>) /storage/innobase/log/log0log.cc:434:18 (mariadbd+0x22eb1d8) #4 log_t::file::write(unsigned long, st_::span<unsigned char>) /storage/innobase/log/log0log.cc:496:29 (mariadbd+0x22ebb55) #5 log_write_checkpoint_info(unsigned long) /storage/innobase/log/log0log.cc:911:14 (mariadbd+0x22edd4e) #6 log_checkpoint_low(unsigned long, unsigned long) /storage/innobase/buf/buf0flu.cc:1755:3 (mariadbd+0x20d3a3d) #7 buf_flush_sync_for_checkpoint(unsigned long) /storage/innobase/buf/buf0flu.cc:1947:7 (mariadbd+0x20d4163) #8 buf_flush_page_cleaner() /storage/innobase/buf/buf0flu.cc:2186:9 (mariadbd+0x20cdab1) #9 void std::__invoke_impl<void, void ()()>(std::__invoke_other, void (&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (mariadbd+0x20c3aaa) #10 std::__invoke_result<void ()()>::type std::__invoke<void ()()>(void (&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (mariadbd+0x20c39bd) #11 void std:🧵:_Invoker<std::tuple<void ()()> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (mariadbd+0x20c3965) #12 std:🧵:_Invoker<std::tuple<void ()()> >::operator()() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (mariadbd+0x20c3905) #13 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (*)()> > >::_M_run() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (mariadbd+0x20c37f9) #14 <null> <null> (libstdc++.so.6+0xd230f) Location is global 'os_n_file_writes' of size 8 at 0x0000067b1e98 (mariadbd+0x67b1e98) Make variable atomic.	2021-09-09 04:08:50 +06:00
Eugene Kosov	7f50edb215	TSAN: data race on a global counter WARNING: ThreadSanitizer: data race (pid=1503350) Write of size 8 at 0x0000067b1f20 by thread T3: #0 os_file_sync_posix(int) /storage/innobase/os/os0file.cc:895:5 (mariadbd+0x23493f6) #1 os_file_flush_func(int) /storage/innobase/os/os0file.cc:983:8 (mariadbd+0x2349204) #2 file_os_io::flush() /storage/innobase/log/log0log.cc:326:10 (mariadbd+0x22eaaa9) #3 log_file_t::flush() /storage/innobase/log/log0log.cc:440:18 (mariadbd+0x22eb2d0) #4 log_t::file::flush() /storage/innobase/log/log0log.cc:507:29 (mariadbd+0x22ebe69) #5 log_write_flush_to_disk_low(unsigned long) /storage/innobase/log/log0log.cc:629:17 (mariadbd+0x22ed3f3) #6 log_write_up_to(unsigned long, bool, bool, completion_callback const) /storage/innobase/log/log0log.cc:829:3 (mariadbd+0x22ecb04) #7 log_checkpoint_low(unsigned long, unsigned long) /storage/innobase/buf/buf0flu.cc:1734:5 (mariadbd+0x20d37f1) #8 buf_flush_sync_for_checkpoint(unsigned long) /storage/innobase/buf/buf0flu.cc:1947:7 (mariadbd+0x20d4193) #9 buf_flush_page_cleaner() /storage/innobase/buf/buf0flu.cc:2186:9 (mariadbd+0x20cdad7) #10 void std::__invoke_impl<void, void ()()>(std::__invoke_other, void (&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (mariadbd+0x20c3aaa) #11 std::__invoke_result<void ()()>::type std::__invoke<void ()()>(void (&&)()) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (mariadbd+0x20c39bd) #12 void std:🧵:_Invoker<std::tuple<void ()()> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (mariadbd+0x20c3965) #13 std:🧵:_Invoker<std::tuple<void ()()> >::operator()() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (mariadbd+0x20c3905) #14 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void ()()> > >::_M_run() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (mariadbd+0x20c37f9) #15 <null> <null> (libstdc++.so.6+0xd230f) Previous write of size 8 at 0x0000067b1f20 by main thread: #0 os_file_sync_posix(int) /storage/innobase/os/os0file.cc:895:5 (mariadbd+0x23493f6) #1 os_file_flush_func(int) /storage/innobase/os/os0file.cc:983:8 (mariadbd+0x2349204) #2 fil_space_t::flush_low() /storage/innobase/fil/fil0fil.cc:504:5 (mariadbd+0x205cad5) #3 fil_flush_file_spaces() /storage/innobase/fil/fil0fil.cc:2947:13 (mariadbd+0x206523f) #4 log_checkpoint() /storage/innobase/buf/buf0flu.cc:1777:5 (mariadbd+0x20cd069) #5 buf_flush_wait_flushed(unsigned long) /storage/innobase/buf/buf0flu.cc:1867:5 (mariadbd+0x20ccf95) #6 log_make_checkpoint() /storage/innobase/buf/buf0flu.cc:1793:3 (mariadbd+0x20cc4c9) #7 buf_dblwr_t::create() /storage/innobase/buf/buf0dblwr.cc:216:3 (mariadbd+0x209076a) #8 srv_start(bool) /storage/innobase/srv/srv0start.cc:1685:20 (mariadbd+0x256b514) #9 innodb_init(void) /storage/innobase/handler/ha_innodb.cc:4188:8 (mariadbd+0x1ed406a) #10 ha_initialize_handlerton(st_plugin_int) /sql/handler.cc:659:31 (mariadbd+0xf7c246) #11 plugin_initialize(st_mem_root, st_plugin_int, int, char*, bool) /sql/sql_plugin.cc:1463:9 (mariadbd+0x160fe6b) #12 plugin_init(int, char, int) /sql/sql_plugin.cc:1756:15 (mariadbd+0x160f4cf) #13 init_server_components() /sql/mysqld.cc:5043:7 (mariadbd+0xd713f2) #14 mysqld_main(int, char) /sql/mysqld.cc:5655:7 (mariadbd+0xd6ae17) #15 main /sql/main.cc:34:10 (mariadbd+0xd66158) This is a correct report by TSAN for an obvious case: unprotected global counter. Fix it by making counter std::atomic.	2021-09-09 04:08:50 +06:00
Marko Mäkelä	eb2f2c1e5f	MDEV-26547 fixup: Wait for read completion buf_load(): Wait for the submitted reads to finish before updating innodb_buffer_pool_load_status.	2021-09-07 08:55:08 +03:00
Marko Mäkelä	101da87228	Merge 10.5 into 10.6	2021-06-23 19:36:45 +03:00
Marko Mäkelä	6dfd44c828	MDEV-25954: Trim os_aio_wait_until_no_pending_writes() It turns out that we had some unnecessary waits for no outstanding write requests to exist. They were basically working around a bug that was fixed in MDEV-25953. On write completion callback, blocks will be marked clean. So, it is sufficient to consult buf_pool.flush_list to determine which writes have not been completed yet. On FLUSH TABLES...FOR EXPORT we must still wait for all pending asynchronous writes to complete, because buf_flush_file_space() would merely guarantee that writes will have been initiated.	2021-06-23 19:06:49 +03:00
Marko Mäkelä	5d495fc44b	Merge 10.5 into 10.6	2021-05-19 09:15:54 +03:00
Marko Mäkelä	db8fb40824	Merge 10.4 into 10.5	2021-05-19 08:39:39 +03:00
Marko Mäkelä	08b6fd9395	MDEV-25710: Dead code os_file_opendir() in the server The functions fil_file_readdir_next_file(), os_file_opendir(), os_file_closedir() became dead code in the server in MariaDB 10.4.0 with commit `09af00cbde` (the removal of the crash recovery logic for the TRUNCATE TABLE implementation that was replaced in MDEV-13564). os_file_opendir(), os_file_closedir(): Define as macros.	2021-05-18 12:13:18 +03:00
Marko Mäkelä	cf552f5886	MDEV-25312 Replace fil_space_t::name with fil_space_t::name() A consistency check for fil_space_t::name is causing recovery failures in MDEV-25180 (Atomic ALTER TABLE). So, we'd better remove that field altogether. fil_space_t::name was more or less a copy of dict_table_t::name (except for some special cases), and it was not being used for anything useful. There used to be a name_hash, but it had been removed already in commit `a75dbfd718` (MDEV-12266). We will also remove os_normalize_path(), OS_PATH_SEPARATOR, OS_PATH_SEPATOR_ALT. On Microsoft Windows, we will treat \ and / roughly in the same way. The intention is that for per-table tablespaces, the filenames will always follow the pattern prefix/databasename/tablename.ibd. (Any \ in the prefix must not be converted.) ut_basename_noext(): Remove (unused function). read_link_file(): Replaces RemoteDatafile::read_link_file(). We will ensure that the last two path component separators are forward slashes (converting up to 2 trailing backslashes on Microsoft Windows), so that everywhere else we can assume that data file names end in "/databasename/tablename.ibd". Note: On Microsoft Windows, path names that start with \\?\ must not contain / as path component separators. Previously, such paths did work in the DATA DIRECTORY argument of InnoDB tables. Reviewed by: Vladislav Vaintroub	2021-04-07 18:01:13 +03:00
Marko Mäkelä	00528a0445	Merge 10.5 into 10.6	2021-03-19 13:35:18 +02:00
Marko Mäkelä	be881ec457	Merge 10.4 into 10.5	2021-03-19 13:09:21 +02:00
Marko Mäkelä	44d70c01f0	Merge 10.3 into 10.4	2021-03-19 11:42:44 +02:00
Marko Mäkelä	19052b6deb	Merge 10.2 into 10.3	2021-03-18 12:34:48 +02:00
Marko Mäkelä	14a8b700f3	Cleanup: Remove unused OS_DATA_TEMP_FILE This had been originally added in mysql/mysql-server@192bb153b6 with the motivation to disable O_DIRECT for the dedicated tablespace for temporary tables. In MariaDB Server, commit `5eb539555b` (MDEV-12227) should be a better solution. The code became orphaned later in mysql/mysql-server@c61244c0e6 and it had been applied to MariaDB Server 10.2.2 in commit `2e814d4702` and commit `fec844aca8`. Thanks to Vladislav Vaintroub for spotting this.	2021-03-18 12:24:35 +02:00
Marko Mäkelä	6268bdadf7	Merge 10.5 into 10.6	2021-01-04 10:52:32 +02:00
Marko Mäkelä	b6be78d4e5	Merge 10.4 into 10.5	2020-12-23 15:07:36 +02:00
Marko Mäkelä	3b7dbdf080	MDEV-24449 cleanup: Remove a timeout recv_sys_t::apply(): At the end of the last batch, wait for pending reads to complete (read_slots->wait()), instead of waiting for some time, and assert that buf_pool.n_pend_reads==0 after that wait. io_callback(): Do not invoke read_slots->release() before the callback function has returned, to ensure the correct operation of recv_sys_t::apply().	2020-12-22 07:42:03 +02:00
Marko Mäkelä	f24b738318	MDEV-24313 (2 of 2): Silently ignored innodb_use_native_aio=1 In commit `5e62b6a5e0` (MDEV-16264) the logic of os_aio_init() was changed so that it will never fail, but instead automatically disable innodb_use_native_aio (which is enabled by default) if the io_setup() system call would fail due to resource limits being exceeded. This is questionable, especially because falling back to simulated AIO may lead to significantly reduced performance. srv_n_file_io_threads, srv_n_read_io_threads, srv_n_write_io_threads: Change the data type from ulong to uint. os_aio_init(): Remove the parameters, and actually return an error code. thread_pool::configure_aio(): Do not silently fall back to simulated AIO. Reviewed by: Vladislav Vaintroub	2020-12-14 15:27:03 +02:00

1 2 3 4 5

205 commits