mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-30 18:41:56 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	0b47c126e3	MDEV-13542: Crashing on corrupted page is unhelpful The approach to handling corruption that was chosen by Oracle in commit `177d8b0c12` is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.	2022-06-06 14:03:22 +03:00
Marko Mäkelä	1b03db11d2	MDEV-15528 fixup: Remove some dead code btr_page_split_and_insert(): Declare all parameters nonnull. btr_pessimistic_scrub() was removed in commit `a5584b13d1` (MDEV-15528).	2022-06-06 09:52:11 +03:00
Sergei Golubchik	3bc98a4ec4	Merge branch '10.5' into 10.6	2022-05-10 14:01:23 +02:00
Sergei Golubchik	ef781162ff	Merge branch '10.4' into 10.5	2022-05-09 22:04:06 +02:00
Sergei Golubchik	a70a1cf3f4	Merge branch '10.3' into 10.4	2022-05-08 23:03:08 +02:00
Marko Mäkelä	57a9626fe4	Merge 10.5 into 10.6	2022-05-06 11:11:04 +03:00
Marko Mäkelä	26d46234e8	MDEV-28478: INSERT into SPATIAL INDEX in TEMPORARY table writes log This is based on commit `20ae4816bb` with some adjustments for MDEV-12353. row_ins_sec_index_entry_low(): If a separate mini-transaction is needed to adjust the minimum bounding rectangle (MBR) in the parent page, we must disable redo logging if the table is a temporary table. For temporary tables, no log is supposed to be written, because the temporary tablespace will be reinitialized on server restart. rtr_update_mbr_field(), rtr_merge_and_update_mbr(): Changed the return type to void and removed unreachable code. In older versions, these used to return a different value for temporary tables. page_id_t: Add constexpr to most member functions. mtr_t::log_write(): Catch log writes to invalid tablespaces so that the test case would crash without the fix to row_ins_sec_index_entry_low().	2022-05-06 10:12:31 +03:00
Marko Mäkelä	c844a5881a	MDEV-21452 fixup: Remove an unused variable	2022-05-03 13:31:59 +03:00
Marko Mäkelä	0806592ac8	MDEV-28422 Page split breaks a gap lock btr_insert_into_right_sibling(): Inherit any gap lock from the left sibling to the right sibling before inserting the record to the right sibling and updating the node pointer(s). lock_update_node_pointer(): Update locks in case a node pointer will move. Based on mysql/mysql-server@c7d93c274f	2022-04-27 13:38:08 +03:00
Marko Mäkelä	2ca1123464	MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion `lock->trx == this' failed in dberr_t trx_t::drop_table This follows up the previous fix in commit `c3c53926c4` (MDEV-26554). ha_innobase::delete_table(): Work around the insufficient metadata locking (MDL) during DML operations by acquiring exclusive InnoDB table locks on all child tables. Previously, this was only done on TRUNCATE and ALTER. ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke lock_update_delete() during change buffer operations. The revised trx_t::commit(std::vector<pfs_os_file_t>&) will hold exclusive lock_sys.latch while invoking fil_delete_tablespace(), which in turn may invoke ibuf_delete_rec(). dict_index_t::has_locking(): A new predicate, replacing the dummy !dict_table_is_locking_disabled(index->table). Used for skipping lock operations during ibuf_delete_rec(). trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks and remove the table from the cache while holding exclusive lock_sys.latch. trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds. trx_t::commit(): Reset dict_operation before invoking commit_in_memory() via commit_persist(). lock_release_on_drop(): Release locks while lock_sys.latch is exclusively locked. lock_table(): Add a parameter for a pointer to the table. We must not dereference the table before a lock_sys.latch has been acquired. If the pointer to the table does not match the table at that point, the table is invalid and DB_DEADLOCK will be returned. row_ins_foreign_check_on_constraint(): Improve the checks. Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed before commit `c5fd9aa562` (MDEV-25919). row_upd_check_references_constraints(), wsrep_row_upd_check_foreign_constraints(): Simplify checks.	2022-04-26 18:09:03 +03:00
Marko Mäkelä	8f8ba75855	MDEV-27234: Data dictionary recovery was not READ COMMITTED This also fixes MDEV-20198: Instant ALTER TABLE is not crash safe InnoDB dictionary recovery wrongly used the READ UNCOMMITTED isolation level, causing some mismatch. For example, if a table was renamed or replaced in a transaction, according to READ UNCOMMITTED the table might not exist at all. We implement READ COMMITTED isolation level for accessing the dictionary tables SYS_TABLES, SYS_COLUMNS, SYS_INDEXES, SYS_FIELDS, SYS_VIRTUAL, SYS_FOREIGN, SYS_FOREIGN_COLS. For most of these tables, no secondary index exists. For the secondary indexes (on SYS_TABLES.ID, SYS_FOREIGN.FOR_NAME, SYS_FOREIGN.REF_NAME), we will always look up the primary key in the clustered index and check if the record actually is a committed version. dict_check_sys_tables(): Recover tablespaces also from delete-marked committed records, so that if a matching .ibd file exists, it will be removed by fil_delete_tablespace() when the committed delete-marked SYS_INDEXES record of the clustered index is purged in row_purge_remove_clust_if_poss_low(). fil_ibd_open(): Change the Boolean parameter "validate" to a ternary one, to suppress error messages when the file might not exist. It is possible that a .ibd file was deleted and the server shut down before the SYS_INDEXES and SYS_TABLES records were purged. Hence, if dict_check_sys_tables() finds a committed delete-marked record, we must not complain if the tablespace file is not found. On Windows, we msut treat ERROR_PATH_NOT_FOUND (directory not found) in the same way as ERROR_FILE_NOT_FOUND. This fixes a few failures where a previous test successfully executed DROP DATABASE (and deleted all files and the directory), but a committed delete-marked SYS_TABLES record had not been purged before server restart. dict_getnext_system_low(): Do not filter out delete-marked records. dict_startscan_system(), dict_getnext_system(): Do filter out delete-marked records, for accessing the INFORMATION_SCHEMA tables. dict_sys_tables_rec_read(): Return the DB_TRX_ID of the committed version of the record. This is needed in dict_load_table_low(). dict_load_foreign_cols(), dict_load_foreign(): Add a parameter for the current transaction identifier. In some DDL operations, the FOREIGN KEY constraints are being loaded from the data dictionary before the DDL transaction has been committed. For SYS_FOREIGN and SYS_FOREIGN_COLS, we must implement the special case of READ COMMITTED that the changes of the uncommitted current transaction are visible. dict_load_foreign(): Validate the table name. We could find a SYS_FOREIGN.ID via a committed delete-marked secondary index record that does not match the REF_NAME or FOR_NAME of the secondary index record. dict_load_index_low(): Optionally take the table as a parameter, so that table->def_trx_id can be updated in case of a committed delete-marked SYS_INDEXES record corresponding to DROP INDEX, but not corresponding to an index stub of ADD INDEX. dict_load_indexes(): Do not update table->def_trx_id in case of delete-marked records. rec_is_metadata(), rec_offs_make_valid(), rec_get_offsets_func(), row_build_low(): Relax some assertions. We may now have !index->is_instant() even if a metadata record is present in the index. Previously, the recovery of instant ADD/DROP COLUMN assumed that READ UNCOMMITTED of the data dictionary will be performed. Now, we will have a READ COMMITTED copy of the data dictionary cache, and a READ UNCOMMITTED copy of the metadata record. btr_page_reorganize_low(): Correctly update the FIL_PAGE_TYPE when rolling back an instant ADD/DROP COLUMN operation. row_rec_to_index_entry_impl(): Relax some assertions, and disallow accessing "extra" fields. This fixes the recovery of a crash during an instant ADD COLUMN after a successful instant DROP COLUMN, in the test innodb.instant_alter_crash. Tested by: Matthias Leich	2022-03-28 08:37:51 +03:00
Marko Mäkelä	aaef2e1d8c	MDEV-27058: Reduce the size of buf_block_t and buf_page_t buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.	2021-11-18 17:47:19 +02:00
Marko Mäkelä	1f02280904	MDEV-26769 InnoDB does not support hardware lock elision This implements memory transaction support for: * Intel Restricted Transactional Memory (RTM), also known as TSX-NI (Transactional Synchronization Extensions New Instructions) * POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux transactional_lock_guard, transactional_shared_lock_guard: RAII lock guards that try to elide the lock acquisition when transactional memory is available. buf_pool.page_hash: Try to elide latches whenever feasible. Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED tables, this is not always possible. In buf_page_get_low(), memory transactions only work reasonably well for validating a guessed block address. TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards that try to elide lock_sys.latch and related latches.	2021-10-22 12:38:45 +03:00
Marko Mäkelä	1e9c922fa7	MDEV-26623 Possible race condition between statistics and bulk insert When computing statistics, let us play it safe and check whether an insert into an empty table is in progress, once we have acquired the root page latch. If yes, let us pretend that the table is empty, just like MVCC reads would do. It is unlikely that this race condition could lead into any crashes before MDEV-24621, because the index tree structure should be protected by the page latches. But, at least we can avoid some busy work and return earlier. As part of this, some code that is only used for statistics calculation is being moved into static functions in that compilation unit.	2021-09-17 14:40:41 +03:00
Marko Mäkelä	1a4a7dddb6	Cleanup: Make btr_root_block_get() more robust btr_root_block_get(): Check for index->page == FIL_NULL. btr_root_get(): Declare static. Other callers can invoke btr_root_block_get() directly. btr_get_size(): Remove conditions that are checked in btr_root_block_get().	2021-09-17 14:40:41 +03:00
Marko Mäkelä	f3fcf5f45c	Merge 10.5 to 10.6	2021-08-19 12:25:00 +03:00
Marko Mäkelä	4a25957274	Merge 10.4 into 10.5	2021-08-18 18:22:35 +03:00
Marko Mäkelä	f84e28c119	Merge 10.3 into 10.4	2021-08-18 16:51:52 +03:00
Marko Mäkelä	cd65845a0e	Merge 10.2 into 10.3 MDEV-18734 FIXME: vcol.partition triggers ASAN heap-use-after-free	2021-08-18 12:26:58 +03:00
Eugene Kosov	890f2ad769	MDEV-20931 ALTER...IMPORT can crash the server Main idea: don't log-and-crash but propogate error to the upper layers of stack to handle it and show to a user.	2021-08-17 20:28:42 +06:00
Marko Mäkelä	1bd681c8b3	MDEV-25506 (3 of 3): Do not delete .ibd files before commit This is a complete rewrite of DROP TABLE, also as part of other DDL, such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE. The background DROP TABLE queue hack is removed. If a transaction needs to drop and create a table by the same name (like TRUNCATE TABLE does), it must first rename the table to an internal #sql-ib name. No committed version of the data dictionary will include any #sql-ib tables, because whenever a transaction renames a table to a #sql-ib name, it will also drop that table. Either the rename will be rolled back, or the drop will be committed. Data files will be unlinked after the transaction has been committed and a FILE_RENAME record has been durably written. The file will actually be deleted when the detached file handle returned by fil_delete_tablespace() will be closed, after the latches have been released. It is possible that a purge of the delete of the SYS_INDEXES record for the clustered index will execute fil_delete_tablespace() concurrently with the DDL transaction. In that case, the thread that arrives later will wait for the other thread to finish. HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag. ha_innobase::truncate() now requires that all other references to the table be released in advance. This was implemented by Monty. ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected, we will "hijack" the current transaction, drop the table in the current transaction and commit the current transaction. This essentially fixes MDEV-21602. There is a FIXME comment about making the check less failure-prone. ha_innobase::truncate(), ha_innobase::delete_table(): Implement a fast path for temporary tables. We will no longer allow temporary tables to use the adaptive hash index. dict_table_t::mdl_name: The original table name for the purpose of acquiring MDL in purge, to prevent a race condition between a DDL transaction that is dropping a table, and purge processing undo log records of DML that had executed before the DDL operation. For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the dict_table_t::mdl_name will differ from dict_table_t::name. dict_table_t::parse_name(): Use mdl_name instead of name. dict_table_rename_in_cache(): Update mdl_name. For the internal FTS_ tables of FULLTEXT INDEX, purge would acquire MDL on the FTS_ table name, but not on the main table, and therefore it would be able to run concurrently with a DDL transaction that is dropping the table. Previously, the DROP TABLE queue hack prevented a race between purge and DDL. For now, we introduce purge_sys.stop_FTS() to prevent purge from opening any table, while a DDL transaction that may drop FTS_ tables is in progress. The function fts_lock_table(), which will be invoked before the dictionary is locked, will wait for purge to release any table handles. trx_t::drop_table_statistics(): Drop statistics for the table. This replaces dict_stats_drop_index(). We will drop or rename persistent statistics atomically as part of DDL transactions. On lock conflict for dropping statistics, we will fail instantly with DB_LOCK_WAIT_TIMEOUT, because we will be holding the exclusive data dictionary latch. trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory(). Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT in addition to DB_DUPLICATE_KEY. The call to fts_commit() is entirely misplaced here and may obviously break the consistency of transactions that affect FULLTEXT INDEX. It needs to be fixed separately. dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175). The counter was a work-around for missing meta-data locking (MDL) on the SQL layer, and not really needed in MariaDB. ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28. HA_ERR_TABLE_IN_FK_CHECK: Remove. row_ins_check_foreign_constraints(): Do not acquire dict_sys.latch either. The SQL-layer MDL will protect us. This was reviewed by Thirunarayanan Balathandayuthapani and tested by Matthias Leich.	2021-06-09 17:06:07 +03:00
Marko Mäkelä	a7d68e7a0f	MDEV-25791: Remove UNIV_INTERN Back in 2006 or 2007, when MySQL AB and Innobase Oy existed as separately controlled entities (Innobase had been acquired by Oracle Corporation), MySQL 5.1 introduced a storage engine plugin interface and Oracle made use of it by distributing a separate InnoDB Plugin, which would contain some more bug fixes and improvements, compared to the version of InnoDB that was statically linked with the mysqld server that was distributed by MySQL AB. The built-in InnoDB would export global symbols, which would clash with the symbols of the dynamic InnoDB Plugin (which was supposed to override the built-in one when present). The solution to this problem was to declare all global symbols with UNIV_INTERN, so that they would get the GCC function attribute that specifies hidden visibility. Later, in MariaDB Server, something based on Percona XtraDB (a fork of MySQL InnoDB) became the statically linked implementation, and something closer to MySQL InnoDB was available as a dynamic plugin. Starting with version 10.2, MariaDB Server includes only one InnoDB implementation, and hence any reason to have the UNIV_INTERN definition was lost. btr_get_size_and_reserved(): Move to the same compilation unit with the only caller. innodb_set_buf_pool_size(): Remove. Modify innobase_buffer_pool_size directly. fil_crypt_calculate_checksum(): Merge to the only caller. ha_innobase::innobase_reset_autoinc(): Merge to the only caller. thd_query_start_micro(): Remove. Call thd_start_utime() directly.	2021-05-27 13:28:08 +03:00
Marko Mäkelä	2ceadb3903	MDEV-25506 (2 of 3): Kill during DDL leaves orphan .ibd file dict_drop_index_tree(): Even if SYS_INDEXES.PAGE contains the special value FIL_NULL, the tablespace identified by SYS_INDEXES.SPACE may exist and may need to be dropped. This would definitely be the case if the server had been killed right after a FILE_CREATE record was persistently written during CREATE TABLE, but before the transaction was committed. btr_free_if_exists(): Simplify the interface, to avoid repeated tablespace lookup. One more scenario is known to be broken: If the server is killed during DROP TABLE (or table-rebuilding ALTER TABLE) right after a FILE_DELETE record has been persistently written but before the file was deleted, then we could end up recovering no tablespace at all, and failing to delete the file, in either of fil_name_process() or dict_drop_index_tree(). Thanks to Elena Stepanova for providing "rr replay" and data directories of these scenarios.	2021-05-06 16:34:10 +03:00
Marko Mäkelä	f619d79bda	MDEV-25491: Race condition between DROP TABLE and purge of SYS_INDEXES record btr_free_if_exists(): Always use the BUF_GET_POSSIBLY_FREED mode when accessing pages, because due to MDEV-24589 the function fil_space_t::set_stopping(true) can be called at any time during the execution of this function. mtr_t::m_freeing_tree: New data member for debugging purposes. buf_page_get_low(): Assert that the BUF_GET mode is not being used anywhere during the execution of btr_free_if_exists(). In all code related to freeing or allocating pages, we will add some robustness, by making more use of BUF_GET_POSSIBLY_FREED and by reporting an error instead of crashing in some cases of corruption.	2021-04-28 17:24:44 +03:00
Marko Mäkelä	d2e2d32933	Merge 10.5 into 10.6	2021-04-14 12:32:27 +03:00
Marko Mäkelä	6c3e860cbf	Merge 10.4 into 10.5	2021-04-14 11:35:39 +03:00
Marko Mäkelä	5008171b05	Merge 10.3 into 10.4	2021-04-14 10:33:59 +03:00
Marko Mäkelä	b8c8692fd9	MDEV-24620 ASAN heap-buffer-overflow in btr_pcur_restore_position() Between btr_pcur_store_position() and btr_pcur_restore_position() it is possible that purge empties a table and enlarges index->n_core_fields and index->n_core_null_bytes. Therefore, we must cache index->n_core_fields in btr_pcur_t::old_n_core_fields so that btr_pcur_t::old_rec can be parsed correctly. Unfortunately, this is a huge change, because we will replace "bool leaf" parameters with "ulint n_core" (passing index->n_core_fields, or 0 for non-leaf pages). For special cases where we know that index->is_instant() cannot hold, we may also pass index->n_fields.	2021-04-13 10:28:13 +03:00
Marko Mäkelä	a43ff483fa	Merge 10.5 into 10.6	2021-03-11 20:20:07 +02:00
Marko Mäkelä	a4b7232b2c	Merge 10.4 into 10.5	2021-03-11 20:09:34 +02:00
Thirunarayanan Balathandayuthapani	06df0b0dce	MDEV-25107 Check TABLE miscalutates the length of column - This is caused by merge commit `a26e7a3726`. InnoDB fails to fetch the next index field when there is a externally stored column length check involved.	2021-03-11 20:52:54 +05:30
Marko Mäkelä	d346763479	Merge 10.5 into 10.6	2021-03-08 10:51:31 +02:00
Marko Mäkelä	a5d3c1c819	Merge 10.4 into 10.5	2021-03-08 10:16:20 +02:00
Marko Mäkelä	a26e7a3726	Merge 10.3 into 10.4	2021-03-08 09:39:54 +02:00
Vicențiu Ciorbaru	e9b8b76f47	Merge branch '10.2' into 10.3	2021-03-04 16:04:30 +02:00
Thirunarayanan Balathandayuthapani	b044898b97	MDEV-24748 extern column check missing in btr_index_rec_validate() In btr_index_rec_validate(), externally stored column check is missing while matching the length of the field with the length of the field data stored in record. Fetch the length of the externally stored part and compare it with the fixed field length.	2021-03-03 17:20:43 +05:30
Marko Mäkelä	b08448de64	MDEV-20612: Partition lock_sys.latch We replace the old lock_sys.mutex (which was renamed to lock_sys.latch) with a combination of a global lock_sys.latch and table or page hash lock mutexes. The global lock_sys.latch can be acquired in exclusive mode, or it can be acquired in shared mode and another mutex will be acquired to protect the locks for a particular page or a table. This is inspired by mysql/mysql-server@1d259b87a6 but the optimization of lock_release() will be done in the next commit. Also, we will interleave mutexes with the hash table elements, similar to how buf_pool.page_hash was optimized in commit `5155a300fa` (MDEV-22871). dict_table_t::autoinc_trx: Use Atomic_relaxed. dict_table_t::autoinc_mutex: Use srw_mutex in order to reduce the memory footprint. On 64-bit Linux or OpenBSD, both this and the new dict_table_t::lock_mutex should be 32 bits and be stored in the same 64-bit word. On Microsoft Windows, the underlying SRWLOCK is 32 or 64 bits, and on other systems, sizeof(pthread_mutex_t) can be much larger. ib_lock_t::trx_locks, trx_lock_t::trx_locks: Document the new rules. Writers must assert lock_sys.is_writer() \|\| trx->mutex_is_owner(). LockGuard: A RAII wrapper for acquiring a page hash table lock. LockGGuard: Like LockGuard, but when Galera Write-Set Replication is enabled, we must acquire all shards, for updating arbitrary trx_locks. LockMultiGuard: A RAII wrapper for acquiring two page hash table locks. lock_rec_create_wsrep(), lock_table_create_wsrep(): Special Galera conflict resolution in non-inlined functions in order to keep the common code paths shorter. lock_sys_t::prdt_page_free_from_discard(): Refactored from lock_prdt_page_free_from_discard() and lock_rec_free_all_from_discard_page(). trx_t::commit_tables(): Replaces trx_update_mod_tables_timestamp(). lock_release(): Let trx_t::commit_tables() invalidate the query cache for those tables that were actually modified by the transaction. Merge lock_check_dict_lock() to lock_release(). We must never release lock_sys.latch while holding any lock_sys_t::hash_latch. Failure to do that could lead to memory corruption if the buffer pool is resized between the time lock_sys.latch is released and the hash_latch is released.	2021-02-12 17:44:32 +02:00
Marko Mäkelä	b01d8e1a33	MDEV-20612: Replace lock_sys.mutex with lock_sys.latch For now, we will acquire the lock_sys.latch only in exclusive mode, that is, use it as a mutex. This is preparation for the next commit where we will introduce a less intrusive alternative, combining a shared lock_sys.latch with dict_table_t::lock_mutex or a mutex embedded in lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash.	2021-02-11 14:52:10 +02:00
Marko Mäkelä	2e64513fba	MDEV-20612 preparation: Fewer calls to buf_page_t::id()	2021-02-11 12:48:07 +02:00
Thirunarayanan Balathandayuthapani	a2fbbba2e3	MDEV-24832 Root page AHI removal fails during rollback of bulk insert This failure is caused by commit `43ca6059ca` (MDEV-24720). InnoDB fails to remove the ahi entries during rollback of bulk insert operation. InnoDB should remove the AHI entries of root page before reinitialising it. Reviewed-by: Marko Mäkelä	2021-02-10 15:27:25 +05:30
Thirunarayanan Balathandayuthapani	43ca6059ca	MDEV-24720 AHI removal during rollback of bulk insert InnoDB fails to remove the ahi entries during rollback of bulk insert operation. InnoDB throws the error when validates the ahi hash tables. InnoDB should remove the ahi entries while freeing the segment only during bulk index rollback operation. Reviewed-by: Marko Mäkelä	2021-02-02 19:24:05 +05:30
Marko Mäkelä	3cef4f8f0f	MDEV-515 Reduce InnoDB undo logging for insert into empty table We implement an idea that was suggested by Michael 'Monty' Widenius in October 2017: When InnoDB is inserting into an empty table or partition, we can write a single undo log record TRX_UNDO_EMPTY, which will cause ROLLBACK to clear the table. For this to work, the insert into an empty table or partition must be covered by an exclusive table lock that will be held until the transaction has been committed or rolled back, or the INSERT operation has been rolled back (and the table is empty again), in lock_table_x_unlock(). Clustered index records that are covered by the TRX_UNDO_EMPTY record will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot be distinguished from what MDEV-12288 leaves behind after purging the history of row-logged operations. Concurrent non-locking reads must be adjusted: If the read view was created before the INSERT into an empty table, then we must continue to imagine that the table is empty, and not try to read any records. If the read view was created after the INSERT was committed, then all records must be visible normally. To implement this, we introduce the field dict_table_t::bulk_trx_id. This special handling only applies to the very first INSERT statement of a transaction for the empty table or partition. If a subsequent statement in the transaction is modifying the initially empty table again, we must enable row-level undo logging, so that we will be able to roll back to the start of the statement in case of an error (such as duplicate key). INSERT IGNORE will continue to use row-level logging and locking, because implementing it would require the ability to roll back the latest row. Since the undo log that we write only allows us to roll back the entire statement, we cannot support INSERT IGNORE. We will introduce a handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage engines that INSERT IGNORE is being executed. In many test cases, we add an extra record to the table, so that during the 'interesting' part of the test, row-level locking and logging will be used. Replicas will continue to use row-level logging and locking until MDEV-24622 has been addressed. Likewise, this optimization will be disabled in Galera cluster until MDEV-24623 enables it. dict_table_t::bulk_trx_id: The latest active or committed transaction that initiated an insert into an empty table or partition. Protected by exclusive table lock and a clustered index leaf page latch. ins_node_t::bulk_insert: Whether bulk insert was initiated. trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert). Unlike earlier, this collection will cover also temporary tables. trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(), is_bulk_insert(), was_bulk_insert(). trx_undo_report_row_operation(): Before accessing any undo log pages, invoke trx->mod_tables.emplace() in order to determine whether undo logging was disabled, or whether this is the first INSERT and we are supposed to write a TRX_UNDO_EMPTY record. row_ins_clust_index_entry_low(): If we are inserting into an empty clustered index leaf page, set the ins_node_t::bulk_insert flag for the subsequent trx_undo_report_row_operation() call. lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock(): Remove the redundant parameter 'flags' that can be checked in the caller. btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation(). trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT), ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that the next statement will not be covered by table-level undo logging. ReadView::changes_visible(trx_id_t) const: New accessor for the case where the trx_id_t is not read from a potentially corrupted index page but directly from the memory. In this case, we can skip a sanity check. row_sel(), row_sel_try_search_shortcut(), row_search_mvcc(): row_sel_try_search_shortcut_for_mysql(), row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id. row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees(). lock_sec_rec_cons_read_sees(): Replaced with lower-level code. btr_root_page_init(): Refactored from btr_create(). dict_index_t::clear(), dict_table_t::clear(): Empty an index or table, for the ROLLBACK of an INSERT operation. ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT into an empty table. This is joint work with Thirunarayanan Balathandayuthapani, who created a working prototype. Thanks to Matthias Leich for extensive testing.	2021-01-25 18:41:27 +02:00
Marko Mäkelä	666565c7f0	Merge 10.5 into 10.6	2021-01-11 17:32:08 +02:00
Marko Mäkelä	8de233af81	Merge 10.4 into 10.5	2021-01-11 16:29:51 +02:00
Marko Mäkelä	d67e17bb81	MDEV-24512 Assertion failed in rec_is_metadata() in btr_discard_only_page_on_level() btr_discard_only_page_on_level(): Attempt to read the MDEV-15562 metadata record from the leaf page, not the root page. In the root, the leftmost (in this case, the only) node pointer would look like a metadata record. This corruption bug was introduced in commit `0e5a4ac253` (MDEV-15562). The scenario is rare: a column was dropped instantly or the order of columns was changed instantly, and then the table became empty in such a way that in the last step, the root page had one child page. Normally, a non-leaf B-tree page would always contain at least 2 children.	2021-01-02 16:11:55 +02:00
Marko Mäkelä	30dc4287ec	Merge 10.5 into 10.6	2020-12-19 12:23:06 +02:00
Marko Mäkelä	0c23e32d27	MDEV-24445 Using innodb_undo_tablespaces corrupts system tablespace In the rewrite of MDEV-8139 (based on MDEV-15528), we introduced a wrong assumption that any persistent tablespace that is not an .ibd file is the system tablespace. This assumption is broken when innodb_undo_tablespaces (files undo001, undo002, ...) are being used. By default, we have innodb_undo_tablespaces=0 (the persistent undo log is being stored in the system tablespace). In MDEV-15528 and MDEV-8139 we rewrote the page scrubbing logic so that it will follow the tried-and-true write-ahead logging protocol, first writing FREE_PAGE records and then in the page flushing, zerofilling or hole-punching freed pages. Unfortunately, the implementation included a wrong assumption that that anything that is not in an .ibd file must be the system tablespace. This wrong assumption would cause overwrites of valid data pages in the system tablespace. mtr_t::m_freed_in_system_tablespace: Remove. mtr_t::m_freed_space: The tablespace associated with m_freed_pages. buf_page_free(): Take the tablespace and page number as a parameter, instead of taking a page identifier.	2020-12-18 19:23:34 +02:00
Marko Mäkelä	ff5d306e29	MDEV-21452: Replace ib_mutex_t with mysql_mutex_t SHOW ENGINE INNODB MUTEX functionality is completely removed, as are the InnoDB latching order checks. We will enforce innodb_fatal_semaphore_wait_threshold only for dict_sys.mutex and lock_sys.mutex. dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex. lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex. FIXME: srv_sys should be removed altogether; it is duplicating tpool functionality. fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must not hold fil_system.mutex. fil_close_all_files(): To prevent SAFE_MUTEX warnings for fil_space_destroy_crypt_data(), we must not hold fil_system.mutex while invoking fil_space_free_low() on a detached tablespace.	2020-12-15 17:56:18 +02:00
Marko Mäkelä	38fd7b7d91	MDEV-21452: Replace all direct use of os_event_t Let us replace os_event_t with mysql_cond_t, and replace the necessary ib_mutex_t with mysql_mutex_t so that they can be used with condition variables. Also, let us replace polling (os_thread_sleep() or timed waits) with plain mysql_cond_wait() wherever possible. Furthermore, we will use the lightweight srw_mutex for trx_t::mutex, to hopefully reduce contention on lock_sys.mutex. FIXME: Add test coverage of mariabackup --backup --kill-long-queries-timeout	2020-12-15 17:56:17 +02:00
Marko Mäkelä	9702be2c73	MDEV-24142: Remove __FILE__,__LINE__ related to buf_block_t::lock	2020-12-03 15:28:53 +02:00

1 2 3 4 5 ...

277 commits