mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-17 04:22:27 +01:00

Author	SHA1	Message	Date
Marko Mäkelä	25e2a556de	MDEV-21133 Optimize access to InnoDB page header fields Introduce memcpy_aligned<N>(), memcmp_aligned<N>(), memset_aligned<N>() and use them for accessing InnoDB page header fields that are known to be aligned. MY_ASSUME_ALIGNED(): Wrapper for the GCC/clang __builtin_assume_aligned(). Nothing similar seems to exist in Microsoft Visual Studio, and the C++20 std::assume_aligned is not available to us yet. Explicitly specified alignment guarantees allow compilers to generate faster code on platforms with strict alignment rules, instead of emitting calls to potentially unaligned memcpy(), memcmp(), or memset().	2019-11-26 10:15:03 +02:00
Vladislav Vaintroub	5e62b6a5e0	MDEV-16264 Use threadpool for Innodb background work. Almost all threads have gone - the "ticking" threads, that sleep a while then do some work) (srv_monitor_thread, srv_error_monitor_thread, srv_master_thread) were replaced with timers. Some timers are periodic, e.g the "master" timer. - The btr_defragment_thread is also replaced by a timer , which reschedules it self when current defragment "item" needs throttling - the buf_resize_thread and buf_dump_threads are substitutes with tasks Ditto with page cleaner workers. - purge workers threads are not tasks as well, and purge cleaner coordinator is a combination of a task and timer. - All AIO is outsourced to tpool, Innodb just calls thread_pool::submit_io() and provides the callback. - The srv_slot_t was removed, and innodb_debug_sync used in purge is currently not working, and needs reimplementation.	2019-11-15 18:09:30 +01:00
Marko Mäkelä	55c75b6bb3	Merge 10.3 into 10.4	2019-10-12 06:50:12 +03:00
Marko Mäkelä	8e3d85e112	Merge 10.2 into 10.3	2019-10-12 06:34:09 +03:00
Marko Mäkelä	966d97b5f9	Merge 10.1 into 10.2	2019-10-11 18:38:18 +03:00
Marko Mäkelä	c0c003beb4	MDEV-20805 follow-up: Catch writes of bogus pages buf_flush_init_for_writing(): Assert that FIL_PAGE_TYPE is set except when creating a new data file with a dummy first page. buf_dblwr_create(): Ensure that FIL_PAGE_TYPE on all pages will be initialized. Reset buf_dblwr_being_created at the end.	2019-10-11 15:32:04 +03:00
Marko Mäkelä	09afd3da1a	Merge 10.3 into 10.4	2019-10-10 21:30:40 +03:00
Marko Mäkelä	7f84e3ad75	Merge 10.2 into 10.3	2019-10-10 20:38:44 +03:00
Marko Mäkelä	6fde0073bf	Rename log_make_checkpoint_at() to log_make_checkpoint() The function was always called with lsn=LSN_MAX. Remove that redundant parameter. Spotted by Thirunarayanan Balathandayuthapani.	2019-10-09 18:47:14 +03:00
Marko Mäkelä	e82fe21e3a	Merge 10.2 into 10.3	2019-07-02 17:46:22 +03:00
Thirunarayanan Balathandayuthapani	723a4b1d78	MDEV-17228 Encrypted temporary tables are not encrypted - Introduce a new variable called innodb_encrypt_temporary_tables which is a boolean variable. It decides whether to encrypt the temporary tablespace. - Encrypts the temporary tablespace based on full checksum format. - Introduced a new counter to track encrypted and decrypted temporary tablespace pages. - Warnings issued if temporary table creation has conflict value with innodb_encrypt_temporary_tables - Added a new test case which reads and writes the pages from/to temporary tablespace.	2019-06-28 19:07:59 +05:30
Marko Mäkelä	5d2619b693	MDEV-19584 Allocate recv_sys statically There is only one InnoDB crash recovery subsystem. Allocating recv_sys statically removes one level of pointer indirection and makes code more readable, and removes the awkward initialization of recv_sys->dblwr. recv_sys_t::create(): Replaces recv_sys_init(). recv_sys_t::debug_free(): Replaces recv_sys_debug_free(). recv_sys_t::close(): Replaces recv_sys_close(). recv_sys_t::add(): Replaces recv_add_to_hash_table(). recv_sys_t::empty(): Replaces recv_sys_empty_hash().	2019-05-24 16:19:38 +03:00
Oleksandr Byelkin	c07325f932	Merge branch '10.3' into 10.4	2019-05-19 20:55:37 +02:00
Marko Mäkelä	be85d3e61b	Merge 10.2 into 10.3	2019-05-14 17:18:46 +03:00
Marko Mäkelä	26a14ee130	Merge 10.1 into 10.2	2019-05-13 17:54:04 +03:00
Oleksandr Byelkin	c51f85f882	Merge branch '10.2' into 10.3	2019-05-12 17:20:23 +02:00
Vicențiu Ciorbaru	c0ac0b8860	Update FSF address	2019-05-11 19:25:02 +03:00
Marko Mäkelä	b2f3755c8e	Merge 10.1 into 10.2	2019-05-10 08:02:21 +03:00
Thirunarayanan Balathandayuthapani	3e8cab51cb	MDEV-13893 encryption.innodb-redo-badkey failed in buildbot with page cannot be decrypted buf_dblwr_process(): Remove the useless warning that a copy of a page in the doublewrite buffer is corrupted. We already report an error if a corrupted page cannot be recovered from the doublewrite buffer. Note: In MariaDB 10.1, the original bug reported in MDEV-13893 could still be easily repeatable. In MariaDB 10.2.24, MDEV-12699 should have reduced the probability considerably.	2019-05-10 07:57:01 +03:00
Marko Mäkelä	d3dcec5d65	Merge 10.3 into 10.4	2019-05-05 15:06:44 +03:00
Marko Mäkelä	b6f4cccd19	Merge 10.2 into 10.3	2019-05-03 20:14:09 +03:00
Marko Mäkelä	3db94d2403	MDEV-19346: Remove dummy InnoDB log checkpoints log_checkpoint(), log_make_checkpoint_at(): Remove the parameter write_always. It seems that the primary purpose of this parameter was to ensure in the function recv_reset_logs() that both checkpoint header pages will be overwritten, when the function is called from the never-enabled function recv_recovery_from_archive_start(). create_log_files(): Merge recv_reset_logs() to its only caller. Debug instrumentation: Prefer to flush the redo log, instead of triggering a redo log checkpoint. page_header_set_field(): Disable a debug assertion that will always fail due to MDEV-19344, now that we no longer initiate a redo log checkpoint before an injected crash. In recv_reset_logs() there used to be two calls to log_make_checkpoint_at(). The apparent purpose of this was to ensure that both InnoDB redo log checkpoint header pages will be initialized or overwritten. The second call was removed (without any explanation) in MySQL 5.6.3: mysql/mysql-server@4ca37968da In MySQL 5.6.8 WL#6494, starting with mysql/mysql-server@00a0ba8ad9 the function recv_reset_logs() was not only invoked during InnoDB data file initialization, but also during a regular startup when the redo log is being resized. mysql/mysql-server@45e9167983 in MySQL 5.7.2 removed the UNIV_LOG_ARCHIVE code, but still did not remove the parameter write_always.	2019-05-03 20:02:11 +03:00
Marko Mäkelä	6b6fa3cdb1	MDEV-18644: Support full_crc32 for page_compressed This is a follow-up task to MDEV-12026, which introduced innodb_checksum_algorithm=full_crc32 and a simpler page format. MDEV-12026 did not enable full_crc32 for page_compressed tables, which we will be doing now. This is joint work with Thirunarayanan Balathandayuthapani. For innodb_checksum_algorithm=full_crc32 we change the page_compressed format as follows: FIL_PAGE_TYPE: The most significant bit will be set to indicate page_compressed format. The least significant bits will contain the compressed page size, rounded up to a multiple of 256 bytes. The checksum will be stored in the last 4 bytes of the page (whether it is the full page or a page_compressed page whose size is determined by FIL_PAGE_TYPE), covering all preceding bytes of the page. If encryption is used, then the page will be encrypted between compression and computing the checksum. For page_compressed, FIL_PAGE_LSN will not be repeated at the end of the page. FSP_SPACE_FLAGS (already implemented as part of MDEV-12026): We will store the innodb_compression_algorithm that may be used to compress pages. Previously, the choice of algorithm was written to each compressed data page separately, and one would be unable to know in advance which compression algorithm(s) are used. fil_space_t::full_crc32_page_compressed_len(): Determine if the page_compressed algorithm of the tablespace needs to know the exact length of the compressed data. If yes, we will reserve and write an extra byte for this right before the checksum. buf_page_is_compressed(): Determine if a page uses page_compressed (in any innodb_checksum_algorithm). fil_page_decompress(): Pass also fil_space_t::flags so that the format can be determined. buf_page_is_zeroes(): Check if a page is full of zero bytes. buf_page_full_crc32_is_corrupted(): Renamed from buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32, we always simply validate the checksum to the page contents, while the physical page size is explicitly specified by an unencrypted part of the page header. buf_page_full_crc32_size(): Determine the size of a full_crc32 page. buf_dblwr_check_page_lsn(): Make this a debug-only function, because it involves potentially costly lookups of fil_space_t. create_table_info_t::check_table_options(), ha_innobase::check_if_supported_inplace_alter(): Do allow the creation of SPATIAL INDEX with full_crc32 also when page_compressed is used. commit_cache_norebuild(): Preserve the compression algorithm when updating the page_compression_level. dict_tf_to_fsp_flags(): Set the flags for page compression algorithm. FIXME: Maybe there should be a table option page_compression_algorithm and a session variable to back it?	2019-03-18 14:08:43 +02:00
Thirunarayanan Balathandayuthapani	c0f47a4a58	MDEV-12026: Implement innodb_checksum_algorithm=full_crc32 MariaDB data-at-rest encryption (innodb_encrypt_tables) had repurposed the same unused data field that was repurposed in MySQL 5.7 (and MariaDB 10.2) for the Split Sequence Number (SSN) field of SPATIAL INDEX. Because of this, MariaDB was unable to support encryption on SPATIAL INDEX pages. Furthermore, InnoDB page checksums skipped some bytes, and there are multiple variations and checksum algorithms. By default, InnoDB accepts all variations of all algorithms that ever existed. This unnecessarily weakens the page checksums. We hereby introduce two more innodb_checksum_algorithm variants (full_crc32, strict_full_crc32) that are special in a way: When either setting is active, newly created data files will carry a flag (fil_space_t::full_crc32()) that indicates that all pages of the file will use a full CRC-32C checksum over the entire page contents (excluding the bytes where the checksum is stored, at the very end of the page). Such files will always use that checksum, no matter what the parameter innodb_checksum_algorithm is assigned to. For old files, the old checksum algorithms will continue to be used. The value strict_full_crc32 will be equivalent to strict_crc32 and the value full_crc32 will be equivalent to crc32. ROW_FORMAT=COMPRESSED tables will only use the old format. These tables do not support new features, such as larger innodb_page_size or instant ADD/DROP COLUMN. They may be deprecated in the future. We do not want an unnecessary file format change for them. The new full_crc32() format also cleans up the MariaDB tablespace flags. We will reserve flags to store the page_compressed compression algorithm, and to store the compressed payload length, so that checksum can be computed over the compressed (and possibly encrypted) stream and can be validated without decrypting or decompressing the page. In the full_crc32 format, there no longer are separate before-encryption and after-encryption checksums for pages. The single checksum is computed on the page contents that is written to the file. We do not make the new algorithm the default for two reasons. First, MariaDB 10.4.2 was a beta release, and the default values of parameters should not change after beta. Second, we did not yet implement the full_crc32 format for page_compressed pages. This will be fixed in MDEV-18644. This is joint work with Marko Mäkelä.	2019-02-19 18:50:19 +02:00
Marko Mäkelä	0a1c3477bf	MDEV-18493 Remove page_size_t MySQL 5.7 introduced the class page_size_t and increased the size of buffer pool page descriptors by introducing this object to them. Maybe the intention of this exercise was to prepare for a future where the buffer pool could accommodate multiple page sizes. But that future never arrived, not even in MySQL 8.0. It is much easier to manage a pool of a single page size, and typically all storage devices of an InnoDB instance benefit from using the same page size. Let us remove page_size_t from MariaDB Server. This will make it easier to remove support for ROW_FORMAT=COMPRESSED (or make it a compile-time option) in the future, just by removing various occurrences of zip_size.	2019-02-07 12:21:35 +02:00
Marko Mäkelä	b5763ecd01	Merge 10.3 into 10.4	2018-12-18 11:33:53 +02:00
Marko Mäkelä	45531949ae	Merge 10.2 into 10.3	2018-12-18 09:15:41 +02:00
Marko Mäkelä	7d245083a4	Merge 10.1 into 10.2	2018-12-17 20:15:38 +02:00
Marko Mäkelä	8c43f96388	Follow-up to MDEV-12112: corruption in encrypted table may be overlooked The initial fix only covered a part of Mariabackup. This fix hardens InnoDB and XtraDB in a similar way, in order to reduce the probability of mistaking a corrupted encrypted page for a valid unencrypted one. This is based on work by Thirunarayanan Balathandayuthapani. fil_space_verify_crypt_checksum(): Assert that key_version!=0. Let the callers guarantee that. Now that we have this assertion, we also know that buf_page_is_zeroes() cannot hold. Also, remove all diagnostic output and related parameters, and let the relevant callers emit such messages. Last but not least, validate the post-encryption checksum according to the innodb_checksum_algorithm (only accepting one checksum for the strict variants), and no longer try to validate the page as if it was unencrypted. buf_page_is_zeroes(): Move to the compilation unit of the only callers, and declare static. xb_fil_cur_read(), buf_page_check_corrupt(): Add a condition before calling fil_space_verify_crypt_checksum(). This is a non-functional change. buf_dblwr_process(): Validate the page only as encrypted or unencrypted, but not both.	2018-12-17 19:33:44 +02:00
Marko Mäkelä	dde2ca4aa1	Merge 10.3 into 10.4	2018-11-19 20:22:33 +02:00
Marko Mäkelä	fd58bb71e2	Merge 10.2 into 10.3	2018-11-19 18:45:53 +02:00
Marko Mäkelä	ff88e4bb8a	Remove many redundant #include from InnoDB	2018-11-19 11:42:14 +02:00
Marko Mäkelä	074c684099	Merge 10.3 into 10.4	2018-11-06 16:24:16 +02:00
Marko Mäkelä	df563e0c03	Merge 10.2 into 10.3 main.derived_cond_pushdown: Move all 10.3 tests to the end, trim trailing white space, and add an "End of 10.3 tests" marker. Add --sorted_result to tests where the ordering is not deterministic. main.win_percentile: Add --sorted_result to tests where the ordering is no longer deterministic.	2018-11-06 09:40:39 +02:00
Marko Mäkelä	b3009059d0	Minor cleanup	2018-10-29 12:05:39 +02:00
Marko Mäkelä	09af00cbde	MDEV-13564: Remove old crash-upgrade logic in 10.4 Stop supporting the additional trunc.log files that were introduced via MySQL 5.7 to MariaDB Server 10.2 and 10.3. DB_TABLESPACE_TRUNCATED: Remove. purge_sys.truncate: A new structure to track undo tablespace file truncation. srv_start(): Remove the call to buf_pool_invalidate(). It is no longer necessary, given that we no longer access things in ways that violate the ARIES protocol. This call was originally added for innodb_file_format, and it may later have been necessary for the proper function of the MySQL 5.7 TRUNCATE recovery, which we are now removing. trx_purge_cleanse_purge_queue(): Take the undo tablespace as a parameter. trx_purge_truncate_history(): Rewrite everything mostly in a single function, replacing references to undo::Truncate. recv_apply_hashed_log_recs(): If any redo log is to be applied, and if the log_sys.log.subformat indicates that separately logged truncate may have been used, refuse to proceed except if innodb_force_recovery is set. We will still refuse crash-upgrade if TRUNCATE TABLE was logged. Undo tablespace truncation would only be logged in undotrunc.log files, which we are no longer checking for.	2018-09-11 21:32:15 +03:00
Marko Mäkelä	5a1868b58d	MDEV-13564 Mariabackup does not work with TRUNCATE This is a merge from 10.2, but the 10.2 version of this will not be pushed into 10.2 yet, because the 10.2 version would include backports of MDEV-14717 and MDEV-14585, which would introduce a crash recovery regression: Tables could be lost on table-rebuilding DDL operations, such as ALTER TABLE, OPTIMIZE TABLE or this new backup-friendly TRUNCATE TABLE. The test innodb.truncate_crash occasionally loses the table due to the following bug: MDEV-17158 log_write_up_to() sometimes fails	2018-09-07 22:15:06 +03:00
Marko Mäkelä	055a3334ad	MDEV-13564 Mariabackup does not work with TRUNCATE Implement undo tablespace truncation via normal redo logging. Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name, CREATE, and DROP. Note: Orphan #sql-ib.ibd may be left behind if MariaDB Server 10.2 is killed before the DROP operation is committed. If MariaDB Server 10.2 is killed during TRUNCATE, it is also possible that the old table was renamed to #sql-ib.ibd but the data dictionary will refer to the table using the original name. In MariaDB Server 10.3, RENAME inside InnoDB is transactional, and #sql-* tables will be dropped on startup. So, this new TRUNCATE will be fully crash-safe in 10.3. ha_mroonga::wrapper_truncate(): Pass table options to the underlying storage engine, now that ha_innobase::truncate() will need them. rpl_slave_state::truncate_state_table(): Before truncating mysql.gtid_slave_pos, evict any cached table handles from the table definition cache, so that there will be no stale references to the old table after truncating. == TRUNCATE TABLE == WL#6501 in MySQL 5.7 introduced separate log files for implementing atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB undo and redo log. Some convoluted logic was added to the InnoDB crash recovery, and some extra synchronization (including a redo log checkpoint) was introduced to make this work. This synchronization has caused performance problems and race conditions, and the extra log files cannot be copied or applied by external backup programs. In order to support crash-upgrade from MariaDB 10.2, we will keep the logic for parsing and applying the extra log files, but we will no longer generate those files in TRUNCATE TABLE. A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE (with full redo and undo logging and proper rollback). This will be implemented in MDEV-14717. ha_innobase::truncate(): Invoke RENAME, create(), delete_table(). Because RENAME cannot be fully rolled back before MariaDB 10.3 due to missing undo logging, add some explicit rename-back in case the operation fails. ha_innobase::delete(): Introduce a variant that takes sqlcom as a parameter. In TRUNCATE TABLE, we do not want to touch any FOREIGN KEY constraints. ha_innobase::create(): Add the parameters file_per_table, trx. In TRUNCATE, the new table must be created in the same transaction that renames the old table. create_table_info_t::create_table_info_t(): Add the parameters file_per_table, trx. row_drop_table_for_mysql(): Replace a bool parameter with sqlcom. row_drop_table_after_create_fail(): New function, wrapping row_drop_table_for_mysql(). dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(), fil_prepare_for_truncate(), fil_reinit_space_header_for_table(), row_truncate_table_for_mysql(), TruncateLogger, row_truncate_prepare(), row_truncate_rollback(), row_truncate_complete(), row_truncate_fts(), row_truncate_update_system_tables(), row_truncate_foreign_key_checks(), row_truncate_sanity_checks(): Remove. row_upd_check_references_constraints(): Remove a check for TRUNCATE, now that the table is no longer truncated in place. The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some race-condition like scenarios. The test innodb-innodb.truncate does not use any synchronization. We add a redo log subformat to indicate backup-friendly format. MariaDB 10.4 will remove support for the old TRUNCATE logging, so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve limitations. == Undo tablespace truncation == MySQL 5.7 implements undo tablespace truncation. It is only possible when innodb_undo_tablespaces is set to at least 2. The logging is implemented similar to the WL#6501 TRUNCATE, that is, using separate log files and a redo log checkpoint. We can simply implement undo tablespace truncation within a single mini-transaction that reinitializes the undo log tablespace file. Unfortunately, due to the redo log format of some operations, currently, the total redo log written by undo tablespace truncation will be more than the combined size of the truncated undo tablespace. It should be acceptable to have a little more than 1 megabyte of log in a single mini-transaction. This will be fixed in MDEV-17138 in MariaDB Server 10.4. recv_sys_t: Add truncated_undo_spaces[] to remember for which undo tablespaces a MLOG_FILE_CREATE2 record was seen. namespace undo: Remove some unnecessary declarations. fil_space_t::is_being_truncated: Document that this flag now only applies to undo tablespaces. Remove some references. fil_space_t::is_stopping(): Do not refer to is_being_truncated. This check is for tablespaces of tables. Potentially used tablespaces are never truncated any more. buf_dblwr_process(): Suppress the out-of-bounds warning for undo tablespaces. fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero page number (new size of the tablespace in pages) to inform crash recovery that the undo tablespace size has been reduced. fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2 can be written for undo tablespaces (without .ibd file suffix) for a nonzero page number. os_file_truncate(): Add the parameter allow_shrink=false so that undo tablespaces can actually be shrunk using this function. fil_name_parse(): For undo tablespace truncation, buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[]. recv_read_in_area(): Avoid reading pages for which no redo log records remain buffered, after recv_addr_trim() removed them. trx_rseg_header_create(): Add a FIXME comment that we could write much less redo log. trx_undo_truncate_tablespace(): Reinitialize the undo tablespace in a single mini-transaction, which will be flushed to the redo log before the file size is trimmed. recv_addr_trim(): Discard any redo logs for pages that were logged after the new end of a file, before the truncation LSN. If the rec_list becomes empty, reduce n_addrs. After removing any affected records, actually truncate the file. recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before applying any log records. The undo tablespace files must be open at this point. buf_flush_or_remove_pages(), buf_flush_dirty_pages(), buf_LRU_flush_or_remove_pages(): Add a parameter for specifying the number of the first page to flush or remove (default 0). trx_purge_initiate_truncate(): Remove the log checkpoints, the extra logging, and some unnecessary crash points. Merge the code from trx_undo_truncate_tablespace(). First, flush all to-be-discarded pages (beyond the new end of the file), then trim the space->size to make the page allocation deterministic. At the only remaining crash injection point, flush the redo log, so that the recovery can be tested.	2018-09-07 22:10:02 +03:00
Marko Mäkelä	0121d5a790	Merge 10.2 into 10.3	2018-06-18 15:43:59 +03:00
Marko Mäkelä	2ca904f0ca	MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. Also, validate the page checksum before decryption, and reduce the scope of some variables. fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused). AbstractCallback::operator()(): Remove the parameter 'offset'. The check for it in FetchIndexRootPages::operator() was basically redundant and dead code since the previous refactoring.	2018-06-14 14:23:01 +03:00
Marko Mäkelä	f5eb37129f	MDEV-13103 Deal with page_compressed page corruption fil_page_decompress(): Replaces fil_decompress_page(). Allow the caller detect errors. Remove duplicated code. Use the "safe" instead of "fast" variants of decompression routines. fil_page_compress(): Replaces fil_compress_page(). The length of the input buffer always was srv_page_size (innodb_page_size). Remove printouts, and remove the fil_space_t* parameter. buf_tmp_buffer_t::reserved: Make private; the accessors acquire() and release() will use atomic memory access. buf_pool_reserve_tmp_slot(): Make static. Remove the second parameter. Do not acquire any mutex. Remove the allocation of the buffers. buf_tmp_reserve_crypt_buf(), buf_tmp_reserve_compression_buf(): Refactored away from buf_pool_reserve_tmp_slot(). buf_page_decrypt_after_read(): Make static, and simplify the logic. Use the encryption buffer also for decompressing. buf_page_io_complete(), buf_dblwr_process(): Check more failures. fil_space_encrypt(): Simplify the debug checks. fil_space_t::printed_compression_failure: Remove. fil_get_compression_alg_name(): Remove. fil_iterate(): Allocate a buffer for compression and decompression only once, instead of allocating and freeing it for every page that uses compression, during IMPORT TABLESPACE. fil_node_get_space_id(), fil_page_is_index_page(), fil_page_is_lzo_compressed(): Remove (unused code).	2018-06-14 13:46:07 +03:00
Marko Mäkelä	9ed2b2b2b8	Do not divide or multiply by srv_page_size Instead, shift by srv_page_size_shift.	2018-04-28 20:52:22 +03:00
Marko Mäkelä	a90100d756	Replace univ_page_size and UNIV_PAGE_SIZE Try to use one variable (srv_page_size) for innodb_page_size. Also, replace UNIV_PAGE_SIZE_SHIFT with srv_page_size_shift.	2018-04-28 20:45:45 +03:00
Marko Mäkelä	4cad42392a	MDEV-12266: Change dict_table_t::space to fil_space_t* InnoDB always keeps all tablespaces in the fil_system cache. The fil_system.LRU is only for closing file handles; the fil_space_t and fil_node_t for all data files will remain in main memory. Between startup to shutdown, they can only be created and removed by DDL statements. Therefore, we can let dict_table_t::space point directly to the fil_space_t. dict_table_t::space_id: A numeric tablespace ID for the corner cases where we do not have a tablespace. The most prominent examples are ALTER TABLE...DISCARD TABLESPACE or a missing or corrupted file. There are a few functional differences; most notably: (1) DROP TABLE will delete matching .ibd and .cfg files, even if they were not attached to the data dictionary. (2) Some error messages will report file names instead of numeric IDs. There still are many functions that use numeric tablespace IDs instead of fil_space_t, and many functions could be converted to fil_space_t member functions. Also, Tablespace and Datafile should be merged with fil_space_t and fil_node_t. page_id_t and buf_page_get_gen() could use fil_space_t& instead of a numeric ID, and after moving to a single buffer pool (MDEV-15058), buf_pool_t::page_hash could be moved to fil_space_t::page_hash. FilSpace: Remove. Only few calls to fil_space_acquire() will remain, and gradually they should be removed. mtr_t::set_named_space_id(ulint): Renamed from set_named_space(), to prevent accidental calls to this slower function. Very few callers remain. fseg_create(), fsp_reserve_free_extents(): Take fil_space_t as a parameter instead of a space_id. fil_space_t::rename(): Wrapper for fil_rename_tablespace_check(), fil_name_write_rename(), fil_rename_tablespace(). Mariabackup passes the parameter log=false; InnoDB passes log=true. dict_mem_table_create(): Take fil_space_t* instead of space_id as parameter. dict_process_sys_tables_rec_and_mtr_commit(): Replace the parameter 'status' with 'bool cached'. dict_get_and_save_data_dir_path(): Avoid copying the fil_node_t::name. fil_ibd_open(): Return the tablespace. fil_space_t::set_imported(): Replaces fil_space_set_imported(). truncate_t: Change many member function parameters to fil_space_t*, and remove page_size parameters. row_truncate_prepare(): Merge to its only caller. row_drop_table_from_cache(): Assert that the table is persistent. dict_create_sys_indexes_tuple(): Write SYS_INDEXES.SPACE=FIL_NULL if the tablespace has been discarded. row_import_update_discarded_flag(): Remove a constant parameter.	2018-03-29 22:02:05 +03:00
Marko Mäkelä	7fb03d7abf	Merge bb-10.2-ext into 10.3	2018-03-13 08:15:06 +02:00
Marko Mäkelä	112df06996	MDEV-15529 IMPORT TABLESPACE unnecessarily uses the doublewrite buffer fil_space_t::atomic_write_supported: Always set this flag for TEMPORARY TABLESPACE and during IMPORT TABLESPACE. The page writes during these operations are by definition not crash-safe because they are not written to the redo log. fil_space_t::use_doublewrite(): Determine if doublewrite should be used. buf_dblwr_update(): Add assertions, and let the caller check whether doublewrite buffering is desired. buf_flush_write_block_low(): Disable the doublewrite buffer for the temporary tablespace and for IMPORT TABLESPACE. fil_space_set_imported(), fil_node_open_file(), fil_space_create(): Initialize or revise the space->atomic_write_supported flag. buf_page_io_complete(), buf_flush_write_complete(): Add the parameter dblwr, to indicate whether doublewrite was used for writes. buf_dblwr_sync_datafiles(): Remove an unnecessary flush of persistent tablespaces when flushing temporary tablespaces. (Move the call to buf_dblwr_flush_buffered_writes().)	2018-03-10 11:54:34 +02:00
Marko Mäkelä	a4948dafcd	MDEV-11369 Instant ADD COLUMN for InnoDB For InnoDB tables, adding, dropping and reordering columns has required a rebuild of the table and all its indexes. Since MySQL 5.6 (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing concurrent modification of the tables. This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously, with only minor changes performed to the table structure. The counter innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS is incremented whenever a table rebuild operation is converted into an instant ADD COLUMN operation. ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN. Some usability limitations will be addressed in subsequent work: MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY and ALGORITHM=INSTANT MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE The format of the clustered index (PRIMARY KEY) is changed as follows: (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT, and a new field PAGE_INSTANT will contain the original number of fields in the clustered index ('core' fields). If instant ADD COLUMN has not been used or the table becomes empty, or the very first instant ADD COLUMN operation is rolled back, the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset to 0 and FIL_PAGE_INDEX. (2) A special 'default row' record is inserted into the leftmost leaf, between the page infimum and the first user record. This record is distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the same format as records that contain values for the instantly added columns. This 'default row' always has the same number of fields as the clustered index according to the table definition. The values of 'core' fields are to be ignored. For other fields, the 'default row' will contain the default values as they were during the ALTER TABLE statement. (If the column default values are changed later, those values will only be stored in the .frm file. The 'default row' will contain the original evaluated values, which must be the same for every row.) The 'default row' must be completely hidden from higher-level access routines. Assertions have been added to ensure that no 'default row' is ever present in the adaptive hash index or in locked records. The 'default row' is never delete-marked. (3) In clustered index leaf page records, the number of fields must reside between the number of 'core' fields (dict_index_t::n_core_fields introduced in this work) and dict_index_t::n_fields. If the number of fields is less than dict_index_t::n_fields, the missing fields are replaced with the column value of the 'default row'. Note: The number of fields in the record may shrink if some of the last instantly added columns are updated to the value that is in the 'default row'. The function btr_cur_trim() implements this 'compression' on update and rollback; dtuple::trim() implements it on insert. (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new status value REC_STATUS_COLUMNS_ADDED will indicate the presence of a new record header that will encode n_fields-n_core_fields-1 in 1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header always explicitly encodes the number of fields.) We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for covering the insert of the 'default row' record when instant ADD COLUMN is used for the first time. Subsequent instant ADD COLUMN can use TRX_UNDO_UPD_EXIST_REC. This is joint work with Vin Chen (陈福荣) from Tencent. The design that was discussed in April 2017 would not have allowed import or export of data files, because instead of the 'default row' it would have introduced a data dictionary table. The test rpl.rpl_alter_instant is exactly as contributed in pull request #408. The test innodb.instant_alter is based on a contributed test. The redo log record format changes for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT are as contributed. (With this change present, crash recovery from MariaDB 10.3.1 will fail in spectacular ways!) Also the semantics of higher-level redo log records that modify the PAGE_INSTANT field is changed. The redo log format version identifier was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1. Everything else has been rewritten by me. Thanks to Elena Stepanova, the code has been tested extensively. When rolling back an instant ADD COLUMN operation, we must empty the PAGE_FREE list after deleting or shortening the 'default row' record, by calling either btr_page_empty() or btr_page_reorganize(). We must know the size of each entry in the PAGE_FREE list. If rollback left a freed copy of the 'default row' in the PAGE_FREE list, we would be unable to determine its size (if it is in ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC) because it would contain more fields than the rolled-back definition of the clustered index. UNIV_SQL_DEFAULT: A new special constant that designates an instantly added column that is not present in the clustered index record. len_is_stored(): Check if a length is an actual length. There are two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL. dict_col_t::def_val: The 'default row' value of the column. If the column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT. dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(), instant_value(). dict_col_t::remove_instant(): Remove the 'instant ADD' status of a column. dict_col_t::name(const dict_table_t& table): Replaces dict_table_get_col_name(). dict_index_t::n_core_fields: The original number of fields. For secondary indexes and if instant ADD COLUMN has not been used, this will be equal to dict_index_t::n_fields. dict_index_t::n_core_null_bytes: Number of bytes needed to represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable). dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that n_core_null_bytes was not initialized yet from the clustered index root page. dict_index_t: Add the accessors is_instant(), is_clust(), get_n_nullable(), instant_field_value(). dict_index_t::instant_add_field(): Adjust clustered index metadata for instant ADD COLUMN. dict_index_t::remove_instant(): Remove the 'instant ADD' status of a clustered index when the table becomes empty, or the very first instant ADD COLUMN operation is rolled back. dict_table_t: Add the accessors is_instant(), is_temporary(), supports_instant(). dict_table_t::instant_add_column(): Adjust metadata for instant ADD COLUMN. dict_table_t::rollback_instant(): Adjust metadata on the rollback of instant ADD COLUMN. prepare_inplace_alter_table_dict(): First create the ctx->new_table, and only then decide if the table really needs to be rebuilt. We must split the creation of table or index metadata from the creation of the dictionary table records and the creation of the data. In this way, we can transform a table-rebuilding operation into an instant ADD COLUMN operation. Dictionary objects will only be added to cache when table rebuilding or index creation is needed. The ctx->instant_table will never be added to cache. dict_table_t::add_to_cache(): Modified and renamed from dict_table_add_to_cache(). Do not modify the table metadata. Let the callers invoke dict_table_add_system_columns() and if needed, set can_be_evicted. dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the system columns (which will now exist in the dict_table_t object already at this point). dict_create_table_step(): Expect the callers to invoke dict_table_add_system_columns(). pars_create_table(): Before creating the table creation execution graph, invoke dict_table_add_system_columns(). row_create_table_for_mysql(): Expect all callers to invoke dict_table_add_system_columns(). create_index_dict(): Replaces row_merge_create_index_graph(). innodb_update_n_cols(): Renamed from innobase_update_n_virtual(). Call my_error() if an error occurs. btr_cur_instant_init(), btr_cur_instant_init_low(), btr_cur_instant_root_init(): Load additional metadata from the clustered index and set dict_index_t::n_core_null_bytes. This is invoked when table metadata is first loaded into the data dictionary. dict_boot(): Initialize n_core_null_bytes for the four hard-coded dictionary tables. dict_create_index_step(): Initialize n_core_null_bytes. This is executed as part of CREATE TABLE. dict_index_build_internal_clust(): Initialize n_core_null_bytes to NO_CORE_NULL_BYTES if table->supports_instant(). row_create_index_for_mysql(): Initialize n_core_null_bytes for CREATE TEMPORARY TABLE. commit_cache_norebuild(): Call the code to rename or enlarge columns in the cache only if instant ADD COLUMN is not being used. (Instant ADD COLUMN would copy all column metadata from instant_table to old_table, including the names and lengths.) PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields. This is repurposing the 16-bit field PAGE_DIRECTION, of which only the least significant 3 bits were used. The original byte containing PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B. page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT. page_ptr_get_direction(), page_get_direction(), page_ptr_set_direction(): Accessors for PAGE_DIRECTION. page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION. page_direction_increment(): Increment PAGE_N_DIRECTION and set PAGE_DIRECTION. rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes, and assume that heap_no is always set. Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records, even if the record contains fewer fields. rec_offs_make_valid(): Add the parameter 'leaf'. rec_copy_prefix_to_dtuple(): Assert that the tuple is only built on the core fields. Instant ADD COLUMN only applies to the clustered index, and we should never build a search key that has more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR. All these columns are always present. dict_index_build_data_tuple(): Remove assertions that would be duplicated in rec_copy_prefix_to_dtuple(). rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose number of fields is between n_core_fields and n_fields. cmp_rec_rec_with_match(): Implement the comparison between two MIN_REC_FLAG records. trx_t::in_rollback: Make the field available in non-debug builds. trx_start_for_ddl_low(): Remove dangerous error-tolerance. A dictionary transaction must be flagged as such before it has generated any undo log records. This is because trx_undo_assign_undo() will mark the transaction as a dictionary transaction in the undo log header right before the very first undo log record is being written. btr_index_rec_validate(): Account for instant ADD COLUMN row_undo_ins_remove_clust_rec(): On the rollback of an insert into SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the last column from the table and the clustered index. row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(), trx_undo_update_rec_get_update(): Handle the 'default row' as a special case. dtuple_t::trim(index): Omit a redundant suffix of an index tuple right before insert or update. After instant ADD COLUMN, if the last fields of a clustered index tuple match the 'default row', there is no need to store them. While trimming the entry, we must hold a page latch, so that the table cannot be emptied and the 'default row' be deleted. btr_cur_optimistic_update(), btr_cur_pessimistic_update(), row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low(): Invoke dtuple_t::trim() if needed. row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling row_ins_clust_index_entry_low(). rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number of fields to be between n_core_fields and n_fields. Do not support infimum,supremum. They are never supposed to be stored in dtuple_t, because page creation nowadays uses a lower-level method for initializing them. rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the number of fields. btr_cur_trim(): In an update, trim the index entry as needed. For the 'default row', handle rollback specially. For user records, omit fields that match the 'default row'. btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete(): Skip locking and adaptive hash index for the 'default row'. row_log_table_apply_convert_mrec(): Replace 'default row' values if needed. In the temporary file that is applied by row_log_table_apply(), we must identify whether the records contain the extra header for instantly added columns. For now, we will allocate an additional byte for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table has been subject to instant ADD COLUMN. The ROW_T_DELETE records are fine, as they will be converted and will only contain 'core' columns (PRIMARY KEY and some system columns) that are converted from dtuple_t. rec_get_converted_size_temp(), rec_init_offsets_temp(), rec_convert_dtuple_to_temp(): Add the parameter 'status'. REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG \| REC_STATUS_COLUMNS_ADDED: An info_bits constant for distinguishing the 'default row' record. rec_comp_status_t: An enum of the status bit values. rec_leaf_format: An enum that replaces the bool parameter of rec_init_offsets_comp_ordinary().	2017-10-06 09:50:10 +03:00
Marko Mäkelä	70db1e3b8a	Merge 10.1 into 10.2	2017-09-06 19:28:51 +03:00
Marko Mäkelä	cd694d76ce	Merge 10.0 into 10.1	2017-09-06 15:32:56 +03:00
Marko Mäkelä	6b45355e6b	MDEV-13103 Assertion `flags & BUF_PAGE_PRINT_NO_CRASH' failed in buf_page_print buf_page_print(): Remove the parameter 'flags', and when a server abort is intended, perform that in the caller. In this way, page corruption reports due to different reasons can be distinguished better. This is non-functional code refactoring that does not fix any page corruption issues. The change is only made to avoid falsely grouping together unrelated causes of page corruption.	2017-09-06 14:01:15 +03:00

1 2

100 commits