multiple file tablespace
Problem:
=======
- innochecksum was incorrectly interpreting doublewrite buffer
pages as index pages, causing confusion about stale tables
in the system tablespace.
- innochecksum fails to parse the multi-file system tablespace
Solution:
========
1. Rewrite checksum of doublewrite buffer pages
are skipped.
2. Introduced the option --tablespace-flags which can be used
to initialize page size. This option can handle the ibdata2,
ibdata3 etc without parsing ibdata1.
In commit bea4adcb5a (MDEV-35225)
we inadvertently introduced a race condition. Another thread
may invoke buf_page_t::write_complete() between the time
log_sort_flush_list() inserted the block to the list for sorting,
and the time it would apply the sorted list back to buf_pool.flush_list.
In this case, log_sort_flush_list() would neither add the block to
buf_pool.flush_list nor clear the buf_page_t::oldest_modification_
so that it would correctly indicate whether the block is in the list.
log_sort_flush_list(): Simplify the logic, and always add the entire
sorted list to the buf_pool.flush_list, even if they had been written
back during the time we were copying or sorting.
This fixes an anomaly where a subsequent
buf_pool_t::insert_into_flush_list() would end up incrementing
buf_pool.flush_list.count by one too much.
Thanks to Daniel Black for providing an "rr replay" trace of a
failure.
mtr_t::encrypt(): Handle the special case that the type and length
are at the very end of a m_log snippet, followed by another snippet
that contains the rest of the payload.
It should be noted that in log_decrypt_mtr(), the mini-transaction
consists of a single contiguous memory area. Furthermore, for the
initial log record in mtr_t::encrypt() at least the page identifier
will always be included in the initial m_log snippet.
Tested by: Alice Sherepa
row_vers_impl_x_locked_low(): If a secondary index record points to
a clustered index record that carries the current transaction identifier,
then there cannot possibly be any implicit locks to that secondary index
record, because those would have been checked before the current
transaction got the implicit lock (modified the clustered index record)
in the first place.
This fix will avoid unnecessary access to undo log and possible BLOB pages,
which may already have been freed in a purge operation.
buf_page_get_zip(): Assert that the page is not marked as freed
in the tablespace. This assertion could fire in a scenario like the
test case when the table is created in ROW_FORMAT=COMPRESSED.
This is a 10.6 version of commit be0e3b2f0d
without a test case.
row_vers_impl_x_locked_low(): If a secondary index record points to
a clustered index record that carries the current transaction identifier,
then there cannot possibly be any implicit locks to that secondary index
record, because those would have been checked before the current
transaction got the implicit lock (modified the clustered index record)
in the first place.
This fix will avoid unnecessary access to undo log and possible BLOB pages,
which may already have been freed in a purge operation.
buf_page_get_zip(): Assert that the page is not marked as freed
in the tablespace. This assertion could fire in a scenario like the
test case when the table is created in ROW_FORMAT=COMPRESSED.
Previously, Innodb used partition separator "#p#" on Windows,
and "#P#" elsewhere. Now it uses "#P#" uniformly,
Yet, there is no automated upgrade procedure yet in place that would fix
old data dictionary.
This patch fixed it, by accepting both #P# and #p# as partition
separators.
If lower_case_table_names=0 were used (in case-sensitive directory,
see https://learn.microsoft.com/en-us/windows/wsl/case-sensitivity),
Innodb still handles Windows as special case, and forces lowercasing
in its dictionary.
To fix, remove #ifdef places that handle Windows specially with regard
to filename casing.
ibuf_upgrade_needed(): Adjust the check for the case that the
InnoDB system tablespace had been created without
innodb_checksum_algorithm=full_crc32 and innodb_encrypt_tables
was enabled at the time the change buffer was upgraded.
When there's a column length mismatch in the InnoDB
statistics tables (innodb_table_stats or innodb_index_stats),
consecutive access of statistics table throws error message
and uses transient statistics.
This change makes it easier for users to understand and
resolve the issue when the statistics tables have been
modified or corrupted.
Starting with commit 4369a382a1 the
calls to release_thd() would occasionally attempt InnoDB shutdown,
causing occasional assertion failures when mariadbd or mariadb-backup
is shutting down.
It turns out that thd_set_ha_data() has some side effects, other than
just assigning the pointer that a subsequent thd_get_ha_data() would return.
mtr_t::trx: New public const data member. If it is nullptr,
per-connection statistics will not be updated. The transaction is
not necessarily in active state. We may merely use it as an "anchor"
for buffering updates of buf_pool.stat.n_page_gets in
trx_t::pages_accessed.
As part of this, we try to create mtr_t less often, reusing one object
in multiple places. Some read operations will
invoke mtr_t::rollback_to_savepoint() to release their own page latches
within a larger mini-transactions.
Reviewed by: Vladislav Lesin
Tested by: Saahil Alam
Let us remove the thread-local variable mariadb_stats and introduce
trx_t::pages_accessed, trx_t::active_handler_stats for more
efficiently maintaining some statistics inside InnoDB.
buf_pool.stat.n_page_gets: Reimplemented as Atomic_counter<ulint>.
This will no longer track some accesses in the background where
!current_thd() || !thd_to_trx(current_thd).
trx_t::free(), trx_t::commit_cleanup(): Apply pages_accessed
to buf_pool.stat.n_page_gets.
buf_read_ahead_report(): Report a completed read-ahead batch.
ha_innobase::estimate_rows_upper_bound(): Do not bother updating
trx_t::op_info around some quick arithmetics.
ha_innobase::records_in_range(): Do invoke mariadb_set_stats.
This will change some ANALYZE FORMAT=JSON SELECT results of the test
main.rowid_filter_innodb.
Reviewed by: Vladislav Lesin
Tested by: Saahil Alam
buf_pool_t::page_guess(): Avoid a memory transaction, because
we would be checking several conditions inside it. Synchronize
with buf_page_t::init() in order to avoid false guesses.
buf_page_t::init(): Release-store the state after storing id_,
in order to properly synchronize-with buf_pool_t::page_guess().
buf_pool_t::page_hash_table::append(): Define non-inline.
buf_pool_t::page_hash_table::replace(),
buf_pool_t::page_hash_table::remove(): Move the inline definition
to the compilation unit of the only caller, to declutter the header.
buf_page_create_low(), buf_page_init_for_read(): Do not initialize
the block descriptor before holding a latch on buf_pool.page_hash,
in order to avoid false positive matches in buf_pool_t::page_guess().
trx_undo_report_row_operation(): Check the cheaper and less likely
condition first.
purge_sys_t::is_purgeable(), purge_sys_t::batch_cleanup(): Do not
attempt to use a memory transaction around the rather complex
data structure ReadViewBase.
Also, remove several redundant TRANSACTIONAL_TARGET.
Some of the remaining ones will be made redundant by
commit 9c8bdc6c15 (MDEV-35049).
Reviewed by: Vladislav Lesin
Instrumented debug mode triggering of instant_insert_fail
failed to free partition memory like the que_eval_sql function
who's failure it intended to emulate.
Corrected by freeing the partition information when this internal
debug mode instrumented condition occured.
row_log_apply_ops(), row_log_table_apply_ops(): Instead of adding
an offset to a potentially null pointer, subtract the offset from a
never-null pointer and then compare to the potentially null pointer.
Also, instead of adding a negative (wrapped-around) pointer offset,
subtract a positive pointer offset.
Reviewed by: Daniel Black
Problem:
=======
- InnoDB statistics calculation for the table is done after
every 10 seconds by default in background thread dict_stats_thread()
- Doing multiple ALTER TABLE..ALGORITHM=COPY causes the
dict_stats_thread() to lag behind, therefore calculation of stats
for newly created intermediate table gets delayed
Fix:
====
- Stats calculation for newly created intermediate table is made
independent of background thread. After copying gets completed,
stats for new table is calculated as part of ALTER TABLE ... ALGORITHM=COPY.
dict_stats_rename_table(): Rename the table statistics from
intermediate table to new table
alter_stats_rebuild(): Removes the table name from the warning.
Because this warning can print for intermediate table as well.
Alter table using copy algorithm now calls alter_stats_rebuild()
under a shared MDL lock on a temporary #sql-alter- table,
differing from its previous use only during ALGORITHM=INPLACE
operations on user-visible tables.
dict_stats_schema_check(): Added a separate check for table
readability before checking for tablespace existence.
This could lead to detect of existence of persistent statistics
storage eariler and fallback to transient statistics.
This is a cherry-pick fix of mysql commit@cfe5f287ae99d004e8532a30003a7e8e77d379e3
Modified srv_start to call fil_crypt_threads_init() only
when srv_read_only_mode is not set.
Modified encryption.innodb-read-only to capture number of
encryption threads created for both scenarios when
server is not read only as well as when server is read only.
Let us access some data members of THD directly, instead of invoking
non-inline accessor functions. Note: my_thread_id will be used instead
of the potentially narrower ulong data type.
Also, let us remove some functions from sql_class.cc that were only
being used by InnoDB or RocksDB, for no reason. RocksDB always had
access to the internals of THD.
Reviewed by: Sergei Golubchik
Tested by: Saahil Alam
Instead of using DBUG_EXECUTE_IF fault injection, let us construct
a minimal corrupted log file that will produce an OPT_PAGE_CHECKSUM
mismatch without depending on CMAKE_BUILD_TYPE=Debug.
BsonGet_String and JsonGet_String with a NULL argument
push an empty string warning which is the default contents
of g->Message.
In push_warning in the server, there is a Debug assertion
that the string doesn't end in \n. This looks before
the last null of the string, which in this case is
before the buffer. This results in a UBSAN error as
its a pointer overflow/underflow.
Correct by adding an "Argument is NULL" as the warning
message.
Also corrected the JsonGet_String to error if the Value
failed to allocate a buffer in its constructor.
The function ibuf_remove_free_page() was waiting for ibuf_mutex
while holding ibuf.index->lock. This constitutes a lock order
inversion and may cause InnoDB to hang when innodb_change_buffering
is enabled and ibuf_merge_or_delete_for_page() is being executed
concurrently.
In fact, there is no need for ibuf_remove_free_page() to reacquire
ibuf_mutex if we make ibuf.seg_size and ibuf.free_list_len
protected by the ibuf.index->lock as well as the root page latch rather
than by ibuf_mutex.
ibuf.seg_size, ibuf.free_list_len: Instead of ibuf_mutex, let the
ibuf.index->lock and the root page latch protect these, like ibuf.empty.
ibuf_init_at_db_start(): Acquire the root page latch before updating
ibuf.seg_size. (The ibuf.index would be created later.)
ibuf_data_enough_free_for_insert(), ibuf_data_too_much_free():
Assert also ibuf.index->lock.have_u_or_x().
ibuf_remove_free_page(): Acquire the ibuf.index->lock and the root page
latch before accessing ibuf.free_list_len. Simplify the way how the
root page latch is released and reacquired. Acquire and release
ibuf_mutex only once.
ibuf_free_excess_pages(), ibuf_insert_low(): Acquire also ibuf.index->lock
before reading ibuf.free_list_len.
ibuf_print(): Acquire ibuf.index->lock before reading
ibuf.free_list_len and ibuf.seg_size.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
The function ibuf_remove_free_page() was waiting for ibuf_mutex
while holding ibuf.index->lock. This constitutes a lock order
inversion and may cause InnoDB to hang when innodb_change_buffering
is enabled and ibuf_merge_or_delete_for_page() is being executed
concurrently.
In fact, there is no need for ibuf_remove_free_page() to reacquire
ibuf_mutex if we make ibuf.seg_size and ibuf.free_list_len
protected by the ibuf.index->lock as well as the root page latch rather
than by ibuf_mutex.
ibuf.seg_size, ibuf.free_list_len: Instead of ibuf_mutex, let the
ibuf.index->lock and the root page latch protect these, like ibuf.empty.
ibuf_init_at_db_start(): Acquire the root page latch before updating
ibuf.seg_size. (The ibuf.index would be created later.)
ibuf_data_enough_free_for_insert(), ibuf_data_too_much_free():
Assert also ibuf.index->lock.have_u_or_x().
ibuf_remove_free_page(): Acquire the ibuf.index->lock and the root page
latch before accessing ibuf.free_list_len. Simplify the way how the
root page latch is released and reacquired. Acquire and release
ibuf_mutex only once.
ibuf_free_excess_pages(), ibuf_insert_low(): Acquire also ibuf.index->lock
before reading ibuf.free_list_len.
ibuf_print(): Acquire ibuf.index->lock before reading
ibuf.free_list_len and ibuf.seg_size.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
Follow up to MDEV-34388 82d7419e06
Relax limit on specific files only.
clang-20 + CMAKE_BUILD_TYPE=Debug:
options/cf_options.cc:0:0: stack frame size (17624) exceeds limit (16384) in function '__cxx_global_var_init.33'
options/db_options.cc:0:0: stack frame size (34328) exceeds limit (32768) in function '__cxx_global_var_init.45'
Reviewer: Jimmy Hu <jimmy.hu@mariadb.com>
Under Debug build this becomes a Werror.
Resolved this by changing the grn_mecab_chunk_size_threshold
to a ptrdiff_t along with chunked_tokenize_utf8's string_bytes
argument so there is no need to case.
Reviewer: Jimmy Hu <jimmy.hu@mariadb.com>
Problem:
=======
When InnoDB encounters a corrupted page during crash recovery,
server would abort due to improper handling of page locks
and space references. The recovery process was not properly
cleaning up resources when corruption was detected,
leading to inconsistent state and server termination.
Solution:
=========
recover_low(): Move page lock recursive acquisition
after deferred/non-deferred page creation logic to
ensure consistent locking behavior for both code paths.
Ensure proper block recursive unlock for non-deferred tablespaces
recv_recover_page(): Simplify corrupted page cleanup by
removing redundant space reference handling.
maria_open(): Always initialize open_mode, and remove the
redundant local variable try_open_mode that
commit 24821e9585 had introduced.
Also, optimize away any checks for s3 when WITH_S3_STORAGE_ENGINE
is not defined.
mtr --view-protocol causes a separate thread for execution of SELECT
statements, thus set global spider_same_server_link is needed.
The other issue is opened as MDEV-37568
When spider tries to find a partition matching a name passed from the
sql layer, it construct the partition name with NORMAL_PART_NAME.
However, the name passed from the sql layer could be constructed with
other types of name, such as TEMP_PART_NAME, which is a longer string.
Spider does handle TEMP_PART_NAME in other places of
spider_get_partition_info, but overall it is not able to handle
partition changes involving redistributing data to partitions which
can result in TEMP_PART_NAME. That is a more involved issue. In this
patch, we simply follow the existing intended logic and fix the MSAN
complaint.
Fixed the following issues:
- aria_read_index() and aria_read_data(), used by mariabackup, checked
the wrong status from maria_page_crc_check().
- Both functions did infinite retries if crc did not match.
- Wrong usage of ma_check_if_zero() in maria_page_crc_check()
Author: Thirunarayanan Balathandayuthapani <thiru@mariadb.com>
- Removed duplicate words, like "the the" and "to to"
- Removed duplicate lines (one double sort line found in mysql.cc)
- Fixed some typos found while searching for duplicate words.
Command used to find duplicate words:
egrep -rI "\s([a-zA-Z]+)\s+\1\s" | grep -v param
Thanks to Artjoms Rimdjonoks for the command and pointing out the
spelling errors.
- data files will be opened in readonly mode for repair if --quick
is used.
- Added information about check progress if --verbose is used.
- Added new option --keys-active= as a simpler version of keys-used.
Internal changes:
- Store open file mode in share->index_mode and share->data_mode instead
of in share->mode.
- Removed not needed 'mode' argument from maria_clone_internal()
These changes was done as part of fixing
MDEV-36858 MariaDB MyISAM secondary indexes silently break for
tables > 10B rows
Changes done in myisamchk:
- Tables that are checked are opened in readonly mode if --force is not
used.
- *.MYD files will be opened in readonly mode for repair if --quick
is used.
- Added information about check progress if --verbose is used.
- Output information about repaired/checked rows every 10000 rows instead
of every 1000 rows. Note that this also affects aria_chk
- Store open file mode in share->index_mode and share->data_mode instead
of in share->mode.
- Added new option --keys-active= as a simpler version of keys-used.
- Changed output for "myisamchk -dvv" to get nicer output for tables
with 10 billion rows.