row_log_apply_ops(), row_log_table_apply_ops(): Instead of adding
an offset to a potentially null pointer, subtract the offset from a
never-null pointer and then compare to the potentially null pointer.
Also, instead of adding a negative (wrapped-around) pointer offset,
subtract a positive pointer offset.
Reviewed by: Daniel Black
Problem:
=======
- InnoDB statistics calculation for the table is done after
every 10 seconds by default in background thread dict_stats_thread()
- Doing multiple ALTER TABLE..ALGORITHM=COPY causes the
dict_stats_thread() to lag behind, therefore calculation of stats
for newly created intermediate table gets delayed
Fix:
====
- Stats calculation for newly created intermediate table is made
independent of background thread. After copying gets completed,
stats for new table is calculated as part of ALTER TABLE ... ALGORITHM=COPY.
dict_stats_rename_table(): Rename the table statistics from
intermediate table to new table
alter_stats_rebuild(): Removes the table name from the warning.
Because this warning can print for intermediate table as well.
Alter table using copy algorithm now calls alter_stats_rebuild()
under a shared MDL lock on a temporary #sql-alter- table,
differing from its previous use only during ALGORITHM=INPLACE
operations on user-visible tables.
dict_stats_schema_check(): Added a separate check for table
readability before checking for tablespace existence.
This could lead to detect of existence of persistent statistics
storage eariler and fallback to transient statistics.
This is a cherry-pick fix of mysql commit@cfe5f287ae99d004e8532a30003a7e8e77d379e3
Modified srv_start to call fil_crypt_threads_init() only
when srv_read_only_mode is not set.
Modified encryption.innodb-read-only to capture number of
encryption threads created for both scenarios when
server is not read only as well as when server is read only.
Let us access some data members of THD directly, instead of invoking
non-inline accessor functions. Note: my_thread_id will be used instead
of the potentially narrower ulong data type.
Also, let us remove some functions from sql_class.cc that were only
being used by InnoDB or RocksDB, for no reason. RocksDB always had
access to the internals of THD.
Reviewed by: Sergei Golubchik
Tested by: Saahil Alam
Instead of using DBUG_EXECUTE_IF fault injection, let us construct
a minimal corrupted log file that will produce an OPT_PAGE_CHECKSUM
mismatch without depending on CMAKE_BUILD_TYPE=Debug.
BsonGet_String and JsonGet_String with a NULL argument
push an empty string warning which is the default contents
of g->Message.
In push_warning in the server, there is a Debug assertion
that the string doesn't end in \n. This looks before
the last null of the string, which in this case is
before the buffer. This results in a UBSAN error as
its a pointer overflow/underflow.
Correct by adding an "Argument is NULL" as the warning
message.
Also corrected the JsonGet_String to error if the Value
failed to allocate a buffer in its constructor.
The function ibuf_remove_free_page() was waiting for ibuf_mutex
while holding ibuf.index->lock. This constitutes a lock order
inversion and may cause InnoDB to hang when innodb_change_buffering
is enabled and ibuf_merge_or_delete_for_page() is being executed
concurrently.
In fact, there is no need for ibuf_remove_free_page() to reacquire
ibuf_mutex if we make ibuf.seg_size and ibuf.free_list_len
protected by the ibuf.index->lock as well as the root page latch rather
than by ibuf_mutex.
ibuf.seg_size, ibuf.free_list_len: Instead of ibuf_mutex, let the
ibuf.index->lock and the root page latch protect these, like ibuf.empty.
ibuf_init_at_db_start(): Acquire the root page latch before updating
ibuf.seg_size. (The ibuf.index would be created later.)
ibuf_data_enough_free_for_insert(), ibuf_data_too_much_free():
Assert also ibuf.index->lock.have_u_or_x().
ibuf_remove_free_page(): Acquire the ibuf.index->lock and the root page
latch before accessing ibuf.free_list_len. Simplify the way how the
root page latch is released and reacquired. Acquire and release
ibuf_mutex only once.
ibuf_free_excess_pages(), ibuf_insert_low(): Acquire also ibuf.index->lock
before reading ibuf.free_list_len.
ibuf_print(): Acquire ibuf.index->lock before reading
ibuf.free_list_len and ibuf.seg_size.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
The function ibuf_remove_free_page() was waiting for ibuf_mutex
while holding ibuf.index->lock. This constitutes a lock order
inversion and may cause InnoDB to hang when innodb_change_buffering
is enabled and ibuf_merge_or_delete_for_page() is being executed
concurrently.
In fact, there is no need for ibuf_remove_free_page() to reacquire
ibuf_mutex if we make ibuf.seg_size and ibuf.free_list_len
protected by the ibuf.index->lock as well as the root page latch rather
than by ibuf_mutex.
ibuf.seg_size, ibuf.free_list_len: Instead of ibuf_mutex, let the
ibuf.index->lock and the root page latch protect these, like ibuf.empty.
ibuf_init_at_db_start(): Acquire the root page latch before updating
ibuf.seg_size. (The ibuf.index would be created later.)
ibuf_data_enough_free_for_insert(), ibuf_data_too_much_free():
Assert also ibuf.index->lock.have_u_or_x().
ibuf_remove_free_page(): Acquire the ibuf.index->lock and the root page
latch before accessing ibuf.free_list_len. Simplify the way how the
root page latch is released and reacquired. Acquire and release
ibuf_mutex only once.
ibuf_free_excess_pages(), ibuf_insert_low(): Acquire also ibuf.index->lock
before reading ibuf.free_list_len.
ibuf_print(): Acquire ibuf.index->lock before reading
ibuf.free_list_len and ibuf.seg_size.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
Follow up to MDEV-34388 82d7419e06
Relax limit on specific files only.
clang-20 + CMAKE_BUILD_TYPE=Debug:
options/cf_options.cc:0:0: stack frame size (17624) exceeds limit (16384) in function '__cxx_global_var_init.33'
options/db_options.cc:0:0: stack frame size (34328) exceeds limit (32768) in function '__cxx_global_var_init.45'
Reviewer: Jimmy Hu <jimmy.hu@mariadb.com>
Under Debug build this becomes a Werror.
Resolved this by changing the grn_mecab_chunk_size_threshold
to a ptrdiff_t along with chunked_tokenize_utf8's string_bytes
argument so there is no need to case.
Reviewer: Jimmy Hu <jimmy.hu@mariadb.com>
Problem:
=======
When InnoDB encounters a corrupted page during crash recovery,
server would abort due to improper handling of page locks
and space references. The recovery process was not properly
cleaning up resources when corruption was detected,
leading to inconsistent state and server termination.
Solution:
=========
recover_low(): Move page lock recursive acquisition
after deferred/non-deferred page creation logic to
ensure consistent locking behavior for both code paths.
Ensure proper block recursive unlock for non-deferred tablespaces
recv_recover_page(): Simplify corrupted page cleanup by
removing redundant space reference handling.
maria_open(): Always initialize open_mode, and remove the
redundant local variable try_open_mode that
commit 24821e9585 had introduced.
Also, optimize away any checks for s3 when WITH_S3_STORAGE_ENGINE
is not defined.
mtr --view-protocol causes a separate thread for execution of SELECT
statements, thus set global spider_same_server_link is needed.
The other issue is opened as MDEV-37568
When spider tries to find a partition matching a name passed from the
sql layer, it construct the partition name with NORMAL_PART_NAME.
However, the name passed from the sql layer could be constructed with
other types of name, such as TEMP_PART_NAME, which is a longer string.
Spider does handle TEMP_PART_NAME in other places of
spider_get_partition_info, but overall it is not able to handle
partition changes involving redistributing data to partitions which
can result in TEMP_PART_NAME. That is a more involved issue. In this
patch, we simply follow the existing intended logic and fix the MSAN
complaint.
Fixed the following issues:
- aria_read_index() and aria_read_data(), used by mariabackup, checked
the wrong status from maria_page_crc_check().
- Both functions did infinite retries if crc did not match.
- Wrong usage of ma_check_if_zero() in maria_page_crc_check()
Author: Thirunarayanan Balathandayuthapani <thiru@mariadb.com>
- Removed duplicate words, like "the the" and "to to"
- Removed duplicate lines (one double sort line found in mysql.cc)
- Fixed some typos found while searching for duplicate words.
Command used to find duplicate words:
egrep -rI "\s([a-zA-Z]+)\s+\1\s" | grep -v param
Thanks to Artjoms Rimdjonoks for the command and pointing out the
spelling errors.
- data files will be opened in readonly mode for repair if --quick
is used.
- Added information about check progress if --verbose is used.
- Added new option --keys-active= as a simpler version of keys-used.
Internal changes:
- Store open file mode in share->index_mode and share->data_mode instead
of in share->mode.
- Removed not needed 'mode' argument from maria_clone_internal()
These changes was done as part of fixing
MDEV-36858 MariaDB MyISAM secondary indexes silently break for
tables > 10B rows
Changes done in myisamchk:
- Tables that are checked are opened in readonly mode if --force is not
used.
- *.MYD files will be opened in readonly mode for repair if --quick
is used.
- Added information about check progress if --verbose is used.
- Output information about repaired/checked rows every 10000 rows instead
of every 1000 rows. Note that this also affects aria_chk
- Store open file mode in share->index_mode and share->data_mode instead
of in share->mode.
- Added new option --keys-active= as a simpler version of keys-used.
- Changed output for "myisamchk -dvv" to get nicer output for tables
with 10 billion rows.
This was caused by a wrong handling of bitmaps in
copy_not_changed_fields() that did not work on big endian machines.
This bug caused recovery of Aria files to fail on big endian machines
like s390x or Sparc.
This issue was noticed by the bulk_insert_crash.test on the
s390x builder.
The function row_purge_reset_trx_id() that had been introduced in
commit 3c09f148f3 (MDEV-12288)
introduces some extra buffer pool and redo log activity that will
cause a significant performance regression under some workloads.
This is currently the most significant performance issue, after
commit acd071f599 (MDEV-21923)
fixed the InnoDB LSN allocation and MDEV-19749 the MDL bottleneck in 12.1.
The purpose of row_purge_reset_trx_id() was to ensure that we can
easily identify records for which no history exists. If DB_TRX_ID
is 0, we could avoid looking up the transaction to see if the
history is accessible or the record is implicitly locked.
To avoid trx_sys_t::find() for stale DB_TRX_ID values, we can refer
to trx_t::max_inactive_id, which was introduced in
commit 4105017a58 (MDEV-30357).
Instead of comparing DB_TRX_ID to 0, we may compare it to this
cached value. The cache would be updated by
trx_sys_t::find_same_or_older(), which is invoked for some operations
on secondary indexes.
row_purge_reset_trx_id(): Remove. We will no longer reset the
DB_TRX_ID to 0 after an INSERT. We will retain a single undo log
for all operations, though. Before MDEV-12288, there had been
separate insert_undo and update_undo logs.
row_check_index(): No longer warn
"InnoDB: Clustered index record with stale history in table".
lock_rec_queue_validate(), lock_rec_convert_impl_to_expl(),
row_vers_impl_x_locked_low(): Instead of comparing the DB_TRX_ID
to 0, compare it to trx_t::max_inactive_id.
In dict0load.cc we will not spend any effort to avoid extra
trx_sys.find() calls for stale DB_TRX_ID in dictionary tables.
This code does not currently use trx_t objects, and therefore
we cannot easily access trx_t::max_inactive_id. Loading table
definitions into the InnoDB data dictionary cache (dict_sys)
should be a very rare operation.
Reviewed by: Vladislav Lesin
log_t::append_prepare_wait(): Relax the debug assertion in case
log_overwrite_warning() has been called. In this case, the
contents of log_sys.buf (and the ib_logfile0) is basically
unrecoverable garbage, and it does not matter which write was
last persisted.
This assertion would easily fail in the 11.4 branch in the test
encryption.innochecksum after merging MDEV-36024.
At least in some ATTRIBUTE_COLD code, mach_read_from_8() could be
invoked as a function call that could be as simple as wrapping
one or two instructions. Let us declare __attribute__((always_inline))
on those memory accessor functions that operate on 1, 2, 4, or 8 bytes
and are therefore likely to translate into few instructions, such as
mov;bswap or movbe on x86.
The innodb_encrypt_log=ON subformat of FORMAT_10_8 is inefficient,
because a new encryption or decryption context is being set up for
every log record payload snippet.
An in-place conversion between the old and new innodb_encrypt_log=ON
format is technically possible. No such conversion has been
implemented, though. There is some overhead with respect to the
unencrypted format (innodb_encrypt_log=OFF): At the end of each
mini-transaction, right before the CRC-32C, additional 8 bytes will be
reserved for a nonce (really, log_sys.get_flushed_lsn()), which forms
a part of an initialization vector.
log_t::FORMAT_ENC_11: The new format identifier, a UTF-8 encoding of
🗝 U+1F5DD OLD KEY (encryption). In this format, everything except the
types and lengths of log records will be encrypted. Thus, unlike in
FORMAT_10_8, also page identifiers and FILE_ records will be encrypted.
The initialization vector (IV) consists of the 8-byte nonce as well as
the type and length byte(s) of the first record of the mini-transaction.
Page identifiers will no longer form any part of the IV.
The old log_t::FORMAT_ENC_10_8 (innodb_encrypt_log=ON) will be supported
both by mariadb-backup and by crash recovery. Downgrade from the new
format will only be possible if the new server has been running or
restarted with innodb_encrypt_log=OFF. If innodb_encrypt_log=ON,
only the new log_t::FORMAT_ENC_11 will be written.
log_t::is_recoverable(): A new predicate, which holds for all 3
formats.
recv_sys_t::tmp_buf: A heap-allocated buffer for decrypting a
mini-transaction, or for making the wrap-around of a memory-mapped
log file contiguous.
recv_sys_t::start_lsn: The start of the mini-transaction.
Updated at the start of parse_tail().
log_decrypt_mtr(): Decrypt a mini-transaction in recv_sys.tmp_buf.
Theoretically, when reading the log via pread() rather than a read-only
memory mapping, we could modify the contents of log_sys.buf in place.
If we did that, we would have to re-read the last log block into
log_sys.buf before resuming writes, because otherwise that block could be
re-written as a mix of old decrypted data and new encrypted data, which
would cause a subsequent recovery failure unless the log checkpoint had
been advanced beyond this point.
log_decrypt_legacy(): Decrypt a log_t::FORMAT_ENC_10_8 record snippet
on stack. Replaces recv_buf::copy_if_needed().
recv_sys_t::get_backup_parser(): Return a recv_sys_t::parser, that is,
a pointer to an instantiation of parse_mmap or parse_mtr for the current
log format.
recv_sys_t::parse_mtr(), recv_sys_t::parse_mmap(): Add a parameter
template<uint32_t> for the current log_sys.format.
log_parse_start(): Validate the CRC-32C of a mini-transaction.
This has been split from the recv_sys_t::parse() template to
reduce code duplication. These two are the lowest-level functions
that will be instantiated for both recv_buf and recv_ring.
recv_sys_t::parse(): Split into ::log_parse_start() and parse_tail().
Add a parameter template<uint32_t format> to specialize for
log_sys.format at compilation time.
recv_sys_t::parse_tail(): Operate on pointers to contiguous
mini-transaction data. Use a parameter template<bool ENC_10_8>
for special handling of the old innodb_encrypt_log=ON format.
The former recv_buf::get_buf() is being inlined here.
Much of the logic is split into non-inline functions, to avoid
duplicating a lot of code for every template expansion.
log_crypt: Encrypt or decrypt a mini-transaction in place in the
new innodb_encrypt_log=ON format. We will use temporary buffers
so that encryption_ctx_update() can be invoked on integer multiples
of MY_AES_BLOCK_SIZE, except for the last bytes of the encrypted
payload, which will be encrypted or decrypted in place thanks to
ENCRYPTION_FLAG_NOPAD.
log_crypt::append(): Invoke encryption_ctx_update() in MY_AES_BLOCK_SIZE
(16-byte) blocks and scatter/gather shorter data blocks as needed.
log_crypt::finish(), Handle the last (possibly incomplete) block as a
special case, with ENCRYPTION_FLAG_NOPAD.
mtr_t::parse_length(): Parse the length of a log record.
mtr_t::encrypt(): Use log_crypt instead of the old log_encrypt_buf().
recv_buf::crc32c(): Add a parameter for the initial CRC-32C value.
recv_sys_t::rewind(): Operate on pointers to the start of the
mini-transaction and to the first skipped record.
recv_sys_t::trim(): Declare as ATTRIBUTE_COLD so that this rarely
invoked function will not be expanded inline in parse_tail().
recv_sys_t::parse_init(): Handle INIT_PAGE or FREE_PAGE while scanning
to the end of the log.
recv_sys_t::parse_page0(): Handle WRITE to FSP_SPACE_SIZE and
FSP_SPACE_FLAGS.
recv_sys_t::parse_store_if_exists(), recv_sys_t::parse_store(),
recv_sys_t::parse_oom(): Handle page-level log records.
mlog_decode_varint_length(): Make use of __builtin_clz() to avoid a loop
when possible.
mlog_decode_varint(): Define only on const byte*, as
ATTRIBUTE_NOINLINE static because it is a rather large function.
recv_buf::decode_varint(): Trivial wrapper for mlog_decode_varint().
recv_ring::decode_varint(): Special implementation.
log_page_modify(): Note that a page will be modified in recovery.
Split from recv_sys_t::parse_tail().
log_parse_file(): Handle non-page log records.
log_record_corrupted(), log_unknown(), log_page_id_corrupted():
Common error reporting functions.
mtr_t::get_log_size(): Remove.
mtr_t::crc32c(): New function: compute CRC-32C and determine the size,
including the sequence byte and the CRC-32C.
mtr_t::encrypt(): Return the size, similar to crc32c().
mtr_t::log_file_op(): Return the size written.
fil_name_write(): Remove. Let us invoke mtr_t::log_file_op() directly.
fil_names_clear(): Keep track of the available size without
invoking mtr_t::get_log_size().
mtr_buf_t::m_size: Remove.
mtr_buf_t::list_t: Use ilist instead of sized_ilist.
mtr_buf_t::for_each_block(): Remove. Let us allow iteration via
begin() and end(), without any lambda function objects.
But rocksdb.bulk_load_unsorted_rev and rocksdb.bulk_load_unsorted
succeed under non-debug builds, and because it was slow at 87 seconds)
there is a --big-test criteria for these tests.
m_charset_codec is uninitalized when calling m_make_unpack_info_func.
In the cases where m_make_unpack_info_func is one of:
* Rdb_key_def::make_unpack_unknown_varchar
* Rdb_key_def::make_unpack_unknown
* Rdb_key_def::dummy_make_unpack_info
the m_charset_coded that forms the first argument to this function
is unused.
In these limited cases we initialize the m_charset_codec member
as the only use is to pass though to the m_make_unpack_info_func
Ultimately MemorySanitizer shouldn't error on this as all
of these 3 functions clearly have the attribute
__unused__ on their first argument where the m_charset_coded is
passed.
buf_pool_t::shrink(): When relocating a buffer page, invalidate
the page identifier of the original page so that buf_pool_t::page_guess()
will not accidentally match it.
Before commit b6923420f3 (MDEV-29445)
introduced buf_pool_t::page_guess(), the validity of block descriptor
pointers was checked by buf_pool_t::is_uncompressed(const buf_block_t*).
Therefore, any block descriptors that used to be part of a larger buffer
pool would not be accessed at all.
This race condition is very hard to reproduce. To reproduce it,
an optimistic btr_pcur_t::restore_position() or similar will have to
be invoked on a block that has been relocated by buf_pool_t::shrink()
and that had not meanwhile been replaced with another page with a
different identifier.
Reviewed by: Vladislav Lesin
The pmem_cvap() function currently uses the '.arch armv8.2-a' directive
for the 'dc cvap' instruction. This will cause build errors below when
compiling for ARMv9 systems. Update the '.arch' directive to 'armv9.4-a'
to ensure compatibility with ARMv9 architectures.
{standard input}: Assembler messages:
{standard input}:169: Error: selected processor does not support `retaa'
{standard input}:286: Error: selected processor does not support `retaa'
make[2]: *** [storage/innobase/CMakeFiles/innobase_embedded.dir/build.make:
1644: storage/innobase/CMakeFiles/innobase_embedded.dir/sync/cache.cc.o]
Error 1
Signed-off-by: Ruiqiang Hao <Ruiqiang.Hao@windriver.com>
Problem:
=======
- During copy algorithm, InnoDB fails to detect the duplicate
key error for unique hash key blob index. Unique HASH index
treated as virtual index inside InnoDB.
When table does unique hash key , server does search on
the hash key before doing any insert operation and
finds the duplicate value in check_duplicate_long_entry_key().
Bulk insert does all the insert together when copy of
intermediate table is finished. This leads to undetection of
duplicate key error while building the index.
Solution:
========
- Avoid bulk insert operation when table does have unique
hash key blob index.
dict_table_t::can_bulk_insert(): To check whether the table
is eligible for bulk insert operation during alter copy algorithm.
Check whether any virtual column name starts with DB_ROW_HASH_ to
know whether blob column has unique index on it.
srv_printf_innodb_monitor(): After acquiring a latch,
abort the iteration if innodb_adaptive_hash_index=OFF.
If the adaptive hash index was disabled in a concurrently
executing thread, btr_search_sys_t::partition::clear() would have
freed part->heap, leading to us dereferencing a null pointer.
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Saahil Alam
srv_printf_innodb_monitor(): After acquiring a latch,
abort the iteration if innodb_adaptive_hash_index=OFF.
If the adaptive hash index was disabled in a concurrently
executing thread, btr_search_sys_t::partition::clear() would have
freed part->heap, leading to us dereferencing a null pointer.
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Saahil Alam
Problem:
=======
- InnoDB modifies the PAGE_ROOT_AUTO_INC value on clustered index
root page. But before committing the PAGE_ROOT_AUTO_INC changes
mini-transaction, InnoDB does bulk insert operation and
calculates the page checksum and store as a part of redo log in
mini-transaction. During recovery, InnoDB fails to validate the
page checksum.
Solution:
========
- Avoid writing the persistent auto increment value before doing
bulk insert operation.
- For bulk insert operation, persistent auto increment value
is written via btr_write_autoinc while applying the buffered
insert operation.