This was caused by a wrong handling of bitmaps in
copy_not_changed_fields() that did not work on big endian machines.
This bug caused recovery of Aria files to fail on big endian machines
like s390x or Sparc.
This issue was noticed by the bulk_insert_crash.test on the
s390x builder.
The function row_purge_reset_trx_id() that had been introduced in
commit 3c09f148f3 (MDEV-12288)
introduces some extra buffer pool and redo log activity that will
cause a significant performance regression under some workloads.
This is currently the most significant performance issue, after
commit acd071f599 (MDEV-21923)
fixed the InnoDB LSN allocation and MDEV-19749 the MDL bottleneck in 12.1.
The purpose of row_purge_reset_trx_id() was to ensure that we can
easily identify records for which no history exists. If DB_TRX_ID
is 0, we could avoid looking up the transaction to see if the
history is accessible or the record is implicitly locked.
To avoid trx_sys_t::find() for stale DB_TRX_ID values, we can refer
to trx_t::max_inactive_id, which was introduced in
commit 4105017a58 (MDEV-30357).
Instead of comparing DB_TRX_ID to 0, we may compare it to this
cached value. The cache would be updated by
trx_sys_t::find_same_or_older(), which is invoked for some operations
on secondary indexes.
row_purge_reset_trx_id(): Remove. We will no longer reset the
DB_TRX_ID to 0 after an INSERT. We will retain a single undo log
for all operations, though. Before MDEV-12288, there had been
separate insert_undo and update_undo logs.
row_check_index(): No longer warn
"InnoDB: Clustered index record with stale history in table".
lock_rec_queue_validate(), lock_rec_convert_impl_to_expl(),
row_vers_impl_x_locked_low(): Instead of comparing the DB_TRX_ID
to 0, compare it to trx_t::max_inactive_id.
In dict0load.cc we will not spend any effort to avoid extra
trx_sys.find() calls for stale DB_TRX_ID in dictionary tables.
This code does not currently use trx_t objects, and therefore
we cannot easily access trx_t::max_inactive_id. Loading table
definitions into the InnoDB data dictionary cache (dict_sys)
should be a very rare operation.
Reviewed by: Vladislav Lesin
log_t::append_prepare_wait(): Relax the debug assertion in case
log_overwrite_warning() has been called. In this case, the
contents of log_sys.buf (and the ib_logfile0) is basically
unrecoverable garbage, and it does not matter which write was
last persisted.
This assertion would easily fail in the 11.4 branch in the test
encryption.innochecksum after merging MDEV-36024.
At least in some ATTRIBUTE_COLD code, mach_read_from_8() could be
invoked as a function call that could be as simple as wrapping
one or two instructions. Let us declare __attribute__((always_inline))
on those memory accessor functions that operate on 1, 2, 4, or 8 bytes
and are therefore likely to translate into few instructions, such as
mov;bswap or movbe on x86.
The innodb_encrypt_log=ON subformat of FORMAT_10_8 is inefficient,
because a new encryption or decryption context is being set up for
every log record payload snippet.
An in-place conversion between the old and new innodb_encrypt_log=ON
format is technically possible. No such conversion has been
implemented, though. There is some overhead with respect to the
unencrypted format (innodb_encrypt_log=OFF): At the end of each
mini-transaction, right before the CRC-32C, additional 8 bytes will be
reserved for a nonce (really, log_sys.get_flushed_lsn()), which forms
a part of an initialization vector.
log_t::FORMAT_ENC_11: The new format identifier, a UTF-8 encoding of
🗝 U+1F5DD OLD KEY (encryption). In this format, everything except the
types and lengths of log records will be encrypted. Thus, unlike in
FORMAT_10_8, also page identifiers and FILE_ records will be encrypted.
The initialization vector (IV) consists of the 8-byte nonce as well as
the type and length byte(s) of the first record of the mini-transaction.
Page identifiers will no longer form any part of the IV.
The old log_t::FORMAT_ENC_10_8 (innodb_encrypt_log=ON) will be supported
both by mariadb-backup and by crash recovery. Downgrade from the new
format will only be possible if the new server has been running or
restarted with innodb_encrypt_log=OFF. If innodb_encrypt_log=ON,
only the new log_t::FORMAT_ENC_11 will be written.
log_t::is_recoverable(): A new predicate, which holds for all 3
formats.
recv_sys_t::tmp_buf: A heap-allocated buffer for decrypting a
mini-transaction, or for making the wrap-around of a memory-mapped
log file contiguous.
recv_sys_t::start_lsn: The start of the mini-transaction.
Updated at the start of parse_tail().
log_decrypt_mtr(): Decrypt a mini-transaction in recv_sys.tmp_buf.
Theoretically, when reading the log via pread() rather than a read-only
memory mapping, we could modify the contents of log_sys.buf in place.
If we did that, we would have to re-read the last log block into
log_sys.buf before resuming writes, because otherwise that block could be
re-written as a mix of old decrypted data and new encrypted data, which
would cause a subsequent recovery failure unless the log checkpoint had
been advanced beyond this point.
log_decrypt_legacy(): Decrypt a log_t::FORMAT_ENC_10_8 record snippet
on stack. Replaces recv_buf::copy_if_needed().
recv_sys_t::get_backup_parser(): Return a recv_sys_t::parser, that is,
a pointer to an instantiation of parse_mmap or parse_mtr for the current
log format.
recv_sys_t::parse_mtr(), recv_sys_t::parse_mmap(): Add a parameter
template<uint32_t> for the current log_sys.format.
log_parse_start(): Validate the CRC-32C of a mini-transaction.
This has been split from the recv_sys_t::parse() template to
reduce code duplication. These two are the lowest-level functions
that will be instantiated for both recv_buf and recv_ring.
recv_sys_t::parse(): Split into ::log_parse_start() and parse_tail().
Add a parameter template<uint32_t format> to specialize for
log_sys.format at compilation time.
recv_sys_t::parse_tail(): Operate on pointers to contiguous
mini-transaction data. Use a parameter template<bool ENC_10_8>
for special handling of the old innodb_encrypt_log=ON format.
The former recv_buf::get_buf() is being inlined here.
Much of the logic is split into non-inline functions, to avoid
duplicating a lot of code for every template expansion.
log_crypt: Encrypt or decrypt a mini-transaction in place in the
new innodb_encrypt_log=ON format. We will use temporary buffers
so that encryption_ctx_update() can be invoked on integer multiples
of MY_AES_BLOCK_SIZE, except for the last bytes of the encrypted
payload, which will be encrypted or decrypted in place thanks to
ENCRYPTION_FLAG_NOPAD.
log_crypt::append(): Invoke encryption_ctx_update() in MY_AES_BLOCK_SIZE
(16-byte) blocks and scatter/gather shorter data blocks as needed.
log_crypt::finish(), Handle the last (possibly incomplete) block as a
special case, with ENCRYPTION_FLAG_NOPAD.
mtr_t::parse_length(): Parse the length of a log record.
mtr_t::encrypt(): Use log_crypt instead of the old log_encrypt_buf().
recv_buf::crc32c(): Add a parameter for the initial CRC-32C value.
recv_sys_t::rewind(): Operate on pointers to the start of the
mini-transaction and to the first skipped record.
recv_sys_t::trim(): Declare as ATTRIBUTE_COLD so that this rarely
invoked function will not be expanded inline in parse_tail().
recv_sys_t::parse_init(): Handle INIT_PAGE or FREE_PAGE while scanning
to the end of the log.
recv_sys_t::parse_page0(): Handle WRITE to FSP_SPACE_SIZE and
FSP_SPACE_FLAGS.
recv_sys_t::parse_store_if_exists(), recv_sys_t::parse_store(),
recv_sys_t::parse_oom(): Handle page-level log records.
mlog_decode_varint_length(): Make use of __builtin_clz() to avoid a loop
when possible.
mlog_decode_varint(): Define only on const byte*, as
ATTRIBUTE_NOINLINE static because it is a rather large function.
recv_buf::decode_varint(): Trivial wrapper for mlog_decode_varint().
recv_ring::decode_varint(): Special implementation.
log_page_modify(): Note that a page will be modified in recovery.
Split from recv_sys_t::parse_tail().
log_parse_file(): Handle non-page log records.
log_record_corrupted(), log_unknown(), log_page_id_corrupted():
Common error reporting functions.
mtr_t::get_log_size(): Remove.
mtr_t::crc32c(): New function: compute CRC-32C and determine the size,
including the sequence byte and the CRC-32C.
mtr_t::encrypt(): Return the size, similar to crc32c().
mtr_t::log_file_op(): Return the size written.
fil_name_write(): Remove. Let us invoke mtr_t::log_file_op() directly.
fil_names_clear(): Keep track of the available size without
invoking mtr_t::get_log_size().
mtr_buf_t::m_size: Remove.
mtr_buf_t::list_t: Use ilist instead of sized_ilist.
mtr_buf_t::for_each_block(): Remove. Let us allow iteration via
begin() and end(), without any lambda function objects.
But rocksdb.bulk_load_unsorted_rev and rocksdb.bulk_load_unsorted
succeed under non-debug builds, and because it was slow at 87 seconds)
there is a --big-test criteria for these tests.
m_charset_codec is uninitalized when calling m_make_unpack_info_func.
In the cases where m_make_unpack_info_func is one of:
* Rdb_key_def::make_unpack_unknown_varchar
* Rdb_key_def::make_unpack_unknown
* Rdb_key_def::dummy_make_unpack_info
the m_charset_coded that forms the first argument to this function
is unused.
In these limited cases we initialize the m_charset_codec member
as the only use is to pass though to the m_make_unpack_info_func
Ultimately MemorySanitizer shouldn't error on this as all
of these 3 functions clearly have the attribute
__unused__ on their first argument where the m_charset_coded is
passed.
buf_pool_t::shrink(): When relocating a buffer page, invalidate
the page identifier of the original page so that buf_pool_t::page_guess()
will not accidentally match it.
Before commit b6923420f3 (MDEV-29445)
introduced buf_pool_t::page_guess(), the validity of block descriptor
pointers was checked by buf_pool_t::is_uncompressed(const buf_block_t*).
Therefore, any block descriptors that used to be part of a larger buffer
pool would not be accessed at all.
This race condition is very hard to reproduce. To reproduce it,
an optimistic btr_pcur_t::restore_position() or similar will have to
be invoked on a block that has been relocated by buf_pool_t::shrink()
and that had not meanwhile been replaced with another page with a
different identifier.
Reviewed by: Vladislav Lesin
The pmem_cvap() function currently uses the '.arch armv8.2-a' directive
for the 'dc cvap' instruction. This will cause build errors below when
compiling for ARMv9 systems. Update the '.arch' directive to 'armv9.4-a'
to ensure compatibility with ARMv9 architectures.
{standard input}: Assembler messages:
{standard input}:169: Error: selected processor does not support `retaa'
{standard input}:286: Error: selected processor does not support `retaa'
make[2]: *** [storage/innobase/CMakeFiles/innobase_embedded.dir/build.make:
1644: storage/innobase/CMakeFiles/innobase_embedded.dir/sync/cache.cc.o]
Error 1
Signed-off-by: Ruiqiang Hao <Ruiqiang.Hao@windriver.com>
Problem:
=======
- During copy algorithm, InnoDB fails to detect the duplicate
key error for unique hash key blob index. Unique HASH index
treated as virtual index inside InnoDB.
When table does unique hash key , server does search on
the hash key before doing any insert operation and
finds the duplicate value in check_duplicate_long_entry_key().
Bulk insert does all the insert together when copy of
intermediate table is finished. This leads to undetection of
duplicate key error while building the index.
Solution:
========
- Avoid bulk insert operation when table does have unique
hash key blob index.
dict_table_t::can_bulk_insert(): To check whether the table
is eligible for bulk insert operation during alter copy algorithm.
Check whether any virtual column name starts with DB_ROW_HASH_ to
know whether blob column has unique index on it.
srv_printf_innodb_monitor(): After acquiring a latch,
abort the iteration if innodb_adaptive_hash_index=OFF.
If the adaptive hash index was disabled in a concurrently
executing thread, btr_search_sys_t::partition::clear() would have
freed part->heap, leading to us dereferencing a null pointer.
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Saahil Alam
srv_printf_innodb_monitor(): After acquiring a latch,
abort the iteration if innodb_adaptive_hash_index=OFF.
If the adaptive hash index was disabled in a concurrently
executing thread, btr_search_sys_t::partition::clear() would have
freed part->heap, leading to us dereferencing a null pointer.
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Saahil Alam
Problem:
=======
- InnoDB modifies the PAGE_ROOT_AUTO_INC value on clustered index
root page. But before committing the PAGE_ROOT_AUTO_INC changes
mini-transaction, InnoDB does bulk insert operation and
calculates the page checksum and store as a part of redo log in
mini-transaction. During recovery, InnoDB fails to validate the
page checksum.
Solution:
========
- Avoid writing the persistent auto increment value before doing
bulk insert operation.
- For bulk insert operation, persistent auto increment value
is written via btr_write_autoinc while applying the buffered
insert operation.
during row insert
DB_FOREIGN_DUPLICATE_KEY in row_ins_duplicate_error_in_clust was
motivated by handling the cascade changes during versioned operations.
It was found though, that certain row_update_for_mysql calls could
return DB_FOREIGN_DUPLICATE_KEY, even if there's no foreign relations.
Change DB_FOREIGN_DUPLICATE_KEY to DB_DUPLICATE_KEY in
row_ins_duplicate_error_in_clust.
It will be later converted to DB_FOREIGN_DUPLICATE_KEY in
row_ins_check_foreign_constraint if needed.
Additionally, ha_delete_row should return neither. Ensure it by an
assertion.
We had a protection against it, by allowing versioned delete if:
trx->id != table->vers_start_id()
For replace this check fails: replace calls ha_delete_row(record[2]), but
table->vers_start_id() returns the value from record[0], which is irrelevant.
The same problem hits Field::is_max, which may have checked the wrong record.
Fix:
* Refactor Field::is_max to optionally accept a pointer as an argument.
* Refactor vers_start_id and vers_end_id to always accept a pointer to the
record. there is a difference with is_max is that is_max accepts the pointer to
the
field data, rather than to the record.
Method val_int() would be too effortful to refactor to accept the argument, so
instead the value in record is fetched directly, like it is done in
Field_longlong.
storage/connect/tabxml.cpp:1616:46: warning: ‘*this.XMLCOL::Long’ may be used uninitialized [-Wmaybe-uninitialized]
1616 | Valbuf = (char*)PlugSubAlloc(g, NULL, n * (Long + 1));
In this case we are overriding the class 3 lines earlier. Add
some constructs to preserve the value of Long as the old class
being replaced with a new subclass.
storage/connect/filter.cpp:1594:13: warning: ‘*this.FILTERCMP::FILTERX.FILTERX::FILTER.FILTER::Opc’ is used uninitialized [-Wuninitialized]
1594 | Bt = OpBmp(g, Opc);
The construction of FILTERCMP has an Opc(ode) and this should be passed
rather than relying on the uninitialized value of the parent class.
Also save its value in the class.
Clang processes the "int x=x" code from UNINIT_VAR
literally resulting in an uninitialized read and write.
This is something we want to avoid. Gcc does the same
without emitting warnings.
As the UNINIT_VAR was around avoiding compiler false detection,
and clang doesn't false detect, is default action is a
noop.
Static analysers (examined Infer and SonarQube) are
clang based and have the same detection.
Using a __clang__ instead of WITH_UBSAN would acheived
a better result, however reviewer wanted to keep WITH_UBSAN
only.
LINT_INIT_STRUCT is no longer required, even a gcc-4.8.5
doesn't warn with this construct removed which matches
the comment that it was fixed in gcc ~4.7.
mysql.cc - all paths in com_go populate buff before use.
json: Item_func_json_merge::val_str
LINT_INIT(js2) unneeded as usage in the previous statements
it is explicitly initialized to NULL.
Item_func_json_contains_path::val_bool n_found is guarded
by an uninitialized read by mode_one and from
gcc-13.3.0 in Ubuntu 24.04 this is detected. As the only
remaining use of LINIT_INIT this usage has been applied
with the expanded macro with the unused _lint define removed.
The LINT_INIT macro is removed.
_ma_ck_delete - org_key only valid under share->now_transactional
likewise with _ma_ck_write_btree_with_log
connect engine never used anything that FORCE_INIT_OF_VARS
would change.
Reviewer: Monty
When compiling with debug and valgrind on Ubuntu 22.04 with gcc 11.4.0,
the used framsize was 16400 bytes because of the code:
completion_callback callbacks[1000];
Reducing the array to 950 fixes the issue
buf_pool_t::shrink(): If we run out of pages to evict from buf_pool.LRU,
abort the operation. Also, do not leak the spare block that we may have
allocated.
buf_pool_t::shrink(): When relocating a dirty page of the temporary
tablespace, reset the oldest_modification() on the discarded block,
like we do for persistent pages in buf_flush_relocate_on_flush_list().
buf_pool_t::resize(): Add debug assertions to catch this error earlier.
This bug does not seem to affect non-debug builds.
Reviewed by: Thirunarayanan Balathandayuthapani
Server-level UNIQUE constraints (namely, WITHOUT OVERLAPS and USING HASH)
only worked with InnoDB in REPEATABLE READ isolation mode, when the
constraint was checked first and then the row was inserted or updated.
Gap locks prevented race conditions when a concurrent connection
could've also checked the constraint and inserted/updated a row
at the same time.
In READ COMMITTED there are no gap locks. To avoid race conditions,
we now check the constraint *after* the row operation. This is
enabled by the HA_CHECK_UNIQUE_AFTER_WRITE table flag that InnoDB
sets in the READ COMMITTED transactions.
Checking the constraint after the row operation is more complex.
First, the constraint will see the current (inserted/updated) row,
and needs to skip it. Second, IGNORE operations become tricky,
as we need to revert the insert/update and continue statement execution.
write_row() (INSERT IGNORE) is reverted with delete_row(). Conveniently
it deletes the current row, that is, the last inserted row.
update_row(a,b) (UPDATE IGNORE) is reverted with a reversed update,
update_row(b,a). Conveniently, it updates the current row too.
Except in InnoDB when the PK is updated - in this case InnoDB internally
performs delete+insert, but does not move the cursor, so the "current"
row is the deleted one and the reverse update doesn't work.
This combination now throws an "unsupported" error and will
be fixed in MDEV-37233
The test innodb.undo_truncate occasionally demonstrates a race condition
where scrubbing is writing zeroes to a freed undo page, and
innodb_undo_log_truncate=ON truncating the same tablespace. The
truncation is an exception to the rule that InnoDB tablespace file sizes
can only grow, never shrink.
The fields fil_space_t::size and fil_node_t::size are protected by
fil_system.mutex, which used to be a highly contended resource. We
do not want to revert back to acquiring the mutex in fil_space_t::io()
because that would introduce an obvious scalability bottleneck.
fil_space_t::flush_freed(): Do not try to scrub pages of the undo
tablespace in order to prevent a race condition between io()
and undo tablespace truncation.
fil_space_t::io(): Prevent a null pointer dereference when reporting
an out-of-bounds access to the non-first file of the system or
temporary tablespace. Do not invoke set_corrupted() after an
out-of-bounds asynchronous read.
Note: fil_space_t::flush_freed() may only invoke PUNCH_RANGE on
page_compressed tablespaces, never on an undo tablespace.
ha_innobase::store_lock(): Do not create a read view or start the
transaction if innodb_snapshot_isolation=OFF. This should save some
resources with the default settings.
ha_innobase::store_lock(): Set also trx->will_lock when starting
a transaction at SERIALIZABLE isolation level. This fixes up
commit 7fbbbc983f (MDEV-36330).
buf_page_get_low(): Do not expect a valid state of
buf_page_t::in_zip_hash for blocks that are not file pages.
This debug assertion had been misplaced in
commit aaef2e1d8c (MDEV-27058)
that removed the condition
block->page.state() == BUF_BLOCK_FILE_PAGE.