a comment in the test says
# do not clean up - we do not know which of the three has been released
# so the --reap command may hang because the command that is being executed
# in that connection is still running/waiting
SET GLOBAL innodb_adaptive_hash_index_cells may be executed
while the server is running. This parameter will be effectively
multiplied by innodb_adaptive_hash_index_parts, because each partition will
contain its own hash table.
Previously, the number of hash table cells in the InnoDB adaptive hash index
depended on the initial innodb_buffer_pool_size and was insufficient
for some workloads, leading to excessively long hash bucket chains.
If innodb_adaptive_hash_index_cells is at its minimum and default value
16381 at startup, it will be derived from the innodb_buffer_pool_size,
for backward compatibility.
When the query uses several Window Functions:
SELECT
WIN_FUNC1() OVER (ORDER BY 'const', col1),
WIN_FUNC2() OVER (ORDER BY col1 RANGE BETWEEN CURRENT ROW
AND 5 FOLLOWING)
compare_window_funcs_by_window_specs() will try to get the Window Specs to
reuse the ORDER BY lists. If the lists produce the same order (like above)
Window Spec of the WIN_FUNC2 will reuse the ORDER BY list of WIN_FUNC1.
However, WIN_FUNC2 has a RANGE-type window frame. It expects to get
ORDER BY list with one element, which it will use to compute frame bounds.
Proving it with ORDER BY list from WIN_FUNC1 ('const', col1) was caused an
assertion failure
The fix is to:
Use the original ORDER BY list when constructing RANGE-type frames
Fix an apparent typo bug in compare_window_funcs_by_window_specs():
assignment
win_spec1->save_order_list= win_spec2->order_list;
Saved the order list from the wrong spec. Instead, take one from win_spec1.
Let us remove the thread-local variable mariadb_stats and introduce
trx_t::pages_accessed, trx_t::active_handler_stats for more
efficiently maintaining some statistics inside InnoDB.
buf_pool.stat.n_page_gets: Reimplemented as Atomic_counter<ulint>.
This will no longer track some accesses in the background where
!current_thd() || !thd_to_trx(current_thd).
trx_t::free(), trx_t::commit_cleanup(): Apply pages_accessed
to buf_pool.stat.n_page_gets.
buf_read_ahead_report(): Report a completed read-ahead batch.
ha_innobase::estimate_rows_upper_bound(): Do not bother updating
trx_t::op_info around some quick arithmetics.
ha_innobase::records_in_range(): Do invoke mariadb_set_stats.
This will change some ANALYZE FORMAT=JSON SELECT results of the test
main.rowid_filter_innodb.
Reviewed by: Vladislav Lesin
Tested by: Saahil Alam
Problem was that wsrep was disconnected and new slave
threads tried to connect to cluster but failed as
we were disconnected state.
Allow changing wsrep_slave_threads only when wsrep is enabled
and we are connected to a cluster. In other cases report
error and issue a warning.
Tests on clang-20/21 had both of these tests overrunning the
stack. The check_stack_overrun function checked the function
earlier with a 2*STACK_MIN_SIZE margin. The exection within
the processing is deeper then when check_stack_overrun was
called.
Raising STACK_MIN_SIZE to 44k was sufficient (and 40k wasn't
oufficient). execution_constants also tested however
the topic mention tests are bigger.
Perfscheam tests
* perfschema.statement_program_nesting_event_check
* perfschema.statement_program_nested
* perfschema.max_program_zero
A small increase to the test thread-stack-size on statement_program_lost_inst
allows this test to continue to pass.
Problem was that for partitioned tables base table storage engine
is DB_TYPE_PARTITION_DB and naturally different than DB_TYPE_INNODB
so operation was not allowed in Galera.
Fixed by requesting implementing storage engine for partitioned
tables i.e. table->file->partition_ht() or if that does not exist
we can use base table storage engine. Resulting storage engine
type is then used on condition is operation allowed when
wsrep_mode=DISALLOW_LOCAL_GTID or not. Operations to InnoDB
storage engine i.e DB_TYPE_INNODB should be allowed.
Problem:
=======
- InnoDB statistics calculation for the table is done after
every 10 seconds by default in background thread dict_stats_thread()
- Doing multiple ALTER TABLE..ALGORITHM=COPY causes the
dict_stats_thread() to lag behind, therefore calculation of stats
for newly created intermediate table gets delayed
Fix:
====
- Stats calculation for newly created intermediate table is made
independent of background thread. After copying gets completed,
stats for new table is calculated as part of ALTER TABLE ... ALGORITHM=COPY.
dict_stats_rename_table(): Rename the table statistics from
intermediate table to new table
alter_stats_rebuild(): Removes the table name from the warning.
Because this warning can print for intermediate table as well.
Alter table using copy algorithm now calls alter_stats_rebuild()
under a shared MDL lock on a temporary #sql-alter- table,
differing from its previous use only during ALGORITHM=INPLACE
operations on user-visible tables.
dict_stats_schema_check(): Added a separate check for table
readability before checking for tablespace existence.
This could lead to detect of existence of persistent statistics
storage eariler and fallback to transient statistics.
This is a cherry-pick fix of mysql commit@cfe5f287ae99d004e8532a30003a7e8e77d379e3
Modified srv_start to call fil_crypt_threads_init() only
when srv_read_only_mode is not set.
Modified encryption.innodb-read-only to capture number of
encryption threads created for both scenarios when
server is not read only as well as when server is read only.
Timestamp-versioned row deletion was exposed to a collisional problem: if
current timestamp wasn't changed, then a sequence of row delete+insert could
get a duplication error. A row delete would find another conflicting history row
and return an error.
This is true both for REPLACE and DELETE statements, however in REPLACE, the
"optimized" path is usually taken, especially in the tests. There, delete+insert
is substituted for a single versioned row update. In the end, both paths end up
as ha_update_row + ha_write_row.
The solution is to handle a history collision somehow.
From the design perspective, the user shouldn't experience history rows loss,
unless there's a technical limitation.
To the contrary, trxid-based changes should never generate history for the same
transaction, see MDEV-15427.
If two operations on the same row happened too quickly, so that they happen at
the same timestamp, the history row shouldn't be lost. We can still write a
history row, though it'll have row_start == row_end.
We cannot store more than one such historical row, as this will violate the
unique constraint on row_end. So we will have to phisically delete the row if
the history row is already available.
In this commit:
1. Improve TABLE::delete_row to handle the history collision: if an update
results with a duplicate error, delete a row for real.
2. use TABLE::delete_row in a non-optimistic path of REPLACE, where the
system-versioned case now belongs entirely.
We had a protection against it, by allowing versioned delete if:
trx->id != table->vers_start_id()
For replace this check fails: replace calls ha_delete_row(record[2]), but
table->vers_start_id() returns the value from record[0], which is irrelevant.
The same problem hits Field::is_max, which may have checked the wrong record.
Fix:
* Refactor Field::is_max to optionally accept a pointer as an argument.
* Refactor vers_start_id and vers_end_id to always accept a pointer to the
record. there is a difference with is_max is that is_max accepts the pointer to
the
field data, rather than to the record.
Method val_int() would be too effortful to refactor to accept the argument, so
instead the value in record is fetched directly, like it is done in
Field_longlong.
It appears that some error conditions don't store error information in the
Diagnostics_area. For example when table_def::compatible_with() check fails
error message is stored in Relay_log_info instead.
This results in optimistically identical votes and zero error buffer size
breaks wsrep-lib logic as it relies on error buffer size to decide whether
voting took place.
To account for this, first try to obtain error info from Diagnostics_area,
then fallback to Relay_log_info. If that fails use some "random" data to
distinguish this condition from success in production.
Instead of using DBUG_EXECUTE_IF fault injection, let us construct
a minimal corrupted log file that will produce an OPT_PAGE_CHECKSUM
mismatch without depending on CMAKE_BUILD_TYPE=Debug.
The issue was that unpack_vcol_info_from_frm() wrongly linked the used
sequence tables into tables->internal_tables when more than one sequence
table was used.
Other things:
- Fixed internal_table_exists() to take db into account.
(This is making the code easier to read. As we where comparing
pointers the old code also worked).
Problem:
=======
When InnoDB encounters a corrupted page during crash recovery,
server would abort due to improper handling of page locks
and space references. The recovery process was not properly
cleaning up resources when corruption was detected,
leading to inconsistent state and server termination.
Solution:
=========
recover_low(): Move page lock recursive acquisition
after deferred/non-deferred page creation logic to
ensure consistent locking behavior for both code paths.
Ensure proper block recursive unlock for non-deferred tablespaces
recv_recover_page(): Simplify corrupted page cleanup by
removing redundant space reference handling.
after 633417308f (MDEV-37312) lookup_handler is locked with F_WRLCK,
because it may be used for deleting rows.
And lookup_handler is locked with F_WRLCK after prune_partitions(),
but the main handler is locked before, and might expects all
partitions to be in the read least, non-pruned.
Let's prepare the lookup handler before prune_partitions().
Fixed the following issues:
- aria_read_index() and aria_read_data(), used by mariabackup, checked
the wrong status from maria_page_crc_check().
- Both functions did infinite retries if crc did not match.
- Wrong usage of ma_check_if_zero() in maria_page_crc_check()
Author: Thirunarayanan Balathandayuthapani <thiru@mariadb.com>
- Removed duplicate words, like "the the" and "to to"
- Removed duplicate lines (one double sort line found in mysql.cc)
- Fixed some typos found while searching for duplicate words.
Command used to find duplicate words:
egrep -rI "\s([a-zA-Z]+)\s+\1\s" | grep -v param
Thanks to Artjoms Rimdjonoks for the command and pointing out the
spelling errors.
Ensure that Annotate_rows is always written direct after GTID information,
before any table_map events.
Before this patch, the following problems existed when mixing
transactional and not transactional tables in the same statement:
- Annotate rows could be written after row events or in the next GTID
event.
- See rpl_row_mixing_engines
- Annotate_rows was not always written to binary log in case of error
with a transactional table (rolled back) but a not transactional
table was updated.
- See sp_trans_log, binlog_row_mix_innodb_myisam
Fixed by writing the Annotate_rows event into the non transactional
cache if there are not transactional tables used. If not, write the
event into the transactional cache.
These changes was done as part of fixing
MDEV-36858 MariaDB MyISAM secondary indexes silently break for
tables > 10B rows
Changes done in myisamchk:
- Tables that are checked are opened in readonly mode if --force is not
used.
- *.MYD files will be opened in readonly mode for repair if --quick
is used.
- Added information about check progress if --verbose is used.
- Output information about repaired/checked rows every 10000 rows instead
of every 1000 rows. Note that this also affects aria_chk
- Store open file mode in share->index_mode and share->data_mode instead
of in share->mode.
- Added new option --keys-active= as a simpler version of keys-used.
- Changed output for "myisamchk -dvv" to get nicer output for tables
with 10 billion rows.
This was caused by a wrong handling of bitmaps in
copy_not_changed_fields() that did not work on big endian machines.
This bug caused recovery of Aria files to fail on big endian machines
like s390x or Sparc.
This issue was noticed by the bulk_insert_crash.test on the
s390x builder.
The function row_purge_reset_trx_id() that had been introduced in
commit 3c09f148f3 (MDEV-12288)
introduces some extra buffer pool and redo log activity that will
cause a significant performance regression under some workloads.
This is currently the most significant performance issue, after
commit acd071f599 (MDEV-21923)
fixed the InnoDB LSN allocation and MDEV-19749 the MDL bottleneck in 12.1.
The purpose of row_purge_reset_trx_id() was to ensure that we can
easily identify records for which no history exists. If DB_TRX_ID
is 0, we could avoid looking up the transaction to see if the
history is accessible or the record is implicitly locked.
To avoid trx_sys_t::find() for stale DB_TRX_ID values, we can refer
to trx_t::max_inactive_id, which was introduced in
commit 4105017a58 (MDEV-30357).
Instead of comparing DB_TRX_ID to 0, we may compare it to this
cached value. The cache would be updated by
trx_sys_t::find_same_or_older(), which is invoked for some operations
on secondary indexes.
row_purge_reset_trx_id(): Remove. We will no longer reset the
DB_TRX_ID to 0 after an INSERT. We will retain a single undo log
for all operations, though. Before MDEV-12288, there had been
separate insert_undo and update_undo logs.
row_check_index(): No longer warn
"InnoDB: Clustered index record with stale history in table".
lock_rec_queue_validate(), lock_rec_convert_impl_to_expl(),
row_vers_impl_x_locked_low(): Instead of comparing the DB_TRX_ID
to 0, compare it to trx_t::max_inactive_id.
In dict0load.cc we will not spend any effort to avoid extra
trx_sys.find() calls for stale DB_TRX_ID in dictionary tables.
This code does not currently use trx_t objects, and therefore
we cannot easily access trx_t::max_inactive_id. Loading table
definitions into the InnoDB data dictionary cache (dict_sys)
should be a very rare operation.
Reviewed by: Vladislav Lesin
The innodb_encrypt_log=ON subformat of FORMAT_10_8 is inefficient,
because a new encryption or decryption context is being set up for
every log record payload snippet.
An in-place conversion between the old and new innodb_encrypt_log=ON
format is technically possible. No such conversion has been
implemented, though. There is some overhead with respect to the
unencrypted format (innodb_encrypt_log=OFF): At the end of each
mini-transaction, right before the CRC-32C, additional 8 bytes will be
reserved for a nonce (really, log_sys.get_flushed_lsn()), which forms
a part of an initialization vector.
log_t::FORMAT_ENC_11: The new format identifier, a UTF-8 encoding of
🗝 U+1F5DD OLD KEY (encryption). In this format, everything except the
types and lengths of log records will be encrypted. Thus, unlike in
FORMAT_10_8, also page identifiers and FILE_ records will be encrypted.
The initialization vector (IV) consists of the 8-byte nonce as well as
the type and length byte(s) of the first record of the mini-transaction.
Page identifiers will no longer form any part of the IV.
The old log_t::FORMAT_ENC_10_8 (innodb_encrypt_log=ON) will be supported
both by mariadb-backup and by crash recovery. Downgrade from the new
format will only be possible if the new server has been running or
restarted with innodb_encrypt_log=OFF. If innodb_encrypt_log=ON,
only the new log_t::FORMAT_ENC_11 will be written.
log_t::is_recoverable(): A new predicate, which holds for all 3
formats.
recv_sys_t::tmp_buf: A heap-allocated buffer for decrypting a
mini-transaction, or for making the wrap-around of a memory-mapped
log file contiguous.
recv_sys_t::start_lsn: The start of the mini-transaction.
Updated at the start of parse_tail().
log_decrypt_mtr(): Decrypt a mini-transaction in recv_sys.tmp_buf.
Theoretically, when reading the log via pread() rather than a read-only
memory mapping, we could modify the contents of log_sys.buf in place.
If we did that, we would have to re-read the last log block into
log_sys.buf before resuming writes, because otherwise that block could be
re-written as a mix of old decrypted data and new encrypted data, which
would cause a subsequent recovery failure unless the log checkpoint had
been advanced beyond this point.
log_decrypt_legacy(): Decrypt a log_t::FORMAT_ENC_10_8 record snippet
on stack. Replaces recv_buf::copy_if_needed().
recv_sys_t::get_backup_parser(): Return a recv_sys_t::parser, that is,
a pointer to an instantiation of parse_mmap or parse_mtr for the current
log format.
recv_sys_t::parse_mtr(), recv_sys_t::parse_mmap(): Add a parameter
template<uint32_t> for the current log_sys.format.
log_parse_start(): Validate the CRC-32C of a mini-transaction.
This has been split from the recv_sys_t::parse() template to
reduce code duplication. These two are the lowest-level functions
that will be instantiated for both recv_buf and recv_ring.
recv_sys_t::parse(): Split into ::log_parse_start() and parse_tail().
Add a parameter template<uint32_t format> to specialize for
log_sys.format at compilation time.
recv_sys_t::parse_tail(): Operate on pointers to contiguous
mini-transaction data. Use a parameter template<bool ENC_10_8>
for special handling of the old innodb_encrypt_log=ON format.
The former recv_buf::get_buf() is being inlined here.
Much of the logic is split into non-inline functions, to avoid
duplicating a lot of code for every template expansion.
log_crypt: Encrypt or decrypt a mini-transaction in place in the
new innodb_encrypt_log=ON format. We will use temporary buffers
so that encryption_ctx_update() can be invoked on integer multiples
of MY_AES_BLOCK_SIZE, except for the last bytes of the encrypted
payload, which will be encrypted or decrypted in place thanks to
ENCRYPTION_FLAG_NOPAD.
log_crypt::append(): Invoke encryption_ctx_update() in MY_AES_BLOCK_SIZE
(16-byte) blocks and scatter/gather shorter data blocks as needed.
log_crypt::finish(), Handle the last (possibly incomplete) block as a
special case, with ENCRYPTION_FLAG_NOPAD.
mtr_t::parse_length(): Parse the length of a log record.
mtr_t::encrypt(): Use log_crypt instead of the old log_encrypt_buf().
recv_buf::crc32c(): Add a parameter for the initial CRC-32C value.
recv_sys_t::rewind(): Operate on pointers to the start of the
mini-transaction and to the first skipped record.
recv_sys_t::trim(): Declare as ATTRIBUTE_COLD so that this rarely
invoked function will not be expanded inline in parse_tail().
recv_sys_t::parse_init(): Handle INIT_PAGE or FREE_PAGE while scanning
to the end of the log.
recv_sys_t::parse_page0(): Handle WRITE to FSP_SPACE_SIZE and
FSP_SPACE_FLAGS.
recv_sys_t::parse_store_if_exists(), recv_sys_t::parse_store(),
recv_sys_t::parse_oom(): Handle page-level log records.
mlog_decode_varint_length(): Make use of __builtin_clz() to avoid a loop
when possible.
mlog_decode_varint(): Define only on const byte*, as
ATTRIBUTE_NOINLINE static because it is a rather large function.
recv_buf::decode_varint(): Trivial wrapper for mlog_decode_varint().
recv_ring::decode_varint(): Special implementation.
log_page_modify(): Note that a page will be modified in recovery.
Split from recv_sys_t::parse_tail().
log_parse_file(): Handle non-page log records.
log_record_corrupted(), log_unknown(), log_page_id_corrupted():
Common error reporting functions.
Replication can stop in error if a Heartbeat log event is sent to a
replica during rotation. There are two bugs at play:
1. Prior to MDEV-30128 (added in 11.0), there is a bug when checking
legacy events. When the replica rotates its relay logs, it
initializes its Format_description_log_event with binlog version 3
(this is hard-coded). So immediately after rotation (and until a
new Format_descriptor with binlog_format 4 is sent from the
master), the IO thread is expecting binlog_format 3 (i.e. it will
call queue_old_event() for incoming events). This invalidates any
events that are sent with an event type higher than 14. In theory,
we wouldn't expect any events to be sent in-between a rotate and
the next format descriptor log event, but if a long enough period
of time passes between then, the primary will generate and send a
Heartbeat event (of type 27). In such case, the slave will see the
heartbeat event of type 27, see it is higher than 14, and result
in an error mentioning 'Found invalid event in binary log', with
the expected log coordinates of the new log (which is
optimistically populated from the Rotate log event, not the new
event).
2. In all versions of MariaDB (11.0+), there is a bug when checking
the state of a Heartbeat log event, in that it doesn't consider a
rotated binary log. The check is meant to ensure that the
heartbeat provided by the master (i.e. the state of the master) is
greater than or equal to the state of the slave. In other words,
it checks that the slave isn't ahead of the master. However, if
the filename provided by the master heartbeat event is different
than the filename saved for the slave's state, the check always
fails. This is broken, because when the master rotates its logs,
the new binary log file will have a different filename (i.e. an
incremented index counter suffix). For example, if the master
rotates its binary logs from master-bin.000002 to
master-bin.000003, master-bin.000003 is ahead of
master-bin.000002, but the slave will see a difference between the
filenames and fail the check.
To fix the first problem, this patch disallows passing a heartbeat
event into queue_old_event (which is the source of the error, as it
tries to parse a heartbeat log event). This function (queue_old_event)
was removed with MDEV-30128, so bypassing it for heartbeat events is
not consequential (and it is already also done for
Format_description_events, which are not supported in old binlog file
versions). Note that backporting all of MDEV-30128 was also considered,
but this is less risky for GA.
To fix the second problem, we simply ignore heartbeat events on the
slave if the filenames don't match. This is because during rotation,
it can appear that the slave is ahead of the master, which breaks the
validity of the check (i.e. the check is to ensure the master is
ahead of the slave).
Additionally note that this patch restores a heartbeat check that was
incorrectly removed in 780db8e252
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
Ever since commit 685d958e38
(MDEV-14425) mariadb-backup --backup had some trouble to keep up
with write workloads of the mariadbd server.
Debarun Banerjee found out that mariadb-backup --backup was
copying the log in the wrong way and not pausing when it made
sense to do so. This change includes his fix as well as some
dead code removal from xtrabackup_copy_mmap_logfile().
Some earlier changes to the default behaviour of mariadb-backup --backup
will be reverted, by making the configuration parameters OFF by default.
These parameters were basically working around this bug:
* commit 652f33e0a4 (MDEV-30000)
introduced --innodb-log-checkpoint-now and made it ON by default.
Making the server execute a log checkpoint can be really I/O intensive.
* commit 6acada713a (MDEV-34062)
introduced --innodb-log-file-mmap and made it ON by default on
Linux and FreeBSD. There are no documented semantics what should
happen to a memory mapping when there are concurrent pwrite(2)
operations by other processes. While it appears to work, it is safer
to default to clearly documented semantics.
xtrabackup_copy_logfile(): Add a parameter early_exit.
Always read a log snippet to the start of recv_sys.buf and assign
recv_sys.len to the read length. We used to shift recv_sys.buf
with memmove(). However, on recv_sys_t::PREMATURE_EOF we cannot know
which part of the mini-transaction was correctly read, because that
part of the ib_logfile0 may be concurrently modified by the server.
So, we will reread everything from the start of the mini-transaction.
xtrabackup_backup_func(): Invoke xtrabackup_copy_logfile(true),
allowing it to stop on every recv_sys_t::PREMATURE_EOF.
This will also avoid repeated "Retry" messages when there is no
more redo log to copy.
get_current_lsn(): Execute FLUSH ENGINE LOGS to ensure that
InnoDB will complete any buffered writes to the ib_logfile0
and ensure that everything up to the current LSN has been
written.
backup_wait_for_commit_lsn(): Wait for as much as is really needed.
This avoids an extra 5-second wait at the end of the backup.
xtrabackup_copy_mmap_logfile(): Remove some dead code, and add
debug assertions to demonstrate that the parser can only return
recv_sys_t::OK or recv_sys_t::GOT_EOF.