log_mmap(): If the MAP_SYNC|MAP_SHARED_VALIDATE operation (PMEM)
failed and the path is not in /dev/shm (which we treat as PMEM),
proceed to try regular MAP_SHARED read-only mapping. This allows
somewhat more efficient crash recovery, basically with an I/O
buffer that is not limited by innodb_log_buffer_size.
Reviewed by: Thirunarayanan Balathandayuthapani
Skip an all-zero pages in the index file.
They can happen normally if the ma_checkpoint_background
thread flushes some later page first (e.g. page 50 before page 48).
Also:
* don't do alloca() in a loop
* correct the check in ma_crypt_index_post_read_hook(),
the page can be completely full
* compilation failure in ma_open.c:1289:
comparison is always false due to limited range of data type
The backup of encrypted Aria tables was not supported.
Added support for this. One complication is that the page checksum is
for the not encrypted page. To be able to verify the checksum I have to
temporarly decrypt the page.
In the backup we store the encrypted pages.
Other things:
- Fixed some (not critical) memory leaks in mariabackup
The design of "binlog group commit" involves carrying some state across
transaction boundaries. This includes trx_t::commit_lsn, which keeps track
of how much write-ahead log needs to be written. Unfortunately, this
field was not reset in a commit where a log write was elided. That would
cause an unnecessary wait in a subsequent read-only transaction that
happened to reuse the same transaction object.
trx_deregister_from_2pc(): Reset trx->commit_lsn so that
an earlier write that was executed in the same client connection
will not result in an unnecessary wait during a subsequent read
operation.
trx_commit_complete_for_mysql(): Unless we are inside a binlog
group commit, reset trx->commit_lsn.
unlock_and_close_files(): Reset trx->commit_lsn after durably
writing the log, and remove a redundant log write call from some
callers.
trx_t::rollback_finish(): Clear commit_lsn, because a rolled-back
transaction will not need to be durably written.
trx_t::clear_and_free(): Wrapper function to suppress a debug check
in trx_t::free().
Also, remove some redundant ut_ad(!trx->will_lock) that will be checked
in trx_t::free().
Reviewed by: Vladislav Vaintroub
Ever since mysql/mysql-server@377774689b
was applied in commit 2e814d4702d71a04388386a9f591d14a35980bfed
the data member dict_table_t::fk_max_recusive_level is never being
read, only initialized as 0. Let us follow the lead of
mysql/mysql-server@b22ac19f10
and remove this useless field.
During undo tablespace truncation, pages with LSNs older than the
tablespace creation LSN may still exist in the buffer pool and get
submitted to the doublewrite buffer. When mtr_t::commit_shrink() is
invoked shortly after doublewrite batch submission,
this can lead to out-of-bounds write errors.
Fix:
===
buf_dblwr_t::flush_buffered_writes_completed() : skip doublewrite
processing for pages where the page LSN is older
than the tablespace creation LSN. Such pages belong to the old
tablespace before truncation and should not be written through the
doublewrite buffer.
buf_dblwr_t::create(): Create the doublewrite buffer in a single
atomic mini-transaction. Do not write any log records for
initializing any doublewrite buffer pages, in order to avoid
recovery failure with innodb_log_archive=ON starting from the
very beginning.
The mtr.commit() in buf_dblwr_t::create() was observed to
comprise 295 mtr_t::m_memo entries: 1 entry for the
fil_system.sys_space and the rest split between page 5 (TRX_SYS)
and page 0 (allocation metadata). We are nowhere near the
sux_lock::RECURSIVE_MAX limit of 65535 per page descriptor.
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Saahil Alam
columns
Issue:
- Purge thread attempts to purge a secondary index record that is not
delete-marked.
Root Cause:
- When a secondary index includes a virtual column whose v_pos is
greater than the number of fields in the clustered index record, the
virtual column is incorrectly skipped while reading from the undo
record.
- This leads the purge logic to incorrectly assume it is safe to purge
the secondary index record.
- The code also confuses the nth virtual column with the nth stored
column when writing ordering columns at the end of the undo record.
Fix:
- In trx_undo_update_rec_get_update(): Skip a virtual column only
when v_pos == FIL_NULL, not when v_pos is greater than the number
of fields.
- In trx_undo_page_report_modify(): Ensure ordering columns are
written based on the correct stored-column positions, without
confusing them with virtual-column positions.
Test was affected by incompletely closed preceding connections.
Make test agnostic to concurrent connections by querying
performance_schema.status_by_thread only for connections that it
uses.
Fix log message about unexpected table in system tablespace as the current
message can be missleading due to still existing (but already deprecated)
system tables SYS_DATAFILES and SYS_TABLESPACES, reported in
- MDEV-38412
Also adding the informative message with table name of the unexpected table
in system table space.
fulltext index between databases
Problem:
========
- When renaming/moving an InnoDB table with fulltext indexes
between databases, the server would crash because
InnoDB attempted to open auxiliary tables using
the old database name instead of the new one.
When creating auxiliary index, InnoDB does create temporary
sort fulltext index using old table reference which creates
the auxiliary table name with the prefix of old database name.
FTS Document ID size optimization in row_merge_create_fts_sort_index()
using dict_table_get_n_rows() to decide between 4-byte
and 8-byte Doc IDs for memory optimization. But
dict_table_get_n_rows() returns estimated statistics that
may be stale or inaccurate, potentially leading to wrong
size decisions and data corruption if 4-byte
Doc IDs are chosen when 8-byte are actually needed.
Solution:
=========
fts_rename_aux_tables(): Iterate through all table indexes
and ensure all fulltext indexes are properly renamed
row_merge_create_fts_sort_index() : Use new_table instead
of old_table when creating temporary FTS sort indexes.
row_merge_build_indexes(): Refactored the logic to do
memory optimization to determine the doc id size for
temporary fts sort index.
- When adding FTS index for the first time
(DICT_TF2_FTS_ADD_DOC_ID), always use 8-byte Doc IDs
- For existing FTS tables or user-supplied Doc ID columns,
use fts_get_max_doc_id() approach to check actual maximum Doc ID
When running tests in environments without a network interface (e.g.
Podman container launched with `--network=none`), Spider was not able to
retrieve a hardware address to generate a node ID. This triggered a
warning in the server log, causing MTR to fail multiple Spider tests due
to unexpected warnings and output result mismatches with:
[Warning] mariadbd: Can't get hardware address with error 2
Fix this by logging Spider hardware address errors to server log only
by setting it as a NOTE. This does not pollute the client output.
When`my_gethwaddr` fails, the code zeroes out the address buffer,
resulting in a `spider_unique_id` formatted like `-000000000000-PID-`,
which is fully valid and emitting warnings was a bit overkill to begin
with.
row_log_table_apply_ops(), row_log_apply_ops(): When switching
buffers, do not subtract addresses of unrelated buffers, but
simply set mrec_end relative to mrec. This makes the code more
readable and will avoid a runtime error from clang -fsanitize=undefined
depending on what kind of addresses are being returned by
row_log_block_allocate().
fil_space_t::free_page(): Turn a parameter into a template parameter,
and remove some duplicated code. This fixes an error that was flagged
by clang++-21 -fsanitize=memory in buf_page_create().
fseg_create(): Merge the parameter reserved_extent to n_reserved.
This fixes an error about n_reserved being uninitialized.
fseg_alloc_free_page_low(): Simplify a debug assertion.
flst_add_last(), flst_remove(): Reduce the scope of a conditionally
initialized variable.
Reviewed by: Thirunarayanan Balathandayuthapani
Problem:
=======
- Multiple user threads waits for all encryption threads to start
before returing the control to user. But in fil_crypt_thread(),
InnoDB signals that thread is started after incrementing
srv_n_fil_crypt_threads_started variable. For multiple waiters,
pthread_cond_broadcast() would be more appropriate as
it wakes all waiting threads.
Solution:
========
fil_crypt_thread(): Use pthread_cond_broadcast instead of
pthread_cond_signal(fil_crypt_cond) to wake multiple waiter
threads
This fixes "external" XA commits initiated by syncing binlog a slave
with a master that has done XA commits.
Note that spider_internal_xa_rollback_by_xid already did not emit the
error in the same scenario.
This issue seems to be already fixed. However, to avoid future problems:
Wsrep_server_service::release_storage_service
Add assertion that storage service is not nullptr and
contains thd. In production binaries add guard to
not use nullptr.
Wsrep_server_service::release_high_priority_service
Add assertion that high_priority service is not nullptr and
contains thd. In production binaries add guard to
not use nullptr.
wsrep_is_BF_lock_timeout
Remove printing of record lock because its page might
not be latched leading to assertion in multi-master
testing.
page_zip_decompress_low(): Skip the 8 bytes after FIL_PAGE_TYPE
to be consistent with the rest of ROW_FORMAT=COMPRESSED page handling.
The 8 bytes after FIL_PAGE_TYPE may differ between the compressed
and uncompressed copy of a page when SET GLOBAL innodb_encrypt_tables
is being executed while data pages are being evicted and reloaded
into the buffer pool.
They were based on the maximum possible key tuple length, which can be
much larger than the real data size.
The return value is used by handler::keyread_time(), which is used
to estimate the cost of range access.
This could cause range access not to be picked, even if it uses
the clustered PK and reads about 8% of the table.
The fix is to add KEY::stat_storage_length (next to KEY::rec_per_key) and
have the storage engine fill it in handler::info(HA_STATUS_CONST).
Currently, only InnoDB fills this based on its internal statistics:
index->stat_index_size and ib_table->stat_n_rows.
Also changed:
- In handler::calculate_costs(), use ha_keyread_clustered_time() when
computing clustered PK read cost, not ha_keyread_time().
The fix is OFF by default and enabled by setting FIX_INDEX_LOOKUP_COST flag
in @@new_mode.
The reason for the crash was that two tables where updating Aria's
TRN->used_instances at the same time. This could happen when a
thread started a sub transaction with Aria tables, like reading a
stored procedure from the proc table, at the same time another table
was clearing the table list after committing a transaction involving
Aria tables. The timing window for this to happen is very small,
which is why we did not notice this issue for 5 years.
The fix was to change reset_thd_trn() to clear the table links directly,
instead of calling _ma_reset_trn_for_table() which removed the table
from the linked list, which included updating TRN->used_instances.
This bug could happen when maria_commit or maria_rollback() where
called but not in ha_maria::implicit_commit() which had already
a fix for this problem.
Other things:
- Removed duplicate call to thd_set_ha_data(thd, maria_hton, trn)
in ha_maria::implicit_commit() and maria_commit() when TRN is null.
Issue:
Adding a SPATIAL INDEX triggers a debug assertion failure when the
mysql.innodb_table_stats schema is missing.
Fix:
dict_stats_update_transient_for_index() now exits early with empty
statistics for any index that is not a regular B-tree.
Problem:
After the re-design of `UPDATE` and `DELETE` in MDEV-28883, the call
to find select handler is missing. This prevents the server from
handring over multi-update/multi-delete queries to storage engines
capable of executng such queries on their own (e.g., ColumnStore).
MDEV-32382 introduced a check in `find_select_handler_inner` function,
that blocked pushdown of queries involving CTEs, without checking if
the storage engine is capable of handling such queries.
Fix:
Add a call to find the select handler for the engine involved in the
multi-update/multi-delete query, allowing the storage engines to
execute such queries.
Fix the `find_select_handler_inner` function by allowing the storage
engine's create_select functions to decide the pushdown of queries
involving CTEs.
- use shared key for sequence update certification
- employ native replication's code to apply changes for sequences
which handles all corner cases properly
- fix the tests to allow more transactions using sequences to be
accepted
That way the sequence is always updated to the maximum value
independent of the order of updates, and shared certification keys
allow to improve acceptance ratio of concurrent transactions that
use sequences. It's reflected in the test changes.
The scenario of the bug is the following. Before killing the server some
transaction A starts undo log writing in some undo segment U of rseg R.
It writes its trx_id into the undo log header. Then new trx_id is assigned
to transaction B, but undo log hasn't been started yet. Then transaction
A commits and writes trx_no into its undo log header. Transaction B
starts writing undo log into the undo segment U. So we have the
following undo logs in the undo segments U:
... undo log 1...
... undo log 2...
...
undo log A, trx_id: L, trx_no: M, ...
undo log B, trx_id: N, trx_no: 0, ...
Where L < N < M.
Then server is killed.
On recovery the maximum trx_no is extracted from each rseg, and the
maximum trx_no among all rsegs plus one is considered as a new value
for server-wide transaction id/no counter.
For each undo segment of each rseg we read the last undo log header. If
the last undo log is committed, then we read trx_no from the header,
otherwise we treat trx_id as trx_no. The maximum trx_no from all undo
log segments of the current rseg is treated as the maximum trx_no of the
rseg.
For the above case the undo log of transaction B is not committed and
its trx_no is 0. So we read trx_id and treat it as trx_no. But M < N. If
U is the last modified undo segment in rseg R, and trx_(id/no) N is the
maximum trx_no among all rsegs, then there can be the case when after
recovery some transaction with trx_no_C, such as N < trx_no_C <= M, is
committed.
During a purging we store trx_no of the last parsed undo log of a
committed transaction in purge_sys.tail.trx_no. So if the last parsed
undo log is the undo log of transaction A(transaction B was rolled back
on recovery and its undo log was also removed from the undo segment U),
then purse_sys.tail.trx_no = M. Than if some other transaction C with
trx_no_C <= M is being committed and purged, we hit
"tail.trx_no <= last_trx_no" assertion failure in
purge_sys_t::choose_next_log(), because purge queue is min-heap of
(trx_no, trx_sys.rseg_array index) pairs, where the key is trx_no, and it
must not be that trx_no of the last parsed undo log of a committed
transaction is greater than the last trx_no of the rseg at the top of
the queue.
The fix is to read the trx_no of the previous to last undo log in undo
segment, if the last undo log in that undo segment is not committed, and
set trx_no=max(trx_id of the last undo log, trx_no of the previous to
last undo log) during recovery.
We can do this because we need to extract the maximum
value of trx_no or trx_id of the undo log segment, and the maximum value
is either trx_id of the last undo log or trx_no of the previous to
last undo log, because undo segment can be assigned only to the one
transaction at time, and undo logs in the undo segment are ordered by
trx_id.
Reviewed by Marko Mäkelä.
buf_do_flush_list_batch(): Release and reacquire buf_pool.flush_list_mutex
after every 32 iterations, similar to how buf_flush_LRU_list_batch()
releases buf_pool.mutex ever since
commit 27ff972be2 (MDEV-26827 fixup).
This regression was introduced in
commit 22b62edaed (MDEV-25113)
and made more prominent by the recent
commit a7f0d79f8c (MDEV-35155).
Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Saahil Alam
Tested by: Rahul Raj
Problem was that row_mysql_read_blob_ref can return NULL
in case when blob datatype is used in a key and its real
value is NULL. This NULL pointer is then used in memcpy
function in wsrep_store_key_val_for_row. However,
memcpy is defined so that argument 2 must not be NULL.
Fixed by adding conditions before memcpy functions so
that argument 2 is always non NULL.
Additional fixes after review
- Removed unnecessary copying key data from one buffer to another.
Use original key data buffer as input and temporary buffer as output.
Extra output buffer is needed because strnxfrm might expand input buffer
contents.
- Removed unnecessary initialization of variables and move
declaration where first time needed.
- Removed unnecessary intitialization of temporary buffer because
we already keep track actual filled length.
- Remove unneccessary extra call to charset->strnxfrm
multiple file tablespace
Problem:
=======
- innochecksum was incorrectly interpreting doublewrite buffer
pages as index pages, causing confusion about stale tables
in the system tablespace.
- innochecksum fails to parse the multi-file system tablespace
Solution:
========
1. Rewrite checksum of doublewrite buffer pages
are skipped.
2. Introduced the option --tablespace-flags which can be used
to initialize page size. This option can handle the ibdata2,
ibdata3 etc without parsing ibdata1.
This is a cherry-pick of commit 9f8716ab61