Commit graph

10,230 commits

Author SHA1 Message Date
Daniele Sciascia
648d2da8f2 MDEV-33540 Avoid writes to TRX_SYS page during mariabackup operations
Fix a scenario where `mariabackup --prepare` fails with assertion
`!m_modifications || !recv_no_log_write'  in `mtr_t::commit()`. This
happens if the prepare step of the backup encounters a data directory
which happens to store wsrep xid position in TRX SYS page (this is no
longer the case since 10.3.5). And since MDEV-17458,
`trx_rseg_array_init()` handles this case by copying the xid position
to rollback segments, before clearing the xid from TRX SYS page.
However, this step should be avoided when `trx_rseg_array_init()` is
invoked from mariabackup. The relevant code was surrounded by the
condition `srv_operation == SRV_OPERATION_NORMAL`. An additional check
ensures that we are not trying to copy a xid position which has
already zeroed.
2024-03-08 16:02:38 +01:00
mariadb-DebarunBanerjee
afe9632913 MDEV-33593 Auto increment deadlock error causes ASSERT in subsequent save point
The issue here is ha_innobase::get_auto_increment() could cause a
deadlock involving auto-increment lock and rollback the transaction
implicitly. For such cases, storage engines usually call
thd_mark_transaction_to_rollback() to inform SQL engine about it which
in turn takes appropriate actions and close the transaction. In innodb,
we call it while converting Innodb error code to MySQL.

However, since ::innobase_get_autoinc() returns void, we skip the call
for error code conversion and also miss marking the transaction for
rollback for deadlock error. We assert eventually while releasing a
savepoint as the transaction state is not active.

Since convert_error_code_to_mysql() is handling some generic error
handling part, like invoking the callback when needed, we should call
that function in ha_innobase::get_auto_increment() even if we don't
return the resulting mysql error code back.
2024-03-07 21:54:06 +05:30
Thirunarayanan Balathandayuthapani
6e5333fc8c MDEV-32445 InnoDB may corrupt its log before upgrading it on startup
Problem:
========
 During upgrade, InnoDB does write the redo log for adjusting
the tablespace size or tablespace flags even before the log
has upgraded to configured format. This could lead to data
inconsistent if any crash happened during upgrade process.

Fix:
===
srv_start(): Write the tablespace flags adjustment, increased
tablespace size redo log only after redo log upgradation.

log_write_low(), log_reserve_and_write_fast(): Check whether
the redo log is in physical format.
2024-03-06 15:01:26 +05:30
Thirunarayanan Balathandayuthapani
738da4918d MDEV-32346 Assertion failure sym_node->table != NULL in pars_retrieve_table_def on UPDATE
- During update operation, InnoDB should avoid the initializing
the FTS_DOC_ID of foreign table if the foreign table is discarded
2024-03-06 14:04:49 +05:30
Marko Mäkelä
0772ac1f16 MDEV-33508 Performance regression due to frequent scan of full buf_pool.flush_list
buf_flush_page_cleaner(): Remove a loop that had originally been added
in commit 9d1466522e (MDEV-32029) and made
redundant by commit 5b53342a6a (MDEV-32588).

Starting with commit d34479dc66 (MDEV-33053)
this loop would cause a significant performance regression in workloads
where buf_pool.need_LRU_eviction() constantly holds in
buf_flush_page_cleaner().

Thanks to Steve Shaw of Intel for noticing this.

Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
2024-02-28 12:48:38 +02:00
mariadb-DebarunBanerjee
969669767b MDEV-33011 mariabackup --backup: FATAL ERROR: ... Can't open datafile cool_down/t3
The root cause is the WAL logging of file operation when the actual
operation fails afterwards. It creates a situation with a log entry for
a operation that would always fail. I could simulate both the backup
scenario error and Innodb recovery failure exploiting the weakness.

We are following WAL for file rename operation and once logged the
operation must eventually complete successfully, or it is a major
catastrophe. Right now, we fail for rename and handle it as normal error
and it is the problem.

I created a patch to address RENAME operation to a non existing schema
where the destination schema directory is missing. The patch checks for
the missing schema before logging in an attempt to avoid the failure
after WAL log is written/flushed. I also checked that the schema cannot
be dropped or there cannot be any race with other rename to the same
file. This is protected by the MDL lock in SQL today.

The patch should this be a good improvement over the current situation
and solves the issue at hand.
2024-02-27 17:59:20 +05:30
Marko Mäkelä
71834ccb6c MDEV-24671 fixup: Remove srv_max_n_threads
The variable srv_max_n_threads lost its usefulness in
commit db006a9a43 (MDEV-21452)
and commit e71e613353 (MDEV-24671).
2024-02-27 11:14:28 +02:00
Thirunarayanan Balathandayuthapani
57cc8605eb MDEV-19044 Alter table corrupts while applying the modification log
Problem:
========
- InnoDB reads the length of the variable length field wrongly
while applying the modification log of instant table.

Solution:
========
rec_init_offsets_comp_ordinary(): For the temporary instant
file record, InnoDB should read the length of the variable length
field from the record itself.
2024-02-27 12:59:46 +05:30
Thirunarayanan Balathandayuthapani
e309e02447 MDEV-30655 IMPORT TABLESPACE fails with column count or index count mismatch
update_vcol_pos(): pass table id as table_id_t instead of ulint.
2024-02-23 13:42:46 +05:30
Thirunarayanan Balathandayuthapani
e66928ab28 MDEV-33462 Server aborts while altering an InnoDB statistics table
Problem:
=======
- When online alter of InnoDB statistics table happens,
any transaction which updates the statistics table
has to read the undo log and log the DML changes during
transaction commit. Applying undo log
(UndorecApplier::apply_undo_rec) requires a shared
lock on dictionary cache but dict_stats_save() already
holds write lock on dictionary cache. This leads to
abort of server during commit of statistics table changes.

Solution:
========
- Disallow LOCK=NONE operation for the InnoDB statistics table.
The reasoning is that statistics tables are typically
rather small, so any blocking would be rather short.
Writes to the statistics tables should be a rare operation.
2024-02-22 16:57:04 +05:30
Marko Mäkelä
042c3fc432 MDEV-24167 fixup: srw_lock_debug for SUX_LOCK_GENERIC
srw_lock_debug::have_rd(), srw_lock_debug::have_wr():
For SUX_LOCK_GENERIC (no futex based synchronization primitives),
we cannot check if the underlying srw_lock is held by us.

Thanks to Dmitry Shulga for pointing out this build failure.
2024-02-21 13:03:25 +02:00
Thirunarayanan Balathandayuthapani
903ae30069 MDEV-30655 IMPORT TABLESPACE fails with column count or index count mismatch
Problem:
========
Currently import operation fails with schema mismatch when
cfg file has hidden fts document id and hidden fts document index.

Fix:
====
To fix this issue, simply add the fts doc id column,
indexes in table definition and try to import the table.
In case of success:
1) update the fts document id in sys columns.
2) update the number of columns in sys tables.
3) insert the new fts index entry in sys indexes table
and sys fields.
4) Reload the table with new table definition
2024-02-20 19:48:25 +05:30
Marko Mäkelä
53c6c823dc MDEV-33464 Crash when innodb_max_undo_log_size is set to innodb_page_size*4294967296
purge_sys_t::truncating_tablespace(): Clamp the
innodb_max_undo_log_size to the maximum number of pages
before converting the result into a 32-bit unsigned integer.

This fixes up commit f8c88d905b (MDEV-33213).

In later major versions, we would use 32-bit unsigned integer here
due to commit ca501ffb04
and the code would crash also on 64-bit processors.

Reviewed by: Debarun Banerjee
2024-02-15 12:34:04 +02:00
Marko Mäkelä
691f923906 Merge 10.5 into 10.6 2024-02-13 20:42:59 +02:00
Marko Mäkelä
b770633e07 Merge 10.4 into 10.5 2024-02-13 14:25:21 +02:00
Marko Mäkelä
68d9deb69a MDEV-33332 SIGSEGV in buf_read_ahead_linear() when bpage is in buf_pool.watch
buf_read_ahead_linear(): If buf_pool.watch_is_sentinel(*bpage),
do not attempt to read the page frame because the pointer would be null
for the elements of buf_pool.watch[].

Hitting this bug requires the use of a non-default value of
innodb_change_buffering.
2024-02-13 14:10:44 +02:00
Marko Mäkelä
ca88eac835 MDEV-30528 CREATE FULLTEXT INDEX assertion failure WITH SYSTEM VERSIONING
ha_innobase::check_if_supported_inplace_alter(): Require ALGORITHM=COPY
when creating a FULLTEXT INDEX on a versioned table.

row_merge_buf_add(), row_merge_read_clustered_index(): Remove the parameter
or local variable history_fts that had been added in the attempt to fix
MDEV-25004.

Reviewed by: Thirunarayanan Balathandayuthapani
Tested by: Matthias Leich
2024-02-12 16:52:55 +01:00
Marko Mäkelä
81f3e97bc8 MDEV-33383: Corrupted red-black tree due to incorrect comparison
fts_doc_id_cmp(): Replaces several duplicated functions for
comparing two doc_id_t*. On IA-32, AMD64, ARMv7, ARMv8, RISC-V
this should make use of some conditional ALU instructions.
On POWER there will be conditional jumps. Unlike the original
functions, these will return the correct result even if the
difference of the two doc_id does not fit in the int data type.
We use static_assert() and offsetof() to check at compilation time
that this function is compatible with the rbt_create() calls.

fts_query_compare_rank(): As documented, return -1 and not 1
when the rank are equal and r1->doc_id < r2->doc_id. This will
affect the result of ha_innobase::ft_read().

fts_ptr2_cmp(), fts_ptr1_ptr2_cmp(): These replace
fts_trx_table_cmp(), fts_trx_table_id_cmp().
The fts_savepoint_t::tables will be sorted by dict_table_t*
rather than dict_table_t::id. There was no correctness bug in
the previous comparison predicates. We can avoid one level of
unnecessary pointer dereferencing in this way.
Actually, fts_savepoint_t is duplicating trx_t::mod_tables.
MDEV-33401 was filed about removing it.

The added unit test innodb_rbt-t covers both the previous buggy comparison
predicate and the revised fts_doc_id_cmp(), using keys which led to
finding the bug. Thanks to Shaohua Wang from Alibaba for providing the
example and the revised comparison predicate.

Reviewed by: Thirunarayanan Balathandayuthapani
2024-02-12 17:01:45 +02:00
Marko Mäkelä
47122a6112 MDEV-33383: Replace fts_doc_id_cmp, ib_vector_sort
fts_doc_ids_sort(): Sort an array of doc_id_t by C++11 std::sort().

fts_doc_id_cmp(), ib_vector_sort(): Remove. The comparison was
returning an incorrect result when the difference exceeded the int range.

Reviewed by: Thirunarayanan Balathandayuthapani
2024-02-12 17:01:17 +02:00
Marko Mäkelä
8ec12e0d6d Merge 10.4 into 10.5 2024-02-12 11:38:13 +02:00
Marko Mäkelä
466069b184 Merge 10.5 into 10.6 2024-02-08 10:38:53 +02:00
Marko Mäkelä
0381921e26 MDEV-33277 In-place upgrade causes invalid AUTO_INCREMENT values
MDEV-33308 CHECK TABLE is modifying .frm file even if --read-only

As noted in commit d0ef1aaf61,
MySQL as well as older versions of MariaDB server would during
ALTER TABLE ... IMPORT TABLESPACE write bogus values to the
PAGE_MAX_TRX_ID field to pages of the clustered index, instead of
letting that field remain 0.
In commit 8777458a6e this field
was repurposed for PAGE_ROOT_AUTO_INC in the clustered index root page.

To avoid trouble when upgrading from MySQL or older versions of MariaDB,
we will try to detect and correct bogus values of PAGE_ROOT_AUTO_INC
when opening a table for the first time from the SQL layer.

btr_read_autoinc_with_fallback(): Add the parameters to mysql_version,max
to indicate the TABLE_SHARE::mysql_version of the .frm file and the
maximum value allowed for the type of the AUTO_INCREMENT column.
In case the table was originally created in MySQL or an older version of
MariaDB, read also the maximum value of the AUTO_INCREMENT column from
the table and reset the PAGE_ROOT_AUTO_INC if it is above the limit.

dict_table_t::get_index(const dict_col_t &) const: Find an index that
starts with the specified column.

ha_innobase::check_for_upgrade(): Return HA_ADMIN_FAILED if InnoDB
needs upgrading but is in read-only mode. In this way, the call to
update_frm_version() will be skipped.

row_import_autoinc(): Adjust the AUTO_INCREMENT column at the end of
ALTER TABLE...IMPORT TABLESPACE. This refinement was suggested by
Debarun Banerjee.

The changes outside InnoDB were developed by Michael 'Monty' Widenius:

Added print_check_msg() service for easy reporting of check/repair messages
in ENGINE=Aria and ENGINE=InnoDB.
Fixed that CHECK TABLE do not update the .frm file under --read-only.
Added 'handler_flags' to HA_CHECK_OPT as a way for storage engines to
store state from handler::check_for_upgrade().

Reviewed by: Debarun Banerjee
2024-02-08 10:35:45 +02:00
Marko Mäkelä
85db534731 MDEV-33400 Adaptive hash index corruption after DISCARD TABLESPACE
row_discard_tablespace(): Do not invoke dict_index_t::clear_instant_alter()
because that would corrupt any adaptive hash index entries in the table.

row_import_for_mysql(): Invoke dict_index_t::clear_instant_alter()
after detaching any adaptive hash index entries.
2024-02-08 09:17:47 +01:00
Marko Mäkelä
b2654ba826 MDEV-32899 InnoDB is holding shared dict_sys.latch while waiting for FOREIGN KEY child table lock on DDL
lock_table_children(): A new function to lock all child tables of a table.
We will only hold dict_sys.latch while traversing
dict_table_t::referenced_set. To prevent a race condition with
std::set::erase() we will copy the pointers to the child tables to a
local vector. Once we have acquired MDL and references to all child tables,
we can safely release dict_sys.latch, wait for the locks, and finally
release the references.

dict_acquire_mdl_shared(): A new variant that takes mdl_context as a
parameter.

lock_table_for_trx(): Assert that we are not holding dict_sys.latch.

ha_innobase::truncate(): When foreign_key_checks=ON, assert that
no child tables exist (other than the current table).
In any case, we will invoke lock_table_children()
so that the child table metadata can be safely updated.
(It is possible that a child table is being created concurrently
with TRUNCATE TABLE.)

ha_innobase::delete_table(): Before and after acquiring exclusive
locks on the current table as well as all child tables,
check that FOREIGN KEY constraints will not be violated.
In this way, we can reject impossible DROP TABLE without having to
wait for locks first.

This fixes up commit 2ca1123464 (MDEV-26217)
and commit c3c53926c4 (MDEV-26554).
2024-02-08 14:22:35 +11:00
Marko Mäkelä
5f2dcd112b MDEV-24167 fixup: srw_lock_debug instrumentation
While the index_lock and block_lock include debug instrumentation
to keep track of shared lock holders, such instrumentation was
never part of the simpler srw_lock, and therefore some users of the
class implemented a limited form of bookkeeping.

srw_lock_debug encapsulates srw_lock and adds the data members
writer, readers_lock, and readers to keep track of the threads that
hold the exclusive latch or any shared latches. The debug checks
are available also with SUX_LOCK_GENERIC (in environments that do not
implement a futex-like system call).

dict_sys_t::latch: Use srw_lock_debug in debug builds.
This makes the debug fields latch_ex, latch_readers redundant.

fil_space_t::latch: Use srw_lock_debug in debug builds.
This makes the debug field latch_count redundant.
The field latch_owner must be preserved, because
fil_space_t::is_owner() is being used in all builds.

lock_sys_t::latch: Use srw_lock_debug in debug builds.
This makes the debug fields writer, readers redundant.

lock_sys_t::is_holder(): A new debug predicate to check if
the current thread is holding lock_sys.latch in any mode.

trx_rseg_t::latch: Use srw_lock_debug in debug builds.
2024-02-08 14:22:35 +11:00
mariadb-DebarunBanerjee
5e7047067e MDEV-33274 The test encryption.innodb-redo-nokeys often fails
If we fail to open a tablespace while looking for FILE_CHECKPOINT, we
set the corruption flag. Specifically, if encryption key is missing, we
would not be able to open an encrypted tablespace and the flag could be
set. We miss checking for this flag and report "Missing FILE_CHECKPOINT"

Address review comment to improve the test. Flush pages before starting
no-checkpoint block. It should improve the number of cases where the
test is skipped because some intermediate checkpoint is triggered.
2024-02-08 08:13:16 +05:30
mariadb-DebarunBanerjee
fb9da7f751 MDEV-33023 Crash in mariadb-backup --prepare --export after --prepare
mariadb-backup with --prepare option could result in empty redo log
file. When --prepare is followed by --prepare --export, we exit early
in srv_start function without opening the ibdata1 tablespace. Later
while trying to read rollback segment header page, we hit the debug
assert which claims that the system space should already have been
opened.

There are two assert cases here.

Issue-1: System tablespace object is not there in fil space hash i.e.
srv_sys_space.open_or_create() is not called.

Issue-2: The system tablespace data file ibdata1 is not opened i.e.
fil_system.sys_space->open() is not called.

Fix: For empty redo log and restore operation, open system tablespace
before returning.
2024-02-07 23:12:15 +05:30
Marko Mäkelä
8d54d173d7 Cleanup: Remove ut_format_name()
This follows up commit 383f77cd84
which simplified dict_table_schema_check().

Note: We can display quoted names like this:
my_snprintf(buf, sizeof buf, "%`.*s.%`s",
            int(t->name.dblen()), t->name.m_name, t->name.basename());
2024-02-07 13:56:31 +02:00
Marko Mäkelä
91a2192bf2 Merge 10.5 into 10.6 2024-02-07 13:51:03 +02:00
Daniel Black
e06b159f02 MDEV-33397: Innodb include OS error information when failing to write to iblogfileX
Without an OS error it could one of the many errors from write.
2024-02-07 17:27:35 +11:00
Oleksandr Byelkin
8e7314992f Merge branch '10.5' into mariadb-10.5.24 2024-02-06 18:29:14 +01:00
mariadb-DebarunBanerjee
66bb229e91 MDEV-18288 Transportable Tablespaces leave AUTO_INCREMENT in mismatched state, causing INSERT errors in newly imported tables when .cfg is not used.
During import, if cfg file is not specified, we don't update the autoinc
field in innodb dictionary object dict_table_t. The next insert tries to
insert from the starting position of auto increment and fails.

It can be observed that the issue is resolved once server is restarted
as the persistent value is read correctly from PAGE_ROOT_AUTO_INC from
index root page. The patch fixes the issue by reading the the auto
increment value directly from PAGE_ROOT_AUTO_INC during import if cfg
file is not specified.

Test Fix:

1. import_bugs.test: Embedded mode warning has absolute path. Regular
expression replacement in test.

2. full_crc32_import.test: Table level auto increment mismatch after
import. It was using the auto increment data from the table prior to
discard and import which is not right. This value has cached auto
increment value higher than the actual inserted value and value stored
in PAGE_ROOT_AUTO_INC. Updated the result file and added validation for
checking the maximum value of auto increment column.
2024-02-06 13:45:30 +05:30
Sergei Golubchik
3f6038bc51 Merge branch '10.5' into 10.6 2024-01-31 18:04:03 +01:00
Sergei Golubchik
01f6abd1d4 Merge branch '10.4' into 10.5 2024-01-31 17:32:53 +01:00
Thirunarayanan Balathandayuthapani
21f18bd9d7 MDEV-33341 innodb.undo_space_dblwr test case fails with Unknown Storage Engine InnoDB
Reason:
======
undo_space_dblwr test case fails if the first page of undo
tablespace is not flushed before restart the server. While
restarting the server, InnoDB fails to detect the first
page of undo tablespace from doublewrite buffer.

Fix:
===
Use "ib_log_checkpoint_avoid_hard" debug sync point
to avoid checkpoint and make sure to flush the
dirtied page before killing the server.

innodb_make_page_dirty(): Fails to set
srv_fil_make_page_dirty_debug variable.
2024-01-31 15:55:09 +05:30
Marko Mäkelä
b7d1f65b81 MDEV-12266 fixup: Remove dead code
Ever since commit 5e84ea9634
this "else if" branch was unreachable because the preceding
"if" condition covered it.
2024-01-30 13:10:53 +02:00
Marko Mäkelä
bc2849579b MDEV-33251 Redundant check on prebuilt::n_rows_fetched overflow
row_search_mvcc(): Revise an overflow check, disabling it on 64-bit
systems. The maximum number of consecutive record reads in a key range
scan should be limited by the maximum number of records per page
(less than 2^13) and the maximum number of pages per tablespace (2^32)
to less than 2^45. On 32-bit systems we can simplify the overflow check.

Reviewed by: Vladislav Lesin
2024-01-30 13:10:46 +02:00
Marko Mäkelä
495e7f1b3d MDEV-33053 fixup: Correct a condition before a message 2024-01-22 08:24:08 +02:00
Marko Mäkelä
5c243d4caf Merge 10.5 into 10.6 2024-01-22 08:20:08 +02:00
Daniel Black
0c23f84d8d MDEV-32983 cosmetic improvement on path separator near ib_buffer_pool
A mix of path separators looks odd.

  InnoDB: Loading buffer pool(s) from C:\xampp\mysql\data/ib_buffer_pool

This was changed in cf552f5886

Both forward slashes and backward slashes work on Windows. We do not
use \\?\ names.

So we improve the consistent look of it so it doesn't look like a bug.

Normalize, in this case, the path separator to \ for making the filename.

Reported thanks to Github user @celestinoxp.

Closes: https://github.com/ApacheFriends/xampp-build/issues/33
Reviewed by: Marko Mäkelä and Vladislav Vaintroub
2024-01-22 16:56:00 +11:00
Thirunarayanan Balathandayuthapani
7573fe8b07 MDEV-32968 InnoDB fails to restore tablespace first page from doublewrite buffer when page is empty
recv_dblwr_t::find_first_page(): Free the allocated memory
to read the first 3 pages from tablespace.

innodb.doublewrite: Added sleep to ensure page cleaner thread
wake up from my_cond_wait
2024-01-19 17:01:36 +05:30
Marko Mäkelä
21560bee9d Revert "MDEV-32899 InnoDB is holding shared dict_sys.latch while waiting for FOREIGN KEY child table lock on DDL"
This reverts commit 569da6a7ba,
commit 768a736174, and
commit ba6bf7ad9e
because of a regression that was filed as MDEV-33104.
2024-01-19 12:46:11 +02:00
Marko Mäkelä
7e65e3027e MDEV-33275 buf_flush_LRU(): mysql_mutex_assert_owner(&buf_pool.mutex) failed
In commit a55b951e60 (MDEV-26827)
an error was introduced in a rarely executed code path of the
buf_flush_page_cleaner() thread. As a result, the function
buf_flush_LRU() could be invoked while not holding buf_pool.mutex.

Reviewed by: Debarun Banerjee
2024-01-19 12:40:32 +02:00
Marko Mäkelä
d34479dc66 MDEV-33053 InnoDB LRU flushing does not run before running out of buffer pool
buf_flush_LRU(): Display a warning if no pages could be evicted and
no writes initiated.

buf_pool_t::need_LRU_eviction(): Renamed from buf_pool_t::ran_out().
Check if the amount of free pages is smaller than innodb_lru_scan_depth
instead of checking if it is 0.

buf_flush_page_cleaner(): For the final LRU flush after a checkpoint
flush, use a "budget" of innodb_io_capacity_max, like we do in the
case when we are not in "furious" checkpoint flushing.

Co-developed by: Debarun Banerjee
Reviewed by: Debarun Banerjee
Tested by: Matthias Leich
2024-01-19 12:40:16 +02:00
Daniel Black
82e8633420 innodb: IO Error message missing space
Noted by Susmeet Khaire - thanks.
2024-01-19 17:33:44 +11:00
Marko Mäkelä
a6290a5bc5 MDEV-33095 innodb_flush_method=O_DIRECT creates excessive errors on Solaris
The directio(3C) function on Solaris is supported on NFS and UFS
while the majority of users should be on ZFS, which is a copy-on-write
file system that implements transparent compression and therefore
cannot support unbuffered I/O.

Let us remove the call to directio() and simply treat
innodb_flush_method=O_DIRECT in the same way as the previous
default value innodb_flush_method=fsync on Solaris. Also, let us
remove some dead code around calls to os_file_set_nocache() on
platforms where fcntl(2) is not usable with O_DIRECT.

On IBM AIX, O_DIRECT is not documented for fcntl(2), only for open(2).
2024-01-19 15:34:33 +11:00
Marko Mäkelä
ee1407f74d MDEV-32268: GNU libc posix_fallocate() may be extremely slow
os_file_set_size(): Let us invoke the Linux system call fallocate(2)
directly, because the GNU libc posix_fallocate() implements a fallback
that writes to the file 1 byte every 4096 or fewer bytes. In one
environment, invoking fallocate() directly would lead to 4 times the
file growth rate during ALTER TABLE. Presumably, what happened was
that the NFS server used a smaller allocation block size than 4096 bytes
and therefore created a heavily fragmented sparse file when
posix_fallocate() was used. For example, extending a file by 4 MiB
would create 1,024 file fragments. When the file is actually being
written to with data, it would be "unsparsed".

The built-in EOPNOTSUPP fallback in os_file_set_size() writes a buffer
of 1 MiB of NUL bytes. This was always used on musl libc and other
Linux implementations of posix_fallocate().
2024-01-18 11:00:27 +02:00
Marko Mäkelä
f63045b119 MDEV-33213 fixup: GCC 5 -Wconversion 2024-01-18 10:14:21 +02:00
Marko Mäkelä
3a96eba25f Merge 10.5 into 10.6 2024-01-17 13:35:05 +02:00
Marko Mäkelä
f8c88d905b MDEV-33213 History list is not shrunk unless there is a pause in the workload
The parameter innodb_undo_log_truncate=ON enables a multi-phased logic:
1. Any "producers" (new starting transactions) are prohibited
from using the rollback segments that reside in the undo tablespace.
2. Any transactions that use any of the rollback segments must be
committed or aborted.
3. The purge of committed transaction history must process all the
rollback segments.
4. The undo tablespace is truncated and rebuilt.
5. The rollback segments are re-enabled for new transactions.

There was one flaw in this logic: The first step was not being invoked
as often as it could be, and therefore innodb_undo_log_truncate=ON
would have no chance to work during a heavy write workload.

Independent of innodb_undo_log_truncate, even after
commit 86767bcc0f
we are missing some chances to free processed undo log pages.
If we prohibited the creation of new transactions in one busy
rollback segment at a time, we would be eventually guaranteed
to be able to free such pages.

purge_sys_t::skipped_rseg: The current candidate rollback segment
for shrinking the history independent of innodb_undo_log_truncate.

purge_sys_t::iterator::free_history_rseg(): Renamed from
trx_purge_truncate_rseg_history(). Implement the logic
around purge_sys.m_skipped_rseg.

purge_sys_t::truncate_undo_space: Renamed from truncate.

purge_sys.truncate_undo_space.last: Changed the type to integer
to get rid of some pointer dereferencing and conditional branches.

purge_sys_t::truncating_tablespace(), purge_sys_t::undo_truncate_try():
Refactored from trx_purge_truncate_history().
Set purge_sys.truncate_undo_space.current if applicable,
or return an already set purge_sys.truncate_undo_space.current.

purge_coordinator_state::do_purge(): Invoke
purge_sys_t::truncating_tablespace() as part of the normal work loop,
to implement innodb_undo_log_truncate=ON as often as possible.

trx_purge_truncate_rseg_history(): Remove a redundant parameter.

trx_undo_truncate_start(): Replace dead code with a debug assertion.

Correctness tested by: Matthias Leich
Performance tested by: Axel Schwenke
Reviewed by: Debarun Banerjee
2024-01-17 11:14:24 +02:00