Commit graph

10127 commits

Author SHA1 Message Date
Thirunarayanan Balathandayuthapani
5bb31bc882 MDEV-22230 : Unexpected ER_ERROR_ON_RENAME upon DROP non-existing FOREIGN KEY
mysql_prepare_alter_table(): Alter table should check whether
foreign key exists when it expected to exists and
report the error in early stage

dict_foreign_parse_drop_constraints(): Don't throw error if the
foreign key constraints doesn't exist when if exists is given
in the statement.
2023-11-26 18:46:00 +05:30
Marko Mäkelä
64f44b22d9 MDEV-31574: Assertion failure on REPLACE on ROW_FORMAT=COMPRESSED table
btr_cur_update_in_place(): Update the DB_TRX_ID,DB_ROLL_PTR also on the
compressed copy of the page. In a test case, a server built with
cmake -DWITH_INNODB_EXTRA_DEBUG=ON would crash in page_zip_validate()
due to the inconsistency. In a normal debug build, a different assertion
would fail, depending on when the uncompressed page was restored from
the compressed page.

In MariaDB Server 10.5, this bug had already been fixed by
commit b3d02a1fcf (MDEV-12353).
2023-11-23 15:09:26 +02:00
Marko Mäkelä
d963584d4c Merge 10.5 into 10.6 2023-11-22 16:56:47 +02:00
Marko Mäkelä
78c9a12c8f MDEV-32861 InnoDB hangs when running out of I/O slots
When the constant OS_AIO_N_PENDING_IOS_PER_THREAD is changed from 256 to 1
and the server is run with the minimum parameters
innodb_read_io_threads=1 and innodb_write_io_threads=2, two hangs
were observed.

tpool::cache<T>::put(T*): Ensure that get() in io_slots::acquire()
will be woken up when the cache previously was empty.

buf_pool_t::io_buf_t::reserve(): Schedule a possibly partial doublewrite
batch so that os_aio_wait_until_no_pending_writes() has a chance of
returning. Add a Boolean parameter and pass wait_for_reads=false inside
buf_page_decrypt_after_read(), because those calls will be executed
inside a read completion callback, and therefore
os_aio_wait_until_no_pending_reads() would block indefinitely.
2023-11-22 16:54:41 +02:00
Marko Mäkelä
9c5600adde Merge 10.5 into 10.6 2023-11-21 09:33:06 +02:00
Marko Mäkelä
de31ca6a21 MDEV-32820 Race condition between trx_purge_free_segment() and trx_undo_create()
trx_purge_free_segment(): If fseg_free_step_not_header() needs to be
called multiple times, acquire an exclusive latch on the
rollback segment header page after restarting the mini-transaction
so that the rest of this function cannot execute concurrently
with trx_undo_create() on the same rollback segment.

This fixes a regression that was introduced in
commit c14a39431b (MDEV-30753).

Note: The buffer-fixes that we are holding across the mini-transaction
restart will prevent the pages from being evicted from the buffer pool.
They may be accessed by other threads or written back to data files
while we are not holding exclusive latches.

Reviewed by: Vladislav Lesin
2023-11-21 08:53:02 +02:00
Thirunarayanan Balathandayuthapani
84e0c027e0 MDEV-28613 LeakSanitizer caused by I_S query using LIMIT ROWS EXAMINED
Problem:
========
- InnoDB fails to free the allocated buffer of stored cursor
when information schema query is interrupted.

Solution:
=========
- In case of error handling, information schema query should free
the allocated buffer to store the cursor.
2023-11-21 11:13:43 +05:30
Marko Mäkelä
eb1f8b2919 MDEV-32027 Opening all .ibd files on InnoDB startup can be slow
dict_find_max_space_id(): Return SELECT MAX(SPACE) FROM SYS_TABLES.

dict_check_tablespaces_and_store_max_id(): In the normal case
(no encryption plugin has been loaded and the change buffer is empty),
invoke dict_find_max_space_id() and do not open any .ibd files.
If a std::set<uint32_t> has been specified, open the files whose
tablespace ID is mentioned. Else, open all data files that are identified
by SYS_TABLES records.

fil_ibd_open(): Remove a call to os_file_get_last_error() that can
report a misleading error, such as EINVAL inside my_realpath() that is
not an actual error. This could be invoked when a data file is found
but the FSP_SPACE_FLAGS are incorrect, such as is the case for
table test.td in
./mtr --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 innodb.table_flags

buf_load(): If any tablespaces could not be found, invoke
dict_check_tablespaces_and_store_max_id() on the missing tablespaces.

dict_load_tablespace(): Try to load the tablespace unless it was found
to be futile. This fixes failures related to FTS_*.ibd files for
FULLTEXT INDEX.

btr_cur_t::search_leaf(): Prevent a crash when the tablespace
does not exist. This was caught by the test innodb_fts.fts_concurrent_insert
when the change to dict_load_tablespaces() was not present.

We modify a few tests to ensure that tables will not be loaded at startup.
For some fault injection tests this means that the corrupted tables
will not be loaded, because dict_load_tablespace() would perform stricter
checks than dict_check_tablespaces_and_store_max_id().

Tested by: Matthias Leich
Reviewed by: Thirunarayanan Balathandayuthapani
2023-11-17 15:07:51 +02:00
Marko Mäkelä
9a545eb67c MDEV-26055: Correct the formula for adaptive flushing
This is a 10.5 backport of 10.6
commit d4265fbde5.

page_cleaner_flush_pages_recommendation(): If dirty_pct is
between innodb_max_dirty_pages_pct_lwm
and innodb_max_dirty_pages_pct,
scale the effort relative to how close we are to
innodb_max_dirty_pages_pct.

The previous formula was missing a multiplication by 100.
2023-11-16 17:45:37 +02:00
Marko Mäkelä
a3d0d5fc33 MDEV-26055: Improve adaptive flushing
This is a 10.5 backport from 10.6
commit 9593cccf28.

Adaptive flushing is enabled by setting innodb_max_dirty_pages_pct_lwm>0
(not default) and innodb_adaptive_flushing=ON (default).
There is also the parameter innodb_adaptive_flushing_lwm
(default: 10 per cent of the log capacity). It should enable some
adaptive flushing even when innodb_max_dirty_pages_pct_lwm=0.
That is not being changed here.

This idea was first presented by Inaam Rana several years ago,
and I discussed it with Jean-François Gagné at FOSDEM 2023.

buf_flush_page_cleaner(): When we are not near the log capacity limit
(neither buf_flush_async_lsn nor buf_flush_sync_lsn are set),
also try to move clean blocks from the buf_pool.LRU list to buf_pool.free
or initiate writes (but not the eviction) of dirty blocks, until
the remaining I/O capacity has been consumed.

buf_flush_LRU_list_batch(): Add the parameter bool evict, to specify
whether dirty least recently used pages (from buf_pool.LRU) should
be evicted immediately after they have been written out. Callers outside
buf_flush_page_cleaner() will pass evict=true, to retain the existing
behaviour.

buf_do_LRU_batch(): Add the parameter bool evict.
Return counts of evicted and flushed pages.

buf_flush_LRU(): Add the parameter bool evict.
Assume that the caller holds buf_pool.mutex and
will invoke buf_dblwr.flush_buffered_writes() afterwards.

buf_flush_list_holding_mutex(): A low-level variant of buf_flush_list()
whose caller must hold buf_pool.mutex and invoke
buf_dblwr.flush_buffered_writes() afterwards.

buf_flush_wait_batch_end_acquiring_mutex(): Remove. It is enough to have
buf_flush_wait_batch_end().

page_cleaner_flush_pages_recommendation(): Avoid some floating-point
arithmetics.

buf_flush_page(), buf_flush_check_neighbor(), buf_flush_check_neighbors(),
buf_flush_try_neighbors(): Rename the parameter "bool lru" to "bool evict".

buf_free_from_unzip_LRU_list_batch(): Remove the parameter.
Only actual page writes will contribute towards the limit.

buf_LRU_free_page(): Evict freed pages of temporary tables.

buf_pool.done_free: Broadcast whenever a block is freed
(and buf_pool.try_LRU_scan is set).

buf_pool_t::io_buf_t::reserve(): Retry indefinitely.
During the test encryption.innochecksum we easily run out of
these buffers for PAGE_COMPRESSED or ENCRYPTED pages.

Tested by Matthias Leich and Axel Schwenke
2023-11-16 17:45:18 +02:00
Marko Mäkelä
5a1f821b93 MDEV-31861 Empty INSERT crashes with innodb_force_recovery=6 or innodb_read_only=ON
ha_innobase::extra(): Do not invoke log_buffer_flush_to_disk()
if high_level_read_only holds.

log_buffer_flush_to_disk(): Remove an assertion that duplicates one
at the start of log_write_up_to().
2023-11-16 16:57:42 +02:00
Marko Mäkelä
ea6ca01397 MDEV-32757: rollback crash on corruption
trx_undo_free_page(): Detect a case of corrupted TRX_UNDO_PAGE_LIST.

trx_undo_truncate_end(): Stop attempts to truncate a corrupted log.

trx_t::commit_empty(): Add an error message of a corrupted log.

Reviewed by: Thirunarayanan Balathandayuthapani
2023-11-15 14:11:38 +02:00
Marko Mäkelä
5dbe7a8c9a Merge 10.5 into 10.6 2023-11-15 14:11:24 +02:00
Marko Mäkelä
52ca2e65af Merge 10.5 into 10.6 2023-11-15 14:10:21 +02:00
Marko Mäkelä
a0f02f7438 MDEV-32757 innodb_undo_log_truncate=ON is not crash safe
trx_purge_truncate_history(): Do not prematurely mark dirty pages
as clean. This will be done in mtr_t::commit_shrink() as part of
Shrink::operator()(mtr_memo_slot_t*). Also, register each dirty page
only once in the mini-transaction.

fsp_page_create(): Adjust and simplify the page creation during
undo tablespace truncation. We can directly reuse pages that are
already in buf_pool.page_hash.

This fixes a regression that was caused by
commit f5794e1dc6 (MDEV-26445).

Tested by: Matthias Leich
Reviewed by: Thirunarayanan Balathandayuthapani
2023-11-15 12:23:35 +02:00
Marko Mäkelä
c638051d80 MDEV-32798 innodb_fast_shutdown=0 hang after incomplete startup
innodb_preshutdown(): Only wait for active transactions to be terminated
if InnoDB was started and innodb_force_recovery=3 or larger does not
prevent a rollback.

This fixes the following:

./mtr --parallel=auto --mysqld=--innodb-fast-shutdown=0 \
innodb.log_file_size innodb.innodb_force_recovery \
innodb.read_only_recovery innodb.read_only_recover_committed \
mariabackup.apply-log-only-incr
2023-11-14 14:35:51 +02:00
Oleksandr Byelkin
4a824c0cf0 Merge branch '10.6' into mariadb-10.6.16 2023-11-14 08:56:16 +01:00
Oleksandr Byelkin
9f83a8822f Merge branch '10.5' into mariadb-10.5.23 2023-11-14 08:41:23 +01:00
Marko Mäkelä
dec4d0badc MDEV-32788: Debug build failure with SUX_LOCK_GENERIC
Fixes up commit 2027c482de
2023-11-13 14:35:33 +02:00
Aleksey Midenkov
e53e7cd134 MDEV-20545 Assertion col.vers_sys_end() in dict_index_t::vers_history_row
Index values for row_start/row_end was wrongly calculated for inplace
ALTER for some layout of virtual fields.

Possible impact

  1. history row is not detected upon build clustered index for
     inplace ALTER which may lead to duplicate key errors on
     auto-increment and FTS index add.
  2. foreign key constraint may falsely fail.
  3. after inplace ALTER before server restart trx-based system
     versioning can cause server crash or incorrect data written upon
     UPDATE.
2023-11-10 15:46:14 +03:00
Marko Mäkelä
e0c65784aa MDEV-32737 innodb.log_file_name fails on Assertion `after_apply || !(blocks).end in recv_sys_t::clear
recv_group_scan_log_recs(): Set the debug flag recv_sys.after_apply
after actually completing the log scan.

In the test, suppress some errors that may be reported when
the crash recovery of RENAME TABLE t1 TO t2 is preceded by
copying t2.ibd to t1.ibd.
2023-11-09 11:06:17 +02:00
Oleksandr Byelkin
b83c379420 Merge branch '10.5' into 10.6 2023-11-08 15:57:05 +01:00
Oleksandr Byelkin
6cfd2ba397 Merge branch '10.4' into 10.5 2023-11-08 12:59:00 +01:00
Thirunarayanan Balathandayuthapani
a44869d842 MDEV-31851 Doublewrite recovery fixup
recv_dblwr_t::find_page(): Tablespace flags validity should be
checked only for page 0.
2023-11-08 12:37:41 +01:00
Thirunarayanan Balathandayuthapani
b52b7b4129 MDEV-31851 Doublewrite recovery fixup
recv_dblwr_t::find_page(): Tablespace flags validity should be
checked only for page 0.
2023-11-06 19:18:33 +05:30
Marko Mäkelä
1fc2843eee MDEV-31826: File handle leak on failed IMPORT TABLESPACE
fil_space_t::drop(): If the caller is not interested in a
detached handle, close it immediately.
2023-11-04 16:04:21 +02:00
Kristian Nielsen
9fa718b1a1 Fix mariabackup InnoDB recovered binlog position on server upgrade
Before MariaDB 10.3.5, the binlog position was stored in the TRX_SYS page,
while after it is stored in rollback segments. There is code to read the
legacy position from TRX_SYS to handle upgrades. The problem was if the
legacy position happens to compare larger than the position found in
rollback segments; in this case, the old TRX_SYS position would incorrectly
be preferred over the newer position from rollback segments.

Fixed by always preferring a position from rollback segments over a legacy
position.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-11-03 09:13:51 +01:00
Kristian Nielsen
f8f5ed2280 Revert: MDEV-22351 InnoDB may recover wrong information after RESET MASTER
This commit can cause the wrong (old) binlog position to be recovered by
mariabackup --prepare. It implements that the value of the FIL_PAGE_LSN is
compared to determine which binlog position is the last one and should be
recoved. However, it is not guaranteed that the FIL_PAGE_LSN order matches the
commit order, as is assumed by the code. This is because the page LSN could be
modified by an unrelated update of the page after the commit.

In one example, the recovery first encountered this in trx_rseg_mem_restore():

  lsn=27282754  binlog position (./master-bin.000001, 472908)

and then later:

  lsn=27282699  binlog position (./master-bin.000001, 477164)

The last one 477164 is the correct position. However, because the LSN
encountered for the first one is higher, that position is recovered instead.
This results in too old binlog position, and a newly provisioned slave will
start replicating too early and get duplicate key error or similar.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-11-03 09:13:51 +01:00
Marko Mäkelä
bfab4ab000 MDEV-18867 fixup: Remove DBUG injection
In commit 75e82f71f1 the code to
rename internal tables for FULLTEXT INDEX that had been
created on Microsoft Windows using incompatible names
was removed. Let us also remove the related fault injection.
2023-11-02 15:27:52 +02:00
Thirunarayanan Balathandayuthapani
b4de67da45 MDEV-32638 MariaDB crashes with foreign_key_checks=0 when changing a column and adding a foreign key at the same time
Problem:
=======
 - InnoDB fails to find the foreign key index for the newly
added foreign key relation. This is caused by commit
5f09b53bdb (MDEV-31086).

FIX:
===
In check_col_is_in_fk_indexes(), while iterating through
the newly added foreign key relationship, InnoDB should
consider that foreign key relation may not have foreign index
when foreign key check is disabled.
2023-11-02 14:33:05 +05:30
Marko Mäkelä
0cc809f91b MDEV-31826: Memory leak on failed IMPORT TABLESPACE
fil_delete_tablespace(): Invoke fil_space_free_low() directly.
This fixes up commit 39e3ca8bd2
2023-10-31 12:48:20 +02:00
Marko Mäkelä
15ae97b1c2 MDEV-32578 row_merge_fts_doc_tokenize() handles parser plugin inconsistently
When mysql/mysql-server@0c954c2289
added a plugin interface for FULLTEXT INDEX tokenization to MySQL 5.7,
fts_tokenize_ctx::processed_len got a second meaning, which is only
partly implemented in row_merge_fts_doc_tokenize().

This inconsistency could cause a crash when using FULLTEXT...WITH PARSER.
A test case that would crash MySQL 8.0 when using an n-gram parser and
single-character words would fail to crash in MySQL 5.7, because the
buf_full condition in row_merge_fts_doc_tokenize() was not met.

This change is inspired by
mysql/mysql-server@38e9a0779a
that appeared in MySQL 5.7.44.
2023-10-27 13:13:49 +03:00
Marko Mäkelä
5b53342a6a MDEV-32588 InnoDB may hang when running out of buffer pool
buf_flush_LRU_list_batch(): Do not skip pages that are actually clean
but in buf_pool.flush_list due to the "lazy removal" optimization of
commit 22b62edaed, but try to evict them.
After acquiring buf_pool.flush_list_mutex, reread oldest_modification
to ensure that the block still remains in buf_pool.flush_list.

In addition to server hangs, this bug could also cause
InnoDB: Failing assertion: list.count > 0
in invocations of UT_LIST_REMOVE(flush_list, ...).

This fixes a regression that was caused by
commit a55b951e60
and possibly made more likely to hit due to
commit aa719b5010.
2023-10-26 15:10:53 +03:00
Marko Mäkelä
39e3ca8bd2 MDEV-31826 InnoDB may fail to recover after being killed in fil_delete_tablespace()
InnoDB was violating the write-ahead-logging protocol when a file
was being deleted, like this:

1. fil_delete_tablespace() set the fil_space_t::STOPPING flag
2. The buf_flush_page_cleaner() thread discards some changed pages for
this tablespace advances the log checkpoint a little.
3. The server process is killed before fil_delete_tablespace() wrote
a FILE_DELETE record.
4. Recovery will try to apply log to pages of the tablespace, because
there was no FILE_DELETE record. This will fail, because some pages
that had been modified since the latest checkpoint had not been written
by the page cleaner.

Page writes must not be stopped before a FILE_DELETE record has been
durably written.

fil_space_t::drop(): Replaces fil_space_t::check_pending_operations().
Add the parameter detached_handle, and return a tablespace pointer
if this thread was the first one to stop I/O on the tablespace.

mtr_t::commit_file(): Remove the parameter detached_handle, and
move some handling to fil_space_t::drop().

fil_space_t: STOPPING_READS, STOPPING_WRITES: Separate flags for STOPPING.
We want to stop reads (and encryption) before stopping page writes.

fil_space_t::is_stopping_writes(), fil_space_t::get_for_write():
Special accessors for the write path.

fil_space_t::flush_low(): Ignore the STOPPING_READS flag and only
stop if STOPPING_WRITES is set, to avoid an infinite loop in
fil_flush_file_spaces(), which was occasionally repeated by
running the test encryption.create_or_replace.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
2023-10-26 15:07:59 +03:00
Marko Mäkelä
2ba9702163 MDEV-32050: Boost innodb_purge_batch_size on slow shutdown
A slow shutdown using the previous default innodb_purge_batch_size=300
could be extremely slow, employing at most a few CPU cores on the average.
Let us use the maximum batch size in order to increase throughput.

Reviewed by: Vladislav Lesin
2023-10-25 10:21:49 +03:00
Marko Mäkelä
aa719b5010 MDEV-32050: Do not copy undo records in purge
Also, default to innodb_purge_batch_size=1000,
replacing the old default value of processing 300 undo log pages
in a batch. Axel Schwenke found this value to help reduce purge lag
without having a significant impact on workload throughput.

In purge, we can simply acquire a shared latch on the undo log page
(to avoid a race condition like the one that was fixed in
commit b102872ad5) and retain a buffer-fix
after releasing the latch. The buffer-fix will prevent the undo log
page from being evicted from the buffer pool. Concurrent modification
is prevented by design. Only the purge_coordinator_task
(or its accomplice purge_truncation_task) may free the undo log pages,
after any purge_worker_task have completed execution. Hence, we do not
have to worry about any overwriting or reuse of the undo log records.

trx_undo_rec_copy(): Remove. The only remaining caller would have been
trx_undo_get_undo_rec_low(), which is where the logic was merged.

purge_sys_t::m_initialized: Replaces heap.

purge_sys_t::pages: A cache of buffer-fixed pages that have been
looked up from buf_pool.page_hash.

purge_sys_t::get_page(): Return a buffer-fixed undo page, using the
pages cache.

trx_purge_t::batch_cleanup(): Renamed from clone_end_view().
Clear the pages cache and clone the end_view at the end of a batch.

purge_sys_t::n_pages_handled(): Return pages.size(). This determines
if innodb_purge_batch_size was exceeded.

purge_sys_t::rseg_get_next_history_log(): Replaces
trx_purge_rseg_get_next_history_log().

purge_sys_t::choose_next_log(): Replaces trx_purge_choose_next_log()
and trx_purge_read_undo_rec().

purge_sys_t::get_next_rec(): Replaces trx_purge_get_next_rec()
and trx_undo_get_next_rec().

purge_sys_t::fetch_next_rec(): Replaces trx_purge_fetch_next_rec()
and some use of trx_undo_get_first_rec().

trx_purge_attach_undo_recs(): Do not allow purge_sys.n_pages_handled()
exceed the innodb_purge_batch_size or ¾ of the buffer pool, whichever
is smaller.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich and Axel Schwenke
2023-10-25 10:19:17 +03:00
Marko Mäkelä
88733282fb MDEV-32050: Look up tables in the purge coordinator
The InnoDB table lookup in purge worker threads is a bottleneck that can
degrade a slow shutdown to utilize less than 2 threads. Let us fix that
bottleneck by constructing a local lookup table that does not require any
synchronization while the undo log records of the current batch
are being processed.

TRX_PURGE_TABLE_BUCKETS: The initial number of std::unordered_map
hash buckets used during a purge batch. This could avoid some
resizing and rehashing in trx_purge_attach_undo_recs().

purge_node_t::tables: A lookup table from table ID to an already
looked up and locked table. Replaces many fields.

trx_purge_attach_undo_recs(): Look up each table in the purge batch
only once.

trx_purge(): Close all tables and release MDL at the end of the batch.

trx_purge_table_open(), trx_purge_table_acquire(): Open a table in purge
and acquire a metadata lock on it. This replaces
dict_table_open_on_id<true>() and dict_acquire_mdl_shared().

purge_sys_t::close_and_reopen(): In case of an MDL conflict, close and
reopen all tables that are covered by the current purge batch.
It may be that some of the tables have been dropped meanwhile and can
be ignored. This replaces wait_SYS() and wait_FTS().

row_purge_parse_undo_rec(): Make purge_coordinator_task issue a
MDL warrant to any purge_worker_task which might need it
when innodb_purge_threads>1.

purge_node_t::end(): Clear the MDL warrant.

Reviewed by: Vladislav Lesin and Vladislav Vaintroub
2023-10-25 10:08:20 +03:00
Marko Mäkelä
d70a98ae06 MDEV-32050: Revert the throttling of MDEV-26356
purge_coordinator_state::do_purge(): Simply use all innodb_purge_threads,
no matter what the LSN age is. During shutdown with innodb_fast_shutdown=0
this code could degrade to using only 1 thread.

Also, restore periodical "InnoDB: to purge" messages that were
accidentally disabled in commit 80585c9d6f.

Reviewed by: Vladislav Lesin and Vladislav Vaintroub
2023-10-25 09:42:38 +03:00
Marko Mäkelä
2027c482de MDEV-32050: Hold exclusive purge_sys.rseg->latch longer
Let the purge_coordinator_task acquire purge_sys.rseg->latch
less frequently and hold it longer at a time. This may throttle
concurrent DML and prevent purge lag a little.

Remove an unnecessary std::this_thread::yield(), because the
trx_purge_attach_undo_recs() is supposed to terminate the scan
when running out of undo log records. Ultimately, this will
result in purge_coordinator_state::do_purge() and
purge_coordinator_callback() returning control to the thread pool.

Reviewed by: Vladislav Lesin and Vladislav Vaintroub
2023-10-25 09:38:49 +03:00
Marko Mäkelä
44689eb7d8 MDEV-32050: Improve srv_wake_purge_thread_if_not_active()
purge_sys_t::wake_if_not_active(): Replaces
srv_wake_purge_thread_if_not_active().

innodb_ddl_recovery_done(): Move the wakeup call to
srv_init_purge_tasks().

purge_coordinator_timer: Remove. The srv_master_callback() already
invokes purge_sys.wake_if_not_active() once per second.

Reviewed by: Vladislav Lesin and Vladislav Vaintroub
2023-10-25 09:38:21 +03:00
Marko Mäkelä
14685b10df MDEV-32050: Deprecate&ignore innodb_purge_rseg_truncate_frequency
The motivation of introducing the parameter
innodb_purge_rseg_truncate_frequency in
mysql/mysql-server@28bbd66ea5 and
mysql/mysql-server@8fc2120fed
seems to have been to avoid stalls due to freeing undo log pages
or truncating undo log tablespaces. In MariaDB Server,
innodb_undo_log_truncate=ON should be a much lighter operation
than in MySQL, because it will not involve any log checkpoint.

Another source of performance stalls should be
trx_purge_truncate_rseg_history(), which is shrinking the history list
by freeing the undo log pages whose undo records have been purged.
To alleviate that, we will introduce a purge_truncation_task that will
offload this from the purge_coordinator_task. In that way, the next
innodb_purge_batch_size pages may be parsed and purged while the pages
from the previous batch are being freed and the history list being shrunk.

The processing of innodb_undo_log_truncate=ON will still remain the
responsibility of the purge_coordinator_task.

purge_coordinator_state::count: Remove. We will ignore
innodb_purge_rseg_truncate_frequency, and act as if it had been
set to 1 (the maximum shrinking frequency).

purge_coordinator_state::do_purge(): Invoke an asynchronous task
purge_truncation_callback() to free the undo log pages.

purge_sys_t::iterator::free_history(): Free those undo log pages
that have been processed. This used to be a part of
trx_purge_truncate_history().

purge_sys_t::clone_end_view(): Take a new value of purge_sys.head
as a parameter, so that it will be updated while holding exclusive
purge_sys.latch. This is needed for race-free access to the field
in purge_truncation_callback().

Reviewed by: Vladislav Lesin
2023-10-25 09:11:58 +03:00
Marko Mäkelä
21bec97044 MDEV-32050: Clean up online ALTER
UndorecApplier::assign_rec(): Remove. We will pass the undo record to
UndorecApplier::apply_undo_rec(). There is no need to copy the
undo record, because nothing else can write to the undo log pages
that belong to an active or incomplete transaction.

trx_t::apply_log(): Buffer-fix the undo page across mini-transaction
boundary in order to avoid repeated page lookups.

Reviewed by: Vladislav Lesin
2023-10-25 08:27:27 +03:00
Marko Mäkelä
9bb5d9fe8b MDEV-32050: Clean up log parsing
purge_node_t, undo_node_t: Change the type of rec_type and cmpl_info
to byte, because this data is being extracted from a single byte.

UndoRecApplier: Change type and cmpl_info to be of type byte, and
move them next to the 16-bit offset field to minimize alignment bloat.

row_purge_parse_undo_rec(): Remove some redundant code. Purge will
be started by innodb_ddl_recovery_done(), at which point all
necessary subsystems will have been initialized.

trx_purge_rec_t::undo_rec: Point to const.

Reviewed by: Vladislav Lesin
2023-10-25 08:27:08 +03:00
Marko Mäkelä
ea42c4baac MDEV-32050 preparation: Simplify ROLLBACK
undo_node_t::state: Replaced with bool is_temp.

row_undo_rec_get(): Do not copy the undo log record.
The motivation of the copying was to not hold latches on the undo pages
and therefore to avoid deadlocks due to lock order inversion a.k.a.
latching order violation: It is not allowed to wait for an index page latch
while holding an undo page latch, because MVCC reads would first acquire
an index page latch and then an undo page latch. But, in rollback, we
do not actually need any latch on our own undo pages. The transaction
that is being rolled back is the exclusive owner of its undo log records.
They cannot be overwritten by other threads until the rollback is complete.
Therefore, a buffer fix will protect the undo log record just fine,
by preventing page eviction. We still must initially acquire a shared latch
on each undo page, to avoid a race condition like the one that was fixed in
commit b102872ad5.

row_undo_ins_parse_undo_rec(): The first two bytes of the undo log record
now are the pointer to the next record within the page, not a length.

Reviewed by: Vladislav Lesin
2023-10-25 08:26:34 +03:00
Marko Mäkelä
b78b77e77d MDEV-32530 Race condition in lock_wait_rpl_report()
After acquiring lock_sys.latch, always load trx->lock.wait_lock.
It could have changed by another thread that did lock_rec_move()
and released lock_sys.latch right before lock_sys.wr_lock_try()
succeeded.

This regression was introduced in
commit e039720bf3 (MDEV-32096).

Reviewed by: Vladislav Lesin
2023-10-24 14:33:14 +03:00
Alexander Barkov
df72c57d6f MDEV-30048 Prefix keys for CHAR work differently for MyISAM vs InnoDB
Also fixes: MDEV-30050 Inconsistent results of DISTINCT with NOPAD

Problem:

Key segments for CHAR columns where compared using strnncollsp()
for engines MyISAM and Aria.

This did not work correct in case if the engine applyied trailing
space compression.

Fix:

Replacing ha_compare_text() calls to new functions:

- ha_compare_char_varying()
- ha_compare_char_fixed()
- ha_compare_word()
- ha_compare_word_prefix()
- ha_compare_word_or_prefix()

The code branch corresponding to comparison of CHAR column keys
(HA_KEYTYPE_TEXT segment type) now uses ha_compare_char_fixed()
which calls strnncollsp_nchars().

This patch does not change the behavior for the rest of the code:
- comparison of VARCHAR/TEXT column keys
  (HA_KEYTYPE_VARTEXT1, HA_KEYTYPE_VARTEXT2 segments types)
- comparison in the fulltext code
2023-10-24 03:35:48 +04:00
Marko Mäkelä
b21f52ee73 Merge 10.5 into 10.6 2023-10-23 16:43:48 +03:00
Marko Mäkelä
b5e43a1d35 MDEV-32552 Write-ahead logging is broken for freed pages
buf_page_free(): Flag the freed page as modified if it is found in
the buffer pool.

buf_flush_page(): If the page has been freed, ensure that the log
for it has been durably written, before removing the page
from buf_pool.flush_list.

FindBlockX: Find also MTR_MEMO_PAGE_X_MODIFY in order to avoid an
occasional failure of innodb.innodb_defrag_concurrent, which involves
freeing and reallocating pages in the same mini-transaction.

This fixes a regression that was introduced in
commit a35b4ae898 (MDEV-15528).

This logic was tested by commenting out the $shutdown_timeout line
from a test and running the following:

./mtr --rr innodb.scrub
rr replay var/log/mysqld.1.rr/mariadbd-0

A breakpoint in the modified buf_flush_page() was hit, and the
FIL_PAGE_LSN of that page had been last modified during the
mtr_t::commit() of a mini-transaction where buf_page_free()
had been executed on that page.
2023-10-23 16:13:16 +03:00
Thirunarayanan Balathandayuthapani
7d89dcf1ae MDEV-32527 Server aborts during alter operation when table doesn't have foreign index
Problem:
========
InnoDB fails to find the foreign key index for the
foreign key relation in the table while iterating the
foreign key constraints during alter operation. This is
caused by commit 5f09b53bdb
(MDEV-31086).

Fix:
====
In check_col_is_in_fk_indexes(), while iterating through
the foreign key relationship, InnoDB should consider that
foreign key relation may not have foreign index when
foreign key check is disabled.
2023-10-20 15:23:22 +05:30
Marko Mäkelä
6991b1c47c Merge 10.5 into 10.6 2023-10-19 13:50:00 +03:00
Thirunarayanan Balathandayuthapani
85751ed81d MDEV-31851 After crash recovery, undo tablespace fails to open
srv_all_undo_tablespaces_open(): While opening the extra unused
undo tablespaces, InnoDB should use ULINT_UNDEFINED instead of
SRV_SPACE_ID_UPPER_BOUND.
2023-10-19 15:39:44 +05:30
Thirunarayanan Balathandayuthapani
dbba1bb1c3 MDEV-31851 After crash recovery, undo tablespace fails to open
recv_recovery_from_checkpoint_start(): InnoDB should add the
redo log block header + trailer size while checking the	log
sequence number in log file with log sequence number in the
system tablespace first page.
2023-10-19 13:12:10 +05:30
Marko Mäkelä
2d6dc65de5 MDEV-32144 fixup
In commit 384eb570a6 the debug check
was relaxed in trx_undo_header_create(), not in the intended function
trx_undo_write_xid().
2023-10-19 08:24:37 +03:00
Marko Mäkelä
cfd1788182 MDEV-32511: Race condition between checkpoint and page write
fil_aio_callback(): Invoke fil_node_t::complete_write() before
releasing any page latch, so that in case a log checkpoint is
executed roughly concurrently with the first write into a file
since the previous checkpoint, we will not miss a fdatasync()
or fsync() call to make the write durable.
2023-10-18 16:51:04 +03:00
Marko Mäkelä
bf7c6fc20b MDEV-32511 Assertion !os_aio_pending_writes() failed
In MemorySanitizer builds of 10.10 and 10.11, we would rather often
have the assertion fail in innodb_init() during mariadb-backup --prepare.
The assertion could also fail during InnoDB startup, but less often.

Before commit 685d958e38 in 10.8 the
log file cleanup after a successfully applied backup is different,
and the os_aio_pending_writes() assertion is in srv0start.cc.

IORequest::write_complete(): Invoke node->complete_write() before
releasing the page latch, so that a log checkpoint that is about to
execute concurrently will not miss a fdatasync() or fsync() on the
file, in case this was the first write since the last such call.

create_log_file(), srv_start(): Replace the debug assertion with
a debug check. For all intents and purposes, all writes could have
been completed but some write_io_callback() may not have invoked
io_slots::release() yet.
2023-10-18 16:33:11 +03:00
Daniel Black
e467e8d8c2 MDEV-30825 innodb_compression_algorithm=0 (none) increments Innodb_num_pages_page_compression_error
fil_page_compress_low returns 0 for both innodb_compression_algorithm=0
and where there is compression errors. On the two callers to this
function, don't increment the compression errors if the algorithm was
none.

Reviewed by: Marko Mäkelä
2023-10-18 19:18:50 +11:00
Thirunarayanan Balathandayuthapani
3da5d047b8 MDEV-31851 After crash recovery, undo tablespace fails to open
Problem:
========
- InnoDB fails to open undo tablespace when page0 is corrupted
and fails to throw error.

Solution:
=========
- InnoDB throws DB_CORRUPTION error when InnoDB encounters
page0 corruption of undo tablespace.

- InnoDB restores the page0 of undo tablespace from
doublewrite buffer if it encounters page corruption

- Moved Datafile::restore_from_doublewrite() to
recv_dblwr_t::restore_first_page(). So that undo
tablespace and system tablespace can use this function
instead of duplicating the code

srv_undo_tablespace_open(): Returns 0 if file doesn't exist
or ULINT_UNDEFINED if page0 is corrupted.
2023-10-17 18:41:21 +05:30
Thirunarayanan Balathandayuthapani
ee5cadd5c8 MDEV-28122 Optimize table crash while applying online log
- InnoDB fails to check the overflow buffer while applying
the operation to the table that was rebuilt. This is caused
by commit 3cef4f8f0f (MDEV-515).
2023-10-16 20:17:09 +05:30
Vlad Lesin
18fa00a54c MDEV-32272 lock_release_on_prepare_try() does not release lock if supremum bit is set along with other bits set in lock's bitmap
The error is caused by MDEV-30165 fix with the following commit:
d13a57ae81

There is logical error in lock_release_on_prepare_try():

        if (supremum_bit)
          lock_rec_unlock_supremum(*cell, lock);
        else
          lock_rec_dequeue_from_page(lock, false);

Because there can be other bits set in the lock's bitmap, and the lock
type can be suitable for releasing criteria, but the above logic
releases only supremum bit of the lock.

The fix is to release lock if it suits for releasing criteria and unlock
supremum if supremum is locked otherwise.

Tere is also the test for the case, which was reported by QA team. I
placed it in a separate files, because it requires debug build.

Reviewed by: Marko Mäkelä
2023-10-13 16:29:04 +03:00
Thirunarayanan Balathandayuthapani
cbad0bcd41 MDEV-31098 InnoDB Recovery doesn't display encryption message when no encryption configuration passed
- InnoDB fails to report the error when encryption configuration
wasn't passed. This patch addresses the issue by adding
the error while loading the tablespace and deferring the
tablespace creation.
2023-10-13 17:27:27 +05:30
Daniel Black
fbd11d5f29 MDEV-18200 MariaBackup full backup failed with InnoDB: Failing assertion: success
Review cleanups.
2023-10-13 09:48:57 +11:00
Daniel Black
c79ca7c7ad MDEV-18200 MariaBackup full backup failed with InnoDB: Failing assertion: success
There are many filesystem related errors that can occur with
MariaBackup. These already outputed to stderr with a good description of
the error. Many of these are permission or resource (file descriptor)
limits where the assertion and resulting core crash doesn't offer
developers anything more than the log message. To the user, assertions
and core crashes come across as poor error handling.

As such we return an error and handle this all the way up the stack.
2023-10-12 21:37:27 +11:00
Thirunarayanan Balathandayuthapani
4045ead9db MDEV-32337 Assertion `pos < table->n_def' failed in dict_table_get_nth_col
While checking for altered column in foreign key constraints,
InnoDB fails to ignore virtual columns. This issue caused
by commit 5f09b53bdb4e973e7c7ec2c53a24c98321223f98(MDEV-31086).
2023-10-12 14:49:27 +05:30
Thirunarayanan Balathandayuthapani
a2312b6fb2 MDEV-32017 Auto-increment no longer works for explicit FTS_DOC_ID
- InnoDB should avoid the sync commit operation when
there is nothing in fulltext cache. This is caused by
commit 1248fe7277 (MDEV-27582)
2023-10-12 14:48:43 +05:30
Marko Mäkelä
f9d471e2d5 Cleanup: Remove innobase_init_vc_templ()
This fixes up a merge of commit 4fb8f7d07a
with respect to commit ea37b14409.
2023-10-12 09:48:54 +03:00
Marko Mäkelä
3f1a256234 MDEV-31890: Remove COMPILE_FLAGS
The cmake configuration step is single-threaded and already consuming
too much time. We should not make it worse by adding invocations like
MY_CHECK_CXX_COMPILER_FLAG().

Let us prefer something that works on any supported version
of GCC (4.8.5 or later) or clang, as well as recent versions
of the Intel C compiler.

This replaces commit 1fde785315
2023-10-11 15:59:56 +03:00
Marko Mäkelä
625a150a86 Merge 10.5 into 10.6 2023-10-06 14:34:01 +03:00
Marko Mäkelä
6e9b421f77 MDEV-32364 Server crashes when starting server with high innodb_log_buffer_size
log_t::create(): Return whether the initialisation succeeded.
It may fail if too large an innodb_log_buffer_size is specified.
2023-10-06 14:16:01 +03:00
Vlad Lesin
96ae37abc5 MDEV-30658 lock_row_lock_current_waits counter in information_schema.innodb_metrics may become negative
MONITOR_OVLD_ROW_LOCK_CURRENT_WAIT monitor should has
MONITOR_DISPLAY_CURRENT flag set in its definition, as it shows the
current state and does not accumulate anything.

Reviewed by: Marko Mäkelä
2023-10-05 18:27:54 +03:00
Vladislav Vaintroub
e33e2fa949 MDEV-31095 tpool - restrict threadpool concurrency during bufferpool load
Add threadpool functionality to restrict concurrency during "batch"
periods (where tasks are added in rapid succession).
This will throttle thread creation more agressively than usual, while
keeping performance at least on-par.

One of these cases is bufferpool load, where async read IOs are executed
without any throttling. There can be as much as 650K read IOs for
loading 10GB buffer pool.

Another one is recovery, where "fake read" IOs are executed.

Why there are more threads than we expect?
Worker threads are not be recognized as idle, until they return to the
standby list, and to return to that list, they need to acquire
mutex currently held in the submit_task(). In those cases, submit_task()
has no worker to wake, and would create threads until default concurrency
level (2*ncpus) is satisfied. Only after that throttling would happen.
2023-10-04 17:44:02 +02:00
Daniel Black
ca66a2cbfa MDEV-18200 MariaBackup full backup failed with InnoDB: Failing assertion: success
There are many filesystem related errors that can occur with
MariaBackup. These already outputed to stderr with a good description of
the error. Many of these are permission or resource (file descriptor)
limits where the assertion and resulting core crash doesn't offer
developers anything more than the log message. To the user, assertions
and core crashes come across as poor error handling.

As such we return an error and handle this all the way up the stack.
2023-09-26 08:55:52 +10:00
Jan Lindström
076df87b4c MDEV-30217 : Assertion `mode_ == m_local || transaction_.is_streaming()' failed in int wsrep::client_state::bf_abort(wsrep::seqno)
Problem was that brute force (BF) thread requested conflicting lock
and was trying to kill victim transaction, but this victim
was also brute force thread. However, this victim was not actually
holding conflicting lock, instead both brute force transaction
and victim transaction were had insert intention locks.
We should not kill brute force victim transaction if requesting
lock does not need to wait.

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2023-09-25 16:38:55 +02:00
Yuchen Pei
6b343de8ef
Merge branch '10.4' into 10.5 2023-09-25 13:06:57 +10:00
Vladislav Vaintroub
1ee0d09a2b MDEV-32228 speedup opening tablespaces on Windows
is_file_on_ssd() is more expensive than it should be.
It caches the results by volume name, but still calls GetVolumePathName()
every time, which, as procmon shows, opens multiple directories in
filesystem hierarchy (db directory, datadir, and all ancestors)

The fix is to cache SSD status by volume serial ID, which is cheap to
retrieve with GetFileInformationByHandleEx()
2023-09-22 21:07:50 +02:00
Vlad Lesin
d13a57ae81 Merge 10.5 into 10.6. 2023-09-22 15:21:15 +03:00
Vlad Lesin
95730372bd MDEV-30165 X-lock on supremum for prepared transaction for RR
trx_t::set_skip_lock_inheritance() must be invoked at the very beginning
of lock_release_on_prepare(). Currently  trx_t::set_skip_lock_inheritance()
is invoked at the end of lock_release_on_prepare() when lock_sys and trx
are released, and there can be a case when locks on prepare are released,
but "not inherit gap locks" bit has not yet been set, and page split
inherits lock to supremum.

Also reset supremum bit and rebuild waiting queue when XA is prepared.

Reviewed by: Marko Mäkelä
2023-09-21 20:07:53 +03:00
Marko Mäkelä
52e7016248 Remove dead code
This fixes up commmit ed20e5b111
which fixed up the merge commit 202316a38f
2023-09-20 08:36:30 +03:00
Marko Mäkelä
60b039a864 Merge 10.5 into 10.6 2023-09-20 08:32:04 +03:00
Marko Mäkelä
d58f43f8b4 MDEV-21174 fixup: Remove unused ut_bit_set_nth()
This fixes up commit 56f6dab1d0
2023-09-19 18:02:56 +03:00
Thirunarayanan Balathandayuthapani
2fdacdcd69 MDEV-30802 Assertion `index->is_btree() || index->is_ibuf()' failed in btr_search_guess_on_hash
Problem:
=======
 - There is a race condition between purge and rollback of alter
operation. Alter rollback marks the index as corrupted.
At the same time, purge is working on the same index and leads
to assert failure. This is caused by
commit 7c0b9c6020 (MDEV-15250).

Solution:
=======
 - After MDEV-15250, InnoDB logs the operation only at the
end of transaction commit and applies the log in
ha_innobase::commit_inplace_alter_table() and
also via dml thread. So there is no need for purge to work
on uncommitted index.

The assertion would fail in the test innodb.innodb-index-online
when the following call is added to the start of the function
row_purge_remove_sec_if_poss_leaf():
if (!index->is_committed()) sleep(5);
2023-09-19 19:04:53 +05:30
Marko Mäkelä
8096139b3a Merge 10.5 into 10.6 2023-09-19 10:47:26 +03:00
Marko Mäkelä
6c05edfdcd Merge 10.4 into 10.5 2023-09-19 10:20:09 +03:00
Marko Mäkelä
76b688f100 MDEV-30024 InnoDB: tried to purge non-delete-marked of a virtual column prefix
row_vers_vc_matches_cluster(): Invoke dtype_get_at_most_n_mbchars()
to extract the correct number of bytes corresponding to the number
of characters in a virtual column prefix index, just like we do in
row_sel_sec_rec_is_for_clust_rec().

The test case would occasionally reproduce the failure when this
fix is not present.
2023-09-19 09:31:34 +03:00
Thirunarayanan Balathandayuthapani
85db6df412 MDEV-32151 InnoDB scrubbing doesn't write zero while freeing the page for temporary tablespace
- InnoDB fails to mark the page status as FREED during freeing
of page for temporary tablespace. This behaviour affects
scrubbing and doesn't write all zeroes in file even though
pages are freed.

mtr_t::free(): Mark the page as freed for temporary tablespace
also
2023-09-18 18:26:07 +05:30
Marko Mäkelä
6a470db552 Merge 10.5 into 10.6 2023-09-14 15:25:53 +03:00
Marko Mäkelä
81e60f1a0a MDEV-32163 Crash recovery fails after DROP TABLE in system tablespace
fseg_free_extent(): After fsp_free_extent() succeeded, properly
mark the affected pages as freed. We failed to write FREE_PAGE records.

This bug was revealed or caused by
commit e938d7c18f (MDEV-32028).
2023-09-14 15:17:27 +03:00
Marko Mäkelä
0f9acce3f2 Merge 10.5 into 10.6 2023-09-14 09:01:15 +03:00
Marko Mäkelä
cce76df5cc Fix cmake -DWITH_INNODB_AHI=OFF
This fixes up commit 6cc88c3db1

Thanks to Markus Mäkelä for reporting the build failure.
2023-09-14 08:58:41 +03:00
Marko Mäkelä
d20a4da23d MDEV-32150 InnoDB reports corruption on 32-bit platforms with ibd files sizes > 4GB
buf_read_page_low(): Use 64-bit arithmetics when computing the
file byte offset. In other calls to fil_space_t::io() the offset
was being computed correctly, for example by
buf_page_t::physical_offset().
2023-09-12 15:16:31 +03:00
Marko Mäkelä
736901b443 MDEV-30100 fixup: Remove a failing debug assertion
trx_purge_truncate_history(): Remove a debug assertion that
had originally been added in
commit 0de3be8cfd (MDEV-30671).
In trx_t::commit_empty() we do not have any efficient way to rewind
rseg.needs_purge to an accurate value that would satisfy this
debug assertion.

Note: No correctness property should be violated here. At the point
where the debug assertion was located, we had already established
that purge_sys.sees(rseg.needs_purge) holds, that is, it is safe
to remove everything from rseg.
2023-09-12 12:25:51 +03:00
Marko Mäkelä
3c840ae746 MDEV-26782 fixup: Remove dead code
trx_undo_reuse_cached(): Assert that this is being invoked on the
persistent rollback segment of the transaction, and remove dead code
that was handling cached temporary undo log. This was missed in
commit 51e62cb3b3 (MDEV-26782).
2023-09-12 12:03:35 +03:00
Thirunarayanan Balathandayuthapani
a03b8cd0a2 MDEV-32145 Disable read-ahead for temporary tablespace
- Lifetime of temporary tables is expected to be short, it would
seem to make sense to assume that all temporary tablespace pages
will remain in the buffer pool. It doesn't make sense to have
read-ahead for pages of temporary tablespace
2023-09-11 18:02:53 +05:30
Marko Mäkelä
cdd2fa7fc5 MDEV-32134 InnoDB hang in buf_flush_wait_LRU_batch_end()
buf_flush_page_cleaner(): Before finishing a batch, wake up any threads
that are waiting for buf_pool.done_flush_LRU.

This should fix a hung shutdown that we observed
after SET GLOBAL innodb_buffer_pool_size started was executed
to shrink the InnoDB buffer pool.
2023-09-11 14:54:50 +03:00
Marko Mäkelä
466d9f5ff3 MDEV-32103 InnoDB ALTER TABLE is not crash-safe
Starting with commit 4ff5311dec
log_write_up_to(trx->commit_lsn, true) in DDL operations could end up
being a no-op, because trx->commit_lsn would be 0.

trx_flush_log_if_needed(): Revert an incorrect attempt to ensure
that DDL operations are crash-safe.

trx_t::commit(std::vector<pfs_os_file_t> &), ha_innobase::rename_table():
Set trx_t::flush_log_later so that trx_t::commit_in_memory() will
retain trx_t::commit_lsn for the final durability call.

Tested by: Matthias Leich
2023-09-11 14:54:17 +03:00
Marko Mäkelä
4a8291fc5f MDEV-30531 Corrupt index(es) on busy table when using FOREIGN KEY
lock_wait(): Never return the transient error code DB_LOCK_WAIT.
In commit 78a04a4c22 (MDEV-29869)
some assignments assign trx->error_state = DB_SUCCESS were removed,
and it was possible that the field was left at its initial value
DB_LOCK_WAIT.

The test case for this is nondeterministic; without this fix, it
would only occasionally fail.

Reviewed by: Vladislav Lesin
2023-09-11 14:52:05 +03:00
Marko Mäkelä
e039720bf3 MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait
lock_sys_t::cancel(trx_t*): Remove, and merge to its only caller
innobase_kill_query().

innobase_kill_query(): Before reading trx->lock.wait_lock,
do acquire lock_sys.wait_mutex, like we did before
commit e71e613353 (MDEV-24671).
In this way, we should not miss a recently started lock wait
by the killee transaction.

lock_rec_lock(): Add a DEBUG_SYNC "lock_rec" for the test case.

lock_wait(): Invoke trx_is_interrupted() before entering the wait,
in case innobase_kill_query() was invoked some time earlier and
some longer-running operation did not check for interrupts.
As suggested by Vladislav Lesin, do not overwrite
trx->error_state==DB_INTERRUPTED with DB_SUCCESS.
This would avoid a call to trx_is_interrupted() when the test is
modified to use the DEBUG_SYNC point lock_wait_start instead of lock_rec.
Avoid some redundant loads of trx->lock.wait_lock; cache the value
in the local variable wait_lock.

Deadlock::check_and_resolve(): Take wait_lock as a parameter and
return wait_lock (or -1 or nullptr). We only need to reload
trx->lock.wait_lock if lock_sys.wait_mutex had been released
and reacquired.

trx_t::error_state: Correctly document the data member.

trx_lock_t::was_chosen_as_deadlock_victim: Clarify that other threads
may set the field (or flags in it) while holding lock_sys.wait_mutex.

Thanks to Johannes Baumgarten for reporting the problem and testing
the fix, as well as to Kristian Nielsen for suggesting the fix.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
2023-09-11 14:51:02 +03:00
Marko Mäkelä
0dd25f28f7 Merge 10.5 into 10.6 2023-09-11 14:46:39 +03:00
Marko Mäkelä
384eb570a6 MDEV-32144 Debug assertion failure w == MAYBE_NOP in mtr_t::memcpy()
trx_undo_write_trx_xid(): Silence the debug assertion by passing
a template parameter that causes us to not care that the contents of
the page did not actually change and no log record would be written.
This debug assertion could fail if XA PREPARE was executed multiple
times with the same XID.
2023-09-11 11:48:15 +03:00
Marko Mäkelä
f8f7d9de2c Merge 10.4 into 10.5 2023-09-11 11:29:31 +03:00
Marko Mäkelä
65c99207e0 MDEV-23841: Memory leak in innodb_monitor_validate()
innodb_monitor_validate(): Let item_val_str() allocate the memory
in THD, so that it will be available to innodb_monitor_update().
In this way, there is no need to allocate another buffer, and
no problem if the call to innodb_monitor_update() is skipped due
to an invalid value that is passed to another configuration parameter.

There are some other callers to st_mysql_sys_var::val_str()
that validate configuration parameters that are related to FULLTEXT INDEX,
but they will allocate memory by invoking thd_strmake().
2023-09-11 10:27:21 +03:00
Sergei Golubchik
fba4abf3b9 MDEV-32128 wrong table name in innodb's "row too big" errors 2023-09-08 19:15:33 +02:00
Marko Mäkelä
34c283ba1b MDEV-32132 DROP INDEX followed by CREATE INDEX may corrupt data
ibuf_set_bitmap_for_bulk_load(): Port a bug fix that was made as part of
commit 165564d3c3 (MDEV-30009)
in MariaDB Server 10.5.19.
2023-09-08 11:28:21 +03:00
Nayana Thorat
961b96a5e0 MDEV-29324 s390x patch srw_lock.cc
Fix debug mode build failure on s390x.
Replaced builtin_ttest by __builtin_tx_nesting_depth() > 0 as a s390x equivalent version of the expression.
2023-09-07 01:45:49 -07:00
Marko Mäkelä
b0a43818b4 Merge 10.5 into 10.6 2023-09-04 10:15:02 +03:00
Marko Mäkelä
59952b2625 Merge 10.4 into 10.5 2023-09-04 09:40:26 +03:00
Thirunarayanan Balathandayuthapani
d1fca0baab MDEV-32060 Server aborts when table doesn't have referenced index
- Server aborts when table doesn't have referenced index.
This is caused by 5f09b53bdb (MDEV-31086).
While iterating the foreign key constraints, we fail to
consider that InnoDB doesn't have referenced index for
it when foreign key check is disabled.
2023-09-01 17:54:07 +05:30
Marko Mäkelä
2325f8f339 Merge 10.5 into 10.6 2023-08-31 13:01:42 +03:00
Marko Mäkelä
2db5f1b298 MDEV-32049 Deadlock due to log_free_check() in trx_purge_truncate_history()
The function log_free_check() is not supposed to be invoked while
the caller is holding any InnoDB synchronization objects, such as
buffer page latches, tablespace latches, index tree latches, or
in this case, rseg->mutex (rseg->latch in 10.6 or later).

A hang was reported in 10.6 where several threads were waiting for
an rseg->latch that had been exclusively acquired in
trx_purge_truncate_history(), which invoked log_free_check() inside
trx_purge_truncate_rseg_history(). Because the threads that were
waiting for the rseg->latch were holding exclusive latches on some
index pages, log_free_check() was unable to advance the checkpoint
because those index pages could not be written out.

trx_purge_truncate_history(): Invoke log_free_check() before
acquiring the rseg->mutex and invoking trx_purge_free_segment().

trx_purge_free_segment(): Do not invoke log_free_check() in order
to avoid a deadlock.
2023-08-31 12:14:49 +03:00
Marko Mäkelä
9d1466522e MDEV-32029 Assertion failures in log_sort_flush_list upon crash recovery
In commit 0d175968d1 (MDEV-31354)
we only waited that no buf_pool.flush_list writes are in progress.
The buf_flush_page_cleaner() thread could still initiate page writes
from the buf_pool.LRU list while only holding buf_pool.mutex, not
buf_pool.flush_list_mutex. This is something that was changed in
commit a55b951e60 (MDEV-26827).

log_sort_flush_list(): Wait for the buf_flush_page_cleaner() thread to
be completely idle, including LRU flushing.

buf_flush_page_cleaner(): Always broadcast buf_pool.done_flush_list
when becoming idle, so that log_sort_flush_list() will be woken up.
Also, ensure that buf_pool.n_flush_inc() or
buf_pool.flush_list_set_active() has been invoked before any page
writes are initiated.

buf_flush_try_neighbors(): Release buf_pool.mutex here and not in the
callers, to avoid code duplication. Make innodb_flush_neighbors=ON
obey the innodb_io_capacity limit.
2023-08-30 14:40:13 +03:00
Marko Mäkelä
31ea201ecc MDEV-30986 Slow full index scan for I/O bound case
buf_page_init_for_read(): Test a condition before acquiring a latch,
not while holding it.

buf_read_ahead_linear(): Do not use a memory transaction, because it
could be too large, leading to frequent retries.
Release the hash_lock as early as possible.
2023-08-30 13:20:27 +03:00
Thirunarayanan Balathandayuthapani
e938d7c18f MDEV-32028 InnoDB scrubbing doesn't write zero while freeing the extent
Problem:
========
InnoDB fails to mark the page status as FREED during
freeing of an extent of a segment. This behaviour affects
scrubbing and doesn't write all zeroes in file even though
pages are freed.

Solution:
========
InnoDB should mark the page status as FREED before
reinitialize the extent descriptor entry.
2023-08-28 20:27:19 +05:30
Dmitry Shulga
1fde785315 MDEV-31890: Compilation failing on MacOS (unknown warning option -Wno-unused-but-set-variable)
For clang compiler the compiler's flag -Wno-unused-but-set-variable
was set based on compiler version. This approach could result in
false positive detection for presence of compiler option since
only first three groups of digits in compiler version taken into account
and it could lead to inaccuracy in determining of supported compiler's
features.

Correct way to detect options supported by a compiler is to use
the macros  MY_CHECK_CXX_COMPILER_FLAG and to check the result of
variable with prefix have_CXX__
So, to check whether compiler does support the option
 -Wno-unused-but-set-variable
the macros
 MY_CHECK_CXX_COMPILER_FLAG(-Wno-unused-but-set-variable)
should be called and the result variable
 have_CXX__Wno_unused_but_set_variable
be tested for assigned value.
2023-08-28 16:47:00 +07:00
Thirunarayanan Balathandayuthapani
c438284863 MDEV-31835 Remove unnecesary extra HA_EXTRA_IGNORE_INSERT call
- HA_EXTRA_IGNORE_INSERT call is being called for every inserted row,
and on partitioned tables on every row * every partition.
This leads to slowness during load..data operation

- Under bulk operation, multiple insert statement error handling
will end up emptying the table. This behaviour introduced by the
commit 8ea923f55b (MDEV-24818).
This makes the HA_EXTRA_IGNORE_INSERT call redundant. We can
use the same behavior for insert..ignore statement as well.

- Removed the extra call HA_EXTRA_IGNORE_INSERT as the solution
to improve the performance of load command.
2023-08-25 17:22:17 +05:30
Marko Mäkelä
08a549c33d Clean up buf_LRU_remove_hashed()
buf_LRU_block_remove_hashed(): Test for "not ROW_FORMAT=COMPRESSED" first,
because in that case we can assume that an uncompressed page exists.
This removes a condition from the likely code branch.
2023-08-25 13:44:59 +03:00
Marko Mäkelä
f7780a8eb8 MDEV-30100: Assertion purge_sys.tail.trx_no <= purge_sys.rseg->last_trx_no()
trx_t::commit_empty(): A special case of transaction "commit" when
the transaction was actually rolled back or the persistent undo log
is empty. In this case, we need to change the undo log header state to
TRX_UNDO_CACHED and move the undo log from rseg->undo_list to
rseg->undo_cached for fast reuse. Furthermore, unless this is the only
undo log record in the page, we will remove the record and rewind
TRX_UNDO_PAGE_START, TRX_UNDO_PAGE_FREE, TRX_UNDO_LAST_LOG.

We must also ensure that the system-wide transaction identifier
will be persisted up to this->id, so that there will not be warnings or
errors due to a PAGE_MAX_TRX_ID being too large. We might have modified
secondary index pages before being rolled back, and any changes of
PAGE_MAX_TRX_ID are never rolled back.

Even though it is not going to be written persistently anywhere,
we will invoke trx_sys.assign_new_trx_no(this), so that in the test
innodb.instant_alter everything will be purged as expected.

trx_t::write_serialisation_history(): Renamed from
trx_write_serialisation_history(). If there is no undo log,
invoke commit_empty().

trx_purge_add_undo_to_history(): Simplify an assertion and remove a
comment. This function will not be invoked on an empty undo log anymore.

trx_undo_header_create(): Add a debug assertion.

trx_undo_mem_create_at_db_start(): Remove a duplicated assignment.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
2023-08-25 13:41:54 +03:00
Marko Mäkelä
4ff5311dec MDEV-30100 preparation: Simplify InnoDB transaction commit further
trx_commit_complete_for_mysql(): Remove some conditions.
We will rely on trx_t::commit_lsn.

trx_t::must_flush_log_later: Remove. trx_commit_complete_for_mysql()
can simply check for trx_t::flush_log_later.

trx_t::commit_in_memory(): Set commit_lsn=0 if the log was written.

trx_flush_log_if_needed_low(): Renamed to trx_flush_log_if_needed().
Assert that innodb_flush_log_at_trx_commit!=0 was checked by
the caller and that the transaction is not in XA PREPARE state.
Unconditionally flush the log for data dictionary transactions,
to ensure the correct processing of ddl_recovery.log.

trx_write_serialisation_history(): Move some code from
trx_purge_add_undo_to_history().

trx_prepare(): Invoke log_write_up_to() directly if needed.

innobase_commit_ordered_2(): Simplify some conditions.
A read-write transaction will always carry nonzero trx_t::id.
Let us unconditionally reset mysql_log_file_name, flush_log_later
after trx_t::commit() was invoked.
2023-08-25 13:23:21 +03:00
Marko Mäkelä
f4bbea90f1 MDEV-30100 preparation: Simplify InnoDB transaction commit
trx_commit_cleanup(): Clean up any temporary undo log.
Replaces trx_undo_commit_cleanup() and trx_undo_seg_free().

trx_write_serialisation_history(): Commit the mini-transaction.
Do not touch temporary undo logs. Assume that a persistent rollback
segment has been assigned.

trx_serialise(): Merged into trx_write_serialisation_history().

trx_t::commit_low(): Correct some comments and assertions.

trx_t::commit_persist(): Only invoke commit_low() on a mini-transaction
if the persistent state needs to change.
2023-08-25 13:16:54 +03:00
Marko Mäkelä
eda75cadea Merge 10.5 into 10.6 2023-08-24 10:16:24 +03:00
Marko Mäkelä
aeb8eae5c8 Merge 10.4 into 10.5 2023-08-24 10:12:13 +03:00
Marko Mäkelä
02878f128e MDEV-31813 SET GLOBAL innodb_max_purge_lag_wait hangs if innodb_read_only
innodb_max_purge_lag_wait_update(): Return immediately if we are
in high_level_read_only mode.

srv_wake_purge_thread_if_not_active(): Relax a debug assertion.
If srv_read_only_mode holds, purge_sys.enabled() will not hold
and this function will do nothing.

trx_t::commit_in_memory(): Remove a redundant condition before
invoking srv_wake_purge_thread_if_not_active().
2023-08-24 10:08:51 +03:00
Marko Mäkelä
a60462d93e Remove bogus references to replaced Google contributions
In commit 03ca6495df and
commit ff5d306e29
we forgot to remove some Google copyright notices related to
a contribution of using atomic memory access in the old InnoDB
mutex_t and rw_lock_t implementation.

The copyright notices had been mostly added in
commit c6232c06fa
due to commit a1bb700fd2.

The following Google contributions remain:
* some logic related to the parameter innodb_io_capacity
* innodb_encrypt_tables, added in MariaDB Server 10.1
2023-08-21 15:51:16 +03:00
Marko Mäkelä
6cc88c3db1 Clean up buf0buf.inl
Let us move some #include directives from buf0buf.inl to
the compilation units where they are really used.
2023-08-21 15:51:10 +03:00
Marko Mäkelä
448c2077fb Merge 10.5 into 10.6 2023-08-21 15:50:31 +03:00
Marko Mäkelä
be5fd3ec35 Remove a stale comment
buf_LRU_block_remove_hashed(): Remove a comment that had been added
in mysql/mysql-server@aad1c7d0dd
and apparently referring to buf_LRU_invalidate_tablespace(),
which was later replaced with buf_LRU_flush_or_remove_pages() and
ultimately with buf_flush_remove_pages() and buf_flush_list_space().
All that code is covered by buf_pool.mutex. The note about releasing
the hash_lock for the buf_pool.page_hash slice would actually apply to
the last reference to hash_lock in buf_LRU_free_page(), for the
case zip=false (retaining a ROW_FORMAT=COMPRESSED page while
discarding the uncompressed one).
2023-08-21 13:28:12 +03:00
Marko Mäkelä
5a8a8fc953 MDEV-31928 Assertion xid ... < 128 failed in trx_undo_write_xid()
trx_undo_write_xid(): Correct an off-by-one error in a debug assertion.
2023-08-17 10:31:55 +03:00
Marko Mäkelä
518fe51988 MDEV-31254 InnoDB: Trying to read doublewrite buffer page
buf_read_page_low(): Remove an error message that could be triggered
by buf_read_ahead_linear() or buf_read_ahead_random().

This is a backport of commit c9eff1a144
from MariaDB Server 10.5.
2023-08-17 10:31:44 +03:00
Marko Mäkelä
44df6f35aa MDEV-31875 ROW_FORMAT=COMPRESSED table: InnoDB: ... Only 0 bytes read
buf_read_ahead_random(), buf_read_ahead_linear(): Avoid read-ahead
of the last page(s) of ROW_FORMAT=COMPRESSED tablespaces that use
a page size of 1024 or 2048 bytes. We invoke os_file_set_size() on
integer multiples of 4096 bytes in order to be compatible with
the requirements of innodb_flush_method=O_DIRECT regardless of the
physical block size of the underlying storage.

This change must be null-merged to MariaDB Server 10.5 and later.
There, out-of-bounds read-ahead should be handled gracefully
by simply discarding the buffer page that had been allocated.

Tested by: Matthias Leich
2023-08-17 10:31:28 +03:00
Kristian Nielsen
7c9837ce74 Merge 10.4 into 10.5
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-08-15 18:02:18 +02:00
Kristian Nielsen
805e0668c9 MDEV-31482: Lock wait timeout with INSERT-SELECT, autoinc, and statement-based replication
Remove the exception that InnoDB does not report auto-increment locks waits
to the parallel replication.

There was an assumption that these waits could not cause conflicts with
in-order parallel replication and thus need not be reported. However, this
assumption is wrong and it is possible to get conflicts that lead to hangs
for the duration of --innodb-lock-wait-timeout. This can be seen with three
transactions:

1. T1 is waiting for T3 on an autoinc lock
2. T2 is waiting for T1 to commit
3. T3 is waiting on a normal row lock held by T2

Here, T3 needs to be deadlock killed on the wait by T1.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-08-15 16:40:02 +02:00
Kristian Nielsen
18acbaf416 MDEV-31655: Parallel replication deadlock victim preference code errorneously removed
Restore code to make InnoDB choose the second transaction as a deadlock
victim if two transactions deadlock that need to commit in-order for
parallel replication. This code was erroneously removed when VATS was
implemented in InnoDB.

Also add a test case for InnoDB choosing the right deadlock victim.
Also fixes this bug, with testcase that reliably reproduces:

MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master

Reviewed-by: Marko Mäkelä <marko.makela@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-08-15 16:39:49 +02:00
Kristian Nielsen
900c4d6920 MDEV-31655: Parallel replication deadlock victim preference code errorneously removed
Restore code to make InnoDB choose the second transaction as a deadlock
victim if two transactions deadlock that need to commit in-order for
parallel replication. This code was erroneously removed when VATS was
implemented in InnoDB.

Also add a test case for InnoDB choosing the right deadlock victim.
Also fixes this bug, with testcase that reliably reproduces:

MDEV-28776: rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master

Note: This should be null-merged to 10.6, as a different fix is needed
there due to InnoDB locking code changes.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-08-15 16:35:30 +02:00
Kristian Nielsen
920789e9d4 MDEV-31482: Lock wait timeout with INSERT-SELECT, autoinc, and statement-based replication
Remove the exception that InnoDB does not report auto-increment locks waits
to the parallel replication.

There was an assumption that these waits could not cause conflicts with
in-order parallel replication and thus need not be reported. However, this
assumption is wrong and it is possible to get conflicts that lead to hangs
for the duration of --innodb-lock-wait-timeout. This can be seen with three
transactions:

1. T1 is waiting for T3 on an autoinc lock
2. T2 is waiting for T1 to commit
3. T3 is waiting on a normal row lock held by T2

Here, T3 needs to be deadlock killed on the wait by T1.

Note: This should be null-merged to 10.6, as a different fix is needed
there due to InnoDB lock code changes.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2023-08-15 16:34:09 +02:00
Marko Mäkelä
e9723c2cbb MDEV-31473 Wrong information about innodb_checksum_algorithm in information_schema.SYSTEM_VARIABLES
MYSQL_SYSVAR_ENUM(checksum_algorithm): Correct the documentation string.
Fixes up commit 7a4fbb55b0 (MDEV-25105).
2023-08-14 13:36:17 +03:00
Oleksandr Byelkin
d28d636f57 Merge branch '10.5' into 10.6 2023-08-08 13:20:58 +02:00
Oleksandr Byelkin
8852afe317 Merge branch '10.4' into 10.5 2023-08-08 11:24:42 +02:00
Thirunarayanan Balathandayuthapani
0ede90dd31 MDEV-31869 Server aborts when table does drop column
- InnoDB aborts when table is dropping the column. This is
caused by 5f09b53bdb (MDEV-31086).
While iterating the altered table fields, we fail to consider
the dropped columns.
2023-08-08 13:24:23 +05:30
Oleksandr Byelkin
5ea5291d97 Merge branch '10.5' into 10.6 2023-08-04 07:52:54 +02:00
Sergei Golubchik
ab1191c039 cleanup: key->key_create_info.check_for_duplicate_indexes -> key->old
mark old keys in the ALTER TABLE with the `old` flag, not with
the `key_create_info.check_for_duplicate_indexes`.

This allows to mark old foreign keys too.
2023-08-01 22:43:16 +02:00
Oleksandr Byelkin
6bf8483cac Merge branch '10.5' into 10.6 2023-08-01 15:08:52 +02:00
Marko Mäkelä
72928e640e MDEV-27593: Crashing on I/O error is unhelpful
buf_page_t::write_complete(), buf_page_write_complete(),
IORequest::write_complete(): Add a parameter for passing
an error code. If an error occurred, we will release the
io-fix, buffer-fix and page latch but not reset the
oldest_modification field. The block would remain in
buf_pool.LRU and possibly buf_pool.flush_list, to be written
again later, by buf_flush_page_cleaner(). If all page writes
start consistently failing, all write threads should eventually
hang in log_free_check() because the log checkpoint cannot
be advanced to make room in the circular write-ahead-log ib_logfile0.

IORequest::read_complete(): Add a parameter for passing
an error code. If a read operation fails, we report the error
and discard the page, just like we would do if the page checksum
was not validated or the page could not be decrypted.
This only affects asynchronous reads, due to linear or random read-ahead
or crash recovery. When buf_page_get_low() invokes buf_read_page(),
that will be a synchronous read, not involving this code.

This was tested by randomly injecting errors in
write_io_callback() and read_io_callback(), like this:

  if (!ut_rnd_interval(100))
    cb->m_err= 42;
2023-08-01 14:39:29 +03:00
Marko Mäkelä
96cfdb8710 MDEV-31816 fixup: Relax a debug assertion
buf_LRU_free_page(): The block may also be in the IBUF_EXIST state
when executing the test innodb.innodb_bulk_create_index_debug.
2023-08-01 13:22:16 +03:00
Oleksandr Byelkin
65405308a1 Merge branch '10.4' into 10.5 2023-08-01 11:52:13 +02:00
Marko Mäkelä
d794d3484b MDEV-31816 buf_LRU_free_page() does not preserve ROW_FORMAT=COMPRESSED block state
buf_LRU_free_page(): When we are discarding the uncompressed copy of a
ROW_FORMAT=COMPRESSED page, buf_page_t::can_relocate() must have ensured
that the block descriptor state is one of FREED, UNFIXED, REINIT.
Do not overwrite the state with UNFIXED. We do not want to write back
pages that were actually freed, and we want to avoid doublewrite for
pages that were (re)initialized by log records written since the latest
checkpoint. Last but not least, we do not want crashes like those that
commit dc1bd1802a (MDEV-31386)
was supposed to fix.

The test innodb_zip.wl5522_zip should typically cover all 3 states.

This bug is a regression due to
commit aaef2e1d8c (MDEV-27058).
2023-08-01 09:58:15 +03:00
Aleksey Midenkov
69b118a346 Revert "MDEV-30528 Assertion in dtype_get_at_most_n_mbchars"
This reverts commit add0c01bae

Duplicates must be avoided in FTS_DOC_ID_INDEX
2023-07-31 16:57:18 +03:00
Marko Mäkelä
0d175968d1 MDEV-31354 SIGSEGV in log_sort_flush_list() in InnoDB crash recovery
log_sort_flush_list(): Wait for any pending page writes to cease
before sorting the buf_pool.flush_list. Starting with
commit 22b62edaed (MDEV-25113),
it is possible that some buf_page_t::oldest_modification_
that we will be comparing in std::sort() will be updated from
some value >2 to 1 while we are holding buf_pool.flush_list_mutex.

To catch this type of trouble better in the future, we will clean
garbage (pages that have been written out) from buf_pool.flush_list
while constructing the array for sorting, and check with debug
assertions that all blocks that we are copying from the array to the
list will be dirty (requiring a writeback) while we sort and copy
the array back to buf_pool.flush_list.

This failure was observed by chance exactly once when running the test
innodb.recovery_memory. It was never reproduced in the same form
afterwards. Unrelated to this change, the test does occasionally
reproduce a failure to start up InnoDB due to a corrupted page
being read in recovery. The ticket MDEV-31791 was filed for that.

Tested by: Matthias Leich
2023-07-28 12:36:45 +03:00
Oleksandr Byelkin
7564be1352 Merge branch '10.4' into 10.5 2023-07-26 16:02:57 +02:00
Marko Mäkelä
b102872ad5 MDEV-31767 InnoDB tables are being flagged as corrupted on an I/O bound server
The main problem is that at ever since
commit aaef2e1d8c removed the
function buf_wait_for_read(), it is not safe to invoke
buf_page_get_low() with RW_NO_LATCH, that is, only buffer-fixing
the page. If a page read (or decryption or decompression) is in
progress, there would be a race condition when executing consistency
checks, and a page would wrongly be flagged as corrupted.

Furthermore, if the page is actually corrupted and the initial
access to it was with RW_NO_LATCH (only buffer-fixing), the
page read handler would likely end up in an infinite loop in
buf_pool_t::corrupted_evict(). It is not safe to invoke
mtr_t::upgrade_buffer_fix() on a block on which a page latch
was not initially acquired in buf_page_get_low().

btr_block_reget(): Remove the constant parameter rw_latch=RW_X_LATCH.

btr_block_get(): Assert that RW_NO_LATCH is not being used,
and change the parameter type of rw_latch.

btr_pcur_move_to_next_page(), innobase_table_is_empty(): Adjust for the
parameter type change of btr_block_get().

btr_root_block_get(): If mode==RW_NO_LATCH, do not check the integrity of
the page, because it is not safe to do so.

btr_page_alloc_low(), btr_page_free(): If the root page latch is not
previously held by the mini-transaction, invoke btr_root_block_get()
again with the proper latching mode.

btr_latch_prev(): Helper function to safely acquire a latch on a
preceding sibling page while holding a latch on a B-tree page.
To avoid deadlocks, we must not wait for the latch while holding
a latch on the current page, because another thread may be waiting
for our page latch when moving to the next page from our preceding
sibling page. If s_lock_try() or x_lock_try() on the preceding page fails,
we must release the current page latch, and wait for the latch on the
preceding page as well as the current page, in that order.
Page splits or merges will be prevented by the parent page latch
that we are holding.

btr_cur_t::search_leaf(): Make use of btr_latch_prev().

btr_cur_t::open_leaf(): Make use of btr_latch_prev(). Do not invoke
mtr_t::upgrade_buffer_fix() (when latch_mode == BTR_MODIFY_TREE),
because we will already have acquired all page latches upfront.

btr_cur_t::pessimistic_search_leaf(): Do acquire an exclusive index latch
before accessing the page. Make use of btr_latch_prev().
2023-07-25 11:40:58 +03:00
Marko Mäkelä
9bb5b25325 MDEV-31120 Duplicate entry allowed into a UNIQUE column
row_ins_sec_index_entry_low(): Correct a condition that was
inadvertently inverted
in commit 89ec4b53ac (MDEV-29603).

We are not supposed to buffer INSERT operations into unique indexes,
because duplicate key values would not be checked for. It is only
allowed when using unique_checks=0, and in that case the user is
supposed to guarantee that there are no duplicates.
2023-07-24 12:29:43 +03:00
Aleksey Midenkov
14cc7e7d6e MDEV-25644 UPDATE not working properly on transaction precise system versioned table
First UPDATE under START TRANSACTION does nothing (nstate= nstate),
but anyway generates history. Since update vector is empty we get into
(!uvect->n_fields) branch which only adds history row, but does not do
update. After that we get current row with wrong (old) row_start value
and because of that second UPDATE tries to insert history row again
because it sees trx->id != row_start which is the guard to avoid
inserting multiple trx_id-based history rows under same transaction
(because we have same trx_id and we get duplicate error and this bug
demostrates that). But this try anyway fails because PK is based on
row_end which is constant under same transaction, so PK didn't change.

The fix moves vers_make_update() to an earlier stage of
calc_row_difference(). Therefore it prepares update vector before
(!uvect->n_fields) check and never gets into that branch, hence no
need to handle versioning inside that condition anymore.

Now trx->id and row_start are equal after first UPDATE and we don't
try to insert second history row.

== Cleanups and improvements ==

ha_innobase::update_row():

vers_set_fields and vers_ins_row are cleaned up into direct condition
check. SQLCOM_ALTER_TABLE check now is not used as this is dead code,
assertion is done instead.

upd_node->is_delete is set in calc_row_difference() just to keep
versioning code as much in one place as possible. vers_make_delete()
is still located in row_update_for_mysql() as this is required for
ha_innodbase::delete_row() as well.

row_ins_duplicate_error_in_clust():

Restrict DB_FOREIGN_DUPLICATE_KEY to the better conditions.
VERSIONED_DELETE is used specifically to help lower stack to
understand what caused current insert. Related to MDEV-29813.
2023-07-20 18:22:31 +03:00
Aleksey Midenkov
add0c01bae MDEV-30528 Assertion in dtype_get_at_most_n_mbchars
1. Exclude merging history rows into fts index.

The check !history_fts && (index->type & DICT_FTS) was just incorrect
attempt to avoid history in fts index.

2. Don't check for duplicates for history rows.
2023-07-20 18:22:30 +03:00
Oleksandr Byelkin
f52954ef42 Merge commit '10.4' into 10.5 2023-07-20 11:54:52 +02:00
Vlad Lesin
090a84366a MDEV-29311 Server Status Innodb_row_lock_time% is reported in seconds
Before MDEV-24671, the wait time was derived from my_interval_timer() /
1000 (nanoseconds converted to microseconds, and not microseconds to
milliseconds like I must have assumed). The lock_sys.wait_time and
lock_sys.wait_time_max are already in milliseconds; we should not divide
them by 1000.

In MDEV-24738 the millisecond counts lock_sys.wait_time and
lock_sys.wait_time_max were changed to a 32-bit type. That would
overflow in 49.7 days. Keep using a 64-bit type for those millisecond
counters.

Reviewed by: Marko Mäkelä
2023-07-10 12:42:46 +03:00
Monty
99bd226059 MDEV-31558 Add InnoDB engine information to the slow query log
The new statistics is enabled by adding the "engine", "innodb" or "full"
option to --log-slow-verbosity

Example output:

 # Pages_accessed: 184  Pages_read: 95  Pages_updated: 0  Old_rows_read: 1
 # Pages_read_time: 17.0204  Engine_time: 248.1297

Page_read_time is time doing physical reads inside a storage engine.
(Writes cannot be tracked as these are usually done in the background).
Engine_time is the time spent inside the storage engine for the full
duration of the read/write/update calls. It uses the same code as
'analyze statement' for calculating the time spent.

The engine statistics is done with a generic interface that should be
easy for any engine to use. It can also easily be extended to provide
even more statistics.

Currently only InnoDB has counters for Pages_% and Undo_% status.
Engine_time works for all engines.

Implementation details:

class ha_handler_stats holds all engine stats.  This class is included
in handler and THD classes.
While a query is running, all statistics is updated in the handler. In
close_thread_tables() the statistics is added to the THD.

handler::handler_stats is a pointer to where statistics should be
collected. This is set to point to handler::active_handler_stats if
stats are requested. If not, it is set to 0.
handler_stats has also an element, 'active' that is 1 if stats are
requested. This is to allow engines to avoid doing any 'if's while
updating the statistics.

Cloned or partition tables have the pointer set to the base table if
status are requested.

There is a small performance impact when using --log-slow-verbosity=engine:
- All engine calls in 'select' will be timed.
- IO calls for InnoDB reads will be timed.
- Incrementation of counters are done on local variables and accesses
  are inline, so these should have very little impact.
- Statistics has to be reset for each statement for the THD and each
  used handler. This is only 40 bytes, which should be neglectable.
- For partition tables we have to loop over all partitions to update
  the handler_status as part of table_init(). Can be optimized in the
  future to only do this is log-slow-verbosity changes. For this to work
  we have to update handler_status for all opened partitions and
  also for all partitions opened in the future.

Other things:
- Added options 'engine' and 'full' to log-slow-verbosity.
- Some of the new files in the test suite comes from Percona server, which
  has similar status information.
- buf_page_optimistic_get(): Do not increment any counter, since we are
  only validating a pointer, not performing any buf_pool.page_hash lookup.
- Added THD argument to save_explain_data_intern().
- Switched arguments for save_explain_.*_data() to have
  always THD first (generates better code as other functions also have THD
  first).
2023-07-07 12:53:18 +03:00
Vlad Lesin
1bfd3cc457 MDEV-10962 Deadlock with 3 concurrent DELETEs by unique key
PROBLEM:
A deadlock was possible when a transaction tried to "upgrade" an already
held Record Lock to Next Key Lock.

SOLUTION:
This patch is based on observations that:
(1) a Next Key Lock is equivalent to Record Lock combined with Gap Lock
(2) a GAP Lock never has to wait for any other lock
In case we request a Next Key Lock, we check if we already own a Record
Lock of equal or stronger mode, and if so, then we change the requested
lock type to GAP Lock, which we either already have, or can be granted
immediately, as GAP locks don't conflict with any other lock types.
(We don't consider Insert Intention Locks a Gap Lock in above statements).

The reason of why we don't upgrage Record Lock to Next Key Lock is the
following.

Imagine a transaction which does something like this:

for each row {
    request lock in LOCK_X|LOCK_REC_NOT_GAP mode
    request lock in LOCK_S mode
}

If we upgraded lock from Record Lock to Next Key lock, there would be
created only two lock_t structs for each page, one for
LOCK_X|LOCK_REC_NOT_GAP mode and one for LOCK_S mode, and then used
their bitmaps to mark all records from the same page.

The situation would look like this:

request lock in LOCK_X|LOCK_REC_NOT_GAP mode on row 1:
// -> creates new lock_t for LOCK_X|LOCK_REC_NOT_GAP mode and sets bit for
// 1
request lock in LOCK_S mode on row 1:
// -> notices that we already have LOCK_X|LOCK_REC_NOT_GAP on the row 1,
// so it upgrades it to X
request lock in LOCK_X|LOCK_REC_NOT_GAP mode on row 2:
// -> creates a new lock_t for LOCK_X|LOCK_REC_NOT_GAP mode (because we
// don't have any after we've upgraded!) and sets bit for 2
request lock in LOCK_S mode on row 2:
// -> notices that we already have LOCK_X|LOCK_REC_NOT_GAP on the row 2,
// so it upgrades it to X
    ...etc...etc..

Each iteration of the loop creates a new lock_t struct, and in the end we
have a lot (one for each record!) of LOCK_X locks, each with single bit
set in the bitmap. Soon we run out of space for lock_t structs.

If we create LOCK_GAP instead of lock upgrading, the above scenario works
like the following:

// -> creates new lock_t for LOCK_X|LOCK_REC_NOT_GAP mode and sets bit for
// 1
request lock in LOCK_S mode on row 1:
// -> notices that we already have LOCK_X|LOCK_REC_NOT_GAP on the row 1,
// so it creates LOCK_S|LOCK_GAP only and sets bit for 1
request lock in LOCK_X|LOCK_REC_NOT_GAP mode on row 2:
// -> reuses the lock_t for LOCK_X|LOCK_REC_NOT_GAP by setting bit for 2
request lock in LOCK_S mode on row 2:
// -> notices that we already have LOCK_X|LOCK_REC_NOT_GAP on the row 2,
// so it reuses LOCK_S|LOCK_GAP setting bit for 2

In the end we have just two locks per page, one for each mode:
LOCK_X|LOCK_REC_NOT_GAP and LOCK_S|LOCK_GAP.
Another benefit of this solution is that it avoids not-entirely
const-correct, (and otherwise looking risky) "upgrading".

The fix was ported from
mysql/mysql-server@bfba840dfa
mysql/mysql-server@75cefdb1f7

Reviewed by: Marko Mäkelä
2023-07-06 15:06:10 +03:00
Marko Mäkelä
2855bc53bc Merge 10.5 into 10.6 2023-07-05 16:40:22 +03:00
Marko Mäkelä
bd7908e6ac MDEV-31568 InnoDB protection against dual processes accessing data insufficient
fil_node_open_file_low(): Always acquire an advisory lock on
the system tablespace. Originally, we already did this in
SysTablespace::open_file(), but SysTablespace::open_or_create()
would release those locks when it is closing the file handles.

This is a 10.5+ specific follow up to
commit 0ee1082bd2 (MDEV-28495).

Thanks to Daniel Black for verifying this bug.
2023-07-05 15:15:04 +03:00
Marko Mäkelä
5b62644e68 MDEV-31621 Remove ibuf_read_merge_pages() call from ibuf_insert_low()
When InnoDB attempts to buffer a change operation of a secondary index
leaf page (to insert, delete-mark or remove a record) and the
change buffer is too large, InnoDB used to trigger a change buffer merge
that could affect any tables. This could lead to huge variance in
system throughput and potentially unpredictable crashes, in case the
change buffer was corrupted and a crash occurred while attempting to
merge changes to a table that is not being accessed by the current
SQL statement.

ibuf_insert_low(): Simply return DB_STRONG_FAIL when the maximum size
of the change buffer is exceeded.

ibuf_contract_after_insert(): Remove.

ibuf_get_merge_page_nos_func(): Remove a constant parameter.
The function ibuf_contract() will be our only caller, during
shutdown with innodb_fast_shutdown=0.
2023-07-05 08:48:37 +03:00
Marko Mäkelä
cb364a78d6 MDEV-31619 dict_stats_persistent_storage_check() may show garbage during --bootstrap
dict_stats_persistent_storage_check(): Do not output errmsg
if opt_bootstrap holds, because the message buffer would likely
be uninitialized.
2023-07-04 15:24:57 +03:00
Marko Mäkelä
f7b8a2c953 MDEV-31607 ER_DUP_KEY in mysql.innodb_table_stats upon RENAME on sequence
ha_innobase::delete_table(): Also on DROP SEQUENCE, do try to drop any
persistent statistics. They should really not be created for
SEQUENCE objects (which internally are 1-row no-rollback tables),
but that is how happened to always work.
2023-07-03 16:47:58 +03:00
Marko Mäkelä
b8088487e4 MDEV-19216 Assertion ...SYS_FOREIGN failed in btr_node_ptr_max_size
btr_node_ptr_max_size(): Handle BINARY(0) and VARBINARY(0)
as special cases, similar to CHAR(0) and VARCHAR(0).
2023-07-03 16:09:18 +03:00
Marko Mäkelä
dc1bd1802a MDEV-31386 InnoDB: Failing assertion: page_type == i_s_page_type[page_type].type_value
i_s_innodb_buffer_page_get_info(): Correct a condition.
After crash recovery, there may be some buffer pool pages in FREED state,
containing garbage (invalid data page contents). Let us ignore such pages
in the INFORMATION_SCHEMA output.

The test innodb.innodb_defragment_fill_factor will be removed, because
the queries that it is invoking on information_schema.innodb_buffer_page
would start to fail. The defragmentation feature was removed in
commit 7ca89af6f8 in MariaDB Server 11.1.

Tested by: Matthias Leich
2023-07-03 14:39:29 +03:00
Marko Mäkelä
3d90143859 MDEV-31559 btr_search_hash_table_validate() does not check if CHECK TABLE is killed
btr_search_hash_table_validate(), btr_search_validate(): Add the
parameter THD for checking if the statement has been killed.
Any non-QUICK CHECK TABLE will validate the entire adaptive hash index
for all InnoDB tables, which may be extremely slow when running
multiple concurrent CHECK TABLE.
2023-06-30 17:07:21 +03:00
Marko Mäkelä
33877cfeae Fix WITH_UBSAN GCC -Wconversion 2023-06-28 17:07:00 +03:00
Vlad Lesin
687fd6bef5 MDEV-30648 btr_estimate_n_rows_in_range() accesses unfixed, unlatched page
The issue is caused by MDEV-30400 fix.

There are two cursors in btr_estimate_n_rows_in_range() - p1 and p2, but
both share the same mtr. Each cursor contains mtr savepoint for the
previously fetched block to release it then the current block is
fetched.

Before MDEV-30400 the block was released with
mtr_t::release_block_at_savepoint(), it just unfixed a block and
released its page patch. In MDEV-30400 it was replaced with
mtr_t::rollback_to_savepoint(), which does the same as the former
mtr_t::release_block_at_savepoint(ulint begin, ulint end) but also
erases the corresponding slots from mtr memo, what invalidates any
stored mtr's memo savepoints, greater or equal to "begin".

The idea of the fix is to get rid of savepoints at all in
btr_estimate_n_rows_in_range() and
btr_estimate_n_rows_in_range_on_level(). As
mtr_t::rollback_to_savepoint() erases elements from mtr_t::m_memo, we
know what element of mtr_t::m_memo can be deleted on the certain case,
so there is no need to store savepoints.

See also the following slides for details:
https://docs.google.com/presentation/d/1RFYBo7EUhM22ab3GOYctv3j_3yC0vHtBY9auObZec8U

Reviewed by: Marko Mäkelä
2023-06-28 11:00:02 +03:00
Thirunarayanan Balathandayuthapani
5f09b53bdb MDEV-31086 MODIFY COLUMN can break FK constraints, and lead to unrestorable dumps
- When foreign_key_check is disabled, allowing to modify the
column which is part of foreign key constraint can lead to
refusal of TRUNCATE TABLE, OPTIMIZE TABLE later. So it make
sense to block the column modify operation when foreign key
is involved irrespective of foreign_key_check variable.

Correct way to modify the charset of the column when fk is involved:

SET foreign_key_checks=OFF;
ALTER TABLE child DROP FOREIGN KEY fk, MODIFY m VARCHAR(200) CHARSET utf8mb4;
ALTER TABLE parent MODIFY m VARCHAR(200) CHARSET utf8mb4;
ALTER TABLE child ADD CONSTRAINT FOREIGN KEY (m) REFERENCES PARENT(m);
SET foreign_key_checks=ON;

fk_check_column_changes(): Remove the FOREIGN_KEY_CHECKS while
checking the column change for foreign key constraint. This
is the partial revert of commit 5f1f2fc0e4
and it changes the behaviour of copy alter algorithm

ha_innobase::prepare_inplace_alter_table(): Find the modified
column and check whether it is part of existing and newly
added foreign key constraint.
2023-06-27 16:58:22 +05:30
Marko Mäkelä
694ce0d08e Merge 10.5 into 10.6 2023-06-27 13:03:32 +03:00
Marko Mäkelä
84dbd0253d MDEV-31487: Recovery or backup failure after innodb_undo_log_truncate=ON
recv_sys_t::parse(): For undo tablespace truncation mini-transactions,
remember the start_lsn instead of the end LSN. This is what we expect
after commit 461402a564 (MDEV-30479).
2023-06-27 09:12:38 +03:00
Marko Mäkelä
493083833b Merge 10.5 into 10.6 2023-06-26 17:11:38 +03:00
Thirunarayanan Balathandayuthapani
bd076d4dff MDEV-31442 page_cleaner thread aborts while releasing the tablespace
- InnoDB shouldn't acquire the tablespace when it is being stopped
or closed
2023-06-16 14:58:48 +05:30
Thirunarayanan Balathandayuthapani
841e905f20 MDEV-31442 page_cleaner thread aborts while releasing the tablespace
After further I/O on a tablespace has been stopped
(for example due to DROP TABLE or an operation that
rebuilds a table), page cleaner thread tries to
flush the pending writes for the tablespace and
releases the tablespace reference even though it was not
acquired.

fil_space_t::flush(): Don't release the tablespace when it is
being stopped and closed

Thanks to Marko Mäkelä for suggesting this patch.
2023-06-09 18:15:33 +05:30
Thirunarayanan Balathandayuthapani
bf0a54df34 MDEV-31416 ASAN errors in dict_v_col_t::detach upon adding key to virtual column
- InnoDB throws ASAN error while adding the index on virtual column
of system versioned table. InnoDB wrongly assumes that virtual
column collation type changes, creates new column with different
character set. This leads to failure while detaching the column
from indexes.
2023-06-08 16:34:45 +05:30
Marko Mäkelä
80585c9d6f Merge 10.5 into 10.6 2023-06-08 10:42:56 +03:00
Marko Mäkelä
c25b496724 MDEV-31382 SET GLOBAL innodb_undo_log_truncate=ON has no effect on logically empty undo logs
innodb_undo_log_truncate_update(): A callback function. If
SET GLOBAL innodb_undo_log_truncate=ON, invoke
srv_wake_purge_thread_if_not_active().

srv_wake_purge_thread_if_not_active(): If innodb_undo_log_truncate=ON,
always wake up the purge subsystem.

srv_do_purge(): If the history is empty, invoke
trx_purge_truncate_history() in order to free undo log pages.

trx_purge_truncate_history(): If head.trx_no==0, consider the
cached undo logs to be free.

trx_purge(): Remove the parameter "bool truncate" and let the
caller invoke trx_purge_truncate_history() directly.

Reviewed by: Vladislav Lesin
2023-06-08 09:18:21 +03:00
Marko Mäkelä
3e40f9a7f3 MDEV-31355 innodb_undo_log_truncate=ON fails to wait for purge of enough transaction history
purge_sys_t::sees(): Wrapper for view.sees().

trx_purge_truncate_history(): Invoke purge_sys.sees() instead of
comparing to head.trx_no, to determine if undo pages can be safely freed.

The test innodb.cursor-restore-locking was adjusted by Vladislav Lesin,
as was the the debug instrumentation in row_purge_del_mark().

Reviewed by: Vladislav Lesin
2023-06-08 09:17:52 +03:00
Oleksandr Byelkin
04f0b955dd Merge branch '10.6' into 10.6.14 2023-06-07 19:59:52 +02:00
Marko Mäkelä
c93754d45e MDEV-31234 related cleanup
trx_purge_free_segment(), trx_purge_truncate_rseg_history():
Replace some unreachable code with debug assertions.
A buffer-fix does prevent pages from being evicted
from the buffer pool; see buf_page_t::can_relocate().

Tested by: Matthias Leich
2023-06-05 18:53:20 +02:00
Sergei Golubchik
a42a6fa99b Merge branch 'bb-10.5-release' into bb-10.6-release 2023-06-05 18:53:02 +02:00
Marko Mäkelä
cf37e44eec MDEV-31350: Hang in innodb.recovery_memory
buf_flush_page_cleaner(): Whenever buf_pool.ran_out(), invoke
buf_pool.get_oldest_modification(0) so that all clean blocks
will be removed from buf_pool.flush_list and buf_flush_LRU_list_batch()
will be able to evict some pages.

This fixes a regression that was likely caused by
commit a55b951e60 (MDEV-26827).
2023-06-03 11:12:32 +02:00
Marko Mäkelä
dd298873da MDEV-31309 Innodb_buffer_pool_read_requests is not updated correctly
srv_export_innodb_status(): Update
export_vars.innodb_buffer_pool_read_requests as it was done
before commit a55b951e60 (MDEV-26827).
If innodb_status_variables[] pointed to a sharded variable, it would
only access the first shard.
2023-06-03 11:12:27 +02:00
Marko Mäkelä
89eb6fa8a7 MDEV-31308 InnoDB monitor trx_rseg_history_len was accidentally disabled by default
innodb_counter_info[]: Revert a change that was accidentally made in
commit 204e7225dc
2023-06-03 11:12:21 +02:00
Marko Mäkelä
883333a74e MDEV-31158: Potential hang with ROW_FORMAT=COMPRESSED tables
btr_cur_need_opposite_intention(): Check also page_zip_available()
so that we will escalate to exclusive index latch when a non-leaf
page may have to be split further due to ROW_FORMAT=COMPRESSED page
overflow.

Tested by: Matthias Leich
2023-06-03 11:12:16 +02:00
Marko Mäkelä
459eb9a686 MDEV-29593 fixup: Avoid a leak if rseg.undo_cached is corrupted
trx_purge_truncate_rseg_history(): Avoid a leak similar to the one
that was fixed in MDEV-31324, in case a supposedly cached undo log
page is not found in the rseg.undo_cached list.
2023-06-03 11:12:11 +02:00
Marko Mäkelä
e89bd39c9b MDEV-31343 Another server hang with innodb_undo_log_truncate=ON
trx_purge_truncate_history(): While waiting for a write-fixed block
to become available, simply wait for an exclusive latch on it.
Also, simplify the iteration: first check for oldest_modification>2
(to ignore clean pages or pages belonging to the temporary tablespace)
and then compare the tablespace identifier.

Before releasing buf_pool.flush_list_mutex we will buffer-fix the block
of interest. In that way, buf_page_t::can_relocate() will not hold on
the block and it must remain in the buffer pool until we have acquired
an exclusive latch on it. If the block is still dirty, we will register
it with the tablespace truncation mini-transaction; else, we will simply
release the latch and buffer-fix and move to the next block.

This also reverts commit c4d7939989
because that fix should no longer be necessary; the wait for an
exclusive block latch should allow buf_pool_t::release_freed_page()
on the same block to proceed.

Tested by: Axel Schwenke, Matthias Leich
2023-06-03 11:12:03 +02:00
Marko Mäkelä
3b4b512d8e MDEV-31234 fixup: Allow innodb_undo_log_truncate=ON after upgrade
trx_purge_truncate_history(): Relax a condition that would prevent
undo log truncation if the undo log tablespaces were "contaminated"
by the bug that commit e0084b9d31 fixed.
That is, trx_purge_truncate_rseg_history() would have invoked
flst_remove() on TRX_RSEG_HISTORY but not reduced TRX_RSEG_HISTORY_SIZE.

To avoid any regression with normal operation, we implement this
fixup during slow shutdown only. The condition on the history list
being empty is necessary: without it, in the test
innodb.undo_truncate_recover there may be much fewer than the
expected 90,000 calls to row_purge() before the truncation.
That is, we would truncate the undo tablespace before actually having
processed all undo log records in it.

To truncate such "contaminated" or "bloated" undo log tablespaces
(when using innodb_undo_tablespaces=2 or more)
you can execute the following SQL:

BEGIN;INSERT mysql.innodb_table_stats VALUES('','',DEFAULT,0,0,0);ROLLBACK;
SET GLOBAL innodb_undo_log_truncate=ON, innodb_fast_shutdown=0;
SHUTDOWN;

The first line creates a dummy InnoDB transaction, to ensure that there
will be some history to be purged during shutdown and that the undo
tablespaces will be truncated.
2023-06-03 10:59:53 +02:00
Marko Mäkelä
48d6a5f61b MDEV-31234 fixup: Free some UNDO pages earlier
trx_purge_truncate_rseg_history(): Add a parameter to specify if
the entire rollback segment is safe to be freed. If not, we may
still be able to invoke trx_undo_truncate_start() and free some pages.
2023-06-03 10:59:47 +02:00
Marko Mäkelä
318012a80a MDEV-31234 InnoDB does not free UNDO after the fix of MDEV-30671
trx_purge_truncate_history(): Only call trx_purge_truncate_rseg_history()
if the rollback segment is safe to process. This will avoid leaking undo
log pages that are not yet ready to be processed. This fixes a regression
that was introduced in
commit 0de3be8cfd (MDEV-30671).

trx_sys_t::any_active_transactions(): Separately count XA PREPARE
transactions.

srv_purge_should_exit(): Terminate slow shutdown if the history size
does not change and XA PREPARE transactions exist in the system.
This will avoid a hang of the test innodb.recovery_shutdown.

Tested by: Matthias Leich
2023-06-03 10:59:42 +02:00
Marko Mäkelä
f569e06e03 MDEV-31385 Change buffer stale entries leads to corruption while reusing page
buf_page_free(): If buffered changes existed for the page, drop them.

Co-developed with Thirunarayanan Balathandayuthapani
2023-06-02 11:06:09 +03:00
Marko Mäkelä
8a86df37ef MDEV-31088 Server freeze due to innodb_change_buffering
A 3-thread deadlock has been frequently observed when using
innodb_change_buffering!=none and innodb_file_per_table=0:

(1) ibuf_merge_or_delete_for_page() holding an exclusive latch on the block
and waiting for an exclusive tablespace latch in fseg_page_is_allocated()
(2) btr_free_but_not_root() in fseg_free_step() waiting for an
exclusive tablespace latch
(3) fsp_alloc_free_page() holding the exclusive tablespace latch and waiting
for a latch on the block, which it is reallocating for something else

While this was reproduced using innodb_file_per_table=0, this hang should
be theoretically possible in .ibd files as well, when the recovery or
cleanup of a failed DROP INDEX or ADD INDEX is executing concurrently
with something that involves page allocation.

ibuf_merge_or_delete_for_page(): Avoid invoking fseg_page_is_allocated()
when block==nullptr. The call was redundant in this case, and it could
cause deadlocks due to latching order violation.

ibuf_read_merge_pages(): Acquire an exclusive tablespace latch
before invoking buf_page_get_gen(), which may cause
fseg_page_is_allocated() to be invoked in ibuf_merge_or_delete_for_page().

Note: This will not fix all latching order violations in this area!
Deadlocks involving ibuf_merge_or_delete_for_page(block!=nullptr) are
still possible if the caller is not acquiring an exclusive tablespace latch
upfront. This would be the case in any read operation that involves a
change buffer merge, such as SELECT, CHECK TABLE, or any DML operation that
cannot be buffered in the change buffer.
2023-06-02 10:44:34 +03:00
Marko Mäkelä
548a41c5ec Merge 10.5 into 10.6 2023-06-01 12:28:40 +03:00
Marko Mäkelä
bb9da13baf MDEV-31373 innodb_undo_log_truncate=ON recovery results in a corrupted undo log
recv_sys_t::apply(): When applying an undo log truncation operation,
invoke os_file_truncate() on space->recv_size, which must not be
less than the original truncated file size.

Alternatively, as pointed out by Thirunarayanan Balathandayuthapani,
we could assign space->size = t.pages, so that
fil_system_t::extend_to_recv_size() would extend the file back
to space->recv_size.
2023-06-01 12:11:18 +03:00
Marko Mäkelä
3aea77edeb MDEV-31347 fil_ibd_create() may hijack the file handle of an old file
fil_space_t::add(): If a file handle was passed, invoke
fil_node_t::find_metadata() before releasing fil_system.mutex.
The call was moved from fil_ibd_create().

This is a 10.5 version of commit e3b06156c6
from 10.6.
2023-06-01 09:41:17 +03:00
Thirunarayanan Balathandayuthapani
5919f7b675 MDEV-31264 Purge trying to access freed secondary index page
- InnoDB purge tries to access aborted secondary index and access
the freed secondary index root page.
2023-05-31 19:07:41 +05:30
Marko Mäkelä
e3b06156c6 MDEV-31347 fil_ibd_create() may hijack the file handle of an old file
fil_ibd_create(): Hold fil_system.mutex until fil_node_t::find_metadata()
has completed, so that node->handle cannot be closed by a concurrent
thread. This race condition was introduced
in commit 10dd290b4b (MDEV-17380).

Tested by: Matthias Leich
2023-05-31 15:25:07 +03:00
Marko Mäkelä
eb20e7c900 MDEV-31353 InnoDB recovery hangs after reporting corruption
recv_recover_page(): Remove some code which was added in
commit 0b47c126e3 with
no good reason and which would cause a hang after a corrupted
page was reported during crash recovery.

Tested by: Matthias Leich
2023-05-31 15:20:54 +03:00
Marko Mäkelä
a6c0a27696 MDEV-31362 recv_sys_t::apply(bool): Assertion `!last_batch || recovered_lsn == scanned_lsn' failed
recv_sys_t::apply(): Remove a bogus debug assertion that had been
added in commit f2c17cc9d9 (MDEV-29911).

It is perfectly normal that when the server was killed in the middle of
writing multiple redo log blocks, the recovery would end such that
recv_sys.scanned_lsn will point to the end of the last complete 512-byte
log block, but recv_sys.recovered_lsn will be less than that.

Also, correct the function comment of recv_sys_t::parse().
2023-05-30 17:21:49 +03:00
Marko Mäkelä
ce547cfc05 MDEV-31350: Hang in innodb.recovery_memory
buf_flush_page_cleaner(): Whenever buf_pool.ran_out(), invoke
buf_pool.get_oldest_modification(0) so that all clean blocks
will be removed from buf_pool.flush_list and buf_flush_LRU_list_batch()
will be able to evict some pages.

This fixes a regression that was likely caused by
commit a55b951e60 (MDEV-26827).
2023-05-26 16:40:07 +03:00
Marko Mäkelä
7b72fc0a57 MDEV-22739 !cursor->index->is_committed() in row0ins.cc
row_ins_sec_index_entry_by_modify(): When noticing a corrupted secondary
index on which CREATE INDEX is not in progress, return DB_CORRUPTION
instead of intentionally crashing the server.

Tested by: Matthias Leich
2023-05-26 16:40:02 +03:00
Marko Mäkelä
e38c075aa0 MDEV-31346 trx_purge_add_undo_to_history() is not optimal
trx_undo_set_state_at_finish(): Merge to its only caller,
trx_purge_add_undo_to_history().

trx_purge_add_undo_to_history(): Evaluate the condition related to
TRX_UNDO_STATE only once.

Tested by: Matthias Leich
2023-05-26 16:39:46 +03:00
Marko Mäkelä
db8765500e MDEV-31343 Another server hang with innodb_undo_log_truncate=ON
trx_purge_truncate_history(): While waiting for a write-fixed block
to become available, simply wait for an exclusive latch on it.
Also, simplify the iteration: first check for oldest_modification>2
(to ignore clean pages or pages belonging to the temporary tablespace)
and then compare the tablespace identifier.

Before releasing buf_pool.flush_list_mutex we will buffer-fix the block
of interest. In that way, buf_page_t::can_relocate() will not hold on
the block and it must remain in the buffer pool until we have acquired
an exclusive latch on it. If the block is still dirty, we will register
it with the tablespace truncation mini-transaction; else, we will simply
release the latch and buffer-fix and move to the next block.

This also reverts commit c4d7939989
because that fix should no longer be necessary; the wait for an
exclusive block latch should allow buf_pool_t::release_freed_page()
on the same block to proceed.

Tested by: Axel Schwenke, Matthias Leich
2023-05-26 16:16:10 +03:00
Alexander Barkov
9edb1a5ce3 MDEV-30483 After upgrade to 10.6 from Mysql 5.7 seeing "InnoDB: Column last_update in table mysql.innodb_table_stats is BINARY(4) NOT NULL but should be INT UNSIGNED NOT NULL"
Problem:

Field_timestampf implementations differ in MySQL and MariaDB:
- MariaDB sets the UNSIGNED_FLAG in Field::flags
- MySQL does not

The reference table structures
(defined in table_stats_schema and index_stats_schema)
expected the last_update column to have the DATA_UNSIGNED flag,
because MariaDB's Field_timestampf has the UNSIGNED_FLAG.
It worked fine on pure MariaDB installations.

However, if a MariaDB server starts over a MySQL-5.7 data directory during
a migration, the last_update column does not have DATA_UNSIGNED flag,
because MySQL's Field_timestampf does not have the UNSIGNED_FLAG.

This made InnoDB (after the migration from MySQL) complain into the server
error log about the unexpected data type.

The actual fix is done in storage/innobase/dict/dict0stats.cc:
It removes DATA_UNSIGNED from the prtype_mask member of the reference columns,
so now it does not require the underlying columns to have this flag.

The rest of the fix is needed for MTR tests.
The new data type plugin TYPE_MYSQL_TIMESTAMP implements a slightly modified
version of Field_timestampf, which removes the unsigned flag, so
it works like MySQL's Field_timestampf.

The MTR test ALTERs the data type of the columns
table_stats_schema.last_update and index_stats_schema.last_update
from TIMESTAMP to TYPE_MYSQL_TIMESTAMP, then makes InnoDB
verify the structure of the two statistics tables by creating
and populating an InnoDB table t1.

Without the fix made storage/innobase/dict/dict0stats.cc,
MTR complains about unexpected warnings in the server error log:

[ERROR] InnoDB: Column last_update in table mysql.innodb_table_stats is ...
[ERROR] InnoDB: Column last_update in table mysql.innodb_index_stats is ...

With the fix made storage/innobase/dict/dict0stats.cc these warnings
go away.
2023-05-25 05:25:39 +04:00
Thirunarayanan Balathandayuthapani
7737f15f87 MDEV-31333 fsp_free_page() fails to move the extent from
FSP_FREE_FRAG to FSP_FREE list

- This issue was caused by commit 0b47c126e3.
In fsp_free_page(), InnoDB should set XDES_FREE_BIT of the page before
moving the extent from FSP_FREE_FRAG to FSP_FREE list.
2023-05-24 15:15:29 +05:30
Marko Mäkelä
b220bb756b Merge bb-10.6-release into 10.6 2023-05-24 08:37:19 +03:00
Marko Mäkelä
98de15aba1 Merge bb-10.5-release into bb-10.6-release 2023-05-24 08:36:30 +03:00
Marko Mäkelä
383105dae1 Merge bb-10.5-release into 10.5 2023-05-24 08:28:20 +03:00
Marko Mäkelä
c5cf94b2dc MDEV-31234 fixup: Free some UNDO pages earlier
trx_purge_truncate_rseg_history(): Add a parameter to specify if
the entire rollback segment is safe to be freed. If not, we may
still be able to invoke trx_undo_truncate_start() and free some pages.
2023-05-24 08:25:26 +03:00
Marko Mäkelä
270eeeb523 Merge 10.5 into 10.6 2023-05-23 12:25:39 +03:00
Marko Mäkelä
9c35f9c9c1 MDEV-31234 fixup: Allow innodb_undo_log_truncate=ON after upgrade
trx_purge_truncate_history(): Relax a condition that would prevent
undo log truncation if the undo log tablespaces were "contaminated"
by the bug that commit e0084b9d31 fixed.
That is, trx_purge_truncate_rseg_history() would have invoked
flst_remove() on TRX_RSEG_HISTORY but not reduced TRX_RSEG_HISTORY_SIZE.

To avoid any regression with normal operation, we implement this
fixup during slow shutdown only. The condition on the history list
being empty is necessary: without it, in the test
innodb.undo_truncate_recover there may be much fewer than the
expected 90,000 calls to row_purge() before the truncation.
That is, we would truncate the undo tablespace before actually having
processed all undo log records in it.

To truncate such "contaminated" or "bloated" undo log tablespaces
(when using innodb_undo_tablespaces=2 or more)
you can execute the following SQL:

BEGIN;INSERT mysql.innodb_table_stats VALUES('','',DEFAULT,0,0,0);ROLLBACK;
SET GLOBAL innodb_undo_log_truncate=ON, innodb_fast_shutdown=0;
SHUTDOWN;

The first line creates a dummy InnoDB transaction, to ensure that there
will be some history to be purged during shutdown and that the undo
tablespaces will be truncated.
2023-05-23 12:20:27 +03:00
Marko Mäkelä
a5ce335ac9 MDEV-29593 fixup: Avoid a leak if rseg.undo_cached is corrupted
trx_purge_truncate_rseg_history(): Avoid a leak similar to the one
that was fixed in MDEV-31324, in case a supposedly cached undo log
page is not found in the rseg.undo_cached list.
2023-05-22 17:10:25 +03:00
Marko Mäkelä
eb2e074494 Merge 10.5 into 10.6 2023-05-22 08:38:21 +03:00
Teemu Ollakka
f307160218 MDEV-29293 MariaDB stuck on starting commit state
This commit contains a merge from 10.5-MDEV-29293-squash
into 10.6.

Although the bug MDEV-29293 was not reproducible with 10.6,
the fix contains several improvements for wsrep KILL query and
BF abort handling, and addresses the following issues:

* MDEV-30307 KILL command issued inside a transaction is
  problematic for galera replication:
  This commit will remove KILL TOI replication, so Galera side
  transaction context is not lost during KILL.
* MDEV-21075 KILL QUERY maintains nodes data consistency but
  breaks GTID sequence: This is fixed as well as KILL does not
  use TOI, and thus does not change GTID state.
* MDEV-30372 Assertion in wsrep-lib state: This was caused by
  BF abort or KILL when local transaction was in the middle
  of group commit. This commit disables THD::killed handling
  during commit, so the problem is avoided.
* MDEV-30963 Assertion failure !lock.was_chosen_as_deadlock_victim
  in trx0trx.h:1065: The assertion happened when the victim was
  BF aborted via MDL while it was committing. This commit changes
  MDL BF aborts so that transactions which are committing cannot
  be BF aborted via MDL. The RQG grammar attached in the issue
  could not reproduce the crash anymore.

Original commit message from 10.5 fix:

    MDEV-29293 MariaDB stuck on starting commit state

    The problem seems to be a deadlock between KILL command execution
    and BF abort issued by an applier, where:
    * KILL has locked victim's LOCK_thd_kill and LOCK_thd_data.
    * Applier has innodb side global lock mutex and victim trx mutex.
    * KILL is calling innobase_kill_query, and is blocked by innodb
      global lock mutex.
    * Applier is in wsrep_innobase_kill_one_trx and is blocked by
      victim's LOCK_thd_kill.

    The fix in this commit removes the TOI replication of KILL command
    and makes KILL execution less intrusive operation. Aborting the
    victim happens now by using awake_no_mutex() and ha_abort_transaction().
    If the KILL happens when the transaction is committing, the
    KILL operation is postponed to happen after the statement
    has completed in order to avoid KILL to interrupt commit
    processing.

    Notable changes in this commit:
    * wsrep client connections's error state may remain sticky after
      client connection is closed. This error message will then pop
      up for the next client session issuing first SQL statement.
      This problem raised with test galera.galera_bf_kill.
      The fix is to reset wsrep client error state, before a THD is
      reused for next connetion.
    * Release THD locks in wsrep_abort_transaction when locking
      innodb mutexes. This guarantees same locking order as with applier
      BF aborting.
    * BF abort from MDL was changed to do BF abort on server/wsrep-lib
      side first, and only then do the BF abort on InnoDB side. This
      removes the need to call back from InnoDB for BF aborts which originate
      from MDL and simplifies the locking.
    * Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h.
      The manipulation of the wsrep_aborter can be done solely on
      server side. Moreover, it is now debug only variable and
      could be excluded from optimized builds.
    * Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more
      fine grained locking for SR BF abort which may require locking
      of victim LOCK_thd_kill. Added explicit call for
      wsrep_thd_kill_LOCK/UNLOCK where appropriate.
    * Wsrep-lib was updated to version which allows external
      locking for BF abort calls.

    Changes to MTR tests:
    * Disable galera_bf_abort_group_commit. This test is going to
      be removed (MDEV-30855).
    * Make galera_var_retry_autocommit result more readable by echoing
      cases and expectations into result. Only one expected result for
      reap to verify that server returns expected status for query.
    * Record galera_gcache_recover_manytrx as result file was incomplete.
      Trivial change.
    * Make galera_create_table_as_select more deterministic:
      Wait until CTAS execution has reached MDL wait for multi-master
      conflict case. Expected error from multi-master conflict is
      ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open
      wsrep transaction when it is waiting for MDL, query gets interrupted
      instead of BF aborted. This should be addressed in separate task.
    * A new test galera_bf_abort_registering to check that registering trx gets
      BF aborted through MDL.
    * A new test galera_kill_group_commit to verify correct behavior
      when KILL is executed while the transaction is committing.

    Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi>
    Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com>

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2023-05-22 00:42:05 +02:00
Teemu Ollakka
3f59bbeeae MDEV-29293 MariaDB stuck on starting commit state
The problem seems to be a deadlock between KILL command execution
and BF abort issued by an applier, where:
* KILL has locked victim's LOCK_thd_kill and LOCK_thd_data.
* Applier has innodb side global lock mutex and victim trx mutex.
* KILL is calling innobase_kill_query, and is blocked by innodb
  global lock mutex.
* Applier is in wsrep_innobase_kill_one_trx and is blocked by
  victim's LOCK_thd_kill.

The fix in this commit removes the TOI replication of KILL command
and makes KILL execution less intrusive operation. Aborting the
victim happens now by using awake_no_mutex() and ha_abort_transaction().
If the KILL happens when the transaction is committing, the
KILL operation is postponed to happen after the statement
has completed in order to avoid KILL to interrupt commit
processing.

Notable changes in this commit:
* wsrep client connections's error state may remain sticky after
  client connection is closed. This error message will then pop
  up for the next client session issuing first SQL statement.
  This problem raised with test galera.galera_bf_kill.
  The fix is to reset wsrep client error state, before a THD is
  reused for next connetion.
* Release THD locks in wsrep_abort_transaction when locking
  innodb mutexes. This guarantees same locking order as with applier
  BF aborting.
* BF abort from MDL was changed to do BF abort on server/wsrep-lib
  side first, and only then do the BF abort on InnoDB side. This
  removes the need to call back from InnoDB for BF aborts which originate
  from MDL and simplifies the locking.
* Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h.
  The manipulation of the wsrep_aborter can be done solely on
  server side. Moreover, it is now debug only variable and
  could be excluded from optimized builds.
* Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more
  fine grained locking for SR BF abort which may require locking
  of victim LOCK_thd_kill. Added explicit call for
  wsrep_thd_kill_LOCK/UNLOCK where appropriate.
* Wsrep-lib was updated to version which allows external
  locking for BF abort calls.

Changes to MTR tests:
* Disable galera_bf_abort_group_commit. This test is going to
  be removed (MDEV-30855).
* Record galera_gcache_recover_manytrx as result file was incomplete.
  Trivial change.
* Make galera_create_table_as_select more deterministic:
  Wait until CTAS execution has reached MDL wait for multi-master
  conflict case. Expected error from multi-master conflict is
  ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open
  wsrep transaction when it is waiting for MDL, query gets interrupted
  instead of BF aborted. This should be addressed in separate task.
* A new test galera_kill_group_commit to verify correct behavior
  when KILL is executed while the transaction is committing.

Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi>
Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com>
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2023-05-22 00:39:43 +02:00
Teemu Ollakka
6966d7fe4b MDEV-29293 MariaDB stuck on starting commit state
This is a backport from 10.5.

The problem seems to be a deadlock between KILL command execution
and BF abort issued by an applier, where:
* KILL has locked victim's LOCK_thd_kill and LOCK_thd_data.
* Applier has innodb side global lock mutex and victim trx mutex.
* KILL is calling innobase_kill_query, and is blocked by innodb
  global lock mutex.
* Applier is in wsrep_innobase_kill_one_trx and is blocked by
  victim's LOCK_thd_kill.

The fix in this commit removes the TOI replication of KILL command
and makes KILL execution less intrusive operation. Aborting the
victim happens now by using awake_no_mutex() and ha_abort_transaction().
If the KILL happens when the transaction is committing, the
KILL operation is postponed to happen after the statement
has completed in order to avoid KILL to interrupt commit
processing.

Notable changes in this commit:
* wsrep client connections's error state may remain sticky after
  client connection is closed. This error message will then pop
  up for the next client session issuing first SQL statement.
  This problem raised with test galera.galera_bf_kill.
  The fix is to reset wsrep client error state, before a THD is
  reused for next connetion.
* Release THD locks in wsrep_abort_transaction when locking
  innodb mutexes. This guarantees same locking order as with applier
  BF aborting.
* BF abort from MDL was changed to do BF abort on server/wsrep-lib
  side first, and only then do the BF abort on InnoDB side. This
  removes the need to call back from InnoDB for BF aborts which originate
  from MDL and simplifies the locking.
* Removed wsrep_thd_set_wsrep_aborter() from service_wsrep.h.
  The manipulation of the wsrep_aborter can be done solely on
  server side. Moreover, it is now debug only variable and
  could be excluded from optimized builds.
* Remove LOCK_thd_kill from wsrep_thd_LOCK/UNLOCK to allow more
  fine grained locking for SR BF abort which may require locking
  of victim LOCK_thd_kill. Added explicit call for
  wsrep_thd_kill_LOCK/UNLOCK where appropriate.
* Wsrep-lib was updated to version which allows external
  locking for BF abort calls.

Changes to MTR tests:
* Disable galera_bf_abort_group_commit. This test is going to
  be removed (MDEV-30855).
* Record galera_gcache_recover_manytrx as result file was incomplete.
  Trivial change.
* Make galera_create_table_as_select more deterministic:
  Wait until CTAS execution has reached MDL wait for multi-master
  conflict case. Expected error from multi-master conflict is
  ER_QUERY_INTERRUPTED. This is because CTAS does not yet have open
  wsrep transaction when it is waiting for MDL, query gets interrupted
  instead of BF aborted. This should be addressed in separate task.
* A new test galera_kill_group_commit to verify correct behavior
  when KILL is executed while the transaction is committing.

Co-authored-by: Seppo Jaakola <seppo.jaakola@iki.fi>
Co-authored-by: Jan Lindström <jan.lindstrom@galeracluster.com>
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2023-05-22 00:33:37 +02:00
Vlad Lesin
b54e7b0cea MDEV-31185 rw_trx_hash_t::find() unpins pins too early
rw_trx_hash_t::find() acquires element->mutex, then unpins pins, used for
lf_hash element search. After that the "element" can be deallocated and
reused by some other thread.

If we take a look rw_trx_hash_t::insert()->lf_hash_insert()->lf_alloc_new()
calls, we will not find any element->mutex acquisition, as it was not
initialized yet before it's allocation. rw_trx_hash_t::insert() can reuse
the chunk, unpinned in rw_trx_hash_t::find().

The scenario is the following:

1. Thread 1 have just executed lf_hash_search() in
rw_trx_hash_t::find(), but have not acquired element->mutex yet.
2. Thread 2 have removed the element from hash table with
rw_trx_hash_t::erase() call.
3. Thread 1 acquired element->mutex and unpinned pin 2 pin with
lf_hash_search_unpin(pins) call.
4. Some thread purged memory of the element.
5. Thread 3 reused the memory for the element, filled element->id,
element->trx.
6. Thread 1 crashes with failed "DBUG_ASSERT(trx_id == trx->id)"
assertion.

Note that trx_t objects are also reused, see the code around trx_pools
for details.

The fix is to invoke "lf_hash_search_unpin(pins);" after element->trx is
stored in local variable in rw_trx_hash_t::find().

Reviewed by: Nikita Malyavin, Marko Mäkelä.
2023-05-19 15:50:20 +03:00
Marko Mäkelä
d2420669bd MDEV-31309 Innodb_buffer_pool_read_requests is not updated correctly
srv_export_innodb_status(): Update
export_vars.innodb_buffer_pool_read_requests as it was done
before commit a55b951e60 (MDEV-26827).
If innodb_status_variables[] pointed to a sharded variable, it would
only access the first shard.
2023-05-19 15:38:48 +03:00
Marko Mäkelä
df524dc06f MDEV-31308 InnoDB monitor trx_rseg_history_len was accidentally disabled by default
innodb_counter_info[]: Revert a change that was accidentally made in
commit 204e7225dc
2023-05-19 15:29:26 +03:00
Marko Mäkelä
f2c17cc9d9 MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress
This is a 10.6 port of commit 2f9e264781
from MariaDB Server 10.9 that is missing some optimization due to a
more complex redo log format and recovery logic
(which was simplified in commit 685d958e38).

The progress reporting of InnoDB crash recovery was rather intermittent.
Nothing was reported during the single-threaded log record parsing, which
could consume minutes when parsing a large log. During log application,
there only was progress reporting in background threads that would be
invoked on data page read completion.

The progress reporting here will be detailed like this:

InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
InnoDB: Read redo log up to LSN=1963895808
InnoDB: Multi-batch recovery needed at LSN 2534560930
InnoDB: Read redo log up to LSN=3312233472
InnoDB: Read redo log up to LSN=1599646720
InnoDB: Read redo log up to LSN=2160831488
InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages
InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages
InnoDB: Read redo log up to LSN=3195776000
InnoDB: Read redo log up to LSN=3687099392
InnoDB: Read redo log up to LSN=4165315584
InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages
InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages
InnoDB: Read redo log up to LSN=4508724224
InnoDB: Read redo log up to LSN=5094550528
InnoDB: To recover: 205230 pages

The previous messages "Starting a batch to recover" or
"Starting a final batch to recover" will be replaced by
"To recover: ... pages" messages.

If a batch lasts longer than 15 seconds, then there will be
progress reports every 15 seconds, showing the number of remaining pages.
For the non-final batch, the "To recover:" message includes two end LSN:
that of the batch, and of the recovered log. This is the primary measure
of progress. The batch will end once the number of pages to recover
reaches 0.

If recovery is possible in a single batch, the output will look like this,
with a shorter "To recover:" message that counts only the remaining pages:

InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
InnoDB: Read redo log up to LSN=1984539648
InnoDB: Read redo log up to LSN=2710875136
InnoDB: Read redo log up to LSN=3358895104
InnoDB: Read redo log up to LSN=3965299712
InnoDB: Read redo log up to LSN=4557417472
InnoDB: Read redo log up to LSN=5219527680
InnoDB: To recover: 450915 pages

We will also speed up recovery by improving the memory management and
implementing multi-threaded recovery of data pages that will not need
to be read into the buffer pool ("fake read"). Log application in the
"fake read" threads will be protected by an atomic being_recovered field
and exclusive buf_page_t::lock.

Recovery will reserve for data pages two thirds of the buffer pool,
or 256 pages, whichever is smaller. Previously, we could only use at most
one third of the buffer pool for buffered log records. This would typically
mean that with large buffer pools, recovery unnecessary consisted of
multiple batches.

If recovery runs out of memory, it will "roll back" or "rewind" the current
mini-transaction. The recv_sys.recovered_lsn and recv_sys.pages
will correspond to the "out of memory LSN", at the end of the previous
complete mini-transaction.

If recovery runs out of memory while executing the final recovery batch,
we can simply invoke recv_sys.apply(false) to make room, and resume
parsing.

If recovery runs out of memory before the final batch, we will
scan the redo log to the end and check for any missing or inconsistent
files. In this version of the patch, we will throw away any previously
buffered recv_sys.pages and rescan the log from the checkpoint onwards.

recv_sys_t::pages_it: A cached iterator to recv_sys.pages.

recv_sys_t::is_memory_exhausted(): Remove. We will have out-of-memory
handling deep inside recv_sys_t::parse().

recv_sys_t::rewind(), page_recv_t::recs_t::rewind():
Remove all log starting with a specific LSN.

IORequest::write_complete(), IORequest::read_complete():
Replaces fil_aio_callback().

read_io_callback(), write_io_callback(): Replaces io_callback().

IORequest::fake_read_complete(), fake_io_callback(), os_fake_read():
Process a "fake read" request for concurrent recovery.

recv_sys_t::apply_batch(): Choose a number of successive pages
for a recovery batch.

recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a
page whose recovery is not in progress. Log application threads
will not invoke this; they will only set being_recovered=-1 to indicate
that the entry is no longer needed.

recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries.

recv_sys_t::wait_for_pool(): Wait for some space to become available
in the buffer pool.

mlog_init_t::mark_ibuf_exist(): Avoid calls to
recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low().
Such calls would lead to double locking of recv_sys.mutex, which
depending on implementation could cause a deadlock. We will use
lower-level calls to look up index pages.

buf_LRU_block_remove_hashed(): Disable consistency checks for freed
ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage.
This fixes an occasional failure of the test
innodb.innodb_bulk_create_index_debug.

Tested by: Matthias Leich
2023-05-19 15:20:07 +03:00
Vlad Lesin
5422784792 MDEV-31256 fil_node_open_file() releases fil_system.mutex allowing other thread to open its file node
There is room between mutex_exit(&fil_system.mutex) and
mutex_enter(&fil_system.mutex) calls in fil_node_open_file(). During this
room another thread can open the node, and ut_ad(!node->is_open())
assertion in fil_node_open_file_low() can fail.

The fix is not to open node if it was already opened by another thread.
2023-05-19 14:52:47 +03:00
Marko Mäkelä
347e22fbf8 Merge bb-10.6-release into 10.6 2023-05-19 14:23:53 +03:00
Marko Mäkelä
06d555a41a Merge bb-10.5-release into 10.5 2023-05-19 14:23:04 +03:00
Marko Mäkelä
e5933b99d5 MDEV-31234 related cleanup
trx_purge_free_segment(), trx_purge_truncate_rseg_history():
Replace some unreachable code with debug assertions.
A buffer-fix does prevent pages from being evicted
from the buffer pool; see buf_page_t::can_relocate().

Tested by: Matthias Leich
2023-05-19 12:25:30 +03:00
Marko Mäkelä
37492960f3 Merge 10.5 into 10.6 2023-05-19 12:24:58 +03:00
Marko Mäkelä
e0084b9d31 MDEV-31234 InnoDB does not free UNDO after the fix of MDEV-30671
trx_purge_truncate_history(): Only call trx_purge_truncate_rseg_history()
if the rollback segment is safe to process. This will avoid leaking undo
log pages that are not yet ready to be processed. This fixes a regression
that was introduced in
commit 0de3be8cfd (MDEV-30671).

trx_sys_t::any_active_transactions(): Separately count XA PREPARE
transactions.

srv_purge_should_exit(): Terminate slow shutdown if the history size
does not change and XA PREPARE transactions exist in the system.
This will avoid a hang of the test innodb.recovery_shutdown.

Tested by: Matthias Leich
2023-05-19 12:19:26 +03:00
Marko Mäkelä
a3e5b5c4db Merge 10.5 into 10.6 2023-05-15 09:02:32 +03:00
Marko Mäkelä
c9eff1a144 MDEV-31254 InnoDB: Trying to read doublewrite buffer page
buf_read_page_low(): Remove an error message and a debug assertion
that can be triggered when using innodb_page_size=4k and
innodb_file_per_table=0. In that case, buf_read_ahead_linear()
may be invoked on page 255, which is one less than the first
page of the doublewrite buffer (256).
2023-05-12 15:04:50 +03:00
Marko Mäkelä
477285c8ea MDEV-31253 Freed data pages are not always being scrubbed
fil_space_t::flush_freed(): Renamed from buf_flush_freed_pages();
this is a backport of aa45850687 from 10.6.
Invoke log_write_up_to() on last_freed_lsn, instead of avoiding
the operation when the log has not yet been written.
A more costly alternative would be that log_checkpoint() would invoke
this function on every affected tablespace.
2023-05-12 14:57:14 +03:00
Marko Mäkelä
c271057288 Merge 10.5 into 10.6 2023-05-11 13:27:01 +03:00
Marko Mäkelä
279d0120f5 MDEV-29967 innodb_read_ahead_threshold (linear read-ahead) does not work
buf_read_ahead_linear(): Correct some calculations that were broken
in commit b1ab211dee (MDEV-15053).

Thanks to Daniel Black for providing a test case and initial debugging.

Tested by: Matthias Leich
2023-05-11 13:21:57 +03:00
Marko Mäkelä
7124911a2c MDEV-31158: Potential hang with ROW_FORMAT=COMPRESSED tables
btr_cur_need_opposite_intention(): Check also page_zip_available()
so that we will escalate to exclusive index latch when a non-leaf
page may have to be split further due to ROW_FORMAT=COMPRESSED page
overflow.

Tested by: Matthias Leich
2023-05-11 08:43:00 +03:00
Marko Mäkelä
4a668c1892 MDEV-29401 InnoDB history list length increased in 10.6 compared to 10.5
The InnoDB buffer pool and locking were heavily refactored in
MariaDB Server 10.6. Among other things, dict_sys.mutex was removed,
and the contended lock_sys.mutex was replaced with a combination of
lock_sys.latch and distributed latches in hash tables. Also, a
default value was changed to innodb_flush_method=O_DIRECT to improve
performance in write-heavy workloads.

One thing where an adjustment was missing is around the parameters
innodb_max_purge_lag (number of committed transactions waiting to
be purged), and innodb_max_purge_lag_delay
(maximum number of microseconds to delay a DML operation).

purge_coordinator_state::do_purge(): Pass the history_size to trx_purge()
and reset srv_dml_needed_delay if the history is empty.
Keep executing the loop non-stop as long as srv_dml_needed_delay is set.

trx_purge_dml_delay(): Made part of trx_purge().
Set srv_dml_needed_delay=0 when nothing can be purged (!n_pages_handled).

row_mysql_delay_if_needed(): Mimic the logic of
innodb_max_purge_lag_wait_update().

Reviewed by: Thirunarayanan Balathandayuthapani
2023-04-27 17:11:32 +03:00
Marko Mäkelä
5740638c4c MDEV-31132 Deadlock between DDL and purge of InnoDB history
log_free_check(): Assert that the caller must not hold
exclusive lock_sys.latch. This was the case for calls from
ibuf_delete_for_discarded_space(). This caused a deadlock with
another thread that would be holding a latch on a dirty page
that would need to be written so that the checkpoint would advance
and log_free_check() could return. That other thread was waiting
for a shared lock_sys.latch.

fil_delete_tablespace(): Do not invoke ibuf_delete_for_discarded_space()
because in DDL operations, we will be holding exclusive lock_sys.latch.

trx_t::commit(std::vector<pfs_os_file_t>&), innodb_drop_database(),
row_purge_remove_clust_if_poss_low(), row_undo_ins_remove_clust_rec(),
row_discard_tablespace_for_mysql():
Invoke ibuf_delete_for_discarded_space() on the deleted tablespaces after
releasing all latches.
2023-04-26 12:08:59 +03:00
Marko Mäkelä
d4265fbde5 MDEV-26055: Correct the formula for adaptive flushing
page_cleaner_flush_pages_recommendation(): If dirty_pct is
between innodb_max_dirty_pages_pct_lwm
and innodb_max_dirty_pages_pct,
scale the effort relative to how close we are to
innodb_max_dirty_pages_pct.

The previous formula was missing a multiplication by 100.

Tested by: Axel Schwenke
2023-04-26 11:53:42 +03:00
Marko Mäkelä
c22ab93f8a MDEV-26827 fixup: Prevent a hang in LRU eviction
buf_pool_t::page_cleaner_wakeup(): If for_LRU=true, wake up the page
cleaner immediately, also when it is in a timed wait. This avoids an
unnecessary delay of up to 1 second.
2023-04-25 15:03:38 +03:00
Marko Mäkelä
818d5e4814 Merge 10.5 into 10.6 2023-04-25 13:10:33 +03:00
Marko Mäkelä
50f3b7d164 MDEV-31124 Innodb_data_written miscounts doublewrites
When commit a5a2ef079c
implemented asynchronous doublewrite, the writes via
the doublewrite buffer started to be counted incorrectly,
without multiplying them by innodb_page_size.

srv_export_innodb_status(): Correctly count the
Innodb_data_written.

buf_dblwr_t: Remove submitted(), because it is close to written()
and only Innodb_data_written was interested in it. According to
its name, it should count completed and not submitted writes.

Tested by: Axel Schwenke
2023-04-25 12:17:06 +03:00
Oleksandr Byelkin
1d74927c58 Merge branch '10.4' into 10.5 2023-04-24 12:43:47 +02:00
Marko Mäkelä
0976afec88 MDEV-31114 Assertion !...is_waiting() failed in os_aio_wait_until_no_pending_writes()
os_aio_wait_until_no_pending_reads(), os_aio_wait_until_pending_writes():
Add a Boolean parameter to indicate whether the wait should be declared
in the thread pool.

buf_flush_wait(): The callers have already declared a wait, so let us
avoid doing that again, just call os_aio_wait_until_pending_writes(false).

buf_flush_wait_flushed(): Do not declare a wait in the rare case that
the buf_flush_page_cleaner thread has been shut down already.

buf_flush_page_cleaner(), buf_flush_buffer_pool(): In the code that runs
during shutdown, do not declare waits.

buf_flush_buffer_pool(): Remove a debug assertion that might fail.
What really matters here is buf_pool.flush_list.count==0.

buf_read_recv_pages(), srv_prepare_to_delete_redo_log_file():
Do not declare waits during InnoDB startup.
2023-04-24 09:57:58 +03:00
Thirunarayanan Balathandayuthapani
2c567b2fa3 MDEV-30996 insert.. select in presence of full text index freezes all other commits at commit time
- This patch does the following:
git revert --no-commit 673243c893
git revert --no-commit 6c669b9586
git revert --no-commit bacaf2d4f4
git checkout HEAD mysql-test
git revert --no-commit 1fd7d3a9ad

Above command reverts MDEV-29277, MDEV-25581, MDEV-29342.

When binlog is enabled, trasaction takes a lot of time to do
sync operation on innodb fts table. This leads to block
of other transaction commit. To avoid this failure, remove
the fulltext sync operation during transaction commit. So
reverted MDEV-25581 related patches.

We filed MDEV-31105 to avoid the memory consumption
problem during fulltext sync operation.
2023-04-24 11:06:56 +05:30
Alexander Barkov
9f98a2acd7 MDEV-30968 mariadb-backup does not copy Aria logs if aria_log_dir_path is used
- `mariadb-backup --backup` was fixed to fetch the value of the
   @@aria_log_dir_path server variable and copy aria_log* files
   from @@aria_log_dir_path directory to the backup directory.
   Absolute and relative (to --datadir) paths are supported.

   Before this change aria_log* files were copied to the backup
   only if they were in the default location in @@datadir.

- `mariadb-backup --copy-back` now understands a new my.cnf and command line
   parameter --aria-log-dir-path.

  `mariadb-backup --copy-back` in the main loop in copy_back()
   (when copying back from the backup directory to --datadir)
   was fixed to ignore all aria_log* files.

   A new function copy_back_aria_logs() was added.
   It consists of a separate loop copying back aria_log* files from
   the backup directory to the directory specified in --aria-log-dir-path.
   Absolute and relative (to --datadir) paths are supported.
   If --aria-log-dir-path is not specified,
   aria_log* files are copied to --datadir by default.

- The function is_absolute_path() was fixed to understand MTR style
  paths on Windows with forward slashes, e.g.
   --aria-log-dir-path=D:/Buildbot/amd64-windows/build/mysql-test/var/...
2023-04-21 19:08:35 +04:00
Marko Mäkelä
51e62cb3b3 MDEV-26782 InnoDB temporary tablespace: reclaiming of free space does not work
The motivation of this change is to allow undo pages for temporary tables
to be marked free as often as possible, so that we can avoid buf_pool.LRU
eviction (and writes) of undo pages that contain data that is
no longer needed. For temporary tables, no MVCC or purge of history
is needed, and reusing cached undo log pages might not help that much.

It is possible that this may cause some performance regression due to
more frequent allocation and freeing of undo log pages, but I only
measured a performance improvement.

trx_write_serialisation_history(): Never cache temporary undo log pages.

trx_undo_reuse_cached(): Assert that the rollback segment is persistent.

trx_undo_assign_low(): Add template<bool is_temp>. Never invoke
trx_undo_reuse_cached() for temporary tables.

Tested by: Matthias Leich
2023-04-21 17:58:26 +03:00
Marko Mäkelä
204e7225dc Cleanup: MONITOR_EXISTING trx_undo_slots_used, trx_undo_slots_cached
Let us remove explicit updates of MONITOR_NUM_UNDO_SLOT_USED
and MONITOR_NUM_UNDO_SLOT_CACHED, and let us compute the rough values
from trx_sys.rseg_array[] on demand.
2023-04-21 17:58:18 +03:00
Marko Mäkelä
86767bcc0f MDEV-29593 Purge misses a chance to free not-yet-reused undo pages
trx_purge_truncate_rseg_history(): If all other conditions for
invoking trx_purge_remove_log_hdr() hold, but the state is
TRX_UNDO_CACHED instead of TRX_UNDO_TO_PURGE, detach and free it.

Tested by: Matthias Leich
2023-04-21 17:58:09 +03:00
Marko Mäkelä
40eff3f868 MDEV-26827 fixup: hangs and !os_aio_pending_writes() assertion failures
buf_LRU_get_free_block(): Always wake up the page cleaner if needed
before exiting the inner loop.

srv_prepare_to_delete_redo_log_file():
Replace a debug assertion with a wait in debug builds.
Starting with commit 7e31a8e7fa
the debug assertion ut_ad(!os_aio_pending_writes())
could occasionally fail, while it would hold in core dumps of crashes.
The failure can be reproduced more easily by adding a sleep to the
write completion callback function, right before releasing to
write_slots.

srv_start(): Remove a bogus debug assertion
ut_ad(!os_aio_pending_writes()) that could fail in
mariadb-backup --prepare. In an rr replay trace, we had
buf_pool.flush_list.count==0 but write_slots->m_cache.m_pos==1
and buf_page_t::write_complete() was executing u_unlock().
2023-04-21 17:52:47 +03:00
Marko Mäkelä
e55e761eae MDEV-31084 assert(waiting) failed in TP_connection_generic::wait_end
buf_flush_wait_flushed(): Correct the logic for registering a wait
around buf_flush_wait() that
commit a091d6ac4e
recently broke. This should be easily repeatable when using a
non-default startup parameter:

	thread-handling=pool-of-threads
2023-04-21 16:49:59 +03:00
Marko Mäkelä
abe4c7bfd6 Merge 10.5 into 10.6 2023-04-21 16:38:22 +03:00
Marko Mäkelä
c6e58a8d17 MDEV-30753 fixup: Unsafe buffer page restoration
trx_purge_free_segment(): The buffer-fix only prevents a block from
being freed completely from the buffer pool, but it will not prevent
the block from being evicted. Recheck the page identifier after
acquiring an exclusive page latch. If it has changed, backtrack and
invoke buf_page_get_gen() to look up the page normally.
2023-04-21 16:19:39 +03:00
Marko Mäkelä
7e31a8e7fa MDEV-26827 fixup: Fix os_aio_wait_until_no_pending_writes()
io_callback(): Process the request before releasing the write slot.
Before commit a091d6ac4e
when we had a duplicated counter for writes, either ordering was fine.
Now, correctness depends on os_aio_wait_until_no_pending_writes().
2023-04-20 14:08:48 +03:00
Marko Mäkelä
27ff972be2 MDEV-26827 fixup: Do not hog buf_pool.mutex
buf_flush_LRU_list_batch(): When evicting clean pages,
release and reacquire the buf_pool.mutex after every 32 pages.
Also, eliminate some conditional branches.
2023-04-19 18:57:18 +03:00
Marko Mäkelä
0cda0e4e15 MDEV-31080 fil_validate() failures during deferred tablespace recovery
fil_space_t::create(), fil_space_t::add(): Expect the caller to
acquire and release fil_system.mutex. In this way, creating a tablespace
and adding the first (usually only) data file will be atomic.

recv_sys_t::recover_deferred(): Correctly protect some changes by
holding fil_system.mutex.

Tested by: Matthias Leich
2023-04-19 18:56:58 +03:00
Marko Mäkelä
78368e5866 MDEV-30863 fixup: Assertion failure when using innodb_undo_tablespaces=0
trx_assign_rseg_low(): Let us restore the debug variable look_for_rollover
to avoid assertion failures when a server that was created with
multiple undo tablespaces is being started with innodb_undo_tablespaces=0.
2023-04-19 15:52:11 +03:00
Marko Mäkelä
1892f5d8fc MDEV-30863 fixup: Hang in a debug build
trx_assign_rseg_low(): Correct a debug injection condition.
2023-04-19 14:46:49 +03:00