Commit graph

316 commits

Author SHA1 Message Date
Marko Mäkelä
620c55e708 Merge 10.4 into 10.5 2022-04-21 15:33:50 +03:00
Marko Mäkelä
394784095e Merge 10.3 into 10.4 2022-04-21 11:33:59 +03:00
Marko Mäkelä
4730314a70 MDEV-28369 ibuf_bitmap_mutex is an unnecessary contention point
The only purpose of ibuf_bitmap_mutex is to prevent a deadlock between
two concurrent invocations of ibuf_update_free_bits_for_two_pages_low()
on the same pair of bitmap pages, but in opposite order.
The mutex is unnecessarily serializing the execution of the function
even when it is being invoked on totally different tablespaces.
To avoid deadlocks, it suffices to ensure that the two page latches
are being acquired in a deterministic (sorted) order.
2022-04-21 09:15:18 +03:00
Marko Mäkelä
7b957316cb Merge 10.3 into 10.4 2022-04-07 10:32:56 +03:00
Jan Lindström
3c99a48db3 MDEV-28247 : Disable background ibuf merge during Galera SST
This failure was caused by MDEV-25975, which removed the parameter
innodb_disallow_writes.

Added a check for wsrep_sst_disable_writes to the function
ibuf_merge_in_background().
2022-04-07 08:45:01 +03:00
Vlad Lesin
f6f055a191 Merge 10.3 into 10.4 2022-02-21 14:10:27 +03:00
Vlad Lesin
a6f258e47f MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration
Backported from 10.5 20e9e804c1 and
5948d7602e.

sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).

It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode  to PAGE_CUR_LE for cursor->rel_pos ==  BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.

But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.

The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.

The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.

Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.

If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.

To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.

The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.

One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.

If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().
2022-02-21 12:49:54 +03:00
Vlad Lesin
20e9e804c1 MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration
sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).

It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode  to PAGE_CUR_LE for cursor->rel_pos ==  BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.

But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.

The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.

The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.

Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.

If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.

To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.

The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.

One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.

If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().
2022-02-14 17:35:04 +03:00
Marko Mäkelä
310dff5d84 MDEV-25783: Change buffer records are lost under heavy load
ibuf_read_merge_pages(): Disable some code that was added in MDEV-20394
in order to avoid a server hang if the change buffer is corrupted,
presumably due to the race condition in recovery that was later fixed in
MDEV-24449. The code will still be available in debug builds when
the command line option --debug=d,ibuf_merge_corruption is specified.

Due to MDEV-19514, the impact of this code is much worse starting
with the 10.5 series. In older versions, the code was only enabled
during a shutdown with innodb_fast_shutdown=0, but in 10.5 it was
active during the normal operation of the server.
2021-06-07 19:07:45 +03:00
Marko Mäkelä
6ca065468f MDEV-19514 fixup: Optimize ibuf_merge_or_delete_for_page() calls 2021-05-31 15:44:04 +03:00
Marko Mäkelä
6c3e860cbf Merge 10.4 into 10.5 2021-04-14 11:35:39 +03:00
Marko Mäkelä
5008171b05 Merge 10.3 into 10.4 2021-04-14 10:33:59 +03:00
Marko Mäkelä
b8c8692fd9 MDEV-24620 ASAN heap-buffer-overflow in btr_pcur_restore_position()
Between btr_pcur_store_position() and btr_pcur_restore_position()
it is possible that purge empties a table and enlarges
index->n_core_fields and index->n_core_null_bytes.
Therefore, we must cache index->n_core_fields in
btr_pcur_t::old_n_core_fields so that btr_pcur_t::old_rec can be
parsed correctly.

Unfortunately, this is a huge change, because we will replace
"bool leaf" parameters with "ulint n_core"
(passing index->n_core_fields, or 0 for non-leaf pages).
For special cases where we know that index->is_instant() cannot hold,
we may also pass index->n_fields.
2021-04-13 10:28:13 +03:00
Marko Mäkelä
450c017c2d Merge 10.2 into 10.3 2021-04-09 14:32:06 +03:00
Marko Mäkelä
35ee4aa4e3 MDEV-13103 fixup: Actually fix a crash during IMPORT TABLESPACE 2021-03-31 09:06:44 +03:00
Thirunarayanan Balathandayuthapani
aa4f76bed7 MDEV-25018 [FATAL] InnoDB: Unable to read page (of a dropped tablespace)
- This issue is caused by commit deadec4e68
(MDEV-24569). InnoDB fails to read the change buffer bitmap page
from dropped tablespace. In ibuf_bitmap_get_map_page_func(), InnoDB
should fetch the page using BUF_GET_POSSIBLY_FREED mode. Callers of
ibuf_bitmap_get_map_page() should be adjusted in that case.
2021-03-03 17:18:26 +05:30
Thirunarayanan Balathandayuthapani
8d714db622 MDEV-24997 Assertion mtr->is_named_space(page_id.space()) in ibuf0ibuf.cc:624
- This is caused by commit deadec4e68
(MDEV-24569). InnoDB fails to set the tablespace associated with
mini-transacton while resetting the change buffer bitmap bits of
the page.
2021-02-26 19:02:26 +05:30
Marko Mäkelä
deadec4e68 MDEV-24569: Change buffer merge modifies freed page
Starting with MDEV-15528, we will avoid writing freed pages back
to data files. During stress testing, the assertion
mach_read_from_4(frame + 4U) == block.page.id().page_no()
failed in log_phys_t::apply(), because before the server was
killed and restarted, change buffer merge was executed on a
previously freed page.

During recovery, the assertion would fail when InnoDB would apply
a FREE_PAGE and a WRITE record to the page.

ibuf_merge_or_delete_for_page(): Before applying any changes, check whether
the secondary index leaf page has already been freed according to
the allocation bitmap page. If it is freed, then we must reset the bits
in change buffer bitmap page.

ibuf_reset_bitmap(): Auxiliary function for repeated code.

mtr_t::sx_lock_space(): Replaces mtr_t::x_lock_space(). Due to the
lazy approach of the change buffer, The function
ibuf_merge_or_delete_for_page() may be executed in buf_page_create()
while the tablespace is already X-latched. An S-latch must not be
acquired while the thread already holds an X-latch, but a redundant
SX-latch is fine.

The fix was developed by Thirunarayanan Balathandayuthapani.
2021-01-14 14:26:24 +02:00
Marko Mäkelä
0c23e32d27 MDEV-24445 Using innodb_undo_tablespaces corrupts system tablespace
In the rewrite of MDEV-8139 (based on MDEV-15528), we introduced a
wrong assumption that any persistent tablespace that is not an .ibd
file is the system tablespace. This assumption is broken when
innodb_undo_tablespaces (files undo001, undo002, ...) are being used.
By default, we have innodb_undo_tablespaces=0 (the persistent undo
log is being stored in the system tablespace).

In MDEV-15528 and MDEV-8139 we rewrote the page scrubbing logic
so that it will follow the tried-and-true write-ahead logging
protocol, first writing FREE_PAGE records and then in the page
flushing, zerofilling or hole-punching freed pages.

Unfortunately, the implementation included a wrong assumption that
that anything that is not in an .ibd file must be the system tablespace.
This wrong assumption would cause overwrites of valid data pages in
the system tablespace.

mtr_t::m_freed_in_system_tablespace: Remove.

mtr_t::m_freed_space: The tablespace associated with m_freed_pages.

buf_page_free(): Take the tablespace and page number as a parameter,
instead of taking a page identifier.
2020-12-18 19:23:34 +02:00
Marko Mäkelä
d7a5824899 Merge 10.4 into 10.5 2020-11-13 21:54:21 +02:00
Marko Mäkelä
972dc6ee98 Merge 10.3 into 10.4 2020-11-12 11:18:04 +02:00
Marko Mäkelä
150f447af1 Merge 10.2 into 10.3 2020-11-12 10:37:21 +02:00
Marko Mäkelä
bd528b0c93 MDEV-24182 ibuf_merge_or_delete_for_page() contains dead code
The function ibuf_merge_or_delete_for_page() was always being
invoked with update_ibuf_bitmap=true ever since
commit cd623508df
fixed up something after MDEV-9566.

Furthermore, the parameter page_size is never being passed as a
null pointer, and therefore it should better be a reference to
a constant object.
2020-11-11 15:48:43 +02:00
Marko Mäkelä
898521e2dd Merge 10.4 into 10.5 2020-10-30 11:15:30 +02:00
Marko Mäkelä
2b6f804490 Merge 10.2 into 10.3 2020-10-28 10:44:40 +02:00
Marko Mäkelä
a8de8f261d Merge 10.2 into 10.3 2020-10-28 10:01:50 +02:00
Eugene Kosov
afc9d00c66 MDEV-23991 dict_table_stats_lock() has unnecessarily long scope
Patch removes dict_index_t::stats_latch. Table/index statistics now
protected with dict_sys->mutex. That way statistics computation can
happen in parallel in several threads and dict_sys->mutex will be locked
only for a short period of time.

This patch is a joint work with Marko Mäkelä

dict_index_t:🔒 make mutable which allows to pass const pointer
when only lock is touched in an object

btr_height_get()
btr_get_size(): make index argument const for better type safety

btr_estimate_number_of_different_key_vals(): now returns computed values
instead of setting fields in dict_index_t directly

remove everything related to dict_index_t::stats_latch

dict_stats_index_set_n_diff(): now returns computed values instead
of setting fields in dict_index_t directly

dict_stats_analyze_index():  now returns computed values instead
of setting fields in dict_index_t directly

Reviewed by: Marko Mäkelä
2020-10-27 19:09:20 +03:00
Marko Mäkelä
118e258aaa MDEV-23855: Shrink fil_space_t
Merge n_pending_ios, n_pending_ops to std::atomic<uint32_t> n_pending.
Change some more fil_space_t members to uint32_t to reduce
the memory footprint.

fil_space_t::add(), fil_ibd_create(): Attach the already opened
handle to the tablespace, and enforce the fil_system.n_open limit.

dict_boot(): Initialize fil_system.max_assigned_id.

srv_boot(): Call srv_thread_pool_init() before anything else,
so that files should be opened in the correct mode on Windows.

fil_ibd_create(): Create the file in OS_FILE_AIO mode, just like
fil_node_open_file_low() does it.

dict_table_t::is_accessible(): Replaces fil_table_accessible().

Reviewed by: Vladislav Vaintroub
2020-10-26 17:53:54 +02:00
Marko Mäkelä
45ed9dd957 MDEV-23855: Remove fil_system.LRU and reduce fil_system.mutex contention
Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored
for system tablespace on SSD

When the maximum configured number of file is exceeded, InnoDB will
close data files. We used to maintain a fil_system.LRU list and
a counter fil_node_t::n_pending to achieve this, at the huge cost
of multiple fil_system.mutex operations per I/O operation.

fil_node_open_file_low(): Implement a FIFO replacement policy:
The last opened file will be moved to the end of fil_system.space_list,
and files will be closed from the start of the list. However, we will
not move tablespaces in fil_system.space_list while
i_s_tablespaces_encryption_fill_table() is executing
(producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION)
because it may cause information of some tablespaces to go missing.
We also avoid this in mariabackup --backup because datafiles_iter_next()
assumes that the ordering is not changed.

IORequest: Fold more parameters to IORequest::type.

fil_space_t::io(): Replaces fil_io().

fil_space_t::flush(): Replaces fil_flush().

OS_AIO_IBUF: Remove. We will always issue synchronous reads of the
change buffer pages in buf_read_page_low().

We will always ignore some errors for background reads.

This should reduce fil_system.mutex contention a little.

fil_node_t::complete_write(): Replaces fil_node_t::complete_io().
On both read and write completion, fil_space_t::release_for_io()
will have to be called.

fil_space_t::io(): Do not acquire fil_system.mutex in the normal
code path.

xb_delta_open_matching_space(): Do not try to open the system tablespace
which was already opened. This fixes a file sharing violation in
mariabackup --prepare --incremental.

Reviewed by: Vladislav Vaintroub
2020-10-26 17:09:01 +02:00
Thirunarayanan Balathandayuthapani
3ba8f619e4 MDEV-23370 innodb_fts.innodb_fts_misc failed in buildbot, server crashed in dict_table_autoinc_destroy
This issue is caused by MDEV-22456 ad6171b91c. Fix involves the backported version of 10.4 patch
MDEV-22778 5f2628d1ee and few parts of
MDEV-17441 (e9a5f288f2).

dict_table_t::stats_latch_created: Removed

dict_table_t::stats_latch: make value member and always lock it for
simplicity even for stats cloned table.

zip_pad_info_t::mutex_created: Removed

zip_pad_info_t::mutex: make member value instead of pointer

os0once.h: Removed

dict_table_remove_from_cache_low(): Ensure that fts_free() is always
called, even if dict_mem_table_free() is deferred until
btr_search_lazy_free().

InnoDB would always zip_pad_info_t::mutex and
dict_table_t::autoinc_mutex, even for tables are not in
ROW_FORMAT=COMPRESSED nor include any AUTO_INCREMENT column.
2020-10-25 15:53:17 +05:30
Marko Mäkelä
9028cc6b86 Cleanup: Make InnoDB page numbers uint32_t
InnoDB stores a 32-bit page number in page headers and in some
data structures, such as FIL_ADDR (consisting of a 32-bit page number
and a 16-bit byte offset within a page). For better compile-time
error detection and to reduce the memory footprint in some data
structures, let us use a uint32_t for the page number, instead
of ulint (size_t) which can be 64 bits.
2020-10-15 17:06:17 +03:00
Marko Mäkelä
ecb913c2a5 Cleanup: Compare page_id_t directly 2020-10-15 17:06:16 +03:00
Marko Mäkelä
9ae608d24d Merge 10.3 into 10.4 2020-10-01 13:41:36 +03:00
Marko Mäkelä
323500bfa9 Merge 10.2 into 10.3 2020-09-30 16:25:06 +03:00
Marko Mäkelä
74bd3683ca MDEV-23839 innodb_fast_shutdown=0 hang on change buffer merge
ibuf_merge_or_delete_for_page(): Do not attempt to invoke
ibuf_delete_recs() on a page of the change buffer itself.
The caller could already be holding ibuf->index->lock,
and an attempt to acquire it in S mode would hang the release server
or cause an assertion failure in rw_lock_s_lock_func() in a debug
server.

This problem was reproducible on 1 out of 2 runs of the following:
./mtr --no-reorder \
innodb.innodb-page_compression_default \
innodb.innodb-page_compression_snappy \
innodb.innodb-page_compression_zip \
innodb.innodb_wl6326_big innodb.xa_recovery
2020-09-29 11:04:12 +03:00
Marko Mäkelä
00cd53d39a MDEV-23719: Make lock_sys use page_id_t
Since commit 8ccb3caafb it should be
more efficient to use page_id_t rather than two separate variables
for tablespace identifier and page number.

lock_rec_fold(): Replaced with page_id_t::fold().

lock_rec_hash(): Replaced with lock_sys.hash(page_id).

lock_rec_expl_exist_on_page(), lock_rec_get_first_on_page_addr(),
lock_rec_get_first_on_page(): Replaced with lock_sys.get_first().
2020-09-17 14:08:41 +03:00
Marko Mäkelä
ca9276e37e Merge 10.1 into 10.2 2020-07-20 14:53:24 +03:00
Marko Mäkelä
57ec42bc32 MDEV-23190 InnoDB data file extension is not crash-safe
When InnoDB is extending a data file, it is updating the FSP_SIZE
field in the first page of the data file.

In commit 8451e09073 (MDEV-11556)
we removed a work-around for this bug and made recovery stricter,
by making it track changes to FSP_SIZE via redo log records, and
extend the data files before any changes are being applied to them.

It turns out that the function fsp_fill_free_list() is not crash-safe
with respect to this when it is initializing the change buffer bitmap
page (page 1, or generally, N*innodb_page_size+1). It uses a separate
mini-transaction that is committed (and will be written to the redo
log file) before the mini-transaction that actually extended the data
file. Hence, recovery can observe a reference to a page that is
beyond the current end of the data file.

fsp_fill_free_list(): Initialize the change buffer bitmap page in
the same mini-transaction.

The rest of the changes are fixing a bug that the use of the separate
mini-transaction was attempting to work around. Namely, we must ensure
that no other thread will access the change buffer bitmap page before
our mini-transaction has been committed and all page latches have been
released.

That is, for read-ahead as well as neighbour flushing, we must avoid
accessing pages that might not yet be durably part of the tablespace.

fil_space_t::committed_size: The size of the tablespace
as persisted by mtr_commit().

fil_space_t::max_page_number_for_io(): Limit the highest page
number for I/O batches to committed_size.

MTR_MEMO_SPACE_X_LOCK: Replaces MTR_MEMO_X_LOCK for fil_space_t::latch.

mtr_x_space_lock(): Replaces mtr_x_lock() for fil_space_t::latch.

mtr_memo_slot_release_func(): When releasing MTR_MEMO_SPACE_X_LOCK,
copy space->size to space->committed_size. In this way, read-ahead
or flushing will never be invoked on pages that do not yet exist
according to FSP_SIZE.
2020-07-20 14:48:56 +03:00
Eugene Kosov
0e86254b05 MDEV-22930 Unnecessary contention on rw_lock_list_mutex in ibuf_dummy_index_create()
1. Do not initialize dict_table_t::stats_latch in ibuf
2. Remove overengineering in GenericPolicy to speed up things

dict_mem_table_create(): add new argument init_stats_latch

ibuf_dummy_index_create(): do not initialize dict_table_t::stats_latch

GenericPolicy: add new members m_filename and m_line

sync_file_create_register()
sync_file_created_deregister()
sync_file_created_get()
CreateTracker: remove

rw_lock_t::created: a new debug member
2020-07-08 04:08:30 +03:00
Marko Mäkelä
d2c593c2a6 MDEV-22877 Avoid unnecessary buf_pool.page_hash S-latch acquisition
MDEV-15053 did not remove all unnecessary buf_pool.page_hash S-latch
acquisition. There are code paths where we are holding buf_pool.mutex
(which will sufficiently protect buf_pool.page_hash against changes)
and unnecessarily acquire the latch. Many invocations of
buf_page_hash_get_locked() can be replaced with the much simpler
buf_pool.page_hash_get_low().

In the worst case the thread that is holding buf_pool.mutex will become
a victim of MDEV-22871, suffering from a spurious reader-reader conflict
with another thread that genuinely needs to acquire a buf_pool.page_hash
S-latch.

In many places, we were also evaluating page_id_t::fold() while holding
buf_pool.mutex. Low-level functions such as buf_pool.page_hash_get_low()
must get the page_id_t::fold() as a parameter.

buf_buddy_relocate(): Defer the hash_lock acquisition to the critical
section that starts by calling buf_page_t::can_relocate().
2020-06-12 16:22:03 +03:00
Marko Mäkelä
17a7bafec0 MDEV-22110 preparation: Remove mtr_memo_contains macros
Let us invoke the debug member functions of mtr_t directly.

mtr_t::memo_contains(): Change the parameter type to
const rw_lock_t&. This function cannot be invoked on
buf_block_t::lock.

The function mtr_t::memo_contains_flagged() is intended to be invoked
on buf_block_t* or rw_lock_t*, and it along with
mtr_t::memo_contains_page_flagged() are the way to check whether
a buffer pool page has been latched within a mini-transaction.
2020-06-10 07:50:09 +03:00
Marko Mäkelä
6877ef9a7c Merge 10.4 into 10.5 2020-06-05 20:36:43 +03:00
Marko Mäkelä
68d9d512e9 Merge 10.3 into 10.4 2020-06-05 18:05:22 +03:00
Marko Mäkelä
680463a8d9 Merge 10.2 into 10.3 2020-06-05 16:51:26 +03:00
Marko Mäkelä
efc70da5fd MDEV-22769 Shutdown hang or crash due to XA breaking locks
The background drop table queue in InnoDB is a work-around for
cases where the SQL layer is requesting DDL on tables on which
transactional locks exist.

One such case are XA transactions. Our test case exploits the
fact that the recovery of XA PREPARE transactions will
only resurrect InnoDB table locks, but not MDL that should
block any concurrent DDL.

srv_shutdown_t: Introduce the srv_shutdown_state=SRV_SHUTDOWN_INITIATED
for the initial part of shutdown, to wait for the background drop
table queue to be emptied.

srv_shutdown_bg_undo_sources(): Assign
srv_shutdown_state=SRV_SHUTDOWN_INITIATED
before waiting for the background drop table queue to be emptied.

row_drop_tables_for_mysql_in_background(): On slow shutdown, if
no active transactions exist (excluding ones that are in
XA PREPARE state), skip any tables on which locks exist.

row_drop_table_for_mysql(): Do not unnecessarily attempt to
drop InnoDB persistent statistics for tables that have
already been added to the background drop table queue.

row_mysql_close(): Relax an assertion, and free all memory
even if innodb_force_recovery=2 would prevent the background
drop table queue from being emptied.
2020-06-05 15:22:46 +03:00
Marko Mäkelä
b1ab211dee MDEV-15053 Reduce buf_pool_t::mutex contention
User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
and will no longer report the PAGE_STATE value READY_FOR_USE.

We will remove some fields from buf_page_t and move much code to
member functions of buf_pool_t and buf_page_t, so that the access
rules of data members can be enforced consistently.

Evicting or adding pages in buf_pool.LRU will remain covered by
buf_pool.mutex.

Evicting or adding pages in buf_pool.page_hash will remain
covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.

After this fix, buf_pool.page_hash lookups can entirely
avoid acquiring buf_pool.mutex, only relying on
buf_pool.hash_lock_get() S-latch.

Similarly, buf_flush_check_neighbors() can will rely solely on
buf_pool.mutex, no buf_pool.page_hash latch at all.

The buf_pool.mutex is rather contended in I/O heavy benchmarks,
especially when the workload does not fit in the buffer pool.

The first attempt to alleviate the contention was the
buf_pool_t::mutex split in
commit 4ed7082eef
which introduced buf_block_t::mutex, which we are now removing.

Later, multiple instances of buf_pool_t were introduced
in commit c18084f71b
and recently removed by us in
commit 1a6f708ec5 (MDEV-15058).

UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
related debugging in otherwise non-debug builds has not been used
for years. Instead, we have been using UNIV_DEBUG, which is enabled
in CMAKE_BUILD_TYPE=Debug.

buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
std::atomic and the buf_pool.page_hash latches, and in some cases
depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
We must always release buf_block_t::lock before invoking
unfix() or io_unfix(), to prevent a glitch where a block that was
added to the buf_pool.free list would apper X-latched. See
commit c5883debd6 how this glitch
was finally caught in a debug environment.

We move some buf_pool_t::page_hash specific code from the
ha and hash modules to buf_pool, for improved readability.

buf_pool_t::close(): Assert that all blocks are clean, except
on aborted startup or crash-like shutdown.

buf_pool_t::validate(): No longer attempt to validate
n_flush[] against the number of BUF_IO_WRITE fixed blocks,
because buf_page_t::flush_type no longer exists.

buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
Reduce mutex contention by separating the buf_pool.watch[]
allocation and the insert into buf_pool.page_hash.

buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
buf_pool.page_hash latch.
Replaces and extends buf_page_hash_lock_s_confirm()
and buf_page_hash_lock_x_confirm().

buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.

buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
Use Atomic_counter.

buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().

buf_pool_t::LRU_remove(): Remove a block from the LRU list
and return its predecessor. Incorporates buf_LRU_adjust_hp(),
which was removed.

buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
BTR_DELETE_OP (purge), which is never invoked on temporary tables.

buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.

buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.

buf_LRU_free_page(): Clarify the function comment.

buf_flush_check_neighbor(), buf_flush_check_neighbors():
Rewrite the construction of the page hash range. We will hold
the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
consecutive lookups of buf_pool.page_hash.

buf_flush_page_and_try_neighbors(): Remove.
Merge to its only callers, and remove redundant operations in
buf_flush_LRU_list_batch().

buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
Do not acquire buf_pool.mutex, and iterate directly with page_id_t.

ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
and avoids any loops.

fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.

buf_flush_page(): Add a fil_space_t* parameter. Minimize the
buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
atomically with the io_fix, and we will protect most buf_block_t
fields with buf_block_t::lock. The function
buf_flush_write_block_low() is removed and merged here.

buf_page_init_for_read(): Use static linkage. Initialize the newly
allocated block and acquire the exclusive buf_block_t::lock while not
holding any mutex.

IORequest::IORequest(): Remove the body. We only need to invoke
set_punch_hole() in buf_flush_page() and nowhere else.

buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
This field is only used during a fil_io() call.
That function already takes IORequest as a parameter, so we had
better introduce  for the rarely changing field.

buf_block_t::init(): Replaces buf_page_init().

buf_page_t::init(): Replaces buf_page_init_low().

buf_block_t::initialise(): Initialise many fields, but
keep the buf_page_t::state(). Both buf_pool_t::validate() and
buf_page_optimistic_get() requires that buf_page_t::in_file()
be protected atomically with buf_page_t::in_page_hash
and buf_page_t::in_LRU_list.

buf_page_optimistic_get(): Now that buf_block_t::mutex
no longer exists, we must check buf_page_t::io_fix()
after acquiring the buf_pool.page_hash lock, to detect
whether buf_page_init_for_read() has been initiated.
We will also check the io_fix() before acquiring hash_lock
in order to avoid unnecessary computation.
The field buf_block_t::modify_clock (protected by buf_block_t::lock)
allows buf_page_optimistic_get() to validate the block.

buf_page_t::real_size: Remove. It was only used while flushing
pages of page_compressed tables.

buf_page_encrypt(): Add an output parameter that allows us ot eliminate
buf_page_t::real_size. Replace a condition with debug assertion.

buf_page_should_punch_hole(): Remove.

buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
Add the parameter size (to replace buf_page_t::real_size).

buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
Add the parameter size (to replace buf_page_t::real_size).

fil_system_t::detach(): Replaces fil_space_detach().
Ensure that fil_validate() will not be violated even if
fil_system.mutex is released and reacquired.

fil_node_t::complete_io(): Renamed from fil_node_complete_io().

fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
Avoid invoking fil_node_t::close() because fil_system.n_open
has already been decremented in fil_space_t::detach().

BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.

BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
and distinguish dirty pages by buf_page_t::oldest_modification().

BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
This state was only being used for buf_page_t that are in
buf_pool.watch.

buf_pool_t::watch[]: Remove pointer indirection.

buf_page_t::in_flush_list: Remove. It was set if and only if
buf_page_t::oldest_modification() is nonzero.

buf_page_decrypt_after_read(), buf_corrupt_page_release(),
buf_page_check_corrupt(): Change the const fil_space_t* parameter
to const fil_node_t& so that we can report the correct file name.

buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.

buf_page_io_complete(): Split to buf_page_read_complete() and
buf_page_write_complete().

buf_dblwr_t::in_use: Remove.

buf_dblwr_t::buf_block_array: Add IORequest::flush_t.

buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
os_aio_wait_until_no_pending_writes().

buf_flush_write_complete(): Declare static, not global.
Add the parameter IORequest::flush_t.

buf_flush_freed_page(): Simplify the code.

recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.

fil_read(), fil_write(): Replaced with direct use of fil_io().

fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.

fil_mutex_enter_and_prepare_for_io(): Return the resolved
fil_space_t* to avoid a duplicated lookup in the caller.

fil_report_invalid_page_access(): Clean up the parameters.

fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
Always invoke fil_space_t::acquire_for_io() and let either the
sync=true caller or fil_aio_callback() invoke
fil_space_t::release_for_io().

fil_aio_callback(): Rewrite to replace buf_page_io_complete().

fil_check_pending_operations(): Remove a parameter, and remove some
redundant lookups.

fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
do an extra lookup of the tablespace between fil_io() and the
completion of the operation, we must give fil_node_t::complete_io() a
chance to decrement the counter.

fil_close_tablespace(): Remove unused parameter trx, and document
that this is only invoked during the error handling of IMPORT TABLESPACE.

row_import_discard_changes(): Merged with the only caller,
row_import_cleanup(). Do not lock up the data dictionary while
invoking fil_close_tablespace().

logs_empty_and_mark_files_at_shutdown(): Do not invoke
fil_close_all_files(), to avoid a !needs_flush assertion failure
on fil_node_t::close().

innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().

fil_close_all_files(): Invoke fil_flush_file_spaces()
to ensure proper durability.

thread_pool::unbind(): Fix a crash that would occur on Windows
after srv_thread_pool->disable_aio() and os_file_close().
This fix was submitted by Vladislav Vaintroub.

Thanks to Matthias Leich and Axel Schwenke for extensive testing,
Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
Marko Mäkelä
23047d3ed4 Merge 10.4 into 10.5 2020-05-18 17:30:02 +03:00
Marko Mäkelä
9e6e43551f Merge 10.3 into 10.4
We will expose some more std::atomic internals in Atomic_counter,
so that dict_index_t::lock will support the default assignment operator.
2020-05-16 07:39:15 +03:00
Marko Mäkelä
3d0bb2b7f1 Merge 10.2 into 10.3 2020-05-15 19:11:57 +03:00
Marko Mäkelä
ad6171b91c MDEV-22456 Dropping the adaptive hash index may cause DDL to lock up InnoDB
If the InnoDB buffer pool contains many pages for a table or index
that is being dropped or rebuilt, and if many of such pages are
pointed to by the adaptive hash index, dropping the adaptive hash index
may consume a lot of time.

The time-consuming operation of dropping the adaptive hash index entries
is being executed while the InnoDB data dictionary cache dict_sys is
exclusively locked.

It is not actually necessary to drop all adaptive hash index entries
at the time a table or index is being dropped or rebuilt. We can let
the LRU replacement policy of the buffer pool take care of this gradually.
For this to work, we must detach the dict_table_t and dict_index_t
objects from the main dict_sys cache, and once the last
adaptive hash index entry for the detached table is removed
(when the garbage page is evicted from the buffer pool) we can free
the dict_table_t and dict_index_t object.

Related to this, in MDEV-16283, we made ALTER TABLE...DISCARD TABLESPACE
skip both the buffer pool eviction and the drop of the adaptive hash index.
We shifted the burden to ALTER TABLE...IMPORT TABLESPACE or DROP TABLE.
We can remove the eviction from DROP TABLE. We must retain the eviction
in the ALTER TABLE...IMPORT TABLESPACE code path, so that in case the
discarded table is being re-imported with the same tablespace identifier,
the fresh data from the imported tablespace will replace any stale pages
in the buffer pool.

rpl.rpl_failed_drop_tbl_binlog: Remove the test. DROP TABLE can
no longer be interrupted inside InnoDB.

fseg_free_page(), fseg_free_step(), fseg_free_step_not_header(),
fseg_free_page_low(), fseg_free_extent(): Remove the parameter
that specifies whether the adaptive hash index should be dropped.

btr_search_lazy_free(): Lazily free an index when the last
reference to it is dropped from the adaptive hash index.

buf_pool_clear_hash_index(): Declare static, and move to the
same compilation unit with the bulk of the adaptive hash index
code.

dict_index_t::clone(), dict_index_t::clone_if_needed():
Clone an index that is being rebuilt while adaptive hash index
entries exist. The original index will be inserted into
dict_table_t::freed_indexes and dict_index_t::set_freed()
will be called.

dict_index_t::set_freed(), dict_index_t::freed(): Note that
or check whether the index has been freed. We will use the
impossible page number 1 to denote this condition.

dict_index_t::n_ahi_pages(): Replaces btr_search_info_get_ref_count().

dict_index_t::detach_columns(): Move the assignment n_fields=0
to ha_innobase_inplace_ctx::clear_added_indexes().
We must have access to the columns when freeing the
adaptive hash index. Note: dict_table_t::v_cols[] will remain
valid. If virtual columns are dropped or added, the table
definition will be reloaded in ha_innobase::commit_inplace_alter_table().

buf_page_mtr_lock(): Drop a stale adaptive hash index if needed.

We will also reduce the number of btr_get_search_latch() calls
and enclose some more code inside #ifdef BTR_CUR_HASH_ADAPT
in order to benefit cmake -DWITH_INNODB_AHI=OFF.
2020-05-15 17:23:08 +03:00