The InnoDB write-ahead log ib_logfile0 is of fixed size,
specified by innodb_log_file_size. If the tail of the log
manages to overwrite the head (latest checkpoint) of the log,
crash recovery will be broken.
Let us clarify the messages about this, including adding
a message on the completion of a log checkpoint that notes
that the dangerous situation is over.
To reproduce the dangerous scenario, we will introduce the
debug injection label ib_log_checkpoint_avoid_hard, which will
avoid log checkpoints even harder than the previous
ib_log_checkpoint_avoid.
log_t::overwrite_warned: The first known dangerous log sequence number.
Set in log_close() and cleared in log_write_checkpoint_info(),
which will output a "Crash recovery was broken" message.
btr_search_guess_on_hash() would only acquire an index page latch if it
is invoked with ahi_latch=NULL. If it's invoked from
row_sel_try_search_shortcut_for_mysql() with ahi_latch!=NULL, a page
will not be latched, and row_search_mvcc() will get a pointer to the
record, which can be changed by some other transaction before the record
was stored in result buffer with row_sel_store_mysql_rec() call.
ahi_latch argument of btr_cur_search_to_nth_level_func() and
btr_pcur_open_with_no_init_func() is used only for
row_sel_try_search_shortcut_for_mysql().
btr_cur_search_to_nth_level_func(..., ahi_latch !=0, ...) is invoked
only from btr_pcur_open_with_no_init_func(..., ahi_latch !=0, ...),
which, in turns, is invoked only from
row_sel_try_search_shortcut_for_mysql().
I suppose that separate case with ahi_latch!=0 was intentionally
implemented to protect row_sel_store_mysql_rec() call in
row_search_mvcc() just after row_sel_try_search_shortcut_for_mysql()
call. After the ahi_latch was moved from row_seach_mvcc() to
row_sel_try_search_shortcut_for_mysql(), there is no need in it at all
if btr_search_guess_on_hash() latches a page unconditionally. And if
btr_search_guess_on_hash() latched the page, any access to the record in
row_sel_try_search_shortcut_for_mysql() after btr_pcur_open_with_no_init()
call will be protected with the page latch.
The fix is to remove ahi_latch argument from
btr_pcur_open_with_no_init_func(), btr_cur_search_to_nth_level_func()
and btr_search_guess_on_hash().
There will not be test, as to test it we need to freeze some SELECT
execution in the point between row_sel_try_search_shortcut_for_mysql()
and row_sel_store_mysql_rec() calls in row_search_mvcc(), and to change
the record in some other transaction to let row_sel_store_mysql_rec() to
store changed record in result buffer. Buf we can't do this with the
fix, as the page will be latched in btr_search_guess_on_hash() call.
Let us use the normal platform-specific preprocessor symbols
__linux__, __sun__, _AIX instead of some homebrew ones.
The preprocessor symbol UNIV_HPUX must have lost its meaning
by f6deb00a56 (note: the symbol
UNIV_HPUX10 is being checked for, but only UNIV_HPUX is defined).
buf_defer_drop_ahi(): Remove. Ever since
commit c7f8cfc9e7 (MDEV-27700)
it is safe to invoke btr_search_drop_page_hash_index(block, true)
to remove an orphan adaptive hash index.
Any attempt to upgrade page latches is prone to deadlocks. Recently,
we observed a few hangs that involved nothing more than a small table
consisting of one clustered index page, one secondary index page and
some undo pages.
Reason:
=======
Race condition between btr_search_drop_hash_index() and
btr_search_lazy_free(). One thread does resizing of buffer pool
and clears the ahi on all pages in the buffer pool, frees the
index and table while removing the last reference. At the same time,
other thread access index->heap in btr_search_drop_hash_index().
Solution:
=========
Acquire the respective ahi latch before checking index->freed()
btr_search_drop_page_hash_index(): Added new parameter to indicate
that drop ahi entries only if the index is marked as freed
btr_search_check_marked_free_index(): Acquire all ahi latches and
return true if the index was freed
Before version 10, GCC would think that a right shift of an
unsigned char returns int. Let us explicitly cast that back,
to silence a bogus -Wconversion warning.
buf_page_print(): Dump the buffer page 32 bytes (64 hexadecimal digits)
per line. In this way, the limitation in mtr
("Data too long for column 'line'") will not be triggered.
Also, do not bother decoding the page contents, because everything
is present in the hexadecimal output.
dict_index_find_on_id_low(): Merge to dict_index_get_if_in_cache_low().
The direct call in buf_page_print() was prone to crashing, in case the
table definition was concurrently evicted or dropped from the
data dictionary cache.
In commit 73fee39ea6 (MDEV-27985)
a regression was introduced that would cause bpage=nullptr to
be referenced.
buf_flush_LRU_list_batch(): Always terminate the loop upon
encountering a null pointer.
buf_flush_page(): Never wait for a page latch, even in checkpoint
flushing (flush_type == BUF_FLUSH_LIST), to prevent a hang of the
page cleaner threads when a large number of pages is latched.
In mysql/mysql-server@9542f3015b
it was claimed that such a hang only affects CREATE FULLTEXT INDEX.
Their fix was to retain buffer-fix but release exclusive latch
on non-leaf pages, and subsequently write to those pages while
they are not associated with the mini-transaction, which would
trip a debug assertion in the MariaDB version of
mtr_t::memo_modify_page() and cause potential corruption
when using the default MariaDB setting innodb_log_optimize_ddl=OFF.
This change essentially backports a small part of
commit 7cffb5f6e8 (MDEV-23399)
from MariaDB Server 10.5.7.
This is a backport of commit 4489a89c71
in order to remove the test innodb.redo_log_during_checkpoint
that would cause trouble in the DBUG subsystem invoked by
safe_mutex_lock() via log_checkpoint(). Before
commit 7cffb5f6e8
these mutexes were of different type.
The following options were introduced in
commit 2e814d4702 (mariadb-10.2.2)
and have little use:
innodb_disable_resize_buffer_pool_debug had no effect even in
MariaDB 10.2.2 or MySQL 5.7.9. It was introduced in
mysql/mysql-server@5c4094cf49
to work around a problem that was fixed in
mysql/mysql-server@2957ae4f99
(but the parameter was not removed).
innodb_page_cleaner_disabled_debug and innodb_master_thread_disabled_debug
are only used by the test innodb.redo_log_during_checkpoint
that will be removed as part of this commit.
innodb_dict_stats_disabled_debug is only used by that test,
and it is redundant because one could simply use
innodb_stats_persistent=OFF or the STATS_PERSISTENT=0 attribute
of the table in the test to achieve the same effect.
The comparison on the checkpoint age (number of log bytes
written since the previous checkpoint) is inaccurate, because
the previous FILE_CHECKPOINT record could span two 512-byte
log blocks, which will cause the LSN to increase by the size of the
log block header and footer.
We will still generate a redudant checkpoint if the previous
checkpoint wrote some FILE_MODIFY records before the FILE_CHECKPOINT
record.
Only one checkpoint may be in progress at a time.
The counter log_sys.n_pending_checkpoint_writes
was being protected by log_sys.mutex.
Let us replace it with the Boolean log_sys.checkpoint_pending.
buf_pool_t::watch_unset(): Reorder some code so that
no warning will be emitted in CMAKE_BUILD_TYPE=RelWithDebInfo.
It is unclear why invoking watch_is_sentinel() before
buf_fix_count() would make the warning disappear.
In commit 437da7bc54 (MDEV-19534),
the default value of the global variable srv_checksum_algorithm
in innochecksum was changed from SRV_CHECKSUM_ALGORITHM_INNODB
to implied 0 (innodb_checksum_algorithm=crc32). As a result,
the function buf_page_is_corrupted() would by default invoke
buf_calc_page_crc32() in innochecksum, and crc32_inited would hold.
This would cause "innochecksum" to fail on a particular page.
The actual problem is older, introduced in 2011 in
mysql/mysql-server@17e497bdb7
(MySQL 5.6.3). It should affect the validation of pages of old
data files that were written with innodb_checksum_algorithm=innodb.
When using innodb_checksum_algorithm=crc32 (the default setting
since MariaDB Server 10.2), some valid pages would be rejected
only because exactly one of the two checksum fields accidentally
matches the innodb_checksum_algorithm=crc32 value.
buf_page_is_corrupted(): Simplify the logic of non-strict
checksum validation, by always invoking buf_calc_page_crc32().
Remove a bogus condition that if only one of the checksum fields
contains the value returned by buf_calc_page_crc32(), the page
is corrupted.
buf_flush_freed_pages(): Assert that neither buf_pool.mutex
nor buf_pool.flush_list_mutex are held. Simplify the loops.
Return the tablespace and the number of pages written or punched.
buf_flush_LRU_list_batch(), buf_do_flush_list_batch():
Release buf_pool.mutex before invoking buf_flush_space().
buf_flush_list_space(): Acquire the mutexes only after invoking
buf_flush_freed_pages().
Reviewed by: Thirunarayanan Balathandayuthapani
As btrfs showed, a partial read of data in AIO /O_DIRECT circumstances can
really confuse MariaDB.
Filipe Manana (SuSE)[1] showed how database programmers can assume
O_DIRECT is all or nothing.
While a fix was done in the kernel side, we can do better in our code by
requesting that the rest of the block be read/written synchronously if
we do only get a partial read/write.
Per the APIs, a partial read/write can occur before an error, so
reattempting the request will leave the caller with a concrete error to
handle.
[1] https://lore.kernel.org/linux-btrfs/CABVffENfbsC6HjGbskRZGR2NvxbnQi17gAuW65eOM+QRzsr8Bg@mail.gmail.com/T/#mb2738e675e48e0e0778a2e8d1537dec5ec0d3d3a
Also spell synchronously correctly in other files.
Problem:
=======
InnoDB ran out of memory during recovery and it fails to
flush the dirty LRU blocks. The reason is that buffer pool
can ran out before the LRU list length reaches
BUF_LRU_OLD_MIN_LEN(256) threshold.
Fix:
====
During recovery, InnoDB should write out and evict all
dirty blocks.
.. to be the same as startup.
In resolving MDEV-27461, BUF_LRU_MIN_LEN (256) is the minimum number of
pages for the innodb buffer pool size. Obviously we need more than just
flushing pages. Taking the 16k page size and its default minimum, an
extra 25% is needed on top of the flushing pages to make a workable buffer
pool.
The minimum innodb_buffer_pool_chunk_size (1M) restricts the minimum
otherwise we'd have a pool made up of different chunk sizes.
The resulting minimum innodb buffer pool sizes are:
Page Size, Previously minimum (startup), with change.
4k 5M 2M
8k 5M 3M
16k 5M 5M
32k 24M 10M
64k 24M 20M
With this patch, SET GLOBAL innodb_buffer_pool_size minimums are
enforced.
The evident minimum system variable size for innodb_buffer_pool_size
is 2M, however this is only setable if using 4k page size. As
the order of the page_size and buffer_pool_size aren't fixed, we can't
hide this change.
Subsequent changes:
* innodb_buffer_pool_resize_with_chunks.test - raised of pool resize due to new
minimums. Chunk size also needed increase as the test was for
pool_size < chunk_size to generate a warning.
* Removed srv_buf_pool_min_size and replaced use with MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* Removed srv_buf_pool_def_size and replaced constant defination in
MYSQL_SYSVAR_LONGLONG(buffer_pool_size)
* Reordered ha_innodb to allow for direct use of MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* Moved buf_pool_size_align into ha_innodb to access to MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* loose-innodb_disable_resize_buffer_pool_debug is needed in the
innodb.restart.opt test so that under debug mode, resizing of the
innodb buffer pool can occur.
In commit 4c3ad24413 (MDEV-27416)
an unnecessarily strict wait condition was introduced in the
function buf_flush_wait(). Most callers actually only care that
the pages have been flushed, not that a checkpoint has completed.
Only in the buf_flush_sync() call for log resizing, we might care
about the log checkpoint. But, in fact,
srv_prepare_to_delete_redo_log_file() is explicitly disabling
checkpoints. So, we can simply remove the unnecessary wait loop.
Thanks to Krunal Bauskar for reporting this performance regression
that we failed to repeat in our testing.