- The commit 5fd3c7471e3e0673b50d309567c9747d36f09412(MDEV-24709)
resets the recv_no_ibuf_operations in
recv_recovery_from_checkpoint_start(), but InnoDB fails to reset
the variable recv_no_log_write() during that time and that leads
to the assert failure.
recv_recovery_from_checkpoint_start(): Clear the recv_no_ibuf_operations
flag at the same time when we enabled writes to the log.
The failure to clear the flag might have caused some missed
change buffer merges, at least to the secondary index of SYS_TABLES
that were accessed by trx_resurrect_table_locks() while the last
recovery batch was in progress.
Thanks to Thirunarayanan Balathandayuthapani for suggesting this fix.
This corresponds to 10.5 commit 39378e1366.
With a patched version of the test innodb.ibuf_not_empty (so that
it would trigger crash recovery after using the change buffer),
and patched code that would modify the os_thread_sleep() in
recv_apply_hashed_log_recs() to be 1ms as well as add a sleep of
the same duration to the end of recv_recover_page() when
recv_sys->n_addrs=0, we can demonstrate a race condition.
After disabling some debug checks in buf_all_freed_instance(),
buf_pool_invalidate_instance() and buf_validate(), we managed to
trigger an assertion failure in fseg_free_step(), on the XDES_FREE_BIT.
In other words, an trx_undo_seg_free() call during
trx_rollback_resurrected() was attempting a double-free of a page.
This was repeated about once in 400 to 500 test runs. With the fix
applied, the test passed 2,000 runs.
recv_apply_hashed_log_recs(): Do not only wait for recv_sys->n_addrs
to reach 0, but also wait for buf_get_n_pending_read_ios() to reach 0,
to guarantee that buf_page_io_complete() will not be executing
ibuf_merge_or_delete_for_page().
The function ibuf_merge_or_delete_for_page() was always being
invoked with update_ibuf_bitmap=true ever since
commit cd623508df
fixed up something after MDEV-9566.
Furthermore, the parameter page_size is never being passed as a
null pointer, and therefore it should better be a reference to
a constant object.
log_group_read_log_seg() returns error when:
1) Calculated log block number does not correspond to read log block
number. This can be caused by:
a) Garbage or an incompletely written log block. We can exclude this
case by checking log block checksum if it's enabled(see innodb-log-checksums,
encrypted log block contains checksum always).
b) The log block is overwritten. In this case checksum will be correct and
read log block number will be greater then requested one.
2) When log block length is wrong. In this case recv_sys->found_corrupt_log
is set.
3) When redo log block checksum is wrong. In this case innodb code
writes messages to error log with the following prefix: "Invalid log
block checksum."
The fix processes all the cases above.
buf_page_create() is invoked when page is initialized. So that
previous contents of the page ignored. In few cases, it calls
buf_page_get_gen() is called to fetch the page from buffer pool.
It should take x-latch on the page. If other thread uses the block
or block io state is different from BUF_IO_NONE then release the
mutex and check the state and buffer fix count again. For compressed
page, use the existing free block from LRU list to create new page.
Retry to fetch the compressed page if it is in flush list
fseg_create(), fseg_create_general(): Introduce block as a parameter
where segment header is placed. It is used to avoid repetitive
x-latch on the same page
Change the assert to check whether the page has SX latch and
X latch in all callee function of buf_page_create()
mtr_t::get_fix_count(): Get the buffer fix count of the given
block added by the mtr
FindBlock is added to find the buffer fix count of the given
block acquired by the mini-transaction
Regretfully, the parameter innodb_log_checksums was introduced
in MySQL 5.7.9 (the first GA release of that series) by
mysql/mysql-server@af0acedd88
which partly replaced a parameter that had been introduced in 5.7.8
mysql/mysql-server@22ba38218e
as innodb_log_checksum_algorithm.
Given that the CRC-32C operations are accelerated on many processor
implementations (AMD64 with SSE4.2; since MDEV-22669 also on IA-32
with SSE4.2, POWER 8 and later, ARMv8 with some extensions)
and by lookup tables when only generic SISD instructions are available,
there should be no valid reason to disable checksums.
In MariaDB 10.5.2, as a preparation for MDEV-12353, MDEV-19543 deprecated
and ignored the parameter innodb_log_checksums altogether. This should
imply that after a clean shutdown with innodb_log_checksums=OFF one
cannot upgrade to MariaDB Server 10.5 at all.
Due to these problems, let us deprecate the parameter innodb_log_checksums
and honor it only during server startup.
The command SET GLOBAL innodb_log_checksums will always set the
parameter to ON.
The purpose of the InnoDB doublewrite buffer is to make InnoDB
tolerant against cases where the server was killed in the middle
of a page write. (In Linux, killing a process may interrupt a
write system call, typically on a 4096-byte boundary.)
There may exist multiple copies of a page number in the doublewrite
buffer. Recovery should choose the latest valid copy of the page.
By design, the FIL_PAGE_LSN must not precede the latest checkpoint LSN
nor be later than the end of the recovered log.
For page_compressed and encrypted pages, we were missing proper
consistency checks. In the 10.4 data set generated for in MDEV-23231,
the data file contained a valid page_compressed page, and an
identical copy of that page was also present in the doublewrite
buffer. But, recovery would incorrectly consider the page invalid
and restore an uncompressed copy of the same page that had been
written before the log checkpoint. (In fact, no redo log was to
be applied to that page.)
buf_dblwr_process(): Validate the FIL_PAGE_LSN in the doublewrite
buffer pages, and always skip page 0, because those pages should
have been recovered by Datafile::restore_from_doublewrite() if
necessary.
Datafile::restore_from_doublewrite(): Choose the latest applicable
page from the doublewrite buffer.
recv_dblwr_t::find_page(): Also validate encrypted or
page_compressed pages.
recv_dblwr_t::validate_page(): New function to validate a page,
either a copy in a data file or in the doublewrite buffer.
Also validate encrypted or page_compressed pages.
This is joint work with Thirunarayanan Balathandayuthapani.
__pthread_cond_timedwait() in page cleaner hangs if os time moved
backwards.Workaround could be waking up the page cleaner thread in
logs_empty_and_mark_files_at_shutdown(). But there is possibility that
server could hang when server is running. So InnoDB should wake up page
cleaner thread periodically in srv_master_do_idle_tasks().
There can be multiple MLOG_CHECKPOINT record for the same checkpoint.
During recovery, InnoDB could encounter the previous MLOG_CHECKPOINT
for the checkpoint lsn. So the assertion
mlog_checkpoint_lsn <= recovered_lsn is wrong.
MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The only exception is the
C runtime library. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.
In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.
Note: Before MariaDB Server 10.5, ./mtr will typically fail
due to the old PCRE library, which was updated in MDEV-14024.
The following cmake options were tested on 10.5
in commit 94d0bb4dbe:
cmake \
-DCMAKE_C_FLAGS='-march=native -O2' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \
-DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DWITH_MSAN=ON
MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and __msan_unpoison().
MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for
VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow().
InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros.
ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.
The background drop table queue in InnoDB is a work-around for
cases where the SQL layer is requesting DDL on tables on which
transactional locks exist.
One such case are XA transactions. Our test case exploits the
fact that the recovery of XA PREPARE transactions will
only resurrect InnoDB table locks, but not MDL that should
block any concurrent DDL.
srv_shutdown_t: Introduce the srv_shutdown_state=SRV_SHUTDOWN_INITIATED
for the initial part of shutdown, to wait for the background drop
table queue to be emptied.
srv_shutdown_bg_undo_sources(): Assign
srv_shutdown_state=SRV_SHUTDOWN_INITIATED
before waiting for the background drop table queue to be emptied.
row_drop_tables_for_mysql_in_background(): On slow shutdown, if
no active transactions exist (excluding ones that are in
XA PREPARE state), skip any tables on which locks exist.
row_drop_table_for_mysql(): Do not unnecessarily attempt to
drop InnoDB persistent statistics for tables that have
already been added to the background drop table queue.
row_mysql_close(): Relax an assertion, and free all memory
even if innodb_force_recovery=2 would prevent the background
drop table queue from being emptied.
Introduce a new ATTRIBUTE_NOINLINE to
ib::logger member functions, and add UNIV_UNLIKELY hints to callers.
Also, remove some crash reporting output. If needed, the
information will be available using debugging tools.
Furthermore, remove some fts_enable_diag_print output that included
indexed words in raw form. The code seemed to assume that words are
NUL-terminated byte strings. It is not clear whether a NUL terminator
is always guaranteed to be present. Also, UCS2 or UTF-16 strings would
typically contain many NUL bytes.
Maybe this patch will help catch problems like buffer overflow.
log_t::first_in_use: removed
log_t::buf: this is where mtr_t are supposed to append data
log_t::flush_buf: this is from server writes to a file
Those two buffers are std::swap()ped when some thread is gonna write
to a file
was restored.
Optionally rollback prepared XA's on "mariabackup --prepare".
The fix MUST NOT be ported on 10.5+, as MDEV-742 fix solves the issue for
slaves.
redo log during recovery
- InnoDB unnecessarily reads the page even though it has fully initialized
buffered redo log records. Allow the page initialization redo log to
apply for the page in buf_page_get_gen() during recovery.
- Renamed buf_page_get_gen() to buf_page_get_low()
- Newly added buf_page_get_gen() will check for buffered redo log for
the particular page id during recovery
- Added new function buf_page_mtr_lock() which basically latches the page
for the given latch type.
- recv_recovery_create_page() is inline function which creates a page
if it has page initialization redo log records.
- This issue is caused by MDEV-19176
(bba59abb03).
- Problem is that there is miscalculation of available memory during
recovery if innodb_buffer_pool_instances > 1.
- Ignore the buffer pool instance while calculating available_memory
- Removed recv_n_pool_free_frames variable and use buf_pool_get_n_pages()
instead.
- Moved the recv_sys->heap memory condition inside recv_parse_log_recs().
So that, InnoDB can mark the status as STORE_NO earlier.
- InnoDB uses one third of buffer pool chunk size for reading the redo
log records. In that case, we can avoid the scenario where buffer ran
out of memory issue during recovery.
The test encryption.innodb-redo-badkey was accidentally disabled
until commit 23657a2101 enabled
it recently. Once it was enabled, it started failing randomly.
recv_recover_corrupt_page(): Do not assume that any redo log exists
for the page. A page may be unnecessarily read by read-ahead.
When noting the corruption, reset recv_addr->state to RECV_PROCESSED,
so that even if the same page is re-read again, we will only
decrement recv_sys->n_addrs once.
The general reason why innodb redo log file is limited by 512G is that
log_block_convert_lsn_to_no() returns value limited by 1G. But there is no
need to have unique log block numbers in log group. The fix removes 512G
limit and limits log group size by
(uint32_t maximum value) * (minimum page size), which, in turns, can be
removed if fil_io() is no longer used for innodb redo log io.