Regretfully, the parameter innodb_log_checksums was introduced
in MySQL 5.7.9 (the first GA release of that series) by
mysql/mysql-server@af0acedd88
which partly replaced a parameter that had been introduced in 5.7.8
mysql/mysql-server@22ba38218e
as innodb_log_checksum_algorithm.
Given that the CRC-32C operations are accelerated on many processor
implementations (AMD64 with SSE4.2; since MDEV-22669 also on IA-32
with SSE4.2, POWER 8 and later, ARMv8 with some extensions)
and by lookup tables when only generic SISD instructions are available,
there should be no valid reason to disable checksums.
In MariaDB 10.5.2, as a preparation for MDEV-12353, MDEV-19543 deprecated
and ignored the parameter innodb_log_checksums altogether. This should
imply that after a clean shutdown with innodb_log_checksums=OFF one
cannot upgrade to MariaDB Server 10.5 at all.
Due to these problems, let us deprecate the parameter innodb_log_checksums
and honor it only during server startup.
The command SET GLOBAL innodb_log_checksums will always set the
parameter to ON.
Problem:
=======
InnoDB drops the column which has foreign key relations on it. So it
tries to load the foreign key during rename process of copy algorithm
even though the foreign_key_check is disabled.
Solution:
========
During alter copy algorithm, InnoDB ignores the error while loading
the foreign key constraint if foreign key check is disabled. It
should throw the warning about failure of the foreign key constraint
when foreign key check is disabled.
This problem is caused by 6697135c6d
(MDEV-21572). During recovery, InnoDB prefetches the siblings of
change buffer index leaf page. It does asynchronous page read
and recovery scenario wasn't handled in buf_read_page_background().
It leads to the refusal of startup of the server.
Solution:
=========
InnoDB shouldn't allow the change buffer index page siblings
to be prefetched.
fil_page_decompress(): Remove a rather useless debug check.
We should have test coverage for reading page_compressed pages
from files, either due to buffer pool page eviction or due to
server restarts.
A similar check was removed from fil_space_encrypt() in
commit 0b36c27e0c (MDEV-20307).
The usage message for the innodb_compression_algorithm system variable
did not list snappy, which was added as an optional compression algorithm
in MariaDB 10.1.3 and might actually work since
commit 90c52e5291 (MDEV-12615)
in MariaDB 10.1.24.
Unfortunately, we will include also unavailable compression algorithms
in the list, because ENUM parameters allow numeric values, and we do
not want innodb_compression_algorithm=3 to change meaning depending on
the way how the source code was compiled.
InnoDB only reserves 13 bits for the heap number in the record header,
limiting the heap number to be at most 8191. But, when using
innodb_page_size=64k and secondary index records of 7 bytes each,
it is possible to exceed the maximum heap number.
btr_cur_optimistic_insert(): Let the operation fail if the
maximum number of records would be exceeded.
page_mem_alloc_heap(): Move to the same compilation unit with the
only caller, and let the operation fail if the maximum heap number
has been allocated already.
The debug assertion is bogus, and we had removed it in
commit b1ab211dee (MDEV-15053)
in the MariaDB Server 10.5 branch.
For a small data file, fil_space_extend_must_retry() would always
allocate a minimum size of 4*innodb_page_size.
It is possible that random read-ahead will be triggered for
a smaller file than this. In the observed case, the read-ahead
was triggered for a 6-page file that used ROW_FORMAT=COMPRESSED
with 8KiB page size. So, the desired file size was 49152 bytes,
but the actual size was 65536 bytes.
innobase_pk_order_preserved(): Treat an added AUTO_INCREMENT
column in the same way as an added existing column.
In either case, the column values are not guaranteed to
be constant, and thus the ordering may change if such a column
is added before any existing PRIMARY KEY columns.
prepare_inplace_alter_table_dict(): Initialize
dict_table_t::persistent_autoinc before invoking
innobase_pk_order_preserved().
fil_system_t::keyrotate_next(): If space && space->is_in_rotation_list
does not hold, iterate from the start of the list.
In debug builds, we would typically have hit SIGSEGV because the
iterator would have wrapped a null pointer. It might also be that
we are dereferencing a stale pointer.
There is no test case, because the encryption is very nondeterministic
in nature, due to the use of background threads.
This scenario can be hit by setting the following:
SET GLOBAL innodb_encryption_threads=5;
SET GLOBAL innodb_encryption_rotate_key_age=0;
The test encryption.create_or_replace would occasionally fail,
because some fil_space_t::n_pending_ops would never be decremented.
fil_crypt_find_space_to_rotate(): If rotate_thread_t::should_shutdown()
holds due to innodb_encryption_threads having been reduced, do
release the reference.
fil_space_remove_from_keyrotation(), fil_space_next(): Declare the
functions static, simplify a little, and define in the same compilation
unit with the only caller, fil_crypt_find_space_to_rotate().
fil_crypt_key_mutex: Remove (unused).
lock_rec_has_to_wait_in_queue(): Remove an obviously redundant assertion
that was added in commit a8ec45863b
and also enclose a Galera-specific condition in #ifdef WITH_WSREP.
lock_rec_has_to_wait
wsrep_kill_victim
lock_rec_create_low
lock_rec_add_to_queue
DeadlockChecker::select_victim()
THD can't change from normal transaction to BF (brute force) transaction
here, thus there is no need to syncronize access in wsrep_thd_is_BF
function.
lock_rec_has_to_wait_in_queue
Add condition that lock is not NULL and add assertions if we are in
strong state.
The purpose of the InnoDB doublewrite buffer is to make InnoDB
tolerant against cases where the server was killed in the middle
of a page write. (In Linux, killing a process may interrupt a
write system call, typically on a 4096-byte boundary.)
There may exist multiple copies of a page number in the doublewrite
buffer. Recovery should choose the latest valid copy of the page.
By design, the FIL_PAGE_LSN must not precede the latest checkpoint LSN
nor be later than the end of the recovered log.
For page_compressed and encrypted pages, we were missing proper
consistency checks. In the 10.4 data set generated for in MDEV-23231,
the data file contained a valid page_compressed page, and an
identical copy of that page was also present in the doublewrite
buffer. But, recovery would incorrectly consider the page invalid
and restore an uncompressed copy of the same page that had been
written before the log checkpoint. (In fact, no redo log was to
be applied to that page.)
buf_dblwr_process(): Validate the FIL_PAGE_LSN in the doublewrite
buffer pages, and always skip page 0, because those pages should
have been recovered by Datafile::restore_from_doublewrite() if
necessary.
Datafile::restore_from_doublewrite(): Choose the latest applicable
page from the doublewrite buffer.
recv_dblwr_t::find_page(): Also validate encrypted or
page_compressed pages.
recv_dblwr_t::validate_page(): New function to validate a page,
either a copy in a data file or in the doublewrite buffer.
Also validate encrypted or page_compressed pages.
This is joint work with Thirunarayanan Balathandayuthapani.
row_vers_impl_x_locked_low(): clust_offsets may point to memory
that is allocated by mem_heap_alloc() and may have been freed.
For initializing clust_offsets, try to use the stack-allocated
buffer instead of a pointer that may point to freed memory.
This fixes a regression that was introduced in
commit f0aa073f2b (MDEV-20950).
Problem:
========
In row_merge_drop_indexes(), InnoDB drops only the index from
dictionary and frees the index pages but it maintains the index
object if the table is being used by other DML threads. It sets
the online status of the index to ONLINE_INDEX_ABORTED_DROPPED.
Removing the index from dictionary doesn't remove the
corressponding ahi entries of the index. When block is being
reused, InnoDB tries to remove ahi entries for the block and
it fails if index online status is ONLINE_INDEX_ABORTED_DROPPED.
Fix:
====
MDEV-22456 allows the index ahi entries to be dropped lazily.
so checking online status in btr_search_drop_page_hash_index()
is meaningless and should be removed.
mysql/mysql-server@e00ad49edc
which introduced WL#6326 to MySQL 5.7.2 added a buffer page
acquisition to CHECK TABLE code (solely for the purpose of
obeying the changed latching order), but failed to check that
a parent page actually exists. It would not necessarily exist in a
corrupted index where a parent page is missing pointer records
to child pages.
commit ad6171b91c (MDEV-22456)
introduced code to buf_page_create() that would lazily drop
adaptive hash index entries for an index that has been
evicted from the data dictionary cache.
Unfortunately, that call was missing adequate protection.
While the btr_search_drop_page_hash_index(block) was executing,
the block could be reused for something else.
buf_page_create(): If btr_search_drop_page_hash_index() must be
invoked, pin the block before releasing the buf_pool->page_hash lock,
so that the block cannot be grabbed by other threads.
Problem:
=======
In buf_cur_optimistic_latch_leaves(), requesting a left block with BTR_GET
after releasing current block. But there is no guarantee that left block
could be still available.
Fix:
====
(1) In btr_cur_optimistic_latch_leaves(), replace the BUF_GET with
BUF_GET_POSSIBLY_FREED for fetching left block.
(2) Once InnoDB acquires left block, it should check FIL_PAGE_NEXT with
current block page number. If not, release cursor->left_block and return
false.
THD proc info was assigned from stack allocated temporary buffer
which went out of scope immediately after assignment.
Fixed by removing the use of temp buffer and assign proc info
from string literal.
Problem:
========
dict_load_table_one() doesn't handle the scenario where clustered
index page is FIL_NULL when DICT_ERR_IGNORE_INDEX_ROOT mode
is set.
Fix:
====
InnoDB should set the file_unreadable when it can't find the
clustered index root page.
Problem:
=======
fts_cache_append_deleted_doc_ids() holds the deleted_lock and tries to
access size of deleted_doc_ids. In the meantime, fts_cache_clear()
clears the sync_heap before clearing deleted_doc_ids. It leads to
invalid access of deleted_doc_ids.
Fix:
===
fts_cache_clear() should free the sync_heap after clearing
deleted_doc_ids.
- Due to commit fe95cb2e40 (MDEV-16125),
InnoDB master thread does not need to call srv_resume_thread()
and therefore there is no need to wake up the thread.
Due to the above patch, InnoDB should remove the following dead code.
srv_check_activity(): Makes the parameter as in,out and returns the
recent activity value
innobase_active_small(): Removed
srv_active_wake_master_thread(): Removed
srv_wake_master_thread(): Removed
srv_active_wake_master_thread_low(): Removed
Simplify srv_master_thread() and remove switch cases, added the assert.
Replace srv_wake_master_thread() with srv_inc_activity_count()
INNOBASE_WAKE_INTERVAL: Removed
The srv_monitor_event and the srv_monitor_thread would not be
created when InnoDB is in read-only mode. Yet, some code would
unconditionally invoke os_event_set(srv_monitor_event).
__pthread_cond_timedwait() in page cleaner hangs if os time moved
backwards.Workaround could be waking up the page cleaner thread in
logs_empty_and_mark_files_at_shutdown(). But there is possibility that
server could hang when server is running. So InnoDB should wake up page
cleaner thread periodically in srv_master_do_idle_tasks().
There can be multiple MLOG_CHECKPOINT record for the same checkpoint.
During recovery, InnoDB could encounter the previous MLOG_CHECKPOINT
for the checkpoint lsn. So the assertion
mlog_checkpoint_lsn <= recovered_lsn is wrong.
When InnoDB is extending a data file, it is updating the FSP_SIZE
field in the first page of the data file.
In commit 8451e09073 (MDEV-11556)
we removed a work-around for this bug and made recovery stricter,
by making it track changes to FSP_SIZE via redo log records, and
extend the data files before any changes are being applied to them.
It turns out that the function fsp_fill_free_list() is not crash-safe
with respect to this when it is initializing the change buffer bitmap
page (page 1, or generally, N*innodb_page_size+1). It uses a separate
mini-transaction that is committed (and will be written to the redo
log file) before the mini-transaction that actually extended the data
file. Hence, recovery can observe a reference to a page that is
beyond the current end of the data file.
fsp_fill_free_list(): Initialize the change buffer bitmap page in
the same mini-transaction.
The rest of the changes are fixing a bug that the use of the separate
mini-transaction was attempting to work around. Namely, we must ensure
that no other thread will access the change buffer bitmap page before
our mini-transaction has been committed and all page latches have been
released.
That is, for read-ahead as well as neighbour flushing, we must avoid
accessing pages that might not yet be durably part of the tablespace.
fil_space_t::committed_size: The size of the tablespace
as persisted by mtr_commit().
fil_space_t::max_page_number_for_io(): Limit the highest page
number for I/O batches to committed_size.
MTR_MEMO_SPACE_X_LOCK: Replaces MTR_MEMO_X_LOCK for fil_space_t::latch.
mtr_x_space_lock(): Replaces mtr_x_lock() for fil_space_t::latch.
mtr_memo_slot_release_func(): When releasing MTR_MEMO_SPACE_X_LOCK,
copy space->size to space->committed_size. In this way, read-ahead
or flushing will never be invoked on pages that do not yet exist
according to FSP_SIZE.
In commit 0f90728bc0 (MDEV-16809)
we introduced the configuration option innodb_log_optimize_ddl
for controlling whether native index creation or table-rebuild
in InnoDB should avoid writing full redo log.
Fungo Wang reported that this option is causing occasional failures.
The reason is that pages may be written to data files in an
inconsistent state. Applying log records to such inconsistent pages
may fail.
The solution is to always invoke PageBulk::finish() before page latches
may be released, to ensure that the page contents is in a consistent
state.
Something similar was implemented in MySQL 8.0.13:
mysql/mysql-server@d1254b9473
buf_block_t::skip_flush_check: Remove. Suppressing consistency checks
is a bad idea.
PageBulk::needs_finish(): New predicate: Determine whether
PageBulk::finish() must fix up the page.
PageBulk::init(): Clear PAGE_DIRECTION to ensure that needs_finish()
will hold. We change the field from PAGE_NO_DIRECTION to 0
and back without writing redo log. This trick avoids the need
to introduce any new data member to PageBulk.
PageBulk::insert(): Replace some high-level accessors to bypass
debug assertions related to PAGE_HEAP_TOP that we will be violating
until finish() has been executed.
PageBulk::finish(): Tolerate m_rec_no==0. We must invoke this also
on an empty page, to ensure that PAGE_HEAP_TOP is initialized.
PageBulk::commit(): Always invoke finish().
PageBulk::release(), BtrBulk::pageSplit(), BtrBulk::storeExt(),
BtrBulk::finish(): Invoke PageBulk::finish().
This issue was originally reported by Fungo Wang, along with a fix, as
MySQL Bug #98990.
His suggested fix was applied as part of
mysql/mysql-server@a003fc373d
and released in MySQL 5.7.31.
i_s_metrics_fill(): Add the missing call to Field::set_notnull(),
and simplify some code.
Problem:
=======
- Read operations are always allowed to hold a secondary index leaf
latch and then look up the corresponding clustered index record.
Flush table operation acquires secondary index latch while holding
a clustered index latch. It leads to deadlock violation.
Fix:
====
- Flush table operation should acquire secondary index before taking
clustered index to avoid deadlock violation with select operation.
MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The only exception is the
C runtime library. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.
In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.
Note: Before MariaDB Server 10.5, ./mtr will typically fail
due to the old PCRE library, which was updated in MDEV-14024.
The following cmake options were tested on 10.5
in commit 94d0bb4dbe:
cmake \
-DCMAKE_C_FLAGS='-march=native -O2' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \
-DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DWITH_MSAN=ON
MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and __msan_unpoison().
MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for
VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow().
InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros.
ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.