The debug parameter innodb_simulate_comp_failures injected compression
failures for ROW_FORMAT=COMPRESSED tables, breaking the pre-existing
logic that I had implemented in the InnoDB Plugin for MySQL 5.1 to prevent
compressed page overflows. A much better check is already achieved by
defining UNIV_ZIP_COPY at the compilation time.
(Only UNIV_ZIP_DEBUG is part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON.)
This issue is caused by MDEV-22456 ad6171b91c. Fix involves the backported version of 10.4 patch
MDEV-22778 5f2628d1ee and few parts of
MDEV-17441 (e9a5f288f2).
dict_table_t::stats_latch_created: Removed
dict_table_t::stats_latch: make value member and always lock it for
simplicity even for stats cloned table.
zip_pad_info_t::mutex_created: Removed
zip_pad_info_t::mutex: make member value instead of pointer
os0once.h: Removed
dict_table_remove_from_cache_low(): Ensure that fts_free() is always
called, even if dict_mem_table_free() is deferred until
btr_search_lazy_free().
InnoDB would always zip_pad_info_t::mutex and
dict_table_t::autoinc_mutex, even for tables are not in
ROW_FORMAT=COMPRESSED nor include any AUTO_INCREMENT column.
InnoDB only reserves 13 bits for the heap number in the record header,
limiting the heap number to be at most 8191. But, when using
innodb_page_size=64k and secondary index records of 7 bytes each,
it is possible to exceed the maximum heap number.
btr_cur_optimistic_insert(): Let the operation fail if the
maximum number of records would be exceeded.
page_mem_alloc_heap(): Move to the same compilation unit with the
only caller, and let the operation fail if the maximum heap number
has been allocated already.
MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The only exception is the
C runtime library. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.
In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.
Note: Before MariaDB Server 10.5, ./mtr will typically fail
due to the old PCRE library, which was updated in MDEV-14024.
The following cmake options were tested on 10.5
in commit 94d0bb4dbe:
cmake \
-DCMAKE_C_FLAGS='-march=native -O2' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \
-DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DWITH_MSAN=ON
MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and __msan_unpoison().
MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for
VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow().
InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros.
ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.
page_zip_fields_decode(): Do not dereference index=NULL.
Instead, return NULL early. The only caller does not care
about the values of output parameters in that case.
This bug was introduced in MySQL 5.7.6 by
mysql/mysql-server@9eae0edb7a
and in MariaDB 10.2.2 by
commit 2e814d4702.
Thanks to my son for pointing this out after investigating
the output of a static analysis tool.
Introduce a new ATTRIBUTE_NOINLINE to
ib::logger member functions, and add UNIV_UNLIKELY hints to callers.
Also, remove some crash reporting output. If needed, the
information will be available using debugging tools.
Furthermore, remove some fts_enable_diag_print output that included
indexed words in raw form. The code seemed to assume that words are
NUL-terminated byte strings. It is not clear whether a NUL terminator
is always guaranteed to be present. Also, UCS2 or UTF-16 strings would
typically contain many NUL bytes.
In my micro-benchmarks memcmp(4196) 3 times faster than old
implementation. Also, it's generally better to use as less
reinterpret_casts<> as possible.
buf_is_zeroes(): renamed from buf_page_is_zeroes() and
argument changed to span<> for convenience.
st_::span<T>::const_iterator: fixed
page_zip-verify_checksum(): make argument byte* instead of void*
actually, page_zip_verify_checksum() generally allows all-zeroes
checksums because our CRC32 checksum is something like
crc_1 ^ crc_2 ^ crc_3
Also, all zeroes page is considered correct.
As a side effect fix nasty reinterpret_cast<> UB
Also, since c0f47a4a58 innodb_checksum_algorithm=full_crc32
exists which computes CRC32 in one go (without bitwise arithmetic)
cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON
was broken ever since commit 8777458a6e
(MDEV-6076 Persistent AUTO_INCREMENT for InnoDB).
There is a race condition between page reads that call
page_zip_validate() (while holding clustered index root page S-latch)
and writes that update PAGE_ROOT_AUTO_INC
(with buf_block_t::lock SX-latch, compatible with S-latch).
page_zip_validate_low(): Skip the PAGE_ROOT_AUTO_INC field on
clustered index root pages in order to avoid false positives.
offset_t: this is a type which represents one record offset.
It's unsigned short int.
a lot of functions: replace ulint with offset_t
btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
allocate record offsets on the stack instead of waiting for rec_get_offsets()
to allocate it from mem_heap_t. So, reducing memory allocations.
RECORD_OFFSET, INDEX_OFFSET:
now it's less convenient to store pointers in offset_t*
array. One pointer occupies now several offset_t. And those constant are start
indexes into array to places where to store pointer values
REC_OFFS_HEADER_SIZE: adjusted for the new reality
REC_OFFS_NORMAL_SIZE:
increase size from 100 to 300 which means less heap allocations.
And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
is smaller than previous 800 bytes.
REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
rem0rec.h, rem0rec.ic, rem0rec.cc:
various arguments, return values and local variables types were changed to
fix numerous integer conversions issues.
enum field_type_t:
offset types concept was introduces which replaces old offset flags stuff.
Like in earlier version, 2 upper bits are used to store offset type.
And this enum represents those types.
REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
get_type(), set_type(), get_value(), combine():
these are convenience functions to work with offsets and it's types
rec_offs_base()[0]:
still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
rec_offs_base()[i]:
these have type offset_t now. Two upper bits contains type.
ut_rnd_interval(): Remove the first parameter, which was mostly
passed as 0. Implement as a simple wrapper around ut_rnd_gen().
Trivially return 0 if the size of the interval is smaller than 2.
ut_rnd_ulint_counter, ut_rnd_gen_next_ulint(), ut_rnd_gen_ulint(): Remove.
ut_rnd_set_seed(): Unused function; remove.
ut_rnd_gen(): Renamed from page_cur_lcg_prng().
ut_rnd_current: The internal state of ut_rnd_gen().
page_cur_open_on_rnd_user_rec(): Replace linear search with
page_rec_get_nth().
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096 and
commit 99dc40d6ac
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
btr_cur_pessimistic_delete(): code changed in a way that allows
to put more REC_INFO_MIN_REC_FLAG assertions inside btr_set_min_rec_mark().
Without that change tests innodb.innodb-table-online,
innodb.temp_table_savepoint and innodb_zip.prefix_index_liftedlimit fail.
Removed basically duplicated page_zip_validate() calls
which fails because of temporary(!) invariant violation.
That fixed innodb_zip.wl5522_debug_zip and
innodb_zip.prefix_index_liftedlimit
The bug affects MariaDB Server 10.3 or later, but it makes sense
to improve CHECK TABLE in earlier versions already.
page_validate(): Check REC_INFO_MIN_REC_FLAG in the records.
This allows CHECK TABLE to catch more bugs.
Until now, InnoDB inefficiently compared the aligned fields
FIL_PAGE_PREV, FIL_PAGE_NEXT to the byte-order-agnostic value FIL_NULL.
This is a backport of 32170f8c6d
from MariaDB Server 10.3.
Some places didn't match the previous rules, making the Floor
address wrong.
Additional sed rules:
sed -i -e 's/Place.*Suite .*, Boston/Street, Fifth Floor, Boston/g'
sed -i -e 's/Suite .*, Boston/Fifth Floor, Boston/g'
The record MLOG_ZIP_PAGE_COMPRESS is similar to MLOG_INIT_FILE_PAGE2
that it contains all the information needed to initialize the page.
Like for the other record, do initialize the entire page on recovery.
The predicate page_is_root(), which was added in MariaDB Server 10.2.2,
is based on a wrong assumption.
Under some circumstances, InnoDB can transform B-trees into a degenerate
state where a non-leaf page has no sibling pages. Because of this,
we cannot assume that a page that has no siblings is the root page.
This bug will be tracked as MDEV-19022.
Because of the bug that may affect many InnoDB data files, we must remove
and replace the wrong predicate. Using the wrong predicate can cause
corruption. A leaf page is not allowed to be empty except if it is the
root page, and the entire table is empty.
In MySQL 5.7, it was noticed that files are not portable between
big-endian and little-endian processor architectures
(such as SPARC and x86), because the original implementation of
innodb_checksum_algorithm=crc32 was not byte order agnostic.
A byte order agnostic implementation of innodb_checksum_algorithm=crc32
was only added to MySQL 5.7, not backported to 5.6. Consequently,
MariaDB Server versions 10.0 and 10.1 only contain the CRC-32C
implementation that works incorrectly on big-endian architectures,
and MariaDB Server 10.2.2 got the byte-order agnostic CRC-32C
implementation from MySQL 5.7.
MySQL 5.7 introduced a "legacy crc32" variant that is functionally
equivalent to the big-endian version of the original crc32 implementation.
Thanks to this variant, old data files can be transferred from big-endian
systems to newer versions.
Introducing new variants of checksum algorithms (without introducing
new names for them, or something on the pages themselves to identify
the algorithm) generally is a bad idea, because each checksum algorithm
is like a lottery ticket. The more algorithms you try, the more likely
it will be for the checksum to match on a corrupted page.
So, essentially MySQL 5.7 weakened innodb_checksum_algorithm=crc32,
and MariaDB 10.2.2 inherited this weakening.
We introduce a build option that together with MDEV-17957
makes innodb_checksum_algorithm=strict_crc32 strict again
by only allowing one variant of the checksum to match.
WITH_INNODB_BUG_ENDIAN_CRC32: A new cmake option for enabling the
bug-compatible "legacy crc32" checksum. This is only enabled on
big-endian systems by default, to facilitate an upgrade from
MariaDB 10.0 or 10.1. Checked by #ifdef INNODB_BUG_ENDIAN_CRC32.
ut_crc32_byte_by_byte: Remove (unused function).
legacy_big_endian_checksum: Remove. This variable seems to have
unnecessarily complicated the logic. When the weakening is enabled,
we must always fall back to the buggy checksum.
buf_page_check_crc32(): A helper function to compute one or
two CRC-32C variants.
Also, apply the MDEV-17957 changes to encrypted page checksums,
and remove error message output from the checksum function,
because these messages would be useless noise when mariabackup
is retrying reads of corrupted-looking pages, and not that
useful during normal server operation either.
The error messages in fil_space_verify_crypt_checksum()
should be refactored separately.
Problem:
Innodb_checksum_algorithm checks for all checksum algorithm to
validate the page checksum even though the algorithm is specified as
strict_crc32, strict_innodb, strict_none.
Fix:
Remove the checks for all checksum algorithm to validate the page
checksum if the algo is specified as strict_* values.
An INSERT into a temporary table would fail to set the
index page as modified. If there were no other write operations
(such as UPDATE or DELETE) to the page, and the page was evicted,
we would read back the old contents of the page, causing
corruption or loss of data.
page_cur_insert_rec_write_log(): Call mtr_t::set_modified()
for temporary tables. Normally this is part of the mlog_open()
call, but the mlog_open() call was only present in debug builds.
This regression was caused by
commit 48192f963a
which was preparation for MDEV-11369 and supposed to affect
debug builds only.
Thanks to Thirunarayanan Balathandayuthapani for debugging.
recv_parse_log_recs(): Check for corruption before checking for
end-of-log-buffer.
mlog_parse_initial_log_record(), page_cur_parse_delete_rec():
Flag corruption for out-of-bounds values, and let the caller
dump the corrupted redo log extract.
Introduce the configuration option innodb_log_optimize_ddl
for controlling whether native index creation or table-rebuild
in InnoDB should keep optimizing the redo log
(and writing MLOG_INDEX_LOAD records to ensure that
concurrent backup would fail).
By default, we have innodb_log_optimize_ddl=ON, that is,
the default behaviour that was introduced in MariaDB 10.2.2
(with the merge of InnoDB from MySQL 5.7) will be unchanged.
BtrBulk::m_trx: Replaces m_trx_id. We must be able to check for
KILL QUERY even if !m_flush_observer (innodb_log_optimize_ddl=OFF).
page_cur_insert_rec_write_log(): Declare globally, so that this
can be called from PageBulk::insert().
row_merge_insert_index_tuples(): Remove the unused parameter trx_id.
row_merge_build_indexes(): Enable or disable redo logging based on
the innodb_log_optimize_ddl parameter.
PageBulk::init(), PageBulk::insert(), PageBulk::finish(): Write
redo log records if needed. For ROW_FORMAT=COMPRESSED, redo log
will be written in PageBulk::compress() unless we called
m_mtr.set_log_mode(MTR_LOG_NO_REDO).
Revert the dead code for MySQL 5.7 multi-master replication (GCS),
also known as
WL#6835: InnoDB: GCS Replication: Deterministic Deadlock Handling
(High Prio Transactions in InnoDB).
Also, make innodb_lock_schedule_algorithm=vats skip SPATIAL INDEX,
because the code does not seem to be compatible with them.
Add FIXME comments to some SPATIAL INDEX locking code. It looks
like Galera write-set replication might not work with SPATIAL INDEX.