If a table has no unique indexes, write set key information will be collected on all columns in the table.
The write set key information has space only for max 3500 bytes for individual column, and if a varchar colummn of such non-primary key table is longer than
this limit, currently a crash follows.
The fix in this commit, is to truncate key values extracted from such long varhar columns to max 3500 bytes.
This may potentially lead to false positive certification failures for transactions, which operate on separate cluster nodes, and update/insert/delete table rows, which differ only in the part of such long columns after 3500 bytes border.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
trx_rseg_header_create(): Add a parameter for the value that is
to be written to TRX_RSEG_MAX_TRX_ID. If we omit this write, then
the updated test innodb.undo_truncate will fail for the 4k, 8k, 16k
page sizes. This was broken ever since
commit 947efe17ed (MDEV-15158)
removed the writes of transaction identifiers to the TRX_SYS page.
srv_do_purge(): Truncate undo tablespaces also during slow shutdown
(innodb_fast_shutdown=0).
Thanks to Krunal Bauskar for noticing this problem.
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This patch also fixes mutex locking order and unprotected
THD member accesses on bf aborting case. We try to hold
THD::LOCK_thd_data during bf aborting. Only case where it
is not possible is at wsrep_abort_transaction before
call wsrep_innobase_kill_one_trx where we take InnoDB
mutexes first and then THD::LOCK_thd_data.
This will also fix possible race condition during
close_connection and while wsrep is disconnecting
connections.
Added wsrep_bf_kill_debug test case
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
At least since commit 055a3334ad
(MDEV-13564) the undo log truncation in InnoDB did not work correctly.
The main issue is that during the execution of
trx_purge_truncate_history() some pages of the newly truncated
undo tablespace could be discarded.
fsp_try_extend_data_file(): Apply the peculiar rounding of
fil_space_t::size_in_header only to the system tablespace,
whose size can be expressed in megabytes in a configuration parameter.
Other files may freely grow by a number of pages.
fseg_alloc_free_page_low(): Do allow the extension of undo tablespaces,
and mention the file name in the error message.
mtr_t::commit_shrink(): Implement crash-safe shrinking of a tablespace
file. First, durably write the log, then shrink the file, and finally
release the page latches of the rebuilt tablespace. Refactored from
trx_purge_truncate_history().
log_write_and_flush_prepare(), log_write_and_flush(): New functions
to durably write log during mtr_t::commit_shrink().
- Handle stored function conditions correctly, with the same logic as with UDFs.
- When running queries on Spider SE, by default, we do not push down WHERE conditions containing usage of UDFs/stored functions to remote data nodes, unless the user demands (by setting spider_use_pushdown_udf).
- Disable direct update/delete when a udf condition is skipped.
- Handle stored function conditions correctly, with the same logic as with UDFs.
- When running queries on Spider SE, by default, we do not push down WHERE conditions containing usage of UDFs/stored functions to remote data nodes, unless the user demands (by setting spider_use_pushdown_udf).
btr_defragment_save_defrag_stats_if_needed(): Do not save
defragmentation statistics for temporary tables.
They are exempt of defragmentation anyway
(ha_innobase::optimize() never invokes defragmentation for them),
and the user-visible names are not available inside InnoDB.
Furthermore, InnoDB assumes that temporary tables are never accessed
by other threads than the one that handles the session with which
the temporary table is associated with.
Furthermore, we simplify the test innodb.innodb_defrag_stats
and include a test case that demonstrates that defragmentation
statistics are no longer being saved for temporary tables.
dict_index_t::clear_instant_alter(): when searhing for an AUTO_INCREMENT column
don't skip the beginning of the list because the field can be at the beginning of the list
InnoDB could evict the fts auxiliary table in
row_fts_merge_insert(). So bulk insert could be
dealing with garbage FTS auxiliary table.Patch
should delay closing the table in row_fts_merge_insert().
The st_blksize returned by fstat(2) is not documented to be
a power of 2, like we assumed in
commit 58252fff15 (MDEV-26040).
While on Linux, the st_blksize appears to report the file system
block size (which hopefully is not smaller than the sector size
of the underlying block device), on FreeBSD we observed
st_blksize values that might have been something similar to st_size.
Also IBM AIX was affected by this. A simple test case would
lead to a crash when using the minimum innodb_buffer_pool_size=5m
on both FreeBSD and AIX:
seq -f 'create table t%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&
seq -f 'create table u%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&
We will fix this by not trusting st_blksize at all, and assuming that
the smallest allowed write size (for O_DIRECT) is 4096 bytes. We hope
that no storage systems with larger block size exist. Anything larger
than 4096 bytes should be unlikely, given that it is the minimum
virtual memory page size of many contemporary processors.
MariaDB Server on Microsoft Windows was not affected by this.
While the 512-byte sector size of the venerable Seagate ST-225 is still
in widespread use, the minimum innodb_page_size is 4096 bytes, and
innodb_log_file_size can be set in integer multiples of 65536 bytes.
The only occasion where InnoDB uses smaller data file block sizes than
4096 bytes is with ROW_FORMAT=COMPRESSED tables with KEY_BLOCK_SIZE=1
or KEY_BLOCK_SIZE=2 (or innodb_page_size=4096). For such tables,
we will from now on preallocate space in integer multiples of 4096 bytes
and let regular writes extend the file by 1024, 2048, or 3072 bytes.
The view INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES.FS_BLOCK_SIZE
should report the raw st_blksize.
For page_compressed tables, the function fil_space_get_block_size()
will map to 512 any st_blksize value that is larger than 4096.
os_file_set_size(): Assume that the file system block size is 4096 bytes,
and only support extending files to integer multiples of 4096 bytes.
fil_space_extend_must_retry(): Round down the preallocation size to
an integer multiple of 4096 bytes.
FTS indexes has a prefix_len=1 or prefix_len=0 as stated by comment in
mysql_prepare_create_table().
Thus, a newly added assertion should be relaxed for FTS indexes.
Bug happens when partially indexed CHAR or VARCHAR field in converted from
utf8mb3 to utf8mb4.
Fixing by relaxing assertions. For some time dict_index_t and dict_table_t
are becoming not synchronized. Namely, dict_index_t has a new prefix_len which
is a multiple of a user-provided length and charset->mbmaxlen. But
the table still have and old mbmaxlen and assertion fails. This happens only
during utf8mb3 -> utf8mb4 conversions and the magic number 4 comes from
utf8mb_4_.
At the end of ALTER TABLE (innobase_rename_or_enlarge_columns_cache())
dict_index_t and dict_table_t became synchronized
again and will stay so at all times. For, example, they will be synchronized
on table load and newly added assertion proves that.
Improve documentation of performance_schema tables by appending COLUMN
comments to tables. Additionally improve test coverage and update corresponding
tests.
init_mutex_v1_t: Stop lying that the mutex parameter is const.
GCC 11.2.0 assumes that it is and could complain about any mysql_mutex_t
being uninitialized even after mysql_mutex_init() as long as
PLUGIN_PERFSCHEMA is enabled.
init_rwlock_v1_t, init_cond_v1_t: Remove untruthful const qualifiers.
Note: init_socket_v1_t is expecting that the socket fd has already
been created before PSI_SOCKET_CALL(init_socket), and therefore that
parameter really is being treated as a pointer to const.
Thanks to Theodore Brockman on Zulip for noticing
on an OSX ARM64 and testing this patch.
Per https://github.com/google/cpu_features/pull/150/files
CMAKE_SYSTEM_PROCESSOR is arm64 on Apple.
Without this, compulation error:
[ 80%] Building CXX object storage/rocksdb/CMakeFiles/rocksdblib.dir/rocksdb/util/crc32c.cc.o
/mariadb/storage/rocksdb/rocksdb/util/crc32c.cc:500:18: error: use of undeclared identifier 'isSSE42'
has_fast_crc = isSSE42();
^
/mariadb/storage/rocksdb/rocksdb/util/crc32c.cc:1230:7: error: use of undeclared identifier 'isSSE42'
if (isSSE42()) {
^
/mariadb/storage/rocksdb/rocksdb/util/crc32c.cc:1231:9: error: use of undeclared identifier 'isPCLMULQDQ'
if (isPCLMULQDQ()) {
^
This can be reverted when the RocksDB submodule is updated.
ee4bd4780b
trx_purge_rseg_get_next_history_log(): Fix a race condition that
was introduced in commit e46f76c974
(MDEV-15912). The buffer pool page contents must not be accessed
while not holding a page latch. The page latch was released by
mtr_t::commit().
This race resulted in an ASAN heap-use-after-poison during a stress test.
To avoid potential race conditions between concurrent access to
dict_table_t::freed_indexes, let us consistently use
dict_table_t::autoinc_mutex.
dict_table_remove_from_cache_low(): To avoid extensive hold time
of table->autoinc_mutex, unconditionally free the FTS data structures.
ha_innobase::check_if_supported_inplace_alter(): Do not invoke
innobase_table_is_empty() if the tablespace has been discarded.
That is, native ALTER TABLE in InnoDB will treat an empty table
in the same way as a tablespace whose tablespace has been discarded.
(Note: ALTER TABLE...ALGORITHM=COPY will fail if the tablespace
has been discarded.)
This fixes a crash that was introduced
in commit c755974775 (MDEV-19611).
Problem:
=======
The last AHI page for two indexes of an dropped table is being
freed at the same time by two threads. One thread frees the
table heap and other thread tries to access table heap again.
It leads to asan failure in btr_search_lazy_free().
Solution:
========
InnoDB uses autoinc_mutex to avoid the race condition
in btr_search_lazy_free()
Designated initializers were introduced in ISO/IEC 9899:1999 (C99),
but the C code base of MariaDB is supposed to be compatible with the
1990 version of the standard.
The InnoDB code based was switched from C to C++ in
MySQL 5.6 and MariaDB 10.0. C++ did not introduce syntax for
designated initializers until ISO/IEC 14882:2020.
Our C++ code base is still stuck with the 2011 or earlier version of
that standard.
Therefore, this check as well as the macro STRUCT_FLD are best removed.
PageConverter::update_index_page(): Always validate the PAGE_INDEX_ID.
Failure to do so could cause a crash when iterating
secondary index pages. This was caught by the 10.4 test
innodb.full_crc32_import.
Delete-marked record is on the secondary index and the clustered index
already purged the corresponding record. We cannot detect if such
record is historical and we should not: the algorithm of
row_ins_check_foreign_constraint() skips such record anyway.