The InnoDB DATA DIRECTORY attribute is not implemented via
symbolic links but something similar, *.isl files that contain
the names of data files.
InnoDB failed to ignore the DATA DIRECTORY attribute even though
the server was started with --skip-symbolic-links.
Native ALTER TABLE in InnoDB will retain the DATA DIRECTORY attribute
of the table, no matter if the table will be rebuilt or not.
Generic ALTER TABLE (with ALGORITHM=COPY) as well as TRUNCATE TABLE
will discard the DATA DIRECTORY attribute.
All tests have been run with and without the ./mtr option
--mysqld=--skip-symbolic-links
and some tests that use the InnoDB DATA DIRECTORY attribute
have been adjusted for this.
.. to be the same as startup.
In resolving MDEV-27461, BUF_LRU_MIN_LEN (256) is the minimum number of
pages for the innodb buffer pool size. Obviously we need more than just
flushing pages. Taking the 16k page size and its default minimum, an
extra 25% is needed on top of the flushing pages to make a workable buffer
pool.
The minimum innodb_buffer_pool_chunk_size (1M) restricts the minimum
otherwise we'd have a pool made up of different chunk sizes.
The resulting minimum innodb buffer pool sizes are:
Page Size, Previously minimum (startup), with change.
4k 5M 2M
8k 5M 3M
16k 5M 5M
32k 24M 10M
64k 24M 20M
With this patch, SET GLOBAL innodb_buffer_pool_size minimums are
enforced.
The evident minimum system variable size for innodb_buffer_pool_size
is 2M, however this is only setable if using 4k page size. As
the order of the page_size and buffer_pool_size aren't fixed, we can't
hide this change.
Subsequent changes:
* innodb_buffer_pool_resize_with_chunks.test - raised of pool resize due to new
minimums. Chunk size also needed increase as the test was for
pool_size < chunk_size to generate a warning.
* Removed srv_buf_pool_min_size and replaced use with MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* Removed srv_buf_pool_def_size and replaced constant defination in
MYSQL_SYSVAR_LONGLONG(buffer_pool_size)
* Reordered ha_innodb to allow for direct use of MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* Moved buf_pool_size_align into ha_innodb to access to MYSQL_SYSVAR_NAME(buffer_pool_size).min_val
* loose-innodb_disable_resize_buffer_pool_debug is needed in the
innodb.restart.opt test so that under debug mode, resizing of the
innodb buffer pool can occur.
There were no InnoDB changes in the MySQL 5.7.37 release that would be
relevant to MariaDB Server. We will merely update the reported
InnoDB version number.
create_log_files(): Check log_set_capacity() before modifying
or creating any log files.
innobase_start_or_create_for_mysql(): If create_log_files()
fails and we were initializing a new database, delete the
system tablespace files before exiting.
fil_space_decrypt(): change signature to return status via dberr_t only.
Also replace impossible condition with an assertion and prove it via
test cases.
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is
wsrep_thd_LOCK()
wsrep_kill_victim()
lock_rec_other_has_conflicting()
lock_clust_rec_read_check_and_lock()
row_search_mvcc()
ha_innobase::index_read()
ha_innobase::rnd_pos()
handler::ha_rnd_pos()
handler::rnd_pos_by_record()
handler::ha_rnd_pos_by_record()
Rows_log_event::find_row()
Update_rows_log_event::do_exec_row()
Rows_log_event::do_apply_event()
Log_event::apply_event()
wsrep_apply_events()
and mutexes are taken in the order
lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data
When a normal KILL statement is executed, the stack is
innobase_kill_query()
kill_handlerton()
plugin_foreach_with_mask()
ha_kill_query()
THD::awake()
kill_one_thread()
and mutexes are
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
The initial test case for MySQL Bug #33053297 is based on
mysql/mysql-server@27130e2507.
innobase_get_field_from_update_vector is not a suitable function to fetch
updated row info, as well as parent table's update vector is not always
suitable. For instance, in case of DELETE it contains undefined data.
castade->update vector seems to be good enough to fetch all base columns
update data, and besides faster, and less error-prone.
SysTablespace::file_not_found(): If the system tablespace cannot be
found and innodb_force_recovery has been specified, refuse to start up.
The system tablespace is necessary for accessing any InnoDB tables,
because it contains the TRX_SYS page (the state of transactions)
and the InnoDB data dictionary.
This is similar to our handling of innodb_read_only except that
we will happily create the InnoDB temporary tablespace even if
innodb_force_recovry is set.
Based on mysql/mysql-server@bc9c46bf28
but without sleeps.
The test was verified to hit the debug assertion if the change to
fts_add_doc_by_id() in commit 2d98b967e3
was reverted.
fts_cache_t::total_size_at_sync: New field, to sample total_size.
fts_add_doc_by_id(): Invoke sync if total_size has grown too much
since the previous sync request. (Maintain cache->total_size_at_sync.)
ib_wqueue_t::length: Caches ib_list_len(*items).
ib_wqueue_len(): Removed. We will refer to fts_optimize_wq->length
directly.
Based on mysql/mysql-server@bc9c46bf28
trx_commit_in_memory(): Do not release the rseg reference before
trx_undo_commit_cleanup() has been invoked and the current transaction
is truly done with the rollback segment. The purpose of the reference
count is to prevent data races with trx_purge_truncate_history().
This is based on
mysql/mysql-server@ac79aa1522.
InnoDB commit fails when consecutive FTS_DOC_ID value
is greater than 4294967295.
Fix is that InnoDB should remove the delta FTS_DOC_ID
value limitations and fts should encode 8 byte value,
remove FTS_DOC_ID_MAX_STEP variable. Replaced the
fts0vlc.ic file with fts0vlc.h
fts_encode_int(): Should be able to encode 10 bytes value
fts_get_encoded_len(): Should get the length of the value
which has 10 bytes
fts_decode_vlc(): Add debug assertion to verify the maximum
length allowed is 10.
mach_read_uint64_little_endian(): Reads 64 bit stored in
little endian format
Added a unit test case which check for minimum and maximum
value to do the fts encoding
create_table_info_t::innobase_table_flags(): Refuse to create
a PAGE_COMPRESSED table with PAGE_COMPRESSION_LEVEL=0 if also
innodb_compression_level=0.
The parameter value innodb_compression_level=0 was only somewhat
meaningful for testing or debugging ROW_FORMAT=COMPRESSED tables.
For the page_compressed format, it never made any sense, and the
check in dict_tf_is_valid_not_redundant() that was added in
72378a2583 (MDEV-12873) would cause
the server to crash.
In commit 1cb218c37c (MDEV-26450)
we introduced the function log_write_and_flush(), which may
compete with log_checkpoint() invoking log_write_flush_to_disk_low()
from another thread.
The assertion n_pending_flushes==1 is too strict.
There is no possibility of a race condition here, because
fil_flush() is protected by fil_system->mutex and the
rest will be protected by log_sys->mutex.
log_write_flush_to_disk_low(), log_write_and_flush():
Relax the assertions to test for a nonzero count.
If a table has no unique indexes, write set key information will be collected on all columns in the table.
The write set key information has space only for max 3500 bytes for individual column, and if a varchar colummn of such non-primary key table is longer than
this limit, currently a crash follows.
The fix in this commit, is to truncate key values extracted from such long varhar columns to max 3500 bytes.
This may potentially lead to false positive certification failures for transactions, which operate on separate cluster nodes, and update/insert/delete table rows, which differ only in the part of such long columns after 3500 bytes border.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This patch also fixes mutex locking order and unprotected
THD member accesses on bf aborting case. We try to hold
THD::LOCK_thd_data during bf aborting. Only case where it
is not possible is at wsrep_abort_transaction before
call wsrep_innobase_kill_one_trx where we take InnoDB
mutexes first and then THD::LOCK_thd_data.
This will also fix possible race condition during
close_connection and while wsrep is disconnecting
connections.
Added wsrep_bf_kill_debug test case
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
At least since commit 055a3334ad
(MDEV-13564) the undo log truncation in InnoDB did not work correctly.
The main issue is that during the execution of
trx_purge_truncate_history() some pages of the newly truncated
undo tablespace could be discarded.
fsp_try_extend_data_file(): Apply the peculiar rounding of
fil_space_t::size_in_header only to the system tablespace,
whose size can be expressed in megabytes in a configuration parameter.
Other files may freely grow by a number of pages.
fseg_alloc_free_page_low(): Do allow the extension of undo tablespaces,
and mention the file name in the error message.
mtr_t::commit_shrink(): Implement crash-safe shrinking of a tablespace
file. First, durably write the log, then shrink the file, and finally
release the page latches of the rebuilt tablespace. Refactored from
trx_purge_truncate_history().
log_write_and_flush_prepare(), log_write_and_flush(): New functions
to durably write log during mtr_t::commit_shrink().
btr_defragment_save_defrag_stats_if_needed(): Do not save
defragmentation statistics for temporary tables.
They are exempt of defragmentation anyway
(ha_innobase::optimize() never invokes defragmentation for them),
and the user-visible names are not available inside InnoDB.
Furthermore, InnoDB assumes that temporary tables are never accessed
by other threads than the one that handles the session with which
the temporary table is associated with.
Furthermore, we simplify the test innodb.innodb_defrag_stats
and include a test case that demonstrates that defragmentation
statistics are no longer being saved for temporary tables.
InnoDB could evict the fts auxiliary table in
row_fts_merge_insert(). So bulk insert could be
dealing with garbage FTS auxiliary table.Patch
should delay closing the table in row_fts_merge_insert().
The st_blksize returned by fstat(2) is not documented to be
a power of 2, like we assumed in
commit 58252fff15 (MDEV-26040).
While on Linux, the st_blksize appears to report the file system
block size (which hopefully is not smaller than the sector size
of the underlying block device), on FreeBSD we observed
st_blksize values that might have been something similar to st_size.
Also IBM AIX was affected by this. A simple test case would
lead to a crash when using the minimum innodb_buffer_pool_size=5m
on both FreeBSD and AIX:
seq -f 'create table t%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&
seq -f 'create table u%g engine=innodb select * from seq_1_to_200000;' \
1 100|mysql test&
We will fix this by not trusting st_blksize at all, and assuming that
the smallest allowed write size (for O_DIRECT) is 4096 bytes. We hope
that no storage systems with larger block size exist. Anything larger
than 4096 bytes should be unlikely, given that it is the minimum
virtual memory page size of many contemporary processors.
MariaDB Server on Microsoft Windows was not affected by this.
While the 512-byte sector size of the venerable Seagate ST-225 is still
in widespread use, the minimum innodb_page_size is 4096 bytes, and
innodb_log_file_size can be set in integer multiples of 65536 bytes.
The only occasion where InnoDB uses smaller data file block sizes than
4096 bytes is with ROW_FORMAT=COMPRESSED tables with KEY_BLOCK_SIZE=1
or KEY_BLOCK_SIZE=2 (or innodb_page_size=4096). For such tables,
we will from now on preallocate space in integer multiples of 4096 bytes
and let regular writes extend the file by 1024, 2048, or 3072 bytes.
The view INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES.FS_BLOCK_SIZE
should report the raw st_blksize.
For page_compressed tables, the function fil_space_get_block_size()
will map to 512 any st_blksize value that is larger than 4096.
os_file_set_size(): Assume that the file system block size is 4096 bytes,
and only support extending files to integer multiples of 4096 bytes.
fil_space_extend_must_retry(): Round down the preallocation size to
an integer multiple of 4096 bytes.
To avoid potential race conditions between concurrent access to
dict_table_t::freed_indexes, let us consistently use
dict_table_t::autoinc_mutex.
dict_table_remove_from_cache_low(): To avoid extensive hold time
of table->autoinc_mutex, unconditionally free the FTS data structures.
Problem:
=======
The last AHI page for two indexes of an dropped table is being
freed at the same time by two threads. One thread frees the
table heap and other thread tries to access table heap again.
It leads to asan failure in btr_search_lazy_free().
Solution:
========
InnoDB uses autoinc_mutex to avoid the race condition
in btr_search_lazy_free()
Designated initializers were introduced in ISO/IEC 9899:1999 (C99),
but the C code base of MariaDB is supposed to be compatible with the
1990 version of the standard.
The InnoDB code based was switched from C to C++ in
MySQL 5.6 and MariaDB 10.0. C++ did not introduce syntax for
designated initializers until ISO/IEC 14882:2020.
Our C++ code base is still stuck with the 2011 or earlier version of
that standard.
Therefore, this check as well as the macro STRUCT_FLD are best removed.
PageConverter::update_index_page(): Always validate the PAGE_INDEX_ID.
Failure to do so could cause a crash when iterating
secondary index pages. This was caught by the 10.4 test
innodb.full_crc32_import.
Import operation without .cfg file fails when there is mismatch of index
between metadata table and .ibd file. Moreover, MDEV-19022 shows
that InnoDB can end up with index tree where non-leaf page has only
one child page. So it is unsafe to find the secondary index root page.
This patch does the following when importing the table without .cfg file:
1) If the metadata contains more than one index then InnoDB stops
the import operation and report the user to drop all secondary
indexes before doing import operation.
2) When the metadata contain only clustered index then InnoDB finds the
index id by reading page 0 & page 3 instead of traversing the
whole tablespace.
pars_info_bind_id(): Remove the parameter copy_name. It was always
being passed as constant TRUE or true. It turns out that copying
the string is completely unnecessary. In all calls except the one
in fts_get_select_columns_str() and fts_doc_fetch_by_doc_id(),
the parameter is being passed as a compile-time constant, and therefore
the pointer cannot become stale. In that special call, the string
that is being passed is allocated from the same memory heap that
pars_info_bind_id() would have been using.
pars_info_add_id(): Remove (unused declaration).
1. rename option DEPENDENCIES in MYSQL_ADD_PLUGIN() to DEPENDS
to be consistent with other cmake commands and macros
2. use this DEPENDS option in plugins
3. add dependencies to the plugin embedded target too
4. plugins don't need to add GenError dependency explicitly,
all plugins depend on it automatically
len was containing garbage, since vctempl->mysql_col_offset was
containing old value while calling row_mysql_store_col_in_innobase_format
from innobase_get_computed_value().
It was not updated after the first ALTER TABLE call, because it's INPLACE
logic considered there's nothing to update, and exited immediately from
ha_innobase::inplace_alter_table().
However, vcol metadata needs an update, since vcols structure is changed
in mysql record.
The regression was introduced by 12614af1fe. There, refcount==1 condition
was removed, which turned out to be crucial, though racy. The idea was to
update vc_templ after each (sequencing) ALTER TABLE.
We should do the same another way, and there may be a plenty of solutions,
but the simplest one is to add a following condition:
if vcol structure is changed, drop vc_templ; it will be recreated on next
ha_innobase::open() call.
in prepare_inplace_alter_table. It is safe, since innodb inplace changes
require at least HA_ALTER_INPLACE_SHARED_LOCK_AFTER_PREPARE, which
guarantee MDL_EXCLUSIVE on this stage.
alter_templ_needs_rebuild() also has to track the columns not indexed, to
keep vc_templ correct.
Note that vc_templ is always kept constructed and available after
ha_innobase::open() call, even on INSERT, though no virtual columns are
evaluated during that statement
inside innodb.
In the test case suplied, it will be recreated on the second ALTER TABLE.
ha_innobase::prepare_inplace_alter_table(): Remove always-true conditions.
Near the start of the function, we would already have returned if
no ALTER TABLE operation flags were set that would require special
action from InnoDB.
It turns out that the conditions were redundant already when they were
introduced in mysql/mysql-server@241387a2b6
and in commit 068c61978e.
Thanks to Nikita Malyavin for noticing this.
Thanks to Nikita Malyavin for noticing this.
The dead code that was originally introduced in
mysql/mysql-server@b8bd31740c
was added in commit 2e814d4702
to this code base.
btr_scrub_start_space(): Avoid an unnecessary tablespace lookup
and related acquisition of fil_system->mutex. In MariaDB Server 10.3
we would get deadlocks between that mutex and a crypt_data mutex.
The fix was developed by Thirunarayanan Balathandayuthapani.