PROBLEM
-------
1. The customer had presented a stack which had many threads waiting on
multiple mutexes like LOCK_Status, srv_innodb_monitor_mutex, ibuf_mutex etc.
2. The root cause was that the AHI latch was held in S (shared) mode by the a thread which was
doing a truncate of a large table .
3. There was another thread which was trying to acquire the AHI latch in X (exclusive) mode
4. With our lock implementation any thread requesting a X lock ,blocks rest of the threads
requesting S(shared) locks,this caused many threads to wait for this shared lock.
5. The main reason why we hold the latches in truncate is to avoid disabling of AHI
during truncate
FIX
InnoDB purge thread locks the root page of clustered index
while accessing the undo log records and later same thread
tries to open the table, initialize statistics and tries
to lock the clustered index root page while doing virtual
column computation.
Solution:
=========
InnoDB should prevent statistics initialization when the
table is being opened by purge thread
Between btr_pcur_store_position() and btr_pcur_restore_position()
it is possible that purge empties a table and enlarges
index->n_core_fields and index->n_core_null_bytes.
Therefore, we must cache index->n_core_fields in
btr_pcur_t::old_n_core_fields so that btr_pcur_t::old_rec can be
parsed correctly.
Unfortunately, this is a huge change, because we will replace
"bool leaf" parameters with "ulint n_core"
(passing index->n_core_fields, or 0 for non-leaf pages).
For special cases where we know that index->is_instant() cannot hold,
we may also pass index->n_fields.
Problem:
========
InnoDB fails to clean the index stub if it fails to add the
virtual index which contains new virtual column. But it clears
the newly virtual column from index in clear_added_indexes()
during inplace_alter_table. On commit, InnoDB evicts and
reload the table. In case of rollback, it doesn't happen.
InnoDB clears the ABORTED index while opening the table
or doing the DDL. In the mean time, InnoDB can access
the dropped virtual index columns while creating prebuilt
or rollback of concurrent DML.
Solution:
==========
(1) InnoDB should maintain newly added virtual column while
rollbacking the newly added virtual index.
(2) InnoDB must not defer the index removal
if the alter table is executed with LOCK=EXCLUSIVE.
(3) For LOCK=SHARED, InnoDB should check whether the table
has any other transaction lock other than alter transaction
before deferring the index stub.
Replaced has_new_v_col with dict_add_vcol_info in dict_index_t to
indicate whether the index has any new virtual column.
dict_index_t::has_new_v_col(): Returns whether the index has
newly added virtual column, it doesn't say which columns are
newly added virtual column
ha_innobase_inplace_ctx::is_new_vcol(): Return whether the
given column is added as a part of the current alter.
ha_innobase_inplace_ctx::clean_new_vcol_index(): Copy the newly
added virtual column to new_vcol_info in dict_index_t. Replace
the column in the index fields with virtual column stored
in new_vcol_info.
dict_index_t::assign_new_v_col(): Store the number of virtual
column added in index as a part of alter table.
dict_index_t::get_n_new_vcol(): Get the number of newly added
virtual column
dict_index_t::assign_drop_v_col(): Allocate the memory for
adding new virtual column in new_vcol_info.
dict_index_t::add_drop_v_col(): Add the newly added virtual
column in new_vcol_info.
dict_table_t::has_lock_for_other_trx(): Whether the table has
any other transaction lock than given transaction.
row_merge_drop_indexes(): Add parameter alter_trx and check
whether the table has any other lock than alter transaction.
When a table has been evicted from dict_sys and reloaded internally by
InnoDB for FOREIGN KEY processing, statistics may not be initialized,
but nevertheless row_update_cascade_for_mysql() could invoke
dict_stats_update_if_needed(). In that case, we cannot really update
the statistics. For tables that have STATS_PERSISTENT=1 and
STATS_AUTO_RECALC=1, ANALYZE TABLE might have to be executed later.
dict_stats_update_if_needed(): Replace the assertion with
a conditional early return.
trx_sys_any_active_transactions(): Remove a bogus debug assertion.
In trx_commit_in_memory() and trx_erase_lists(), we will remove
the transaction from trx_sys->rw_trx_list and set the state to
TRX_STATE_COMMITTED_IN_MEMORY.
server failure in different, confusing ways
InnoDB fails to free the buffer pool instance mutex and zip mutex
If the allocation of buffer pool instance chunk fails. So it leads
to freeing of buffer pool before freeing the mutexes and
leads to double freeing of memory while freeing the mutex
during shutdown.
InnoDB should skip the dropped aborted FTS_DOC_ID_INDEX while
checking the existing FTS_DOC_ID_INDEX in the table. InnoDB
should able to create new FTS_DOC_ID_INDEX if the fulltext
index is being added for the first time.
During the prepare phase of restoring backups, "mariabackup" does
not seem to allow (or recognize) the option "innodb_force_recovery"
for the embedded InnoDB server instance that it starts.
If page corruption observed during page recovery, the prepare step
fails. While this is indeed the correct behavior ideally, allowing
this option to be set in case of emergencies might be useful when
the current backup is the only copy available. Some error messages
during "--prepare" suggest to set "innodb_force_recovery" to 1:
[ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
For backwards compatibility, "mariabackup --innobackupex --apply-log"
should also have this option.
Signed-off-by: Srinidhi Kaushik <shrinidhi.kaushik@gmail.com>
Virtual column fields are not found in prebuilt data type, so we should
match InnoDB fields with `get_innobase_type_from_mysql_type` method.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
- Aborting of fulltext index creation fails to remove the
index from sys indexes table. When we try to reload the
table definition, InnoDB fails with index count mismatch
error. InnoDB should remove the index from sys indexes while
rollbacking the secondary index creation.
- Importing table operation fails to punch the hole in
the filesystem when page compressed table is involved.
To achieve that, InnoDB firstly punches the hole for
the IOBuffer size(1MB). After that, InnoDB should write
page by page when page compression is involved.
Add condition on trx->state == TRX_STATE_COMMITTED_IN_MEMORY in order to
avoid unnecessary work. If a transaction has already been committed or
rolled back, it will release its locks in lock_release() and let
the waiting thread(s) continue execution.
Let BF wait on lock_rec_has_to_wait and if necessary other BF
is replayed.
wsrep_trx_order_before
If BF is not even replicated yet then they are ordered
correctly.
bg_wsrep_kill_trx
Make sure victim_trx is found and check also its state. If
state is TRX_STATE_COMMITTED_IN_MEMORY transaction is
already committed or rolled back and will release it locks
soon.
wsrep_assert_no_bf_bf_wait
Transaction requesting new record lock should be TRX_STATE_ACTIVE
Conflicting transaction can be in states TRX_STATE_ACTIVE,
TRX_STATE_COMMITTED_IN_MEMORY or in TRX_STATE_PREPARED.
If conflicting transaction is already committed in memory or
prepared we should wait. When transaction is committed in memory we
held trx mutex, but not lock_sys->mutex. Therefore, we
could end here before transaction has time to do lock_release()
that is protected with lock_sys->mutex.
lock_rec_has_to_wait
We very well can let bf to wait normally as other BF will be
replayed in case of conflict. For debug builds we will do
additional sanity checks to catch unsupported bf wait if any.
wsrep_kill_victim
Check is victim already in TRX_STATE_COMMITTED_IN_MEMORY state and
if it is we can return.
lock_rec_dequeue_from_page
lock_rec_unlock
Remove unnecessary wsrep_assert_no_bf_bf_wait function calls.
We can very well let BF wait here.
The function row_upd_clust_step() is invoking several static functions,
some of which used to commit the mini-transaction in some cases.
If innobase_get_computed_value() would fail due to some reason,
we would fail to invoke mtr_t::commit() and release buffer pool
page latches. This would likely lead to a hanging server later.
This regression was introduced in
commit 97db6c15ea (MDEV-20618).
row_upd_index_is_referenced(), row_upd_sec_index_entry(),
row_upd_sec_index_entry(): Cleanup: Replace some ibool with bool.
row_upd_clust_rec_by_insert(), row_upd_clust_rec(): Guarantee that
the mini-transaction will always remain in active state.
row_upd_del_mark_clust_rec(): Guarantee that
the mini-transaction will always remain in active state.
This fixes one "leak" of mini-transaction on DB_COMPUTE_VALUE_FAILED.
row_upd_clust_step(): Use only one return path, which will always
invoke mtr.commit(). After a failed row_upd_store_row() call, we
will no longer "leak" the mini-transaction.
This fix was verified by RQG on 10.6 (depending on MDEV-371 that
was introduced in 10.4). Unfortunately, it is challenging to
create a regression test for this, and a test case could soon become
invalid as more bugs in virtual column evaluation are fixed.
As suggested by Vladislav Vaintroub, let us remove misleading
and malformatted startup messages.
Even if the global variable srv_use_atomic_writes were set, we would
still invoke my_test_if_atomic_write() to check if writes are atomic
with a particular page size.
When using the default innodb_page_size=16k, page writes should be
atomic on NTFS when using ROW_FORMAT=COMPRESSED and KEY_BLOCK_SIZE<=4.
Disabling srv_use_atomic_writes when innodb_file_per_table=OFF does
not make sense, because that is a dynamic parameter.
We also correct the documentation string of innodb_use_atomic_writes
and remove the duplicate variable innobase_use_atomic_writes.
The debug parameter innodb_simulate_comp_failures injected compression
failures for ROW_FORMAT=COMPRESSED tables, breaking the pre-existing
logic that I had implemented in the InnoDB Plugin for MySQL 5.1 to prevent
compressed page overflows. A much better check is already achieved by
defining UNIV_ZIP_COPY at the compilation time.
(Only UNIV_ZIP_DEBUG is part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON.)
In commit eaeb8ec4b8 (MDEV-24653)
an incorrect debug assertion was introduced.
btr_pcur_store_position(): If the only record in the page is the
instant ALTER TABLE metadata record, we cannot expect there to be
a successor page. The situation could be improved by MDEV-24673 later.
Before MDEV-14638, there was no race condition between the
execution of fetch_data_into_cache() and transaction commit.
fetch_data_into_cache(): Acquire trx_t::mutex before checking
trx_t::state, to prevent a concurrent transition from
TRX_STATE_COMMITTED_IN_MEMORY to TRX_STATE_NOT_STARTED
in trx_commit_in_memory().
ha_innobase::info_low(): While collecting statistics for
ANALYZE TABLE, ensure that dict_stats_process_entry_from_recalc_pool()
is not executing on the same table.
We observed result differences for the test innodb.innodb_stats because
dict_stats_empty_index() was being invoked by the background statistics
calculation while ha_innobase::analyze() was executing
dict_stats_analyze_index_level().
Tests with 4096-byte sector size confirm that it is
safe to use O_DIRECT with page_compressed tables.
That had been disabled on Linux, in an attempt to fix MDEV-21584
which had been filed for the O_DIRECT problems earlier.
The fil_node_t::block_size was being set mostly correctly until
commit 10dd290b4b (MDEV-17380)
introduced a regression in MariaDB Server 10.4.4.
fil_node_t::read_page0(): Initialize fil_node_t::block_size.
This will probably make similar code in fil_space_extend_must_retry()
redundant, but we play it safe and will not remove that code.
Thanks to Vladislav Vaintroub for testing this on Microsoft Windows
using an old-fashioned rotational hard disk with 4KiB sector size.
Reviewed by: Vladislav Vaintroub
This had been originally added in
mysql/mysql-server@192bb153b6
with the motivation to disable O_DIRECT for the dedicated tablespace
for temporary tables. In MariaDB Server,
commit 5eb539555b (MDEV-12227)
should be a better solution.
The code became orphaned later in
mysql/mysql-server@c61244c0e6
and it had been applied to MariaDB Server 10.2.2 in
commit 2e814d4702 and
commit fec844aca8.
Thanks to Vladislav Vaintroub for spotting this.
Keyvalue can be longer than REC_VERSION_56_MAX_INDEX_COL_LEN
and this leads out-of-array reference. Use dynamic memory
allocation using actual max length of key value.
Online log for insert operation of redundant table fails with
index->is_instant() assert. Purge can reset the n_core_fields when
alter is waiting to upgrade MDL for commit phase of DDL. In the
meantime, any insert DML tries to log the operation fails with
index is not being instant.
row_log_get_n_core_fields(): Get the n_core_fields of online log
for the given index.
rec_get_converted_size_comp_prefix_low(): Use n_core_fields of online
log when InnoDB calculates the size of data tuple during redundant
row format table rebuild.
rec_convert_dtuple_to_rec_comp(): Use n_core_fields of online log
when InnoDB does the conversion of data tuple to record during
redudant row format table rebuild.
- Adding the test case which has more than 129 instant columns.
MDEV-25105 (commit 7a4fbb55b0)
in MariaDB 10.6 will refuse the innodb_checksum_algorithm
values none, innodb, strict_none, strict_innodb.
We will issue a deprecation warning if innodb_checksum_algorithm
is set to any of these non-default unsafe values.
innodb_checksum_algorithm=crc32 was made the default in
MySQL 5.7 and MariaDB Server 10.2, and given that older versions
of the server have reached their end of life, there is no valid
reason to use anything else than innodb_checksum_algorithm=crc32
or innodb_checksum_algorithm=strict_crc32 in MariaDB 10.3.
Reviewed by: Sergei Golubchik
InnoDB set the space in dict_table_t as NULL when table
is discarded. So InnoDB shouldn't use the space present
in table to detect whether the given tablespace is
temporary tablespace.
btr_node_ptr_max_size(): Let us remove the debug assertion that was
added in MDEV-14637. The assertion assumed that no additional
indexes exist in mysql.innodb_index_stats or mysql.innodb_table_stats.
The code path is working around an incorrect definition of a table,
interpreting VARCHAR(64) as the more correct VARCHAR(199).
No test case will be added, because MDEV-24579 proves that executing
DDL on the statistics tables involves a race condition. The test
case included the following:
ALTER TABLE mysql.innodb_index_stats ADD KEY (stat_name);
CREATE TABLE t (a INT) ENGINE=InnoDB STATS_PERSISTENT=1;
Incorrect processing of an auto-incrementing field in the
WSREP-related code during applying transactions results in
a duplicate key being created. This is due to the fact that
at the beginning of the write_row() and update_row() functions,
the values of the auto-increment parameters are used, which
are read from the parameters of the current thread, but further
along the code other values are used, which are read from global
variables (when applying a transaction). This can happen when
the cluster configuration has changed while applying a transaction
(for example in the high_priority_service mode for Galera 4).
Further during IST processing duplicating key is detected, and
processing of the DB_DUPLICATE_KEY return code (inside innodb,
in the write_row() handler) results in a call to the
wsrep_thd_self_abort() function.
row_prebuilt_t::m_no_prefetch: Remove (it was always false).
row_prebuilt_t::m_read_virtual_key: Remove (it was always false).
Only ha_innopart ever set these fields.
innobase_rename_table(): Invoke dict_stats_wait_bg_to_stop_using_table()
to ensure that dict_stats_update() cannot be accessing the table name
that we will be modifying. If we are executing RENAME rather than TRUNCATE,
reset the flag at the end so that persistent statistics can be calculated
again.
The race condition was encountered with ASAN and rr.
Sorry, there is no test case, like there is for nothing related to
dict_stats_wait_bg_to_stop_using_table(). The entire code is an ugly
work-around for the failure of dict_stats_process_entry_from_recalc_pool()
to acquire MDL.
Note: It appears that an ALTER TABLE that is not rebuilding the table
will fail to reset the flag that blocks the processing of statistics.
In btr_index_rec_validate(), externally stored column
check is missing while matching the length of the field
with the length of the field data stored in record.
Fetch the length of the externally stored part and compare it
with the fixed field length.
This is a backport of commit 18535a4028
from 10.6.
lock_release(): Implement innodb_evict_tables_on_commit_debug.
Before releasing any locks, collect the identifiers of tables to
be evicted. After releasing all locks, look up for the tables and
evict them if it is safe to do so.
trx_commit_in_memory(): Invoke trx_update_mod_tables_timestamp()
before lock_release(), so that our locks will protect the tables
from being evicted.