Backported from 10.5 20e9e804c1 and
5948d7602e.
sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).
It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.
But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.
The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.
The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.
Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.
If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.
To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.
The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.
One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.
If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().
The sys_var class has the deprecation_substitute member to mark the
deprecated variables. As it's set, the server produces warnings when
these variables are used. However, the plugin has no means to utilize
that functionality.
So, the PLUGIN_VAR_DEPRECATED flag is introduced to set the
deprecation_substitute with the empty string. A non-empty string can
make the warning more informative, but there's no nice way seen to
specify it, and not that needed at the moment.
Fixed a condition where designated group commit lead was woken in release,
but returned early without trying to take over the lock.
So, the group commit locks would be unowned, and the waiters could
be stalled, until another thread comes that would on whatever reasons
flush the redo log.
Also, use better criteria for choosing potential next group commit lead.
convert_error_code_to_mysql(): Use the correct limit FK_MAX_CASCADE_DEL
in the error message. The DICT_FK_MAX_RECURSIVE_LOAD applies to
the number of foreign key constraints in table definitions,
not to the number of rows that are visited while processing
a foreign key constraint.
MDEV-22500 Assertion `thd->killed != 0' failed in ha_maria::enable_indexes
For MDEV-17223 the issue was an assert that didn't take into account that
we could get duplicate key errors when enablling unique indexes.
Fixed by not retrying repair in case of duplicate key error for this
case, which avoids the assert.
For MDEV-22500 I removed the assert, as it's not critical (just a way to
find potential wrong code) and we will anyway get things logged in the
error log if this happens. This case cannot triggered an assert in 10.3
but I verified that it would trigger in 10.5 and that this patch fixes
it.
Some GNU/Linux distributions ship a zlib that is modified to use
the s390x DFLTCC instruction. That modification would essentially
redefine compressBound(sourceLen) as (sourceLen * 16 + 2308) / 8 + 6.
Let us relax the tests for InnoDB ROW_FORMAT=COMPRESSED to cope with
such a weaker compression guarantee.
create_table_info_t::row_size_is_acceptable(): Remove a bogus debug-only
assertion that would fail to hold for the test innodb_zip.bug36169.
The function page_zip_empty_size() may indeed return 0.
Post-push fix: remove unstable test.
The test was developed to find the reason of duplicated rows caused by
MDEV-20605 fix. The test is not necessary as the reason was found and
the bug was fixed.
sel_restore_position_for_mysql() moves forward persistent cursor
position after btr_pcur_restore_position() call if cursor relative position
is BTR_PCUR_ON and the cursor points to the record with NOT the same field
values as in a stored record(and some other not important for this case
conditions).
It was done because btr_pcur_restore_position() sets
page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON
before opening cursor. So we are searching for the record less or equal
to stored one. And if the found record is not equal to stored one, then
it is less and we need to move cursor forward.
But there can be a situation when the stored record was purged, but the
new one with the same key but different value was inserted while
row_search_mvcc() was suspended. In this case, when the thread is
awaken, it will invoke sel_restore_position_for_mysql(), which, in turns,
invoke btr_pcur_restore_position(), which will return false because found
record don't match stored record, and
sel_restore_position_for_mysql() will move forward cursor position.
The above can lead to the case when awaken row_search_mvcc() do not see
records inserted by other transactions while it slept. The mtr test case
shows the example how it can be.
The fix is to return special value from persistent cursor restoring
function which would notify its caller that uniq fields of restored
record and stored record are the same, and in this case
sel_restore_position_for_mysql() don't move cursor forward.
Delete-marked records are correctly processed in row_search_mvcc().
Non-unique secondary indexes are "uniquified" by adding the PK, the
index->n_uniq should then be index->n_fields. So there is no need in
additional checks in the fix.
If transaction's readview can't see the changes made in secondary index
record, it requests clustered index record in row_search_mvcc() to check
its transaction id and get the correspondent record version. After this
row_search_mvcc() commits mtr to preserve clustered index latching
order, and starts mtr. Between those mtr commit and start secondary
index pages are unlatched, and purge has the ability to remove stored in
the cursor record, what causes rows duplication in result set for
non-locking reads, as cursor position is restored to the previously
visited record.
To solve this the changes are just switched off for non-locking reads,
it's quite simple solution, besides the changes don't make sense for
non-locking reads.
The more complex and effective from performance perspective solution is
to create mtr savepoint before clustered record requesting and rolling
back to that savepoint after that. See MDEV-27557.
One more solution is to have per-record transaction id for secondary
indexes. See MDEV-17598.
If any of those is implemented, just remove select_lock_type argument in
sel_restore_position_for_mysql().
Before commit 86dc7b4d4c (MDEV-24626)
all tablespace ID that needed recovery were known already in
recv_init_crash_recovery_spaces().
recv_sys_t::recover_deferred(): Invoke fil_names_dirty(space) on
the newly initialized tablespace. In this way, if the next log
checkpoint occurs at some LSN that is after the initialization of
the tablespace and before the last recovered LSN, a FILE_MODIFY
record will be written, so that a subsequent recovery will succeed.
The recovery was broken when
commit 0261eac57f merged the 10.5
commit f443cd1100 (MDEV-27022).
- regression got revealed while running tpcc workload.
- as part of MDEV-25919 changes logic for statistics computation was revamped.
- if the table has changed to certain threshold then table is added to
statistics recomputation queue (dict_stats_recalc_pool_add)
- after the table is added to queue the background statistics thread is
notified
- during revamp the condition to notify background statistics threads was
wrongly updated to check if the queue/vector is empty when it should
check if there is queue/vector has entries to process.
- vec.begin() == vec.end() : only when vector is empty
- also accessing these iterator outside the parallely changing vector is not
safe
- fix now tend to notify background statistics thread if the logic adds
an entry to the queue/vector.
row_sel_sec_rec_is_for_clust_rec() treats empty BLOB prefix field in
secondary index as a field equal to any external BLOB field in clustered
index. Row_sel_get_clust_rec_for_mysql::operator() doesn't zerro out
clustered record pointer in row_search_mvcc(), and row_search_mvcc()
thinks that delete-marked secondary index record has visible for
"CHECK TABLE"'s read view old-versioned clustered index record, and
row_scan_index_for_mysql() counts it as a row.
The fix is to execute row_sel_sec_rec_is_for_blob() in
row_sel_sec_rec_is_for_clust_rec() if clustered field contains BLOB's
reference.
mtr_t::is_block_dirtied(), mtr_t::memo_push(): Never set m_made_dirty
for pages of the temporary tablespace. Ever since
commit 5eb539555b
we never add those pages to buf_pool.flush_list.
mtr_t::commit(): Implement part of mtr_t::prepare_write() here,
and avoid acquiring log_sys.mutex if no log is written.
During IMPORT TABLESPACE fixup, we do not write log, but we must
add pages to buf_pool.flush_list and for that, be prepared
to acquire log_sys.flush_order_mutex.
mtr_t::do_write(): Replaces mtr_t::prepare_write().
The aim of the InnoDB change buffer is to avoid delays when a leaf page
of a secondary index is not present in the buffer pool, and a record needs
to be inserted, delete-marked, or purged. Instead of reading the page into
the buffer pool for making such a modification, we may insert a record to
the change buffer (a special index tree in the InnoDB system tablespace).
The buffered changes are guaranteed to be merged if the index page
actually needs to be read later.
The change buffer could be useful when the database is stored on a
rotational medium (hard disk) where random seeks are slower than
sequential reads or writes.
Obviously, the change buffer will cause write amplification, due to
potentially large amount of metadata that is being written to the
change buffer. We will have to write redo log records for modifying
the change buffer tree as well as the user tablespace. Furthermore,
in the user tablespace, we must maintain a change buffer bitmap page
that uses 2 bits for estimating the amount of free space in pages,
and 1 bit to specify whether buffered changes exist. This bitmap needs
to be updated on every operation, which could reduce performance.
Even if the change buffer were free of bugs such as MDEV-24449
(potentially causing the corruption of any page in the system tablespace)
or MDEV-26977 (corruption of secondary indexes due to a currently
unknown reason), it will make diagnosis of other data corruption harder.
Because of all this, it is best to disable the change buffer by default.
If innodb_flush_method=O_DSYNC, log_sys.flushed_to_disk_lsn is changed
without 'flush_lock' protection inside log_write().
This leads to a race condition, if there are 2 threads running in parallel,
doing log_write_up_to() with different values for 'flush_to_disk'
In this case, log_write() and log_write_flush_to_disk_low() can execute at
the same time, and both would change flushed_lsn.
The fix is to remove special treatment of durable writes from log_write().
There is no apparent reason for this special treatment, log_write_flush_to_disk_low()
is already optimized for durable writes.
Nor there is an apparent reason to call log_flush_notify() more often in
for O_DSYNC.
buf_page_get_low(): If the page was read-fixed, validate the page ID
because the page could have been marked as corrupted. We should retry
the page read in this case, instead of returning a soon-to-be-evicted
corrupted page to the caller.
This was initially only observed on Microsoft Windows.
On Linux, this was repeated after adding a sleep
to buf_pool_t::corrupted_evict() between
bpage->zip.fix.fetch_sub() and bpage->lock.x_unlock().
In commit 9bc874a594 (MDEV-23497)
the configuration option innodb_read_only_compressed was introduced
to giver users advance notice of a plan to remove ROW_FORMAT=COMPRESSED
support for InnoDB.
Based on user feedback, this plan has been scrapped.
Even though ROW_FORMAT=COMPRESSED is a dead end and causes some
overhead for InnoDB data structures, we can live with that.
Now that we know that some users really want to keep using
ROW_FORMAT=COMPRESSED, the previous default value of the parameter
innodb_read_only_compressed=ON should be changed to OFF, to allow
smooth upgrades to 10.6 and later versions, without requiring users
to update any configuration file.
- Store the deferred tablespace name while loading the tablespace
for backup process.
- Mariabackup stores the list of space ids which has page0 INIT_PAGE
records. backup_first_page_op() and first_page_init() was introduced
to track the page0 INIT_PAGE records.
- backup_file_op() and log_file_op() was changed to handle
FILE_MODIFY redo log records. It is used to identify the
deferred tablespace space id.
- Whenever file operation redo log was processed by backup,
backup_file_op() should check whether the space name exist
in deferred tablespace. If it is then it needs to store the
space id, name when FILE_MODIFY, FILE_RENAME redo log processed
and it should delete the tablespace name from defer list in other
cases.
- backup_fix_ddl() should check whether deferred tablespace has
any page0 init records. If it is then consider the tablespace
as newly created tablespace. If not then backup should try
to reload the tablespace with SRV_BACKUP_NO_DEFER mode to
avoid the deferring of tablespace.