Replace wait_for_pattern_in_file.inc and all of its uses
to use search_pattern_in_file.inc with SEARCH_WAIT.
Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Sergei Golubchik <serg@mariadb.org>
For a primary configured with wait_point=AFTER_SYNC, if two threads
T1 (binlogging through MYSQL_BIN_LOG::write()) and T2 were
binlogging at the same time, T1 could accidentally wait for its
semi-sync ACK using the binlog coordinates of T2. Prior to
MDEV-33551, this only resulted in delayed transactions, because all
transactions shared the same condition variable for ACK signaling.
However, with the MDEV-33551 changes, each thread has its own
condition variable to signal. So T1 could wait indefinitely when
either:
1) T1's ACK is received but not T2's when T1 goes into
wait_after_sync(), because the ACK receiver thread has already
notified about the T1 ACK, but T1 was _actually_ waiting on T2's
ACK, and therefore tries to wait (in vain).
2) T1 goes to wait_after_sync() before any ACKs have arrived. When
T1's ACK comes in, T1 is woken up; however, sees it needs to wait
more (because it was actually waiting on T2's ACK), and goes to wait
again (this time, in vain).
Note that the actual cause of T1 waiting on T2's binlog coordinates
is when MYSQL_BIN_LOG::write() would call
Repl_semisync_master::wait_after_sync(), the binlog offset parameter
was read as the end of MYSQL_BIN_LOG::log_file, which is shared
among transactions. So if T2 had updated the binary log _after_ T1
had released LOCK_log, but not yet invoked wait_after_sync(), it
would use the end of the binary log file as the binlog offset, which
was that of T2 (or any future transaction).
The fix in this patch ensures consistency between the binary log
coordinates a transaction uses between report_binlog_update() and
wait_after_sync().
Reviewed By
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
This is a preparatory commit for MDEV-35109 to make its
testing code cleaner (and harden other tests too).
The DEBUG_DBUG point simulate_delay_semisync_slave_reply
up to this patch used my_sleep() to delay an ACK response,
but sleeps are prone to test failures on machines that
run tests when already having a heavy load (e.g. on
buildbot).
This patch changes this DEBUG_DBUG sleep to use DEBUG_SYNC
to coordinate exactly when a slave should send its reply,
which is safer and faster.
As DEBUG_SYNC can't be used while a server is shutting
down, to synchronize threads with SHUTDOWN WAIT FOR SLAVES
logic, we use and extend wait_for_pattern_in_file.inc to
wait for an informational error message in the logic to
indicate that the shutdown process has reached the
intended state (i.e. indicating that the shutdown has
been delayed to await semi-sync ACKs). Specifically, the
extensions are as follows:
1. wait_for_pattern_in_file.inc is extended with parameter
wait_for_pattern_count as a number that indicates the
number of times a pattern should occur in the file before
return control back to the calling script.
2. search_for_pattern_in_file.inc is extended with parameter
SEARCH_ABORT_IS_SUCCESS to inverse the error/success
logic, so the SEARCH_ABORT condition can be used to
indicate success, rather than error.
This is an extension of MDEV-30423 "Deadlock on Replica during BACKUP
STAGE BLOCK_COMMIT on XA transactions"
The original commit in MDEV-30423 was not complete as some usage in XA of
MDL_BACKUP_COMMIT locks did not set thd->backup_commit_lock.
This is required to be set when using parallel replication.
Fixed by ensuring that all usage of BACKUP_COMMIT lock i XA is uniform and
all sets thd->backup_commit_lock. I also changed all locks to be
MDL_EXPLICIT to keep also that part uniform.
A regression test is added.
Post-fix for MDEV-35144.
Cannot allocate options values on the statement arena, because
HA_CREATE_INFO is shallow-copied for every execution, so if the
option_list was initially empty, it will be reset for every execution
and any values allocated on the statement arena will be lost.
Cannot allocate option values on the execution arena, because
HA_CREATE_INFO is shallow-copied for every execution, so if the
option_list was initially NOT empty, any values appended to the
end will be preserved and if they're on the execution arena their
content will be destroyed.
Let's use thd->change_item_tree() to save and restore necessary pointers
for every execution.
followup for 3da565c41d
lock_rec_unlock_unmodified() is executed either under lock_sys.wr_lock()
or under a combination of lock_sys.rd_lock() + record locks hash table
cell latch. It also requests page latch to check if locked records were
changed by the current transaction or not.
Usually InnoDB requests page latch to find the certain record on the
page, and then requests lock_sys and/or record lock hash cell latch to
request record lock. lock_rec_unlock_unmodified() requests the latches
in the opposite order, what causes deadlocks. One of the possible
scenario for the deadlock is the following:
thread 1 - lock_rec_unlock_unmodified() is invoked under locks hash table
cell latch, the latch is acquired;
thread 2 - purge thread acquires page latch and tries to remove
delete-marked record, it invokes lock_update_delete(), which
requests locks hash table cell latch, held by thread 1;
thread 1 - requests page latch, held by thread 2.
To fix it we need to release lock_sys.latch and/or lock hash cell latch,
acquire page latch and re-acquire lock_sys related latches.
When lock_sys.latch and/or lock hash cell latch are released in
lock_release_on_prepare() and lock_release_on_prepare_try(), the page on
which the current lock is held, can be merged. In this case the bitmap
of the current lock must be cleared, and the new lock must be added to
the end of trx->lock.trx_locks list, or bitmap of already existing lock
must be changed.
The new field trx_lock_t::set_nth_bit_calls indicates if new locks
(bits in existing lock bitmaps or new lock objects) were created during
the period when lock_sys was released in trx->lock.trx_locks list
iteration loop in lock_release_on_prepare() or
lock_release_on_prepare_try(). And, if so, we traverse the list again.
The block can be freed during pages merging, what causes assertion
failure in buf_page_get_gen(), as btr_block_get() passes BUF_GET as page
get mode to it. That's why page_get_mode parameter was added to
btr_block_get() to pass BUF_GET_POSSIBLY_FREED from
lock_release_on_prepare() and lock_release_on_prepare_try() to
buf_page_get_gen().
As searching for id of trx, which modified secondary index record, is
quite expensive operation, restrict its usage for master. System variable
was added to remove the restriction for testing simplifying. The
variable exists only either for debug build or for build with
-DINNODB_ENABLE_XAP_UNLOCK_UNMODIFIED_FOR_PRIMARY option to increase the
probability of catching bugs for release build with RQG.
Note that the code, which does primary index lookup to find out what
transaction modified secondary index record, is necessary only when
there is no primary key and no unique secondary key on replica with row
based replication, because only in this case extra X locks on unmodified
records can be set during scan phase.
Reviewed by Marko Mäkelä.
There is no need to exclude exclusive non-gap locks from the procedure
of locks releasing on XA PREPARE execution in
lock_release_on_prepare_try() after commit
17e59ed3aa (MDEV-33454), because
lock_rec_unlock_unmodified() should check if the record was modified
with the XA, and release the lock if it was not.
lock_release_on_prepare_try(): don't skip X-locks, let
lock_rec_unlock_unmodified() to process them.
lock_sec_rec_some_has_impl(): add template parameter for not acquiring
trx_t::mutex for the case if a caller already holds the mutex, don't
crash if lock's bitmap is clean.
row_vers_impl_x_locked(), row_vers_impl_x_locked_low(): add new argument
to skip trx_t::mutex acquiring.
rw_trx_hash_t::validate_element(): don't acquire trx_t::mutex if the
current thread already holds it.
Thanks to Andrei Elkin for finding the bug.
Reviewed by Marko Mäkelä, Debarun Banerjee.
convert_error_code_to_mysql(): Treat DB_DEADLOCK and DB_RECORD_CHANGED
in the same way, that is, signal to the SQL layer that the transaction
had been rolled back.
Replication of non-transactional engines is experimental and
uses TOI. This naturally means that if there is open transaction
with transactional engine it's changes will be rolled back.
Fixed by adding error message if non-transactional engine
is part of multi-engine transaction with warning.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Clean up configuration and tests. Add wait conditions to make
sure test continues from clean state.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
If semi-sync is switched off then on while a transaction is
in-between binlogging and waiting for an ACK, the semi-sync state of
the transaction is removed, leading to a debug assertion that
indicates the transaction tried to wait, but cannot receive an ACK
signal. More specifically, when semi-sync is switched off, the
Active_tranx list is cleared (where a transaction adds an entry to
this list during binlogging), and each entry in this list saves the
thread which will wait for an ACK, and the thread has the COND
variable to signal to wake itself. So if the entry is lost, the
Ack_receiver thread won’t be able to find the thread to wake up when
an ACK comes in
The fix is to ensure that the entry exists before awaiting the ACK,
and if there is no entry, skip the wait. In debug builds, an
informative message is written explaining that the transaction is
skipping its wait. Additional debug-build only logic is added to
ensure that the cause of the missing entry is due to semi-sync being
turned off and on
Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
- InnoDB fulltext rebuilds the FTS COMMON table while adding the
new fulltext index. This can be optimized by avoiding rebuilding
the FTS COMMON table in case of FTS COMMON TABLE already exists.
Reviewed-by: Marko Mäkelä <marko.makela@mariadb.com>
When adding a column or index that uses plugin-defined
sysvar-based options with CREATE ... LIKE the server
was using the current value of the sysvar, not the default one.
Because parse_option_list() function was used both in create
and open and it tried to guess when it's create (need to use
current sysvar value and add a new name=value pair to the list)
or open (need to use default, without extending the list).
Let's move the list extending functionality into a separate
function and call it explicitly when needed. Operations that
add new objects (CREATE, ALTER ... ADD) will extend the list,
other operations (ALTER, CREATE ... LIKE, open) will not.
The invariant of write-ahead logging is that before any change to a
page is written to the data file, the corresponding log record must
must first have been durably written.
On crash recovery, there were some sloppy checks for this. Let us
implement accurate checks and flag an inconsistency as a hard error,
so that we can avoid further corruption of a corrupted database.
For data extraction from the corrupted database, innodb_force_recovery
can be used.
Before recovery is reading any data pages or invoking
buf_dblwr_t::recover() to recover torn pages from the
doublewrite buffer, InnoDB will have parsed the log until the
final LSN and updated log_sys.lsn to that. So, we can rely on
log_sys.lsn at all times. The doublewrite buffer recovery has been
refactored in such a way that the recv_sys.dblwr.pages may be consulted
while discovering files and their page sizes, but nothing will be
written back to data files before buf_dblwr_t::recover() is invoked.
A section of the test mariabackup.innodb_redo_overwrite
that is parsing some mariadb-backup --backup output has
been removed, because that output "redo log block is overwritten"
would often be missing in a Microsoft Windows environment
as a result of these changes.
recv_max_page_lsn, recv_lsn_checks_on: Remove.
recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging
condition at the end of the recovery.
recv_dblwr_t::validate_page(): Keep track of the maximum LSN
(if we are checking a non-doublewrite copy of a page) but
do not complain LSN being in the future. The doublewrite buffer
is a special case, because it will be read early during recovery.
Besides, starting with commit 762bcb81b5
the dblwr=true copies of pages may legitimately be "too new".
recv_dblwr_t::find_page(): Find a valid page with the smallest
FIL_PAGE_LSN that is in the valid range for recovery.
recv_dblwr_t::restore_first_page(): Replaced by find_page().
Only buf_dblwr_t::recover() will write to data files.
buf_dblwr_t::recover(): Simplify the message output. Do attempt
doublewrite recovery on user page read error. Ignore doublewrite
pages whose FIL_PAGE_LSN is outside the usable bounds. Previously,
we could wrongly recover a too new page from the doublewrite buffer.
It is unlikely that this could have lead to an actual error.
Write back all recovered pages from the doublewrite buffer here,
including for the first page of any tablespace.
buf_page_is_corrupted(): Distinguish the return values
CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER.
buf_page_check_corrupt(): Return the error code DB_CORRUPTION
in case the LSN is in the future.
Datafile::read_first_page(): Handle FSP_SPACE_FLAGS=0xffffffff
in the same way on both 32-bit and 64-bit architectures.
Datafile::read_first_page_flags(): Split from read_first_page().
Take a copy of the first page as a parameter.
recv_sys_t::free_corrupted_page(): Take the file as a parameter
and return whether a message was displayed. This avoids some duplicated
and incomplete error messages.
buf_page_t::read_complete(): Remove some redundant output and always
display the name of the corrupted file. Never return DB_FAIL;
use it only in internal error handling.
IORequest::read_complete(): Assume that buf_page_t::read_complete()
will have reported any error.
fil_space_t::set_corrupted(): Return whether this is the first time
the tablespace had been flagged as corrupted.
Datafile::validate_first_page(), fil_node_open_file_low(),
fil_node_open_file(), fil_space_t::read_page0(),
fil_node_t::read_page0(): Add a parameter for a copy of the
first page, and a parameter to indicate whether the FIL_PAGE_LSN
check should be suppressed. Before buf_dblwr_t::recover() is
invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the
FSP_SPACE_FLAGS and the tablespace ID that may be present in a
potentially too new copy of a page.
Reviewed by: Debarun Banerjee
Remove workaround for MDEV-13941, it served for 5 years,and all affected
pre-release 10.2 installation should have been already fixed in between.
Apparently Innodb is using is_sparse parameter in os_file_set_size()
inconsistently, and it passes is_sparse=false now during first file
extension. With MDEV-13941 workaround in place, it would unsparse
the file, which is makes compression not to work at all anymore.
Implement variable legacy_xa_rollback_at_disconnect to support
backwards compatibility for applications that rely on the pre-10.5
behavior for connection disconnect, which is to rollback the
transaction (in violation of the XA specification).
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Problem:
=======
- Redundant table fails to insert into the table after
instant drop blob column. Instant drop column only marking
the column as hidden and consecutive insert statement tries
to insert NULL value for the dropped BLOB column and returns
the fixed length of the blob type as 65535. This lead to
row size too large error.
Fix:
====
For redundant table, if the non-fixed dropped column can be null
then set the length of the field type as 0.
- InnoDB fails to set the index information or index number
for the spatial index error HA_ERR_NULL_IN_SPATIAL.
row_build_spatial_index_key(): Initialize the tmp_mbr array completely.
check_if_supported_inplace_alter(): Fix the spelling mistake of alter
Sometimes, in MariaDB Server 10.5 but apparently not in later branches,
the test would hang because con1 and con2 would be blocked in
debug_sync (for example, lock_wait_suspend_thread_enter and
row_ins_sec_index_entry_dup_locks_created) and therefore blocking
the purge of transactions from completing.
To prevent an occasional DEBUG_SYNC induced hang in the test, we will
wait for everything to be purged, except the last 2 transactions.
This change should be null-merged to 10.6, because the test is not
failing in 10.6 or later major versions.
Hash index is vcol-based wrapper (MDEV-371). row_end is added to
unique index. So when row_end is updated unique hash index must be
recalculated via vcol_update_fields(). DELETE did not update virtual
fields, so DELETE HISTORY was getting wrong hash value.
The fix does update_virtual_fields() on vers_update_end() so in every
case row_end is updated virtual fields are updated as well.
work consistently on replication
Row-based replication does not execute CREATE .. SELECT but instead
CREATE TABLE. CREATE .. SELECT creates implict system fields on
unusual place: in-between declared fields and select fields. That was
done because select_field_pos logic requires select fields go last in
create_list.
So, CREATE .. SELECT on master and CREATE TABLE on slave create system
fields on different positions and replication gets field mismatch.
To fix this we've changed CREATE .. SELECT to create implicit system
fields on usual place in the end and updated select_field_pos for
handling this case.
When calculate_cond_selectivity_for_table() takes into account multi-
column selectivities from range access, it tries to take-into account
that selectivity for some columns may have been already taken into account.
For example, for range access on IDX1 using {kp1, kp2}, the selectivity
of restrictions on "kp2" might have already been taken into account
to some extent.
So, the code tries to "discount" that using rec_per_key[] estimates.
This seems to be wrong and unreliable: the "discounting" may produce a
rselectivity_multiplier number that hints that the overall selectivity
of range access on IDX1 was greater than 1.
Do a conservative fix: if we arrive at conclusion that selectivity of
range access on condition in IDX1 >1.0, clip it down to 1.
row_purge_reset_trx_id(): Reserve large enough offsets for accomodating
the maximum width PRIMARY KEY followed by DB_TRX_ID,DB_ROLL_PTR.
Reviewed by: Thirunarayanan Balathandayuthapani
Don't allow the referencing key column from NULL TO NOT NULL
when
1) Foreign key constraint type is ON UPDATE SET NULL
2) Foreign key constraint type is ON DELETE SET NULL
3) Foreign key constraint type is UPDATE CASCADE and referenced
column declared as NULL
Don't allow the referenced key column from NOT NULL to NULL
when foreign key constraint type is UPDATE CASCADE
and referencing key columns doesn't allow NULL values
get_foreign_key_info(): InnoDB sends the information about
nullability of the foreign key fields and referenced key fields.
fk_check_column_changes(): Enforce the above rules for COPY
algorithm
innobase_check_foreign_drop_col(): Checks whether the dropped
column exists in existing foreign key relation
innobase_check_foreign_low() : Enforce the above rules for
INPLACE algorithm
dict_foreign_t::check_fk_constraint_valid(): This is used
by CREATE TABLE statement to check nullability for foreign
key relation.