If replicating an event in ROW format, and InnoDB detects a deadlock
while searching for a row, the row event will error and rollback in
InnoDB and indicate that the binlog cache also needs to be cleared,
i.e. by marking thd->transaction_rollback_request. In the normal
case, this will trigger an error in Rows_log_event::do_apply_event()
and cause a rollback. During the Rows_log_event::do_apply_event()
cleanup of a successful event application, there is a DBUG_ASSERT in
log_event_server.cc::rows_event_stmt_cleanup(), which sets the
expectation that thd->transaction_rollback_request cannot be set
because the general rollback (i.e. not the InnoDB rollback) should
have happened already. However, if the replica is configured to skip
deadlock errors, the rows event logic will clear the error and
continue on, as if no error happened. This results in
thd->transaction_rollback_request being set while in
rows_event_stmt_cleanup(), thereby triggering the assertion.
This patch fixes this in the following ways:
1) The assertion is invalid, and thereby removed.
2) The rollback case is forced in rows_event_stmt_cleanup() if
transaction_rollback_request is set.
Note the differing behavior between transactions which are skipped
due to deadlock errors and other errors. When a transaction is
skipped due to an ignored deadlock error, the entire transaction is
rolled back and skipped (though note MDEV-33930 which allows
statements in the same transaction after the deadlock-inducing one
to commit). When a transaction is skipped due to ignoring a
different error, only the erroring statements are rolled-back and
skipped - the rest of the transaction will execute as normal. The
effect of this can be seen in the test results. The added test case
to rpl_skip_error.test shows that only statements which are ignored
due to non-deadlock errors are ignored in larger transactions. A
diff between rpl_temporary_error2_skip_all.result and
rpl_temporary_error2.result shows that all statements in the errored
transaction are rolled back (diff pasted below):
: diff rpl_temporary_error2.result rpl_temporary_error2_skip_all.result
49c49
< 2 1
---
> 2 NULL
51c51
< 4 1
---
> 4 NULL
53c53
< * There will be two rows in t2 due to the retry.
---
> * There will be one row in t2 because the ignored deadlock does not retry.
57d56
< 1
59c58
< 1
---
> 0
Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
On Windows systems, occurrences of ERROR_SHARING_VIOLATION due to
conflicting share modes between processes accessing the same file can
result in CreateFile failures.
mysys' my_open() already incorporates a workaround by implementing
wait/retry logic on Windows.
But this does not help if files are opened using shell redirection like
mysqltest traditionally did it, i.e via
--echo exec "some text" > output_file
In such cases, it is cmd.exe, that opens the output_file, and it
won't do any sharing-violation retries.
This commit addresses the issue by introducing a new built-in command,
'write_line', in mysqltest. This new command serves as a brief alternative
to 'write_file', with a single line output, that also resolves variables
like "exec" would.
Internally, this command will use my_open(), and therefore retry-on-error
logic.
Hopefully this will eliminate the very sporadic "can't open file because
it is used by another process" error on CI.
Issue:
------
The actual order of acquisition of the IBUF pessimistic insert mutex
(SYNC_IBUF_PESS_INSERT_MUTEX) and IBUF header page latch
(SYNC_IBUF_HEADER) w.r.t space latch (SYNC_FSP) differs from the order
defined in sync0types.h. It was not discovered earlier as the path to
ibuf_remove_free_page was not covered by the mtr test. Ideal order and
one defined in sync0types.h is as follows.
SYNC_IBUF_HEADER -> SYNC_IBUF_PESS_INSERT_MUTEX -> SYNC_FSP
In ibuf_remove_free_page, we acquire space latch earlier and we have
the order as follows resulting in the assert with innodb_sync_debug=on.
SYNC_FSP -> SYNC_IBUF_HEADER -> SYNC_IBUF_PESS_INSERT_MUTEX
Fix:
---
We do maintain this order in other places and there doesn't seem to be
any real issue here. To reduce impact in GA versions, we avoid doing
extensive changes in mutex ordering to match the current
SYNC_IBUF_PESS_INSERT_MUTEX order. Instead we relax the ordering check
for IBUF pessimistic insert mutex using SYNC_NO_ORDER_CHECK.
Test was waiting INSERT-clause to make rollback but
wait_condition was too tight. State could be
Freeing items or Rollback. Fixed wait_condition
to expect one of them.
create_partitioning_metadata() should only mark transaction r/w
if it actually did anything (that is, the table is partitioned).
otherwise it's a no-op, called even for temporary tables and
it shouldn't do anything at all
In the case if some unique key fields are nullable, there can be
several records with the same key fields in unique index with at least
one key field equal to NULL, as NULL != NULL.
When transaction is resumed after waiting on the record with at least one
key field equal to NULL, and stored in persistent cursor record is
deleted, persistent cursor can be restored to the record with all key
fields equal to the stored ones, but with at least one field equal to
NULL. And such record is wrongly treated as a record with the same unique
key as stored in persistent cursor record one, what is wrong as
NULL != NULL.
The fix is to check if at least one unique field is NULL in restored
persistent cursor position, and, if so, then don't treat the record as
one with the same unique key as in the stored record key.
dict_index_t::nulls_equal was removed, as it was initially developed for
never existed in MariaDB "intrinsic tables", and there is no code, which
would set it to "true".
Reviewed by Marko Mäkelä.
The test failure in rpl.rpl_domain_id_filter_restart is caused by
MDEV-33887. That is, the test uses master_pos_wait() (called
indirectly by sync_slave_with_master) to try and wait for the
replica to catch up to the master. However, the waited on
transaction is ignored by the configured
CHANGE MASTER TO IGNORE_DOMAIN_IDS=()
As MDEV-33887 reports, due to the IO thread updating the binlog
coordinates and the SQL thread updating the GTID state, if the
replica is stopped in-between these updates, the replica state will
be inconsistent. That is, the test expects that the GTID state will
be updated, so upon restart, the replica will be up-to-date.
However, if the replica is stopped before the SQL thread updates its
GTID state, then upon restart, the replica will fetch the previously
ignored event, which is no longer ignored upon restart, and execute
it. This leads to the sporadic extra row in t2.
This patch changes master_pos_wait() to use master_gtid_wait() to
ensure the replica state is consistent with the master state.
- This issue is caused by commit 188c5da72a (MDEV-32453).
InnoDB fails to end the bulk insert for the table after
applying the bulk insert operation. This leads to assertion
during commit process.
Before MDEV-15158, wsrep xid information was stored in only one place:
in the TRX_SYS page. Starting with 10.3, it is not stored there but
in the rollback segment header pages, and the latest one is what
matters. MDEV-19229 allows the undo tablespaces to be rebuilt when
innodb_undo_tablespaces is changed on startup. Previously it was not
possible to change that parameter.
These changes caused the fact that rollback segment header pages could
contain several wsrep xid's stored and when undo tablespaces were
rebuilt there was a effort to restore wsrep xid back to rollback
segment header page but because there was several of them the latest
wsrep xid was overwritten with older one.
trx_rseg_read_wsrep_checkpoint
trx_rseg_init_wsrep_xid
Return true if read xid is wsrep xid, false if not
trx_rseg_mem_restore
Try to read wsrep xid and if it is found copy it to
trx_sys.recovered_wsrep_xid if read xid has larger
seqno.
increase the MASTER_CONNECT_RETRY time under valgrind,
otherwise the slave gives up retrying before the master is ready
also, cosmetic cleanup of rpl_semi_sync_master_shutdown.test
The crash at running mysqlbinlog on a SEQUENCE containing binlog file
was caused MDEV-29621 fixes that did not check which of the slave
or binlog applier executes a block introduced there.
The block is meaningful only for the parallel slave applier, so
it's safe to fix this bug with identified the actual applier and
skipping the block when it's the mysqlbinlog one.
In commit d74d95961a (MDEV-18543)
there was an error that would cause the hidden metadata record
to be deleted, and therefore cause the table to appear corrupted
when it is reloaded into the data dictionary cache.
PageConverter::update_records(): Do not delete the metadata record,
but do validate it.
RecIterator::open(): Make the API more similar to 10.6, to simplify
merges.
When the system variables @@debug_dbug was assigned to
some expression, Sys_debug_dbug::do_check() did not properly
convert the value from the expression character set to utf8.
So the value was erroneously re-interpretted as utf8 without
conversion. In case of a tricky expression character set
(e.g. utf16le), this led to unexpected results.
Fix:
Re-using Sys_var_charptr::do_string_check() in Sys_debug_dbug::do_check().
The signal handler thread can use various different runtime
resources when processing a SIGHUP (e.g. master-info information)
due to calling into reload_acl_and_cache(). Currently, the shutdown
process waits for the termination of the signal thread after
performing cleanup. However, this could cause resources actively
used by the signal handler to be freed while reload_acl_and_cache()
is processing.
The specific resource that caused MDEV-30260 is a race condition for
the hostname_cache, such that mysqld would delete it in
clean_up()::hostname_cache_free(), before the signal handler would
use it in reload_acl_and_cache()::hostname_cache_refresh().
Another similar resource is the active_mi/master_info_index. There
was a race between its deletion by the main thread in end_slave(),
and their usage by the Signal Handler as a part of
Master_info_index::flush_all_relay_logs.read(active_mi) in
reload_acl_and_cache().
This patch fixes these race conditions by relocating where server
shutdown waits for the signal handler to die until after
server-level threads have been killed (i.e., as a last step of
close_connections()). With respect to the hostname_cache, active_mi
and master_info_cache, this ensures that they cannot be destroyed
while the signal handler is still active, and potentially using
them.
Additionally:
1) This requires that Events memory is still in place for SIGHUP
handling's mysql_print_status(). So event deinitialization is moved
into clean_up(), but the event scheduler still needs to be stopped
in close_connections() at the same spot.
2) The function kill_server_thread is no longer used, so it is
deleted
3) The timeout to wait for the death of the signal thread was not
consistent with the comment. The comment mentioned up to 10 seconds,
whereas it was actually 0.01s. The code has been fixed to wait up to
10 seconds.
4) A warning has been added if the signal handler thread fails to
exit in time.
5) Added pthread_join() to end of wait_for_signal_thread_to_end()
if it hadn't ended in 10s with a warning. Note this also removes
the pthread_detached attribute from the signal_thread to allow
for the pthread_join().
Reviewed By:
===========
Vladislav Vaintroub <wlad@mariadb.com>
Andrei Elkin <andrei.elkin@mariadb.com>
Problem was that if wsrep_load_data_splitting was used
streaming replication (SR) parameters were set
for MyISAM table. Galera does not currently support SR for
MyISAM.
Fix is to ignore wsrep_load_data_splitting setting (with
warning) if table is not InnoDB table.
This is 10.6+ case of fix.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Problem was too tight condition on ha_commit_trans to not
allow non transactional storage engines participate 2pc
in Galera case. This is required because transaction
using e.g. procedures might read mysql.proc table inside
a trasaction and these tables use at the moment Aria
storage engine that does not support 2pc.
Fixed by allowing read only transactions to storage
engines that do not support two phase commit to participate
2pc transaction. These will be committed later separately.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Even after commit b8a6719889 there
is an anomaly where a locking read could return inconsistent results.
If a locking read would have to wait for a record lock, then by the
definition of a read view, the modifications made by the current lock
holder cannot be visible in the read view. This is because the read
view must exclude any transactions that had not been committed at the
time when the read view was created.
lock_rec_convert_impl_to_expl_for_trx(), lock_rec_convert_impl_to_expl():
Return an unsafe-to-dereference pointer to a transaction that holds or
held the lock, or nullptr if the lock was available.
lock_clust_rec_modify_check_and_lock(),
lock_sec_rec_read_check_and_lock(),
lock_clust_rec_read_check_and_lock():
Return DB_RECORD_CHANGED if innodb_strict_isolation=ON and the
lock was being held by another transaction.
The test case, which is based on a bug report by Zhuang Liu,
covers the function lock_sec_rec_read_check_and_lock().
Reviewed by: Vladislav Lesin
Keep track of each recently active XID, recording which worker it was queued
on. If an XID might still be active, choose the same worker to queue event
groups that refer to the same XID to avoid conflicts.
Otherwise, schedule the XID freely in the next round-robin slot.
This way, XA PREPARE can normally be scheduled without restrictions (unless
duplicate XID transactions come close together). This improves scheduling
and parallelism over the old method, where the worker thread to schedule XA
PREPARE on was fixed based on a hash value of the XID.
XA COMMIT will normally be scheduled on the same worker as XA PREPARE, but
can be a different one if the XA PREPARE is far back in the event history.
Testcase and code for trimming dynamic array due to Andrei.
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
On Microsoft Windows, ReadFile() as well as WriteFile() limit the size
of the request to DWORD, which is 32 bits (at most 4 GiB - 1) also on
64-bit systems.
On FreeBSD, sysctl debug.iosize_max_clamp could limit the size of a
write request to INT_MAX. The size of a read request is always limited
to INT_MAX. This would allow the request size to be 4095 bytes more than
the Linux limit (0x7ffff000 according to "man 2 read" and "man 2 write").
On OpenBSD, Solaris and possibly NetBSD, the read request size is limited
to SSIZE_T_MAX, which would be half the current maximum
innodb_log_buffer_size. This should be not much of an issue anyway,
because on contemporary 64-bit platforms, the virtual addresses are
limited to 48 bits.
IBM AIX documentation mentions OFF_MAX which would apply when
a 64-bit application is running on a 32-bit kernel.
Let us declare innodb_log_buffer_size as 32-bit unsigned and make the
maximum 0x7ffff000, to be compatible with the least common
denominator (Linux).
The maximum innodb_sort_buffer_size already was 64 MiB,
which is not a problem.
SyncFileIO::execute(): Assert that the size of a synchronous read or
write request is limited to the maximum.
Reviewed by: Vladislav Vaintroub
A GTID event can have variable length, with contributing factors
such as the variable length from the flags2 and optional extra flags
fields. These fields are bitmaps, where each set bit indicates an
additional value that should be appended to the event, e.g.
multi-engine transactions append a number to indicate the number of
additional engines a transaction uses. However, if a flags bit is
set, and no additional fields are appended to the event, MDEV-33672
reports that the server can still try to read from memory as if it
did exist. Note, however, in debug builds, this condition is
asserted for FL_EXTRA_MULTI_ENGINE.
This patch fixes this to check that the length of the event is
aligned with the expectation set by the flags for FL_PREPARED_XA,
FL_COMPLETED_XA, and FL_EXTRA_MULTI_ENGINE.
Reviewed By
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Reason:
=======
- InnoDB fails to apply the buffered insert operation if the
after insert trigger does change the same table. This behaviour
leads to empty table for the subsequent insert operation
and server abort.
Solution:
========
- InnoDB should apply buffered insert operation if
"after insert" trigger changes the same table.
Commit 6dce6aeceb breaks out of a loop
in ha_partition::info when some partitions aren't opened, in which
case auto_increment_value assertion will fail. This commit patches
that hole.
the value of 200 isn't enough for some tests anymore, this causes
some random threads to become not instrumented and any table operations
there are not reflected in the perfschema. If, say, a DROP TABLE
doesn't change perfschema state, perfschema tables might show
ghost tables that no longer exist in the server
- InnoDB reserves the free extents unnecessarily during blob
page allocation even though btr_page_alloc() can handle
reserving the extent when the existing ran out of pages to be used.
on a busy system it might take time for buffer_page_written_index_leaf
to reach the correct value. Wait for it.
also, tag identical statements to be different in the result file.
Fix in this commit handles foreign key value appending into write set
so that db and table names are converted from the filepath format
to tablename format. This is compatible with key values appended from
elsewhere in the code base
There is a mtr test galera.galera_table_with_hyphen for regression testing
Reviewer: monty@mariadb.com
MDEV-26473 fixed a segmentation fault at startup between the handle
manager thread and the binlog background thread, such that the
binlog background thread could be started and submit a job to the
handle manager, before it had initialized. Where MDEV-26473 made it
so the handle manager would initialize before the main thread
started the normal binary logs, it did not account for the recovery
case. That is, there is still a possibility of a segmentation fault
when a server is recovering using the binary logs such that it can
open the binary logs, start the binlog background thread, and submit
a job to the handle manager before it is initialized.
This patch fixes this by moving the initialization of the mysql
handler manager to happen prior to recovery.
Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
Delayed_insert has its own THD (initialized at mysql_insert()) and
hence its own LEX. Delayed_insert initalizes a very few parameters for
LEX and 'duplicates' is not in this list. Now we copy this missing
parameter from parser LEX (as well as sql_command).
When INSERT does auto-create for t1 all its handler instances are
closed by alter_close_table(). At this time down the stack
maria_close() clears share->state_history. Later when we unlock the
tables Aria transaction manager accesses old share instance (the one
before t1 was closed) and tries to reset its state_history.
The problem is maria_close() didn't remove table from transaction's
list (used_tables). The fix does _ma_remove_table_from_trnman() which
is triggered by HA_EXTRA_PREPARE_FOR_RENAME.
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.