When converting a table (test.s3_table) from S3 to another engine, the
following will be logged to the binary log:
DROP TABLE IF EXISTS test.t1;
CREATE OR REPLACE TABLE test.t1 (...) ENGINE=new_engine
INSERT rows to test.t1 in binary-row-log-format
The bug is that the above statements are logged one by one to the binary
log. This means that a fast slave, configured to use the same S3 storage
as the master, would be able to execute the DROP and CREATE from the
binary log before the master has finished the ALTER TABLE.
In this case the slave would ignore the DROP (as it's on a S3 table) but
it will stop on CREATE of the local tale, as the table is still exists in
S3. The REPLACE part will be ignored by the slave as it can't touch the
S3 table.
The fix is to ensure that all the above statements is written to binary
log AFTER the table has been deleted from S3.
- Rewrote bool Query_compressed_log_event::write() to make it more readable
(no logic changes).
- Changed DBUG_PRINT of 'is_error:' to 'is_error():' to make it easier to
find error: in traces.
- Ensure that 'db' is never null in Query_log_event (Simplified code).
The used code is largely based on code from Tencent
The problem is that in some rare cases there may be a conflict between .frm
files and the files in the storage engine. In this case the DROP TABLE
was not able to properly drop the table.
Some MariaDB/MySQL forks has solved this by adding a FORCE option to
DROP TABLE. After some discussion among MariaDB developers, we concluded
that users expects that DROP TABLE should always work, even if the
table would not be consistent. There should not be a need to use a
separate keyword to ensure that the table is really deleted.
The used solution is:
- If a .frm table doesn't exists, try dropping the table from all storage
engines.
- If the .frm table exists but the table does not exist in the engine
try dropping the table from all storage engines.
- Update storage engines using many table files (.CVS, MyISAM, Aria) to
succeed with the drop even if some of the files are missing.
- Add HTON_AUTOMATIC_DELETE_TABLE to handlerton's where delete_table()
is not needed and always succeed. This is used by ha_delete_table_force()
to know which handlers to ignore when trying to drop a table without
a .frm file.
The disadvantage of this solution is that a DROP TABLE on a non existing
table will be a bit slower as we have to ask all active storage engines
if they know anything about the table.
Other things:
- Added a new flag MY_IGNORE_ENOENT to my_delete() to not give an error
if the file doesn't exist. This simplifies some of the code.
- Don't clear thd->error in ha_delete_table() if there was an active
error. This is a bug fix.
- handler::delete_table() will not abort if first file doesn't exists.
This is bug fix to handle the case when a drop table was aborted in
the middle.
- Cleaned up mysql_rm_table_no_locks() to ensure that if_exists uses
same code path as when it's not used.
- Use non_existing_Table_error() to detect if table didn't exists.
Old code used different errors tests in different position.
- Table_triggers_list::drop_all_triggers() now drops trigger file if
it can't be parsed instead of leaving it hanging around (bug fix)
- InnoDB doesn't anymore print error about .frm file out of sync with
InnoDB directory if .frm file does not exists. This change was required
to be able to try to drop an InnoDB file when .frm doesn't exists.
- Fixed bug in mi_delete_table() where the .MYD file would not be dropped
if the .MYI file didn't exists.
- Fixed memory leak in Mroonga when deleting non existing table
- Fixed memory leak in Connect when deleting non existing table
Bugs fixed introduced by the original version of this commit:
MDEV-22826 Presence of Spider prevents tables from being force-deleted from
other engines
MDEV-22531 Remove maria::implicit_commit()
MDEV-22607 Assertion `ha_info->ht() != binlog_hton' failed in
MYSQL_BIN_LOG::unlog_xa_prepare
From the handler point of view, Aria now looks like a transactional
engine. One effect of this is that we don't need to call
maria::implicit_commit() anymore.
This change also forces the server to call trans_commit_stmt() after doing
any read or writes to system tables. This work will also make it easier
to later allow users to have system tables in other engines than Aria.
To handle the case that Aria doesn't support rollback, a new
handlerton flag, HTON_NO_ROLLBACK, was added to engines that has
transactions without rollback (for the moment only binlog and Aria).
Other things
- Moved freeing of MARIA_SHARE to a separate function as the MARIA_SHARE
can be still part of a transaction even if the table has closed.
- Changed Aria checkpoint to use the new MARIA_SHARE free function. This
fixes a possible memory leak when using S3 tables
- Changed testing of binlog_hton to instead test for HTON_NO_ROLLBACK
- Removed checking of has_transaction_manager() in handler.cc as we can
assume that as the transaction was started by the engine, it does
support transactions.
- Added new class 'start_new_trans' that can be used to start indepdendent
sub transactions, for example while reading mysql.proc, using help or
status tables etc.
- open_system_tables...() and open_proc_table_for_Read() doesn't anymore
take a Open_tables_backup list. This is now handled by 'start_new_trans'.
- Split thd::has_transactions() to thd::has_transactions() and
thd::has_transactions_and_rollback()
- Added handlerton code to free cached transactions objects.
Needed by InnoDB.
squash! 2ed35999f2a2d84f1c786a21ade5db716b6f1bbc
All changes (except one) is of type
thd->transaction. -> thd->transaction->
thd->transaction points by default to 'thd->default_transaction'
This allows us to 'easily' have multiple active transactions for a
THD object, like when reading data from the mysql.proc table
Analysis:
========
RESET MASTER TO # command deletes all binary log files listed in the index
file, resets the binary log index file to be empty, and creates a new binary
log with number #. When the user provided binary log number is greater than
the max allowed value '2147483647' server fails to generate a new binary log.
The RESET MASTER statement marks the binlog closure status as
'LOG_CLOSE_TO_BE_OPENED' and exits. Statements which follow RESET MASTER
try to write to binary log they find the log_state != LOG_CLOSED and
proceed to write to binary log cache and it results in crash.
Fix:
===
During MYSQL_BIN_LOG open, if generation of new binary log name fails then the
"log_state" needs to be marked as "LOG_CLOSED". With this further statements
will find binary log as closed and they will skip writing to the binary log.
This is a backport of the applicable part of
commit 93475aff8d and
commit 2c39f69d34
from 10.4.
Before 10.4 and Galera 4, WSREP_ON is a macro that points to
a global Boolean variable, so it is not that expensive to
evaluate, but we will add an unlikely() hint around it.
WSREP_ON_NEW: Remove. This macro was introduced in
commit c863159c32
when reverting WSREP_ON to its previous definition.
We replace some use of WSREP_ON with WSREP(thd), like it was done
in 93475aff8d. Note: the macro
WSREP() in 10.1 is equivalent to WSREP_NNULL() in 10.4.
Item_func_rand::seed_random(): Avoid invoking current_thd
when WSREP is not enabled.
When binlog is disabled, WSREP will not behave correctly when
SAVEPOINT ROLLBACK is executed since we don't register handlers for such case.
Fixed by registering WSREP handlerton for SAVEPOINT related commands.
MDEV-21604
Added "virtual" low level write function encrypt_or_write that is set
to point to either normal or encrypted write functions.
This patch also fixes a possible memory leak if writing to binary log fails.
MDEV-21605 Clean up and speed up interfaces for binary row logging
MDEV-21617 Bug fix for previous version of this code
The intention is to have as few 'if' as possible in ha_write() and
related functions. This is done by pre-calculating once per statement the
row_logging state for all tables.
Benefits are simpler and faster code both when binary logging is disabled
and when it's enabled.
Changes:
- Added handler->row_logging to make it easy to check it table should be
row logged. This also made it easier to disabling row logging for system,
internal and temporary tables.
- The tables row_logging capabilities are checked once per "statements
that updates tables" in THD::binlog_prepare_for_row_logging() which
is called when needed from THD::decide_logging_format().
- Removed most usage of tmp_disable_binlog(), reenable_binlog() and
temporary saving and setting of thd->variables.option_bits.
- Moved checks that can't change during a statement from
check_table_binlog_row_based() to check_table_binlog_row_based_internal()
- Removed flag row_already_logged (used by sequence engine)
- Moved binlog_log_row() to a handler::
- Moved write_locked_table_maps() to THD::binlog_write_table_maps() as
most other related binlog functions are in THD.
- Removed binlog_write_table_map() and binlog_log_row_internal() as
they are now obsolete as 'has_transactions()' is pre-calculated in
prepare_for_row_logging().
- Remove 'is_transactional' argument from binlog_write_table_map() as this
can now be read from handler.
- Changed order of 'if's in handler::external_lock() and wsrep_mysqld.h
to first evaluate fast and likely cases before more complex ones.
- Added error checking in ha_write_row() and related functions if
binlog_log_row() failed.
- Don't clear check_table_binlog_row_based_result in
clear_cached_table_binlog_row_based_flag() as it's not needed.
- THD::clear_binlog_table_maps() has been replaced with
THD::reset_binlog_for_next_statement()
- Added 'MYSQL_OPEN_IGNORE_LOGGING_FORMAT' flag to open_and_lock_tables()
to avoid calculating of binary log format for internal opens. This flag
is also used to avoid reading statistics tables for internal tables.
- Added OPTION_BINLOG_LOG_OFF as a simple way to turn of binlog temporary
for create (instead of using THD::sql_log_bin_off.
- Removed flag THD::sql_log_bin_off (not needed anymore)
- Speed up THD::decide_logging_format() by remembering if blackhole engine
is used and avoid a loop over all tables if it's not used
(the common case).
- THD::decide_logging_format() is not called anymore if no tables are used
for the statement. This will speed up pure stored procedure code with
about 5%+ according to some simple tests.
- We now get annotated events on slave if a CREATE ... SELECT statement
is transformed on the slave from statement to row logging.
- In the original code, the master could come into a state where row
logging is enforced for all future events if statement could be used.
This is now partly fixed.
Other changes:
- Ensure that all tables used by a statement has query_id set.
- Had to restore the row_logging flag for not used tables in
THD::binlog_write_table_maps (not normal scenario)
- Removed injector::transaction::use_table(server_id_type sid, table tbl)
as it's not used.
- Cleaned up set_slave_thread_options()
- Some more DBUG_ENTER/DBUG_RETURN, code comments and minor indentation
changes.
- Ensure we only call THD::decide_logging_format_low() once in
mysql_insert() (inefficiency).
- Don't annotate INSERT DELAYED
- Removed zeroing pos_in_table_list in THD::open_temporary_table() as it's
already 0
MDEV-19964 S3 replication support
Added new configure options:
s3_slave_ignore_updates
"If the slave has shares same S3 storage as the master"
s3_replicate_alter_as_create_select
"When converting S3 table to local table, log all rows in binary log"
This allows on to configure slaves to have the S3 storage shared or
independent from the master.
Other thing:
Added new session variable '@@sql_if_exists' to force IF_EXIST to DDL's.
* Remove dead code
* MDEV-21675 Data inconsistency after multirow insert rollback
This patch fixes data inconsistencies that happen after rollback of
multirow inserts, with binlog disabled.
For example, statements such as `INSERT INTO t1 VALUES (1,'a'),(1,'b')`
that fail with duplicate key error. In such cases the whole statement
is rolled back. However, with wsrep_emulate_binlog in effect, the
IO_CACHE would not be truncated, and the pending rows events would be
replicated to the rest of the cluster. In the above example, it would
result in row (1,'a') being replicated, whereas locally the statement
is rolled back entirely. Making the cluster inconsistent.
The patch changes the code so that prior to statement rollback,
pending rows event are removed and the stmt cache reset.
That patch also introduces MTR tests that excercise multirow insert
statements for regular, and streaming replication.
Lifted long standing limitation to the XA of rolling it back at the
transaction's
connection close even if the XA is prepared.
Prepared XA-transaction is made to sustain connection close or server
restart.
The patch consists of
- binary logging extension to write prepared XA part of
transaction signified with
its XID in a new XA_prepare_log_event. The concusion part -
with Commit or Rollback decision - is logged separately as
Query_log_event.
That is in the binlog the XA consists of two separate group of
events.
That makes the whole XA possibly interweaving in binlog with
other XA:s or regular transaction but with no harm to
replication and data consistency.
Gtid_log_event receives two more flags to identify which of the
two XA phases of the transaction it represents. With either flag
set also XID info is added to the event.
When binlog is ON on the server XID::formatID is
constrained to 4 bytes.
- engines are made aware of the server policy to keep up user
prepared XA:s so they (Innodb, rocksdb) don't roll them back
anymore at their disconnect methods.
- slave applier is refined to cope with two phase logged XA:s
including parallel modes of execution.
This patch does not address crash-safe logging of the new events which
is being addressed by MDEV-21469.
CORNER CASES: read-only, pure myisam, binlog-*, @@skip_log_bin, etc
Are addressed along the following policies.
1. The read-only at reconnect marks XID to fail for future
completion with ER_XA_RBROLLBACK.
2. binlog-* filtered XA when it changes engine data is regarded as
loggable even when nothing got cached for binlog. An empty
XA-prepare group is recorded. Consequent Commit-or-Rollback
succeeds in the Engine(s) as well as recorded into binlog.
3. The same applies to the non-transactional engine XA.
4. @@skip_log_bin=OFF does not record anything at XA-prepare
(obviously), but the completion event is recorded into binlog to
admit inconsistency with slave.
The following actions are taken by the patch.
At XA-prepare:
when empty binlog cache - don't do anything to binlog if RO,
otherwise write empty XA_prepare (assert(binlog-filter case)).
At Disconnect:
when Prepared && RO (=> no binlogging was done)
set Xid_cache_element::error := ER_XA_RBROLLBACK
*keep* XID in the cache, and rollback the transaction.
At XA-"complete":
Discover the error, if any don't binlog the "complete",
return the error to the user.
Kudos
-----
Alexey Botchkov took to drive this work initially.
Sergei Golubchik, Sergei Petrunja, Marko Mäkelä provided a number of
good recommendations.
Sergei Voitovich made a magnificent review and improvements to the code.
They all deserve a bunch of thanks for making this work done!
Support for galera GTID consistency thru cluster. All nodes in cluster
should have same GTID for replicated events which are originating from cluster.
Cluster originating commands need to contain sequential WSREP GTID seqno
Ignore manual setting of gtid_seq_no=X.
In master-slave scenario where master is non galera node replicated GTID is
replicated and is preserved in all nodes.
To have this - domain_id, server_id and seqnos should be same on all nodes.
Node which bootstraps the cluster, to achieve this, sends domain_id and
server_id to other nodes and this combination is used to write GTID for events
that are replicated inside cluster.
Cluster nodes that are executing non replicated events are going to have different
GTID than replicated ones, difference will be visible in domain part of gtid.
With wsrep_gtid_domain_id you can set domain_id for WSREP cluster.
Functions WSREP_LAST_WRITTEN_GTID, WSREP_LAST_SEEN_GTID and
WSREP_SYNC_WAIT_UPTO_GTID now works with "native" GTID format.
Fixed galera tests to reflect this chances.
Add variable to manually update WSREP GTID seqno in cluster
Add variable to manipulate and change WSREP GTID seqno. Next command
originating from cluster and on same thread will have set seqno and
cluster should change their internal counter to it's value.
Behavior is same as using @@gtid_seq_no for non WSREP transaction.
Problem:
-------
Accessing a member within 'xid_count_per_binlog' structure results in
following error when 'UBSAN' is enabled.
member access within address 0xXXX which does not point to an object of type
'xid_count_per_binlog'
Analysis:
---------
The problem appears to be that no constructor for 'xid_count_per_binlog' is
being called, and thus the vtable will not be initialized.
Fix:
---
Defined a parameterized constructor for 'xid_count_per_binlog' class.