Commit graph

2706 commits

Author SHA1 Message Date
Aleksey Midenkov
93c8252f02 MDEV-25292 Atomic CREATE OR REPLACE TABLE
Atomic CREATE OR REPLACE allows to keep an old table intact if the
command fails or during the crash. That is done through creating
a table with a temporary name and filling it with the data
(for CREATE OR REPLACE .. SELECT), then renaming the original table
to another temporary (backup) name and renaming the replacement table
to original table. The backup table is kept until the last chance of
failure and if that happens, the replacement table is thrown off and
backup recovered. When the command is complete and logged the backup
table is deleted.

Atomic replace algorithm

  Two DDL chains are used for CREATE OR REPLACE:
  ddl_log_state_create (C) and ddl_log_state_rm (D).

  1. (C) Log CREATE_TABLE_ACTION of TMP table (drops TMP table);
  2. Create new table as TMP;
  3. Do everything with TMP (like insert data);

  finalize_atomic_replace():
  4. Link chains: (D) is executed only if (C) is closed;
  5. (D) Log DROP_ACTION of BACKUP;
  6. (C) Log RENAME_TABLE_ACTION from ORIG to BACKUP (replays BACKUP -> ORIG);
  7. Rename ORIG to BACKUP;
  8. (C) Log CREATE_TABLE_ACTION of ORIG (drops ORIG);
  9. Rename TMP to ORIG;

  finalize_ddl() in case of success:
  10. Close (C);
  11. Replay (D): BACKUP is dropped.

  finalize_ddl() in case of error:
  10. Close (D);
  11. Replay (C):
    1) ORIG is dropped (only after finalize_atomic_replace());
    2) BACKUP renamed to ORIG (only after finalize_atomic_replace());
    3) drop TMP.

  If crash happens (C) or (D) is replayed in reverse order. (C) is
  replayed if crash happens before it is closed, otherwise (D) is
  replayed.

Temporary table for CREATE OR REPLACE

  Before dropping "old" table, CREATE OR REPLACE creates "tmp" table.
  ddl_log_state_create holds the drop of the "tmp" table.  When
  everything is OK (data is inserted, "tmp" is ready) ddl_log_state_rm
  is written to replace "old" with "tmp". Until ddl_log_state_create
  is closed ddl_log_state_rm is not executed.

  After the binlogging is done ddl_log_state_create is closed. At that
  point ddl_log_state_rm is executed and "tmp" is replaced with
  "old". That is: final rename is done by the DDL log.

  With that important role of DDL log for CREATE OR REPLACE operation
  replay of ddl_log_state_rm must fail at the first hit error and
  print the error message if possible. F.ex. foreign key error is
  discovered at this phase: InnoDB rejects to drop the "old" table and
  returns corresponding foreign key error code.

Additional notes

  - CREATE TABLE without REPLACE is not affected by this commit.

  - Engines having HTON_EXPENSIVE_RENAME flag set are not affected by
    this commit.

  - CREATE TABLE .. SELECT XID usage is fixed and now there is no need
    to log DROP TABLE via DDL_CREATE_TABLE_PHASE_LOG (see comments in
    do_postlock()). XID is now correctly updated so it disables
    DDL_LOG_DROP_TABLE_ACTION. Note that binary log is flushed at the
    final stage when the table is ready. So if we have XID in the
    binary log we don't need to drop the table.

  - Three variations of CREATE OR REPLACE handled:

    1. CREATE OR REPLACE TABLE t1 (..);
    2. CREATE OR REPLACE TABLE t1 LIKE t2;
    3. CREATE OR REPLACE TABLE t1 SELECT ..;

  - Test case uses 6 combinations for engines (aria, aria_notrans,
    myisam, ib, lock_tables, expensive_rename) and 2 combinations for
    binlog types (row, stmt). Combinations help to check differences
    between the results. Error failures are tested for the above three
    variations.

  - expensive_rename tests CREATE OR REPLACE without atomic
    replace. The effect should be the same as with the old behaviour
    before this commit.

  - Triggers mechanism is unaffected by this change. This is tested in
    create_replace.test.

  - LOCK TABLES is affected. Lock restoration must be done after "rm"
    chain is replayed.

  - Moved ddl_log_complete() from send_eof() to finalize_ddl(). This
    checkpoint was not executed before for normal CREATE TABLE but is
    executed now.

  - CREATE TABLE will now rollback also if writing to the binary
    logging failed. See rpl_gtid_strict.test

Rename and drop via DDL log

  We replay ddl_log_state_rm to drop the old table and rename the
  temporary table. In that case we must throw the correct error
  message if ddl_log_revert() fails (f.ex. on FK error).

  If table is deleted earlier and not via DDL log and the crash
  happened, the create chain is not closed. Linked drop chain is not
  executed and the new table is not installed. But the old table is
  already deleted.

ddl_log.cc changes

  Now we can place action before DDL_LOG_DROP_INIT_ACTION and it will
  be replayed after DDL_LOG_DROP_TABLE_ACTION.

  report_error parameter for ddl_log_revert() allows to fail at first
  error and print the error message if possible.
  ddl_log_execute_action() now can print error message.

  Since we now can handle errors from ddl_log_execute_action() (in
  case of non-recovery execution) unconditional setting "error= TRUE"
  is wrong (it was wrong anyway because it was overwritten at the end
  of the function).

On XID usage

  Like with all other atomic DDL operations XID is used to avoid
  inconsistency between master and slave in the case of a crash after
  binary log is written and before ddl_log_state_create is closed. On
  recovery XIDs are taken from binary log and corresponding DDL log
  events get disabled.  That is done by
  ddl_log_close_binlogged_events().

On linking two chains together

  Chains are executed in the ascending order of entry_pos of execute
  entries. But entry_pos assignment order is undefined: it may assign
  bigger number for the first chain and then smaller number for the
  second chain. So the execution order in that case will be reverse:
  second chain will be executed first.

  To avoid that we link one chain to another. While the base chain
  (ddl_log_state_create) is active the secondary chain
  (ddl_log_state_rm) is not executed. That is: only one chain can be
  executed in two linked chains.

  The interface ddl_log_link_chains() was done in "MDEV-22166
  ddl_log_write_execute_entry() extension".

More on CREATE OR REPLACE .. SELECT

  We use create_and_open_tmp_table() like in ALTER TABLE to create
  temporary TABLE object (tmp_table is (NON_)TRANSACTIONAL_TMP_TABLE).

  After we created such TABLE object we use create_info->tmp_table()
  instead of table->s->tmp_table when we need to check for
  parser-requested tmp-table.

  External locking is required for temporary table created by
  create_and_open_tmp_table(). F.ex. that disables logging for Aria
  transactional tables and without that (when no mysql_lock_tables()
  is done) it cannot work correctly.

  For making external lock the patch requires Aria table to work in
  non-transactional mode. That is usually done by
  ha_enable_transaction(false). But we cannot disable transaction
  completely because: 1. binlog rollback removes pending row events
  (binlog_remove_pending_rows_event()). The row events are added
  during CREATE .. SELECT data insertion phase. 2. replication slave
  highly depends on transaction and cannot work without it.

  So we put temporary Aria table into non-transactional mode with
  "thd->transaction->on hack". See comment for on_save variable.

  Note that Aria table has internal_table mode. But we cannot use it
  because:

  if (!internal_table)
  {
    mysql_mutex_lock(&THR_LOCK_myisam);
    old_info= test_if_reopen(name_buff);
  }

  For internal_table test_if_reopen() is not called and we get a new
  MARIA_SHARE for each file handler. In that case duplicate errors are
  missed because insert and lookup in CREATE .. SELECT is done via two
  different handlers (see create_lookup_handler()).

  For temporary table before dropping TABLE_SHARE by
  drop_temporary_table() we must do ha_reset(). ha_reset() releases
  storage share. Without that the share is kept and the second CREATE
  OR REPLACE .. SELECT fails with:

    HA_ERR_TABLE_EXIST (156): MyISAM table '#sql-create-b5377-4-t2' is
    in use (most likely by a MERGE table). Try FLUSH TABLES.

    HA_EXTRA_PREPARE_FOR_DROP also removes MYISAM_SHARE, but that is
    not needed as ha_reset() does the job.

  ha_reset() is usually done by
  mark_tmp_table_as_free_for_reuse(). But we don't need that mechanism
  for our temporary table.

Atomic_info in HA_CREATE_INFO

  Many functions in CREATE TABLE pass the same parameters. These
  parameters are part of table creation info and should be in
  HA_CREATE_INFO (or whatever). Passing parameters via single
  structure is much easier for adding new data and
  refactoring.

InnoDB changes (revised by Marko Mäkelä)

  row_rename_table_for_mysql(): Specify the treatment of FOREIGN KEY
  constraints in a 4-valued enum parameter. In cases where FOREIGN KEY
  constraints cannot exist (partitioned tables, or internal tables of
  FULLTEXT INDEX), we can use the mode RENAME_IGNORE_FK.
  The mod RENAME_REBUILD is for any DDL operation that rebuilds the
  table inside InnoDB, such as TRUNCATE and native ALTER TABLE
  (or OPTIMIZE TABLE). The mode RENAME_ALTER_COPY is used solely
  during non-native ALTER TABLE in ha_innobase::rename_table().
  Normal ha_innobase::rename_table() will use the mode RENAME_FK.

  CREATE OR REPLACE will rename the old table (if one exists) along
  with its FOREIGN KEY constraints into a temporary name. The replacement
  table will be initially created with another temporary name.
  Unlike in ALTER TABLE, all FOREIGN KEY constraints must be renamed
  and not inherited as part of these operations, using the mode RENAME_FK.

  dict_get_referenced_table(): Let the callers convert names when needed.

  create_table_info_t::create_foreign_keys(): CREATE OR REPLACE creates
  the replacement table with a temporary name table, so for
  self-references foreign->referenced_table will be a table with
  temporary name and charset conversion must be skipped for it.

Reviewed by:

  Michael Widenius <monty@mariadb.org>
2022-08-31 11:55:04 +03:00
Sergei Golubchik
c38b8f49b8 cleanup: consolidate binlog-related THD::*_used into one bitmap 2022-08-10 15:03:10 +02:00
Marko Mäkelä
f53f64b7b9 Merge 10.8 into 10.9 2022-07-28 10:47:33 +03:00
Marko Mäkelä
f79cebb4d0 Merge 10.7 into 10.8 2022-07-28 10:33:26 +03:00
Marko Mäkelä
742e1c727f Merge 10.6 into 10.7 2022-07-27 18:26:21 +03:00
Marko Mäkelä
30914389fe Merge 10.5 into 10.6 2022-07-27 17:52:37 +03:00
Marko Mäkelä
098c0f2634 Merge 10.4 into 10.5 2022-07-27 17:17:24 +03:00
Oleksandr Byelkin
3bb36e9495 Merge branch '10.3' into 10.4 2022-07-27 11:02:57 +02:00
Andrei
8d238d4726 MDEV-28609 refine gtid-strict-mode to ignore same server-id gtid from the past
... on semisync slave

To provide semisync master crash-recovery the same server-id transactions
were made to accept for execution on the semisync slave when the strict gtid
mode (see MDEV-27760).
That however caused out-of-order error on a master's transaction
server of the circular setup.
The error was fair in the sense of the gtid strict mode rule as indeed
under the condition of the circular setup the replicated transaction
already exists in the local binlog.

This is fixed by the commit to ignore on the gtid strict mode semisync
slave those gtids that exist in the slave's binlog that effectively restores
the default same-server-id ignore policy.
At the same time the fixes complies with MDEV-21117 semisync slave recovery
to accept the same server-id transactions that do not exist in local binlog.
2022-07-26 16:01:14 +03:00
Andrei
5bf4dee369 MDEV-28948 FLUSH BINARY LOGS waits/hangs on mysql_mutex_unlock(&LOCK_index)
The hang may be caused by a 1pc branch that was fixed by MDEV-26031 in
10.6 and up. That commit did not look relevant in 10.5 and below
so  was not pushed to the low branches.

To possibly tackle the reported issue
the MDEV-26031 is backported now with a test that
unlike 10.6 does not expose the former bug in 10.5.
It is only needed for checking a refined logics
inside MYSQL_BIN_LOG::write_transaction_to_binlog.
The latter is made to do away with xid-unlogging (which is suspected
to have been at fault) for xid-less transaction.
2022-07-26 10:46:01 +03:00
Brandon Nesterenko
555c12a541 MDEV-21087/MDEV-21433: ER_SLAVE_INCIDENT arrives at slave without failure specifics
Problem:
=======

This patch addresses two issues:

 1. An incident event can be incorrectly reported for transactions
which are rolled back successfully. That is, an incident event
should only be generated for failed “non-transactional transactions”
(i.e., those which modify non-transactional tables) because they
cannot be rolled back.

 2. When the mariadb slave (error) stops at receiving the incident
event there's no description of what led to it. Neither in the event
nor in the master's error log.

Solution:
========

Before reporting an incident event for a transaction, first validate
that it is “non-transactional” (i.e. cannot be safely rolled back).
To determine if a transaction is non-transactional,
  lex->stmt_accessed_table(LEX::STMT_WRITES_NON_TRANS_TABLE)
is used because it is set previously in
THD::decide_logging_format().

Additionally, when an incident event is written, write an error
message to the server’s error log to indicate the underlying issue.

Reviewed by:
===========
Andrei Elkin <andrei.elkin@mariadb.com>
2022-07-25 16:26:53 -06:00
Vladislav Vaintroub
016dd21371 MDEV-27142 disable text mode for Windows stdio by default
This avoids LF->CRLF conversion by the C runtime, which historically has
been rather buggy (see MDEV-9409)

Disabling text mode also fixes the  --binary-mode in command line client
to work the same on Windows, as it does elsewhere.

The user-visible effect is that some text files, e.g output of mysqldump
or mysqlbinlog will not have CRLF end-of-lines,but LF. That should be
acceptable, as even Notepad can read this Unix EOLs since 2018
(on older Windows, Wordpad can)

Leave error log in text(CRLF) mode for now, for the sake of old Windows.
2022-07-18 13:18:03 +02:00
Marko Mäkelä
5a33a37682 Merge 10.8 into 10.9 2022-06-07 09:20:07 +03:00
Marko Mäkelä
57d4a242da Merge 10.7 into 10.8 2022-06-06 16:22:09 +03:00
Marko Mäkelä
7e39470e33 Merge 10.6 into 10.7 2022-06-06 14:56:20 +03:00
Marko Mäkelä
2f8d0af883 Merge 10.5 into 10.6 2022-06-02 17:39:13 +03:00
Marko Mäkelä
4b3c3e526e Merge 10.4 into 10.5 2022-06-02 16:51:13 +03:00
mkaruza
ebbd5ef6e2 MDEV-27862 Galera should replicate nextval()-related changes in sequences with INCREMENT <> 0, at least NOCACHE ones with engine=InnoDB
Sequence storage engine is not transactionl so cache will be written in
stmt_cache that is not replicated in cluster. To fix this replicate
what is available in both trans_cache and stmt_cache.

Sequences will only work when NOCACHE keyword is used when sequnce is
created. If WSREP is enabled and we don't have this keyword report error
indicting that sequence will not work correctly in cluster.

When binlog is enabled statement cache will be cleared in transaction
before COMMIT so cache generated from sequence will not be replicated.
We need to keep cache until replication.

Tests are re-recorded because of replication changes that were
introducted with this PR.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2022-05-30 12:43:52 +03:00
Sergei Golubchik
93e64d1f58 cleanup: log_current_statement and OPTION_KEEP_LOG
rename OPTION_KEEP_LOG -> OPTION_BINLOG_THIS_TRX.
    Meaning: transaction cache will be written to binlog even on rollback.

    convert log_current_statement to OPTION_BINLOG_THIS_STMT.
    Meaning: the statement will be written to binlog (or trx binlog cache)
    even if it normally wouldn't be.

    setting OPTION_BINLOG_THIS_STMT must always set OPTION_BINLOG_THIS_TRX,
    otherwise the statement won't be logged if the transaction is rolled back.
    Use OPTION_BINLOG_THIS to set both.
2022-05-06 10:45:17 +03:00
Marko Mäkelä
504a3b32f6 Merge 10.8 into 10.9 2022-04-28 15:54:03 +03:00
Marko Mäkelä
133c2129cd Merge 10.7 into 10.8 2022-04-27 10:43:00 +03:00
Marko Mäkelä
638afc4acf Merge 10.6 into 10.7 2022-04-26 18:59:40 +03:00
Marko Mäkelä
fae0ccad6e Merge 10.5 into 10.6 2022-04-21 17:46:40 +03:00
Marko Mäkelä
620c55e708 Merge 10.4 into 10.5 2022-04-21 15:33:50 +03:00
Marko Mäkelä
394784095e Merge 10.3 into 10.4 2022-04-21 11:33:59 +03:00
Sergei Golubchik
bbdec04d59 MDEV-24317 Data race in LOGGER::init_error_log at sql/log.cc:1443 and in LOGGER::error_log_print at sql/log.cc:1181
don't initialize error_log_handler_list in set_handlers()
* error_log_handler_list is initialized to LOG_FILE early, in init_base()
* set_handlers always reinitializes it to LOG_FILE, so it's pointless
* after init_base() concurrent threads start using sql_log_warning,
  so following set_handlers() shouldn't modify error_log_handler_list
  without some protection
2022-04-12 13:07:20 +02:00
Marko Mäkelä
8680eedb26 Merge 10.8 into 10.9 2022-03-30 09:41:14 +03:00
Marko Mäkelä
5c69e93630 Merge 10.7 into 10.8 2022-03-30 09:34:07 +03:00
Marko Mäkelä
a4d753758f Merge 10.6 into 10.7 2022-03-30 08:52:05 +03:00
Marko Mäkelä
b242c3141f Merge 10.5 into 10.6 2022-03-29 16:16:21 +03:00
Marko Mäkelä
d62b0368ca Merge 10.4 into 10.5 2022-03-29 12:59:18 +03:00
mkaruza
97f237e66d MDEV-25912 wsrep does not identify checksummed events correctly
For GTID consistenty, GTID events was artificialy added before
replication happned. This event should not contain CHECKSUM calculated.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2022-03-28 14:10:27 +03:00
Alexey Yurchenko
9d7e596ba6 MDEV-26971: JSON file interface to wsrep node state.
Integration with status reporter in wsrep-lib.

Status reporter reports changes in wsrep state and logged errors/
warnings to a json file which then can be read and interpreted by
an external monitoring tool.

Rationale: until the server is fully initialized it is unaccessible
by client and the only source of information is an error log which
is not machine-friendly. Since wsrep node can spend a very long time
in initialization phase (state transfer), it may be a very long time
that automatic tools can't easily monitor its liveness and progression.

New variable: wsrep_status_file specifies the output file name.
If not set, no file is created and no reporting is done.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2022-03-18 16:38:41 +01:00
Oleksandr Byelkin
4fb2cb1a30 Merge branch '10.7' into 10.8 2022-02-04 14:50:25 +01:00
Oleksandr Byelkin
9ed8deb656 Merge branch '10.6' into 10.7 2022-02-04 14:11:46 +01:00
Oleksandr Byelkin
f5c5f8e41e Merge branch '10.5' into 10.6 2022-02-03 17:01:31 +01:00
Oleksandr Byelkin
cf63eecef4 Merge branch '10.4' into 10.5 2022-02-01 20:33:04 +01:00
Oleksandr Byelkin
a576a1cea5 Merge branch '10.3' into 10.4 2022-01-30 09:46:52 +01:00
Oleksandr Byelkin
41a163ac5c Merge branch '10.2' into 10.3 2022-01-29 15:41:05 +01:00
Sachin
0c5d1342ae MDEV-11675 Lag Free Alter On Slave
This commit implements two phase binloggable ALTER.
When a new

      @@session.binlog_alter_two_phase = YES

ALTER query gets logged in two parts, the START ALTER and the COMMIT
or ROLLBACK ALTER. START Alter is written in binlog as soon as
necessary locks have been acquired for the table. The timing is
such that any concurrent DML:s that update the same table are either
committed, thus logged into binary log having done work on the old
version of the table, or will be queued for execution on its new
version.

The "COMPLETE" COMMIT or ROLLBACK ALTER are written at the very point
of a normal "single-piece" ALTER that is after the most of
the query work is done. When its result is positive COMMIT ALTER is
written, otherwise ROLLBACK ALTER is written with specific error
happened after START ALTER phase.
Replication of two-phase binloggable ALTER is
cross-version safe. Specifically the OLD slave merely does not
recognized the start alter part, still being able to process and
memorize its gtid.

Two phase logged ALTER is read from binlog by mysqlbinlog to produce
BINLOG 'string', where 'string' contains base64 encoded
Query_log_event containing either the start part of ALTER, or a
completion part. The Query details can be displayed with `-v` flag,
similarly to ROW format events.  Notice, mysqlbinlog output containing
parts of two-phase binloggable ALTER is processable correctly only by
binlog_alter_two_phase server.

@@log_warnings > 2 can reveal details of binlogging and slave side
processing of the ALTER parts.

The current commit also carries fixes to the following list of
reported bugs:
MDEV-27511, MDEV-27471, MDEV-27349, MDEV-27628, MDEV-27528.

Thanks to all people involved into early discussion of the feature
including Kristian Nielsen, those who helped to design, implement and
test: Sergei Golubchik, Andrei Elkin who took the burden of the
implemenation completion, Sujatha Sivakumar, Brandon
Nesterenko, Alice Sherepa, Ramesh Sivaraman, Jan Lindstrom.
2022-01-27 21:25:07 +02:00
Jan Lindström
2b6f235ae0 MDEV-21308 : WSREP: binlog ... cache not empty warnings on server with WSREP disabled
Remove output if wsrep is not enabled.
2022-01-22 09:14:26 +02:00
Brandon Nesterenko
96de6bfd5e MDEV-16091: Seconds_Behind_Master spikes to millions of seconds
Problem:
========
A slave’s relay log format description event is used when
calculating Seconds_Behind_Master (SBM). This forces the SBM
value to spike when processing these events, as their creation
date is set to the timestamp that the IO thread begins.

Solution:
========
When the slave generates a format description event, mark the
event as a relay log event so it does not update the
rli->last_master_timestamp variable.

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2022-01-04 11:21:33 -07:00
Marko Mäkelä
7dfaded962 Merge 10.6 into 10.7 2022-01-04 09:55:58 +02:00
Marko Mäkelä
3f5726768f Merge 10.5 into 10.6 2022-01-04 09:26:38 +02:00
Andrei
30b917d34a MDEV-27039 Trying to lock mutex ... when the mutex was already locked
The reason of the double lock was an extraneous ha_flush_logs().
Unlike the upstream it is unnecessary in Mariadb that exploits a binlog
checkpoint mechanism for not letting PURGE or RESET-MASTER to trouble
transaction recovery. That is in case should a trx
be prepared but its binlog file gone, the trx then is committed on disk too.
Those facts have been always verified by existing tests of

  binlog.binlog_{checkpoint,xa_recover}.test.

A regression test for the bug is included though.
2022-01-03 13:24:50 +02:00
Julius Goryavsky
55bb933a88 Merge branch 10.4 into 10.5 2021-12-26 12:51:04 +01:00
Leandro Pacheco
0165a06322 result of wsrep logic in queue_for_group_commit was being ignored
This could cause out of order wsrep checkpoints due wsrep specific leader
code not being executed in `MYSQL_BIN_LOG::write_transaction_to_binlog_events`.
Move original result assignment to before wsrep logic to prevent that.

Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
2021-12-23 11:51:31 +02:00
Oleksandr Byelkin
8bd21167d2 Merge branch '10.6' into 10.7 2021-11-05 21:01:15 +01:00
Oleksandr Byelkin
109fc67d4d Merge branch '10.5' into 10.6 2021-11-05 20:35:45 +01:00
Oleksandr Byelkin
8635be6a29 Merge branch '10.4' into 10.5 2021-11-05 20:33:57 +01:00