that is not in binlog.
Post-crash recovery of --rpl-semi-sync-slave-enabled server
failed to recognize a transaction in-doubt that needed rolled back.
A prepared-but-not-in-binlog transaction gets committed instead
to possibly create inconsistency with a master (e.g the way it was observed
in the bug report).
The semisync recovery is corrected now with initializing binlog coordinates
of any transaction in-doubt to the maximum offset which is
unreachable.
In effect when a prepared transaction that is not found in binlog
it will be decided to rollback because it's guaranteed to reside
in a truncated tail area of binlog.
Mtr tests are reinforced to cover the described scenario.
* FreeBSD returns errno 31 (EMLINK, Too many links),
not 40 (ELOOP, Too many levels of symbolic links)
* (`mysqlbinlog|mysql`) was just crazy, why did it ever work?
* socket_ipv6.inc check (that checked whether ipv6 is supported)
only worked correctly when ipv6 was supported
* perfschema.socket_summary_by_instance was changing global variables
and then skip-ing the test (because on missing ipv6)
New Feature:
============
Extend mariadb-binlog command-line tool to allow for filtering
events using GTID domain and server ids. The functionality mimics
that of a replica server’s DO_DOMAIN_IDS, IGNORE_DOMAIN_IDS, and
IGNORE_SERVER_IDS from CHANGE MASTER TO. For completeness, this
patch additionally adds the option --do-server-ids as an alias for
--server-id, which now accepts a list of server ids instead of a
single one.
Example usage:
mariadb-binlog --do-domain-ids=2,3,4 --do-server-ids=1,3
master-bin.000001
Functional Notes:
1. --do-domain-ids cannot be combined with --ignore-domain-ids
2. --do-server-ids cannot be combined with --ignore-server-ids
3. A domain id filter can be combined with a server id filter
4. When any new filter options are combined with the
--gtid-strict-mode option, events from excluded domains/servers are
not validated.
5. Domain/server id filters can be combined with GTID ranges (i.e.
specifications of --start-position and --stop-position). However,
because the --stop-position option implicitly undertakes filtering
to only output events within its range of domains, when combined
with --do-domain-ids or --ignore-domain-ids, output will consist of
the intersection between the filters. Specifically, with
--do-domain-ids and --stop-position, only events with domain ids
present in both argument lists will be output. Conversely, with
--ignore-domain-ids and --stop-position, only events with domain ids
present in the --stop-position and absent from the
--ignore-domain-ids options will be output.
Reviewed By
============
Andrei Elkin <andrei.elkin@mariadb.com>
Problem:
========
When the mysql.gtid_slave_pos table uses the InnoDB engine, and
mysqld starts, it reads the table and begins a transaction. After
reading the value, it should end the transaction and release all
associated locks. The bug reported in DBAAS-7828 shows that when
autocommit is off, the locks are not released, resulting in
indefinite hangs on future attempts to change gtid_slave_pos. In
particular, the transaction was not properly finalized because
thd->server_status was not updated to reflect the end of the
transaction.
Solution:
========
This patch updates the code to properly commit the transaction after
reading gtid_slave_pos during mysqld start-up.
Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
Problem:
========
When using mariadb-binlog with --raw and --stop-never, events from
the master's currently active log file should be written to their
respective log file specified by --result-file, and shown on-disk.
There is a bug where mariadb-binlog does not flush the result file
to disk when new events are received
Solution:
========
Add a function call to flush mariadb-binlog’s result file after
receiving an event in --raw mode.
Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
Commit 6c39eaeb1 made the crash recovery dependent on server_id.
The crash recovery could fail when restoring a new instance from
original crashed data directory USING A NEW SERVER ID.
The issue doesn't exist in previous major versions before 10.6.
Root cause is when generating the input XID to be searched in the hash,
server id is populated with the current server id.
So if the server id changed when recovering, the XID couldn't be found
in the hash due to server id doesn't match.
This fix is to use original server id when creating the input XID
object in function `xarecover_do_commit_or_rollback`.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
The first step for deprecating innodb_autoinc_lock_mode(see MDEV-27844) is:
- to switch statement binlog format to ROW if binlog format is MIXED and
the statement changes autoincremented fields
- issue warnings if innodb_autoinc_lock_mode == 2 and binlog format is
STATEMENT
The warning out of OPTIMIZE
Statement is unsafe because it uses a system function
was indeed counterfactual and was resulted by checking an
insufficiently strict property of lex' sql_command_flags.
Fixed with deploying an additional checking of weather
the current sql command that modifes a share->non_determinstic_insert
table is capable of generating ROW format events.
The extra check rules out the unsafety to OPTIMIZE et al, while the
existing check continues to do so to CREATE TABLE (which is
perculiarly tagged as ROW-event generative sql command).
As a side effect sql_sequence.binlog test gets corrected and
binlog_stm_unsafe_warning.test is reinforced to add up
an unsafe CREATE..SELECT test.
This patch fixes two issues:
First, it fixes test failure due to GTID List events
having inconsistent ordering of domain ids. In
particular, this patch ensures that a GTID list log
event will have its GTIDs ordered by domain id
(ascending) followed by sequence number (ascending).
Second, it fixes an assert which could use an
unintialized variable.
Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
This commit implements two phase binloggable ALTER.
When a new
@@session.binlog_alter_two_phase = YES
ALTER query gets logged in two parts, the START ALTER and the COMMIT
or ROLLBACK ALTER. START Alter is written in binlog as soon as
necessary locks have been acquired for the table. The timing is
such that any concurrent DML:s that update the same table are either
committed, thus logged into binary log having done work on the old
version of the table, or will be queued for execution on its new
version.
The "COMPLETE" COMMIT or ROLLBACK ALTER are written at the very point
of a normal "single-piece" ALTER that is after the most of
the query work is done. When its result is positive COMMIT ALTER is
written, otherwise ROLLBACK ALTER is written with specific error
happened after START ALTER phase.
Replication of two-phase binloggable ALTER is
cross-version safe. Specifically the OLD slave merely does not
recognized the start alter part, still being able to process and
memorize its gtid.
Two phase logged ALTER is read from binlog by mysqlbinlog to produce
BINLOG 'string', where 'string' contains base64 encoded
Query_log_event containing either the start part of ALTER, or a
completion part. The Query details can be displayed with `-v` flag,
similarly to ROW format events. Notice, mysqlbinlog output containing
parts of two-phase binloggable ALTER is processable correctly only by
binlog_alter_two_phase server.
@@log_warnings > 2 can reveal details of binlogging and slave side
processing of the ALTER parts.
The current commit also carries fixes to the following list of
reported bugs:
MDEV-27511, MDEV-27471, MDEV-27349, MDEV-27628, MDEV-27528.
Thanks to all people involved into early discussion of the feature
including Kristian Nielsen, those who helped to design, implement and
test: Sergei Golubchik, Andrei Elkin who took the burden of the
implemenation completion, Sujatha Sivakumar, Brandon
Nesterenko, Alice Sherepa, Ramesh Sivaraman, Jan Lindstrom.
The assert was caused by an error of XA transaction that had
BINLOG 'base64_string' statement.
The statement failed because of lack of checking whether the encoded
replication event was handled by the slave applier thread.
If it's not the slave applier no error should be generated, but it was
in this case, see a test added.
Fixed along with the idea borrowed the upstream to introduce a check
of which applier executes the replication event and do not
report any error if the applier is a regular server client.
New Feature:
===========
This commit extends the mariadb-binlog capabilities to allow events
to be filtered by GTID ranges. More specifically, the
--start-position and --stop-position arguments have been extended to
accept values formatted as a list of GTID positions, e.g.
--start-position=0-1-0,1-2-55. The following specific capabilities
are addressed:
1) GTIDs can be used to filter results on local binlog files
2) GTIDs can be used to filter results from remote servers
3) Implemented --gtid-strict-mode that ensures the GTID event
stream in each domain is monotonically increasing
4) Added new level of verbosity in mysqlbinlog -vvv to print
additional diagnostic information/warnings about invalid GTID
states
5) For a given GTID range, its start and stop position parameters
aim to mimic the behaviors of
CHANGE MASTER TO MASTER_USE_GTID=slave_pos and
START SLAVE UNTIL master_gtid_pos=<GTID>, respectively. In
particular, the start-position list expresses a gtid state of
the server, similarly to how @@global.gtid_slave_pos expresses
the gtid state of a slave server when connecting to a master
with MASTER_USE_GTID=slave_pos.
The GTID start-position list is exclusive and the
stop-position list is inclusive. This allows users to receive
events strictly after those that they already have, and is
useful in cases of point in (logical) time recovery including
1) events were received out of order and should be re-sent, or
2) specifying the gtid state of a slave to get events newer
than their current state. If a seq_no is 0 for start-position,
it means to include the entirety of the domain. If a seq_no is
0 for stop-position, it means to exclude all events from that
domain. The GTIDs provided in a start position argument must
match with the GTID state of the first processed log (i.e.
those listed in the Gtid_list event). If a stop position is
provided, the events that are output are limited to only those
with domain ids listed in the argument. When specifying
combinations of start and stop positions, the following
behaviors are expected:
[--start-position without --stop-position]: Events that have domain
ids in the start position are output if their seq_no occurs after
the respective start position. Events with domain ids that are
unspecified in the start position list are also output. Note that if
the Gtid_list event of the first binary log is populated (i.e.
non-empty), each domain in the Gtid_list must be present in the
start-position list with a seq_no at or after the listed value.
This behavior mimics how a slave only processes events after the
state provided by @@global.gtid_slave_pos when connecting to a
master with CHANGE MASTER TO MASTER_USE_GTID=slave_pos.
[--stop-position without --start-position]: Output is limited to
only events with both 1) domain ids that are present in the given
stop position list and 2) seq_nos that are less than or equal to
their respective stop GTID. Once all GTIDs in the stop position
list have been processed, the program will stop processing log
files. This behavior mimics how
START SLAVE UNTIL master_gtid_pos=<G>
has a slave only process events with domain ids present in G with
their seq_nos at or before the respective gtid.
[--start-position and --stop-position]: Output consists of the
intersection between the events permitted by both the start and stop
position rules. More concretely, the output can be defined by a
union of the following rules:
1. For domains which exist in both the start and stop position
lists, the events which exist in-between these positions
(exclusive start, inclusive stop) are output
2. For all other events, the rules of
[--stop-position without --start-position] are followed
This is due to the implicit filtering within each individual rule.
Even though the start position rule always includes events from
unspecified domains, the stop position rule takes precedence because
it always excludes events from unspecified domains. In other words,
events which the start position rule would have included would then
always be excluded by the stop position rule.
[neither --start-position nor --stop-position]: Events are not
omitted based on GTID positioning; however, --gtid-strict-mode and
-vvv can still analyze gtid correctness for warning and error
reporting.
[repeated specification of --start-position or --stop-position]:
Subsequent specifications of start and stop positions completely
override previous ones. E.g., if invoked as
mysqlbinlog --start-position=<G1> --start-position=<G2> ...
All GTIDs specified in G1 are ignored and only those specified in G2
are used for the start position.
A few additional notes:
1) this commit squashes together the commits:
f4319661120e-78a9d49907ba
2) Changed rpl.rpl_blackhole_row_annotate test because it has
out of order GTIDs in its binlog, so I added
--skip-gtid-strict-mode
3) After all binlog events have been written, the session server
id and domain id are reset to their values in the global state
Reviewed By:
===========
Andrei Elkin: <andrei.elkin@mariadb.com>
The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.
We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.
Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.
An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.
An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.
Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.
The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.
The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.
We extend the MDEV-12353 redo log record format as follows.
(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.
In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.
Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.
The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.
The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:
os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes
The following status variables will be removed:
Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)
log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.
If the file system buffers can be bypassed, a message like the following
will be issued:
InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)
This has been tested on Linux and Microsoft Windows with both sizes.
On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.
The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.
When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:
InnoDB: log sequence number 0 (memory-mapped); transaction id 3
On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().
mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.
backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.
xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.
trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).
recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.
recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.
recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.
recv_sys_t, log_sys_t: Removed many data members.
recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.
recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.
recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.
recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)
recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().
recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.
mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.
log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.
log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.
log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.
recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).
We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.
Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.
Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
CREATE-OR-REPLACE SEQUENCE is not logged with Gtid event DDL flag
which affects its slave parallel execution.
Unlike other DDL:s it can occur in concurrent execution with following transactions
which can lead to various errors, including asserts like
(mdl_request->type != MDL_INTENTION_EXCLUSIVE && mdl_request->type != MDL_EXCLUSIVE) || !(get_thd()->rgi_slave && get_thd()->rgi_slave->is_parallel_exec && lock->check_if_conflicting_replication_locks(this)
in MDL_context::acquire_lock.
Fixed to wrap internal statement level commit with save-
and-restore of TRANS_THD::m_unsafe_rollback_flags.
failed in Diagnostics_area::set_ok_status in my_ok from
mysql_sql_stmt_prepare
Analysis: Before PREPARE is executed, binlog_format is STATEMENT.
This PREPARE had SET STATEMENT which sets binlog_format to ROW. Now after
PREPARE is done we reset the binlog_format (back to STATEMENT). But we have
temporary table, it doesn't let changing binlog_format=ROW to
binlog_format=STATEMENT and gives error which goes unreported. This
unreported error eventually causes assertion failure.
Fix: Change return type for LEX::restore_set_statement_var() to bool and
make it return error state.