Commit graph

4,371 commits

Author SHA1 Message Date
Kristian Nielsen
5335c85a47 Binlog-in-engine: Fix incorrect handling of internal 2pc rollback
The error handling for internal 2pc transactions (eg. RocksDB/Spider) would
incorrectly try to handle the engine binlog_unlog() during rollback, in
binlog_post_rollback(); this should instead be handled solely in
log_and_order() and unlog(). This could trigger for example in parallel
replication error handling, causing assertions when wrongly entering XA code
paths.

Also fix a couple bugs found during debug:

 - Don't send format description even to the slave from before the starting
   GTID position, as that can cause the slave to wrongly drop temporary
   tables.

 - When looking up the initial GTID position for a new dump thread, wait for
   the necessary part of the binlog to become durable before reading it.

 - Don't error when searching the initial GTID position if reaching EOF of
   the durable portion, instead search back to an earlier GTID state record.

 - A rare race in the test framework that could fail to kill off lingering
   dump threads before RESET MASTER.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:34:24 +02:00
Kristian Nielsen
956942461a Binlog-in-innodb: Small compile/test fixes after 11.4 rebase
Following rebase on latest 11.4, a few compile and test errors need
fixing. For now, these fixes are not distributed on the individual
patches in the series, but just on top here.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:32:41 +02:00
Kristian Nielsen
caaf6221a6 Binlog-in-engine: XA: Fix hang during server shutdown.
Whenever a record is written to the binlog, it must be entered into the
pending LSN fifo. This was missing for XA PREPARE and XA ROLLBACK. If a
prepare or rollback record was at the end of the binlog, the tablespace
close during shutdown would hang waiting for the record to be marked
durable, which never happened as it was missing from the LSN fifo.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:30:41 +02:00
Kristian Nielsen
b37107b4d1 Binlog-in-engine: Fix 3bugs found during RQG testing
1. Fix the GTID lookup of a connecting slave/dump thread to not look at
parts of the binlog that are not yet durable on disk on the master. This
could cause the dump thread to be ahead of the valid durable end-point of
the reader, causing assertion.

2. Fix bug in the flushing of binlog pages. The background flush thread
would incorrectly flush at most one page per pthread_cond wakeup, which
would cause it to get behind and binlog page flush to disk be delayed.

3. Fix incorrect check during InnoDB recovery scan of redo log; binlog
redo records are allowed to be larger than InnoDB tablespace page size.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:30:37 +02:00
Kristian Nielsen
10218b8d85 Binlog-in-engine: Initial support for 2pc and XA
At XA PREPARE, spill all events (including COMMIT end event) as OOB, and
call into the engine to binlog a PREPARE record. Store the OOB reference
along with the XID in an engine-binlog internal hash.

At XA COMMIT, fetch the OOB reference from the internal hash and put it into
a COMMIT record for the transaction.

For both user XA and internal two-phase commit between binlog and
other storage engine, write the XID into an XA complete event in the
same mtr as the commit record. This record will be later used to be
able to consistently recover (commit or rollback) prepared
transactions in the other engines, depending on whether binlog write
became durable before the crash or not.

At XA ROLLBACK, merely put in an XA complete event.

Maintain reference counts for prending prepared XA transactions, and
for pending two-phase commit records, to make sure binlog files
containing these will not be purged while those transactions are
active.

Implement the necessary "unlog" mechanisms so that the reference
counts can be released only after all other participating engines have
durably committed (respectively XA prepared/rolled back) their part of
the transaction.

This commit does not handle XA/binlog crash recovery, will come in a later
patch.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:29:45 +02:00
Kristian Nielsen
649851f52b Binlog-in-engine: Fix binary search for GTID position
When finding the midpoint for each step in the binary search, that
midpoint was not correctly rounded to the nearest page containing a
GTID state record (when the range from the low to the high point is an
odd multiple of number of innodb_binlog_state_interval bytes). This
caused the search to look at the wrong page (and assert in debug
build).

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:28:01 +02:00
Kristian Nielsen
cf126c83d4 Binlog-in-engine: Fix busy-wait in binlog reader
A couple of bugs in the checks in ha_innodb_binlog_reader::wait_available()
for if new data is available. These could cause the reader (eg. binlog dump
thread on the master) to busy-loop instead of proper pthread_cond_wait()'ing
on new data to become durable and available for sending to slave.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:27:51 +02:00
Kristian Nielsen
dcfe9aec50 Binlog-in-engine: Support for combined stmt/trx cache in binlog
In most cases, only one of stmt and trx caches will be empty when binlogging
an event group. However, it is possible to have data in both when using
autocommit and combining non-transactional and transactional changes in the
same statement.

This patch implements handling this case, the main issue being the existence
of two independent out-of-band references. The commit record is extended to
contain up to two oob references, and the reader is extended to be able to
read both of them.

For simplicity, when this (rare) case occurs, we always spill the full
content of both caches (except the GTID event) as oob.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:27:40 +02:00
Kristian Nielsen
89f0ae10c6 Binlog-in-engine: SHOW BINLOG EVENT improvements.
Improve SHOW BINLOG EVENTS FROM when the specified position does not
correspond exactly to a valid binlog record position. Scan the page
containing the requested position, and start from the first valid point at
or after that position.

If a position is specified that is past the end of data available in the
binlog file, SHOW BINLOG EVENTS returns empty.

Make the format description event written at server startup also have
end_log_pos=0, for consistency.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:24:40 +02:00
Kristian Nielsen
1fca55e590 Binlog-in-engine: Fix out-of-bounds read
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:24:37 +02:00
Kristian Nielsen
9463726469 Binlog-in-engine: Fixes for some review comments
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:24:15 +02:00
Kristian Nielsen
6a4c541fb5 Binlog-in-engine: Bug fix around crash-safe slave
Fix race where trx_group_commit_leader() was accessing the group commit
queue after waking up participants, which can invalidate the queue. Instead
do the remaining operations in the individual thread for each group commit
participant.

Also fix a problem where entries could be inserted out-of-order in the
pending LSN fifo, when the queue was empty after removing a later LSN, and
then an earlier LSN got inserted. This could move back the durable binlog
offset, causing slaves to not receive events.

Seen as sporadic failures of test case
binlog_in_engine.mariabackup_slave_provision_nolock.

A few other test tweaks to make them robust to sporadic failures.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:15:16 +02:00
Kristian Nielsen
b4f59c9e54 Binlog-in-engine: Report master restart to slave
Write a single format description event to the engine binlog at server
startup.

This format description event - like for the legacy binlog - is used to
inform the slave server about the master restart. This is used by the slave
to drop any temporary tables that were binlogged by the master before the
restart, and are now implicitly dropped by the restart.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:15:03 +02:00
Kristian Nielsen
c69b86d468 Binlog-in-engine: Support for new binlog format in mysqlbinlog
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:13:04 +02:00
Kristian Nielsen
c1ad984aa5 Binlog-in-engine: Clean up gtid state reading
Refactor the code to use binlog_chunk_reader for reading a GTID state
record, getting rid of the duplicate logic in the old special-purpose GTID
state reading code. This also removes the assumption that GTID state fits in
a single page (untested for now though).

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:13:01 +02:00
Kristian Nielsen
7b45ddd0c3 Binlog-in-engine: Handle mixing transactional and non-transactional tables
When updating non-transactional tables inside a multi-statement transaction,
and binlog_direct_non_transactional_updates=1, then the non-transactional
updates are binlogged directly through the statement cache while the
transaction cache is still being added to in the main transaction.

Thus, move the engine_binlog_info out from binlog_cache_mngr and into the
individual stmt/trx binlog_cache_data, so that we can have separate
engine_binlog_info active for the statement and the transaction cache.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:12:09 +02:00
Kristian Nielsen
836fd2cefc Binlog-in-engine: Handle recovery when all but one binlog files have been purged
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:11:20 +02:00
Kristian Nielsen
7e6e8724aa Binlog-in-engine: Handle single event writes larger than binlog size
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:11:11 +02:00
Kristian Nielsen
fb050e7981 Binlog-in-engine: Implement dynamically changing binlog max size
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:10:23 +02:00
Kristian Nielsen
5a79c16fd7 Binlog-in-engine: Implement savepoint support
Support for SAVEPOINT, ROLLBACK TO SAVEPOINT, rolling back a failed
statement (keeping active transaction), and rolling back transaction.

For savepoints (and start-of-statement), if the binlog data to be rolled
back is still in the in-memory part of trx cache we can just truncate the
cache to the point.

But if we need to spill cache contents as out-of-band data containing one or
more savepoints/start-of-statement point, then split the spill at each point
and inform the engine of the savepoints.

In InnoDB, at savepoint set, save the state of the forest of perfect binary
trees being built. Then at rollback, restore the appropriate state.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:10:21 +02:00
Kristian Nielsen
df4bec2123 MDEV-34705: Binlog-in-engine: Binlog reader to read whole page at a time
Instead of returning only one chunk at a time, make
ha_innodb_binlog_reader::read_data() try to read all chunks on the page.
This reduces the number of times each reader has to latch pages in the page
fifo, which contends for a global mutex also shared with the writer.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:05:27 +02:00
Kristian Nielsen
0dcab78e0d MDEV-34705: Binlog-in-engine: Crash-safe slave
This patch makes replication crash-safe with the new binlog implementation,
even when --innodb-flush-log-at-trx-commit=0|2. The point is to not send any
binlog events to the slave until they have become durable on master, thus
avoiding that a slave may replicate a transaction that is lost during master
recovery, diverging the slave from the master.

Keep track of which point in the binlog has been durably synced to disk
(meaning the corresponding LSN has been durably synced to disk in the InnoDB
redo log). Each write to the binlog inserts an entry with offset and
corresponding LSN in a FIFO. Dump threads will first read only up to the
durable point in the binlog. A dump thread will then check the LSN fifo, and
do an InnoDB redo log sync if anything is pending. Then the FIFO is emptied
of any LSNs that have now become durable, and the durable point in the
binlog is updated and reading the binlog can continue.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 12:04:37 +02:00
Kristian Nielsen
b96c487a59 MDEV-34705: Binlog-in-engine: Fix hang with event group of specific size
If the event group fitted in the binlog cache without the GTID event but not
with, the code would attempt to spill part of the GTID event as out-of-band
data, which is not correct. In release builds this would hang the server as
the spilling would try to lock an already owned mutex.

Fix by checking if the GTID event fits, and spilling any non-GTID data as
oob if it does not.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:55:02 +02:00
Kristian Nielsen
9bd57017f6 MDEV-34705: Binlog-in-engine: Attempt to fix assertion in do_fdatasync()
After temporarily releasing the mutex during wait in
fsp_binlog_page_fifo::do_fdatasync(), the state may have changed, so be
sure to re-check to avoid fdatasync() on a now stale fh.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:55:00 +02:00
Kristian Nielsen
fdbfc12a74 MDEV-34705: Binlog-in-engine: Improved page fifo
Some basic improvements to the binlog-specific page fifo to hopefully get
reasonable scalabitily as a starting point.

The fifo is still protected by a global mutex, but some effort is taken to
reduce the duration a thread is holding the mutex.

Use a cyclic array instead of a linked list so pages can be looked up in
constant time. And cache allocated page objects to avoid repeated
malloc/free while holding the mutex.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:54:40 +02:00
Kristian Nielsen
24c1a891a5 MDEV-34705: Binlog-in-engine: Reduce struct fsp_binlog_page_entry size
The file_no and page_no values are not really needed in the page object,
so remove them to save a bit of memory.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:54:34 +02:00
Kristian Nielsen
5115cd7425 MDEV-34705: Binlog-in-engine: mariadb-backup integration
InnoDB binlog files are now backed up along with other InnoDB data by
mariadb-backup.

The files are copied after backup locks have been released. Backup files
created later than the backup LSN are skipped. Then during --prepare, any
data missing from the hot-copied binlog files will be restored by the
binlog recovery code, and any excess data written after the backup LSN will
be zeroed out.

A couple test cases test taking a consistent backup of a server with active
traffic during the backup, by provisioning a slave from the restored binlog
position and checking that the slave can replicate from the original master
and get identical data.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:54:26 +02:00
Kristian Nielsen
2c77e8c85a MDEV-34705: Binlog-in-engine: Implement refcounting outstanding OOB records
Keep track of, for each binlog file, how many open transactions have
out-of-band data starting in that file. Then at the start of each new binlog
file, in the header page, record the file_no of the earliest file that this
file might contain commit records with references back to OOB records in
that earlier file.

Use this in PURGE BINARY LOGS, so that when a dump thread (slave connection)
is active in file number N, and that file (or a later one) may require
looking back in an earlier file number M for out-of-band records, purge will
stop already at file number M. This way, we avoid that purge accidentally
deletes some binlog file that a dump thread would later get an error on
because it needs to read out-of-band data.

This patch also includes placeholder data for a similar facility for XA
references. The actual implementation of support for XA is for later though.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:54:24 +02:00
Kristian Nielsen
a0decae938 MDEV-34705: Binlog-in-engine: Integration with server-layer code
Mostly various fixes to avoid initializing or creating any data or files for
the legacy binlog.

A possible later refinement could be to sub-class the binlog class
differently for legacy and in-engine binlogs, writing separate virtual
functions for behaviour that differ, extracting common functionality into
sub-methods. This could remove some if (opt_binlog_engine_hton)
conditionals.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:48:25 +02:00
Kristian Nielsen
bdb88e5561 MDEV-34705: Binlog-in-engine: More compiler warning fixes
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:48:23 +02:00
Kristian Nielsen
bc7d084576 MDEV-34705: Binlog-in-engine: Fix MSAN uninitialized warning in binlog_flush
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:48:20 +02:00
Kristian Nielsen
c2e8fd1f1a MDEV-34705: Binlog-in-engine: Work-around compiler warning
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:47:32 +02:00
Kristian Nielsen
ccbbb1ff24 MDEV-34705: Binlog-in-engine: Fix uninitialized variable in binlog discovery
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:47:28 +02:00
Kristian Nielsen
3c713c7498 MDEV-34705: Binlog-in-engine: Implement file header page
Now the first page of each binlog tablespace file is reserved as a file
header, replacing the use of extra fields in the first gtid state record of
the file. The header is primarily used during recovery, especially to get
the file LSN before which no redo should be applied to the file.

Using a dedicated page makes it possible to durably sync the file header to
disk after RESET MASTER (and at first server startup) and not have it
overwritten (and potentially corrupted) later; this guarantees that the
recovery will have at least one file header to look at to determine from
which LSN to apply redo records.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:47:17 +02:00
Kristian Nielsen
881ed99f8f MDEV-34705: Binlog-in-engine: Use separate 4k pagesize for binlog files
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:47:12 +02:00
Kristian Nielsen
490168b82f MDEV-34705: Binlog-in-engine: Use the whole page for binlog data
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:47:07 +02:00
Kristian Nielsen
336dab063c MDEV-34705: Binlog-in-engine: Implement page checksum
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:47:02 +02:00
Kristian Nielsen
5172e5a0e2 MDEV-34705: Binlog-in-engine: Recovery testcase + few bugfixes
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:47:00 +02:00
Kristian Nielsen
1966a83967 MDEV-34705: Binlog-in-engine: First working recovery
Still needs more testing.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:46:56 +02:00
Kristian Nielsen
857d3451a1 MDEV-34705: Binlog-in-engine: Implement SHOW BINLOG EVENTS
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:44:12 +02:00
Kristian Nielsen
319e635710 MDEV-34705: Binlog-in-engine: Implement legacy SHOW MASTER STATUS
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:43:24 +02:00
Kristian Nielsen
5bd458038d MDEV-34705: binlog-in-engine: New recovery preparatory commit
Some smaller refactoring and additions to prepare for new approach to
recovery of binlog tablespaces.

Store at the head of each binlog file the start LSN and the file size.

The final page of a binlog file is now not released in the page fifo until
mtr is committed. This ensures that all changes to a binlog file are redo
logged when the tablespace is closed, which simplifies things as then at
most the two most recent binlog files will need redo records to be
re-applied during recovery.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:43:12 +02:00
Kristian Nielsen
2b5bb8d03d MDEV-34705: Binlog-in-engine: No use of InnoDB tablespace and bufferpool
In preparation for a simplified, lower-level recovery of binlog files
implemented in InnoDB, remove use of InnoDB tablespaces and buffer pool from
the binlog code. Instead, a custom binlog page fifo replaces the general
buffer pool for binlog pages, and tablespaces are replaced by simple file_no
references.

The new binlog page fifo is deliberately naively written in this commit for
simplicity, until the new recovery is complete and proven with tests; later
it can be improved for better efficiency and scalability. This first version
uses a simple global mutex, linear scans of linked lists, repeated
alloc/free of pages, and simple backgrund flush thread that uses
synchroneous pwrite() one page after another. Error handling is also mostly
omitted in this first version.

The page header/footer is not changed in this commit, nor is the pagesize,
to be done in a later patch.

The call to mtr_t::write_binlog() is currently commented-out in function
fsp_log_binlog_write() as it asserts in numerous places. To be enabled when
those asserts are fixed. For the same reason, the code does not yet
implement binlog_write_up_to(lsn_t lsn), to be done once mtr_t operations
are working.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:42:43 +02:00
Kristian Nielsen
77809c8855 MDEV-34705: Binlog-in-engine: Implement DELETE_DOMAIN_ID for FLUSH
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:39:19 +02:00
Kristian Nielsen
2f1a40ae35 MDEV-34705: Binlog-in-engine: Implement PURGE BINARY LOGS
Still ToDo: is to restrict auto-purge so that it does not purge any binlog
file with out-of-band data that might still be needed by a connected slave.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:38:30 +02:00
Kristian Nielsen
ff2b101609 MDEV-34705: Binlog-in-engine: Handful of fixes
Fix missing WORDS_BIGENDIAN define in ut0compr_int.cc.

Fix misaligned read buffer for O_DIRECT.

Fix wrong/missing update_binlog_end_pos() in binlog group commit.

Fix race where active_binlog_file_no incremented too early.

Fix wrong assertion when reader reaches the very start of (active+1).

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:38:26 +02:00
Kristian Nielsen
ef6e5823b8 MDEV-34705: Binlog-in-engine: Buildbot fixes
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:38:14 +02:00
Kristian Nielsen
bbe3fc3b57 MDEV-34075: Binlog-in-engine: Some test and review fixes
Enable binlog_in_engine as a default suite.

Fix embedded and Windows build failures.

Use sql_print_(error|warning) over ib::error() and ib::warn().

Use small_vector<> for the innodb_binlog_oob_reader instead of a custom
implementation.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:38:01 +02:00
Kristian Nielsen
d3bd9b83ca MDEV-34705: Binlog-in-engine: Misc. small fixes to make normal test suite mostly pass
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:37:57 +02:00
Kristian Nielsen
bb41951d6e MDEV-34705: Binlog-in-engine: Implement RESET MASTER
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
2026-04-08 11:37:08 +02:00