Validate the chunk_len in the binlog chunk reader, so we don't try to read
data outside of the page.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The binlog page size is currently fixed at 16k. Don't attempt to read
using a different page size found in the file header, the code doesn't
support it.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Don't attempt to rotate any old-style binlog files if trying to set
@@binlog_checksum. The @@binlog_checksum variable is unused with the
new binlog, and there are no old-style binlog files to rotate.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The flags trx->active_commit_ordered and trx->active_prepare got cleared in
trx_init() during the fast part of commit (ie. commit_ordered()). This is
too early, then the values are lost when processing reaches
trx_commit_complete_for_mysql(). This caused the MDEV-232 optimization to be
omitted, adding an extra fsync() at the end of commit when using the legacy
binlog and causing severe performance regression.
The values of trx->active_commit_ordered and trx->active_prepare must
persist to the end of commit procesing, same as trx->is_registered. This is
done in this patch, active_commit_ordered and active_prepare are cleared in
trx_deregister_from_2pc() together with trx->is_registered in
trx_deregister_from_2pc(), and asserted to be cleared when
trx->is_registered is set for a following transaction.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Fix a missing error check when opendir() returns NULL, leading to
SIGSEGV when the directory does not exist or otherwise cannot be read.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The test requires large amounts of CPU and runs too slowly in Valgrind to
make sense, can even occasionally time out on loaded machines.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The test was for some reason incorrectly doing SHOW BINLOG EVENTS when the
binlogging of the prior event is deliberately non-deterministic in which
binlog file it will appear in, causing test to depend on timing.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Don't pass uninitialized values in function call. MSAN complains about this
(even when the called function never accesses the uninitialized values, and
even when the function is constexpr .oO).
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Add mtr tests for some pieces of code found not covered in coverage report
analysis.
Fix two bugs found by the added tests:
- Starting server when there is exactly one binlog file that is empty
(crash during RESET MASTER?) would try to add to the XID hash twice,
which would cause an error during restart.
- A GTID binlog state that _exactly_ fits in a page would cause incorrect
handling of page creation in the page fifo, attempting to create the same
page twice.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The GTID state records written to the binlog must contain the state
corresponding to exactly the position at which they appear: All GTIDs
mentioned in the record must be fully contained in the part of the binlog
that is prior to the state record.
There was a corner case when the state record was written right at the start
of a commit record. Since the in-memory state was updated just prior to the
write, this record would incorrectly contain the GTID that appears just
_after_ it. This could be seen as a test failure of
binlog_in_engine.rpl_gtid_index when run with
--innodb-binlog-state-interval=32K.
Fixed by updating the in-memory state only at the point where the GTID is
actually written into the binlog, after any prior GTID state records have
been written.
Another state record issue was when the initial state record at the start of
the binlog file was very large, crossing into the first differential state
record normally at offset innodb_binlog_state_interval (MDEV-38592).
When this happens the code that serches for GTID state finds no diff state
record at that position; this is fine and the code just searches back to the
initial full state record. However a spurious error about corrupt binlog was
written to the error log when not finding a start chunk at the expected
position, because the binlog_chunk_reader did not have the skip_partial flag
set to allow finding the middle of a record at the position.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When binlogging Non-InnoDB DML (such as MyISAM), let's at least write the
redo log to the O/S filesystem cache (but not fsync it) when
--innodb-flush-log-at-trx-commit is 0 or 2. This matches the behaviour
of the legacy binlog, where sync_binlog=0 will still write the binlog to the
O/S filesystem cache. This way, the binlog data will be recovered after a
crash of the server process if the O/S kernel stays intact.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Remove the --innodb-undo-directory=undos command-line argument from the
test, as it causes failures when the test suite is run from distro package
and the test directory is not writeable, and it's not relevant for what is
being tested in that test case.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Don't use (and crash on) any --binlog-directory option specified for
--backup, always use the value fetched from the running server.
Ensure a slash in-between path components when using a relative path for options
such as --innodb-undo-directory and --binlog-directory.
Clarify the description of the --binlog-directory option.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
SAVEPOINT inside a trigger doesn't work correctly. Setting a savepoint
inside a trigger somehow loses the implicit savepoint set at transaction
start, so that the partial changes are left if the statement later fails.
Referencing an existing savepoint claims the savepoint does not exist (and
it is in any case very unclear what exactly it should mean to rollback to a
savepoint from the middle of a statement, or set in the middle of a prior
statement).
These problems are independent of binlog-in-engine, but in the new binlog
implementation we are trying to make things work more correctly and
robustly, so let's disallow use of savepoints inside triggers. The new
binlog is off by default, so backwards compatibility is less of a concern,
though arguably disallowing savepoints in triggers would be better done
unconditionally.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The code for binlogging out-of-band data was missing an appropriate call to
log_free_check(). This call is needed to throttle write activity and wait
for an InnoDB checkpoint, when the redo log is too small (or otherwise has
insufficient space available) to accomodate the write activity.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
There was a race where a new GTID could be allocated (but not written to the
binlog)during the FLUSH, so that the GTID state written at the start of the
new binlog file was incorrect. This in turn could lead to duplicate GTID
being sent to the slave if it happens to reconnect at that exact point.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
FLUSH BINARY LOGS before dumping, to make sure the file is on disk and not
get different mysqlbinlog output depending on timing.
Treat completely empty (all zeros) file the same as file with the header
page written but no events yet.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The error handling path forgot to unlock the LOCK_log mutex, hanging the
server or causing assertion mysql_mutex_assert_not_owner.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When ddl log recovery needs to binlog during the crash recovery, the
GTID was binlogged without the required "ddl" marker. This caused wrong
behaviour on the slave when using parallel replication.
Fixed by explicitly marking the "current statement" as DDL when binlogging
in ddl log crash recovery.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This patch fixes that ALTER TABLE can call wakeup_subsequent_commits() too
early and allow following event groups to commit out-of-order in parallel
replication. Fixed by calling suspend_subsequent_commits() at the start
of the ALTER.
Could be seen as an assertion:
!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id
(Normally this is prevented because an ALTER TABLE will run in its own GCO,
and thus no following event groups can even start; however the missing DDL
mark caused by MDEV-38429 made this visible. And calling
wakeup_subsequent_commits() too early is wrong in any case).
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
If binlog files are deleted or otherwise unreadable during server restart,
don't make the server unstartable. Instead, start up, recovering what is
available, but complaining in the error log.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The GTID position at slave connect is found from the GTID state records
written at the start and at every --innodb-binlog-state-interval bytes of
the binlog files. There was a bug that for a binlog group commit, the binlog
state written was the one corresponding to the last GTID in the group,
regardless of where during the binlogging of the group it was written. Thus,
it could mistakenly write a GTID state record of 0-1-10, say, followed by a
lower GTID 0-1-9. This could cause a slave connecting at 0-1-10 to receive
an extra GTID 0-1-9, and replication would diverge.
Fix by maintaining a full GTID binlog state inside the engine binlog, same
as is done for the differential state.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Add user documentation for the new binlog implementation. And add error messages for the remaining configuration options that are not available with the new binlog.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This happened when the first OOB record of an event group spans two binlog
files, say N and N+1. The reference counting would wrongly attribute the OOB
to N+1, allowing N to be purged while it was still needed.
This for example could cause server restart to fail when it tries to recover
the GTID state from N+1, unable to follow OOB references to N because it was
purged before the server restart.
Fix by:
- Increment OOB refcount _before_ binlogging the first OOB record.
- Decrement refcount only _after binlogging complete.
- Protecting from purge any files referenced from the active file.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
For CREATE TEMPORARY TABLE ... SELECT, InnoDB had code to not start a new
transaction for the CREATE TEMPORARY (correct). But the code that handled
failure for the SELECT part (ha_innobase::extra(HA_EXTRA_ABORT_ALTER_COPY))
was missing a check for CREATE TEMPORARY, so it would roll back the entire
transaction, which is wrong, and could lead to inconsistency with binlog or
other engines in the same transaction.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When using the InnoDB-implemented binlog with another transactional storage
engine, or with explicit user XA transactions, recover such transactions
consistently from the binlog at server startup.
When a transaction is prepared with an XID, the binlog records a "prepare"
record containing the XID and link to the out-of-band replication event
data.
When a previously prepared transaction is committed, the commit record links
to the oob data referenced from the prepare record, and the record is
preceeded by an "XA complete" record containing the XID.
If instead a prepared transaction is rolled back, just an "XA complete"
record is binlogged with the XID and a "rollback" flag.
While any prepared XA transactions are active, maintain in-memory reference
counts in each binlog file, and in each binlog file record the file_no of
the earliest binlog file containing any XID records of still active
transactions.
When the server restarts (possibly after crash), look up the file_no of the
earliest binlog file that may contain active XID records, if any. Scan the
binlogs from that point and record any XID prepare or complete records.
For any XID prepare record, record oob data and reference count, recovering
the in-memory state present before the server restart. Return a hash to the
server layer containing each active XID in the binlog and its state
(prepared, committed, rolled back).
On the server layer, ask each engine for a list of pending XID in prepared
state. If the binlog state of an XID is committed, commit in the engine. If
the binlog state is rolled back or is missing, roll back in the engine. If
the binlog state is prepared, _and_ all participating engines have the
transaction prepared also, then leave the transaction prepared. If a binlog
prepared transaction is missing from an engine, then roll it back in any
other engines and in the binlog (this is to handle a crash in the middle of
an XA PREPARE).
The result is that multi-engine (or non-InnoDB) transactions, as well as
user XA transactions, will be recovered after a crash consisent with the
binlog content.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
If the user copies manually an engine-implemented binlog file and runs
mysqlbinlog on it, any following or oob-referenced files may not be
available to read from. Treat this as end-of-file rather than an error
(so we can output at least any part of the file that is available).
But still output a message about the failure to open the file, to give
some indication why the dump stopped at that point.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
1. Handle RELEASE SAVEPOINT, removing any released savepoint from the list
of savepoints pending in the cache. Also fix a bug in the server layer;
RELEASE SAVEPOINT removes the specified savepoint _and_ any later
savepoints; engines were not informed of the removal of the later ones, if
any.
2. Fix a bug when spilling non-transactional statement data inside of a
transaction using savepoints. The spill of the statement cache must not
spill any savepoints, those apply only to the trx cache.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When an empty user XA transaction is committed (or rolled back), we do
not need to binlog any real transaction, but we still need to binlog a
rollback record to clear the XID for reuse and free the prepare record from
purge and from recovery.
Also fix a bug that in case of error adding to the innodb binlog internal
xid hash (eg. duplicate), we must still ensure that the written XA prepare
record is entered into the pending LSN fifo, so we can track when it becomes
durable (there was a shutdown hang possible if a prepare record was last in
the binlog and XID insert failed so the record was never marked durable).
Bugs found in RQG runs.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When the tablespace is closed during shutdown, it waits for the last record
written in the tablespace to be durable in the InnoDB redo log.
There was an obvious mistake/race in the code in function
ibb_wait_durable_offset(), which did not check for the waited-for condition
before doing the wait.
Thus if the last record in the binlog file became durable between the check
in fsp_binlog_tablespace_close() and the wait in ibb_wait_durable_offset(),
it would wait for new data to be written; this could cause a hang.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Adjust an assertion that checks that no unspilled savepoints are left after
spilling the trx cache as OOB data to the engine binlog.
If the savepoint is at the very end of the trx cache when we spill, there is
no need to spill that particular savepoint to the engine (and we do not do
so); the savepoint can still be rolled back to by truncating the in-memory
part of the cache.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
1. The binlog recovery layer gets duplicate records in case of multi-batch
recovery, seemingly any partial mini-transaction is re-applied from the
start at a batch boundary. Thus, we need to be able to accept multiple
duplicate records belonging to the same mtr.
2. All records in a mini-transaction share the same LSN. Thus, when an mtr
spans two binlog files, the LSN of the mtr may compare larger than the
start_lsn of file_no+1, while part of that mtr (but not all of it) applies
to file_no. In this case, the record could incorrectly be interpreted as
applying to file_no+2 instead of file_no.
Both bugs found as test failures of binlog_in_engine.recover_concurrent_dml.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This recovery testcase more aggressively exercises the recovery. It
runs a parallel DML load on the master and crashes it at arbitrary
point; then checks the self-consistency of the transactions to test
for partially/incorrectly recovered individual transaction, and
replicates to slave and tests consistency between master and slave.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This patch creates binlog-in-engine MTR test equivalents for
rpl_mysqlbinlog_slave_consistency. One new test is a basic variant,
which tests the exact same behavior as the rpl test version, and the
other uses out-of-band binlogging to ensure the slave and mysqlbinlog
event replay is consistent. The rpl.rpl_mysqlbinlog_slave_consistency
test validates that the domain and server id filtering logic of
mariadb-binlog matches that of a replica server. The purpose of this
patch is then to ensure that the expected filtering behavior is still
consistent when using binlog-in-innodb.
Changes to the replication-side of tests are as follows:
1. So the binlog_in_engine tests can find the correct include files,
the paths to the files from the `source` commands are made absolute
2. To support the out-of-band binlogging variant, the
tables/transactions in sql_multisource.inc and
sql_out_of_order_gtid.inc are extended with a configurable size
longblob, so the OOB test can use large transactions.
3. To add some complexity for the OOB testing, the transactions in
sql_multisource.inc and sql_out_of_order_gtid.inc are staggered
using different connections and run concurrently so the OOB data
is overlapping. This is done for cases using different domain ids
as well as the same domain id (though commit order is still
enforced).
Converts test rpl_row_basic_3innodb to binlog in engine with two
variants:
1) Master uses binlog-in-innodb and slave uses file binlog
2) Master use file binlog and slave uses binlog-in-innodb
It is legal to re-use the same name for a SAVEPOINT in a single transaction.
In this case, the new savepoint overrides the old one, effectively deleting
the old one. The code was not correctly handling this, and could end up with
an invalid list of pending savepoints, causing assertion and malfunction.
Fix by searching the list for any existing occurrence of the savepoint being
set, and deleting any such occurrence found before inserting it at the end
of the list.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When starting on a new binlog, during server initialization we force
initialization and durable sync to disk of the initial header page and gtid
state record of the new file_no=0. But the code mistakenly took this code
path also when restarting on an existing binlog.
If the server was restarted on a binlog with current position exactly at the
end of a page, this caused an attempt to write a zero byte record at the
start of the next page (when that page_no was not divisible by the
--innodb-binlog-state-interval), which caused double creation of the page on
the next real write and invalid state/assertion in the page fifo.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>