InnoDB originally skipped the redo logging of PAGE_MAX_TRX_ID changes
until I enabled it in commit e76b873f24
that was part of MySQL 5.5.5 already.
Later, when a more complete history of the InnoDB Plugin for MySQL 5.1
(aka branches/zip in the InnoDB subversion repository) and of the
planned-to-be closed-source branches/innodb+ that became the basis of
InnoDB in MySQL 5.5 was pushed to the MySQL source repository, the
change was part of commit 509e761f06:
------------------------------------------------------------------------
r5038 | marko | 2009-05-19 22:59:07 +0300 (Tue, 19 May 2009) | 30 lines
branches/zip: Write PAGE_MAX_TRX_ID to the redo log. Otherwise,
transactions that are started before the rollback of incomplete
transactions has finished may have an inconsistent view of the
secondary indexes.
dict_index_is_sec_or_ibuf(): Auxiliary function for controlling
updates and checks of PAGE_MAX_TRX_ID: check whether an index is a
secondary index or the insert buffer tree.
page_set_max_trx_id(), page_update_max_trx_id(),
lock_rec_insert_check_and_lock(),
lock_sec_rec_modify_check_and_lock(), btr_cur_ins_lock_and_undo(),
btr_cur_upd_lock_and_undo(): Add the parameter mtr.
page_set_max_trx_id(): Allow mtr to be NULL. When mtr==NULL, do not
attempt to write to the redo log. This only occurs when creating a
page or reorganizing a compressed page. In these cases, the
PAGE_MAX_TRX_ID will be set correctly during the application of redo
log records, even though there is no explicit log record about it.
btr_discard_only_page_on_level(): Preserve PAGE_MAX_TRX_ID. This
function should be unreachable, though.
btr_cur_pessimistic_update(): Update PAGE_MAX_TRX_ID.
Add some assertions for checking that PAGE_MAX_TRX_ID is set on all
secondary index leaf pages.
rb://115 tested by Michael, fixes Issue #211
------------------------------------------------------------------------
After this fix, some bogus references to recv_recovery_is_on()
remained. Also, some references could be replaced with
references to index->is_dummy to prepare us for MDEV-14481
(background redo log apply).
- Make Rdb_binlog_manager::unpack_value to not have a stack overrun
when it is reading invalid data (which it currently does as we in
MariaDB do not store binlog coordinates under BINLOG_INFO_INDEX_NUMBER,
see comments in MDEV-14892 for details).
- We may need to store these coordinates in the future, so instead of
removing the call of this function, let's make it work properly for
all possible inputs.
Problems --------
The slave io thread did not conduct integrity check
for a group of row-based events. Specifically it tolerates missed
terminal block event that must be flagged with STMT_END. Failure to
react on its loss can confuse the applier thread in various ways.
Another potential issue was that there were no check of impossible
second in row Gtid-log-event while the slave io thread is receiving
to be skipped events after reconnect.
Fixes
-----
The slave io thread is made by this patch to track the rows event
STMT_END status.
Whenever at next event reading the IO thread finds out that a preceding
Rows event did not actually had the flag, an
explicit error is issued.
Replication can be resumed after the source of failure is eliminated,
see a provided test.
Note that currently the row-based group integrity check excludes
the compressed version 2 Rows events (which are not generated by MariaDB
master).
Its uncompressed counterpart is manually tested.
The 2nd issue is covered to produce an error in case the io thread
receives a successive Gtid_log_event while it is post-reconnect
skipping.
In CREATE SEQUENCE or CREATE TEMPORARY SEQUENCE, we should not start
an InnoDB transaction for inserting the sequence status record into
the underlying no-rollback table. Because we did this, a debug assertion
failure would fail in START TRANSACTION WITH CONSISTENT SNAPSHOT after
CREATE TEMPORARY SEQUENCE was executed.
row_ins_step(): Do not start the transaction. Let the caller do that.
que_thr_step(): Start the transaction before calling row_ins_step().
row_ins_clust_index_entry(): Skip locking and undo logging for no-rollback
tables, even for temporary no-rollback tables.
row_ins_index_entry(): Allow trx->id==0 for no-rollback tables.
row_insert_for_mysql(): Do not start a transaction for no-rollback tables.
trx reference counter was updated under mutex and read without any
protection. This is both slow and unsafe. Use atomic operations for
reference counter accesses.
trx_sys_t::rw_trx_set is implemented as std::set, which does a few quite
expensive operations under trx_sys_t::mutex protection: e.g. malloc/free
when adding/removing elements. Traversing b-tree is not that cheap either.
This has negative scalability impact, which is especially visible when running
oltp_update_index.lua benchmark on a ramdisk.
To reduce trx_sys_t::mutex contention std::set is replaced with LF_HASH. None
of LF_HASH operations require trx_sys_t::mutex (nor any other global mutex)
protection.
Another interesting issue observed with std::set is reproducible ~2% performance
decline after benchmark is ran for ~60 seconds. With LF_HASH results are stable.
All in all this patch optimises away one of three trx_sys->mutex locks per
oltp_update_index.lua query. The other two critical sections became smaller.
Relevant clean-ups:
Replaced rw_trx_set iteration at startup with local set. The latter is needed
because values inserted to rw_trx_list must be ordered by trx->id.
Removed redundant conditions from trx_reference(): it is (and even was) never
called with transactions that have trx->state == TRX_STATE_COMMITTED_IN_MEMORY.
do_ref_count doesn't (and probably even didn't) make any sense: now it is called
only when reference counter increment is actually requested.
Moved condition out of mutex in trx_erase_lists().
trx_rw_is_active(), trx_rw_is_active_low() and trx_get_rw_trx_by_id() were
greatly simplified and replaced by appropriate trx_rw_hash_t methods.
Compared to rw_trx_set, rw_trx_hash holds transactions only in PREPARED or
ACTIVE states. Transactions in COMMITTED state were required to be found
at InnoDB startup only. They are now looked up in the local set.
Removed unused trx_assert_recovered().
Removed unused innobase_get_trx() declaration.
Removed rather semantically incorrect trx_sys_rw_trx_add().
Moved information printout from trx_sys_init_at_db_start() to
trx_lists_init_at_db_start().
The warning was originally added in
commit c67663054a
(MySQL 4.1.12, 5.0.3) to trace claimed undo log corruption that
was analyzed in https://lists.mysql.com/mysql/176250
on November 9, 2004.
Originally, the limit was 20,000 undo log headers or transactions,
but in commit 9d6d1902e0
in MySQL 5.5.11 it was increased to 2,000,000.
The message can be triggered when the progress of purge is prevented
by a long-running transaction (or just an idle transaction whose
read view was started a long time ago), by running many transactions
that UPDATE or DELETE some records, then starting another transaction
with a read view, and finally by executing more than 2,000,000
transactions that UPDATE or DELETE records in InnoDB tables. Finally,
when the oldest long-running transaction is completed, purge would
run up to the next-oldest transaction, and there would still be more
than 2,000,000 transactions to purge.
Because the message can be triggered when the database is obviously
not corrupted, it should be removed. Heavy users of InnoDB should be
monitoring the "History list length" in SHOW ENGINE INNODB STATUS;
there is no need to spam the error log.
Problem was timing between the thread that was killed and reading the
binary log.
Updated the test to wait until the killed thread was properly terminated
before checking what's in the binary log.
To make check safe, I changed "threads_connected" to be updated after
thd::cleanup() is done, to ensure that all binary logs updates are done
before the variable is changed. This was mainly done to get the
test deterministic and have now other real influence in how the server
works.
recv_log_recover_10_3(): Determine if a log from MariaDB 10.3 is clean.
recv_find_max_checkpoint(): Allow startup with a clean 10.3 redo log.
srv_prepare_to_delete_redo_log_files(): When starting up with a 10.3 log,
display a "Downgrading redo log" message instead of "Upgrading".
The XtraDB option innodb_track_changed_pages causes
the function log_group_read_log_seg() to be invoked
even when recv_sys==NULL, leading to the SIGSEGV.
This regression was caused by
MDEV-11027 InnoDB log recovery is too noisy
with recursive reference in subquery
If a recursive CTE uses a subquery with recursive reference then
the virtual function reset() must be called after each iteration
performed at the execution of the CTE.
This bug affected tables where the PRIMARY KEY contains variable-length
columns, and ROW_FORMAT is COMPACT or DYNAMIC.
rec_init_offsets_comp_ordinary(): Do not short-cut the parsing
of the record header for records that contain explicit values
for instantly added columns.
rec_copy_prefix_to_buf(): Copy more header for records that
contain explicit values for instantly added columns.