When SUX_LOCK_GENERIC is defined, the srw_mutex, srw_lock, sux_lock are
implemented based on pthread_mutex_t and pthread_cond_t. This is the
only option for systems that lack a futex-like system call.
In the SUX_LOCK_GENERIC mode, if pthread_mutex_init() is allocating
some resources that need to be freed by pthread_mutex_destroy(),
a memory leak could occur when we are repeatedly invoking
pthread_mutex_init() without a pthread_mutex_destroy() in between.
pthread_mutex_wrapper::initialized: A debug field to track whether
pthread_mutex_init() has been invoked. This also helps find bugs
like the one that was fixed by
commit 1c8af2ae53 (MDEV-34422);
one simply needs to add -DSUX_LOCK_GENERIC to the CMAKE_CXX_FLAGS
to catch that particular bug on the initial server bootstrap.
buf_block_init(), buf_page_init_for_read(): Invoke block_lock::init()
because buf_page_t::init() will no longer do that.
buf_page_t::init(): Instead of invoking lock.init(), assert that it
has already been invoked (the lock is vacant).
add_fts_index(), build_fts_hidden_table(): Explicitly invoke
index_lock::init() in order to avoid a pthread_mutex_destroy()
invocation on an uninitialized object.
srw_lock_debug::destroy(): Invoke readers_lock.destroy().
trx_sys_t::create(): Invoke trx_rseg_t::init() on all rollback segments
in order to guarantee a deterministic state for shutdown, even if
InnoDB fails to start up.
trx_rseg_array_init(), trx_temp_rseg_create(), trx_rseg_create():
Invoke trx_rseg_t::destroy() before trx_rseg_t::init() in order to
balance pthread_mutex_init() and pthread_mutex_destroy() calls.
Before MDEV-15158, wsrep xid information was stored in only one place:
in the TRX_SYS page. Starting with 10.3, it is not stored there but
in the rollback segment header pages, and the latest one is what
matters. MDEV-19229 allows the undo tablespaces to be rebuilt when
innodb_undo_tablespaces is changed on startup. Previously it was not
possible to change that parameter.
These changes caused the fact that rollback segment header pages could
contain several wsrep xid's stored and when undo tablespaces were
rebuilt there was a effort to restore wsrep xid back to rollback
segment header page but because there was several of them the latest
wsrep xid was overwritten with older one.
trx_rseg_read_wsrep_checkpoint
trx_rseg_init_wsrep_xid
Return true if read xid is wsrep xid, false if not
trx_rseg_mem_restore
Try to read wsrep xid and if it is found copy it to
trx_sys.recovered_wsrep_xid if read xid has larger
seqno.
flst_read_addr(): Remove assertions. Instead, we will check these
conditions in the callers and avoid a crash in case of corruption.
We will check the conditions more carefully, because the callers
know more exact bounds for the page numbers and the byte offsets
withing pages.
flst_remove(), flst_add_first(), flst_add_last(): Add a parameter
for passing fil_space_t::free_limit. None of the lists may point to
pages that are beyond the current initialized length of the
tablespace.
trx_rseg_mem_restore(): Access the first page of the tablespace,
so that we will correctly recover rseg->space->free_limit
in case some log based recovery is pending.
ibuf_remove_free_page(): Only look up the root page once, and
validate the last page number.
Reviewed by: Debarun Banerjee
TrxUndoRsegs is wrapper for vector of trx_rseg_t*. It has two
constructors, both initialize the vector with only one element. And they
are used to push transactions rseg(the singular) to purge queue. There is
no function to add elements to the vector. The default constructor is used
only for declaration of NullElement.
The TrxUndoRsegs was introduced in WL#6915 in MySQL 5.7 and. MySQL 5.7
would unnecessarily let the purge of history parse the
temporary undo records, and then look up the table (via a global hash
table), and only at the point of processing the parsed undo log record
determine that the table is a temporary table and the undo record must be
thrown away.
In MariaDB 10.2 we have two disjoint sets of rollback segments (128 for
persistent, 128 for temporary), and purge does not even see the temporary
tables. The only reason why temporary tables are visible to other threads
is a SQL layer bug (MDEV-17805).
purge_sys_t::choose_next_log(): merge the relevant part
of TrxUndoRsegsIterator::set_next() to the start of
purge_sys_t::choose_next_log().
purge_sys_t::rseg_get_next_history_log(): add a tail call of
purge_sys_t::choose_next_log() and adjust the callers, to simplify the
control flow further.
purge_sys.pq_mutex and purge_sys.purge_queue: make it private by adding
some simple accessor function.
trx_purge_cleanse_purge_queue(): make it a member of purge_sys_t to have
have access to private purge_sys.pq_mutex and purge_sys.purge_queue,
simplify the code with using simple array copy and clearing purge queue
instead of poping each purge queue element.
rseg_t::last_commit_and_offset: exchange trx_no and offset bits to avoid
bitwise operations during pushing to/popping from purge queue.
Thanks Marko Mäkelä for historical overview of TrxUndoRsegs development.
Reviewed by: Marko Mäkelä
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
Fix a scenario where `mariabackup --prepare` fails with assertion
`!m_modifications || !recv_no_log_write' in `mtr_t::commit()`. This
happens if the prepare step of the backup encounters a data directory
which happens to store wsrep xid position in TRX SYS page (this is no
longer the case since 10.3.5). And since MDEV-17458,
`trx_rseg_array_init()` handles this case by copying the xid position
to rollback segments, before clearing the xid from TRX SYS page.
However, this step should be avoided when `trx_rseg_array_init()` is
invoked from mariabackup. The relevant code was surrounded by the
condition `srv_operation == SRV_OPERATION_NORMAL`. An additional check
ensures that we are not trying to copy a xid position which has
already zeroed.
The linear read-ahead (enabled by nonzero innodb_read_ahead_threshold)
works best if index leaf pages or undo log pages have been allocated
on adjacent page numbers. The read-ahead is assumed not to be helpful
in other types of page accesses, such as non-leaf index pages.
buf_page_get_low(): Do not invoke buf_page_t::set_accessed(),
buf_page_make_young_if_needed(), or buf_read_ahead_linear().
We will invoke them in those callers of buf_page_get_gen() or
buf_page_get() where it makes sense: the access is not
one-time-on-startup and the page and not going to be freed soon.
btr_copy_blob_prefix(), btr_pcur_move_to_next_page(),
trx_undo_get_prev_rec_from_prev_page(),
trx_undo_get_first_rec(), btr_cur_t::search_leaf(),
btr_cur_t::open_leaf(): Invoke buf_read_ahead_linear().
We will not invoke linear read-ahead in functions that would
essentially allocate or free pages, because pages that are
freshly allocated are expected to be initialized by buf_page_create()
and not read from the data file. Likewise, freeing pages should
not involve accessing any sibling pages, except for freeing
singly-linked lists of BLOB pages.
We will not invoke read-ahead in btr_cur_t::pessimistic_search_leaf()
or in a pessimistic operation of btr_cur_t::open_leaf(), because
it is assumed that pessimistic operations should be preceded by
optimistic operations, which should already have invoked read-ahead.
buf_page_make_young_if_needed(): Invoke also buf_page_t::set_accessed()
and return the result.
btr_cur_nonleaf_make_young(): Like buf_page_make_young_if_needed(),
but do not invoke buf_page_t::set_accessed().
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
Before MariaDB 10.3.5, the binlog position was stored in the TRX_SYS page,
while after it is stored in rollback segments. There is code to read the
legacy position from TRX_SYS to handle upgrades. The problem was if the
legacy position happens to compare larger than the position found in
rollback segments; in this case, the old TRX_SYS position would incorrectly
be preferred over the newer position from rollback segments.
Fixed by always preferring a position from rollback segments over a legacy
position.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This commit can cause the wrong (old) binlog position to be recovered by
mariabackup --prepare. It implements that the value of the FIL_PAGE_LSN is
compared to determine which binlog position is the last one and should be
recoved. However, it is not guaranteed that the FIL_PAGE_LSN order matches the
commit order, as is assumed by the code. This is because the page LSN could be
modified by an unrelated update of the page after the commit.
In one example, the recovery first encountered this in trx_rseg_mem_restore():
lsn=27282754 binlog position (./master-bin.000001, 472908)
and then later:
lsn=27282699 binlog position (./master-bin.000001, 477164)
The last one 477164 is the correct position. However, because the LSN
encountered for the first one is higher, that position is recovered instead.
This results in too old binlog position, and a newly provisioned slave will
start replicating too early and get duplicate key error or similar.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Let us remove explicit updates of MONITOR_NUM_UNDO_SLOT_USED
and MONITOR_NUM_UNDO_SLOT_CACHED, and let us compute the rough values
from trx_sys.rseg_array[] on demand.
trx_assign_rseg_low(): Simplify the debug check.
trx_rseg_t::reinit(): Reset the skip_allocation() flag.
This logic was broken in the merge
commit 3e2ad0e918
of commit 0de3be8cfd
(that is, innodb_undo_log_truncate=ON would never be "completed").
Tested by: Matthias Leich
Because downgrades from 11.0 to older MariaDB server are not possible
due to the removal of the InnoDB change buffer, there is no need to
access the field TRX_UNDO_NEEDS_PURGE anymore.
It is not safe to invoke trx_purge_free_segment() or execute
innodb_undo_log_truncate=ON before all undo log records in
the rollback segment has been processed.
A prominent failure that would occur due to premature freeing of
undo log pages is that trx_undo_get_undo_rec() would crash when
trying to copy an undo log record to fetch the previous version
of a record.
If trx_undo_get_undo_rec() was not invoked in the unlucky time frame,
then the symptom would be that some committed transaction history is
never removed. This would be detected by CHECK TABLE...EXTENDED that
was impleented in commit ab0190101b.
Such a garbage collection leak should be possible even when using
innodb_undo_log_truncate=OFF, just involving trx_purge_free_segment().
trx_rseg_t::needs_purge: Change the type from Boolean to a transaction
identifier, noting the most recent non-purged transaction, or 0 if
everything has been purged. On transaction start, we initialize this
to 1 more than the transaction start ID. On recovery, the field may be
adjusted to the transaction end ID (TRX_UNDO_TRX_NO) if it is larger.
The field TRX_UNDO_NEEDS_PURGE becomes write-only; only some debug
assertions that would validate the value. The field reflects the old
inaccurate Boolean field trx_rseg_t::needs_purge.
trx_undo_mem_create_at_db_start(), trx_undo_lists_init(),
trx_rseg_mem_restore(): Remove the parameter max_trx_id.
Instead, store the maximum in trx_rseg_t::needs_purge,
where trx_rseg_array_init() will find it.
trx_purge_free_segment(): Contiguously hold a lock on
trx_rseg_t to prevent any concurrent allocation of undo log.
trx_purge_truncate_rseg_history(): Only invoke trx_purge_free_segment()
if the rollback segment is empty and there are no pending transactions
associated with it.
trx_purge_truncate_history(): Only proceed with innodb_undo_log_truncate=ON
if trx_rseg_t::needs_purge indicates that all history has been purged.
Tested by: Matthias Leich
The purpose of the change buffer was to reduce random disk access,
which could be useful on rotational storage, but maybe less so on
solid-state storage.
When we wished to
(1) insert a record into a non-unique secondary index,
(2) delete-mark a secondary index record,
(3) delete a secondary index record as part of purge (but not ROLLBACK),
and the B-tree leaf page where the record belongs to is not in the buffer
pool, we inserted a record into the change buffer B-tree, indexed by
the page identifier. When the page was eventually read into the buffer
pool, we looked up the change buffer B-tree for any modifications to the
page, applied these upon the completion of the read operation. This
was called the insert buffer merge.
We remove the change buffer, because it has been the source of
various hard-to-reproduce corruption bugs, including those fixed in
commit 5b9ee8d819 and
commit 165564d3c3 but not limited to them.
A downgrade will fail with a clear message starting with
commit db14eb16f9 (MDEV-30106).
buf_page_t::state: Merge IBUF_EXIST to UNFIXED and
WRITE_FIX_IBUF to WRITE_FIX.
buf_pool_t::watch[]: Remove.
trx_t: Move isolation_level, check_foreigns, check_unique_secondary,
bulk_insert into the same bit-field. The only purpose of
trx_t::check_unique_secondary is to enable bulk insert into an
empty table. It no longer enables insert buffering for UNIQUE INDEX.
btr_cur_t::thr: Remove. This field was originally needed for change
buffering. Later, its use was extended to cover SPATIAL INDEX.
Much of the time, rtr_info::thr holds this field. When it does not,
we will add parameters to SPATIAL INDEX specific functions.
ibuf_upgrade_needed(): Check if the change buffer needs to be updated.
ibuf_upgrade(): Merge and upgrade the change buffer after all redo log
has been applied. Free any pages consumed by the change buffer, and
zero out the change buffer root page to mark the upgrade completed,
and to prevent a downgrade to an earlier version.
dict_load_tablespaces(): Renamed from
dict_check_tablespaces_and_store_max_id(). This needs to be invoked
before ibuf_upgrade().
btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.
btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer
allocate any change buffer pages.
btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.
row_search_index_entry(), btr_lift_page_up(): Add a parameter thr
for the SPATIAL INDEX case.
rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert().
rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert().
Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0
change buffer format that predates the MySQL 4.1 introduction of
the option innodb_file_per_table was removed in MySQL 5.6.5
as part of mysql/mysql-server@69b6241a79
and MariaDB 10.0.11 as part of 1d0f70c2f8.
In the tests innodb.log_upgrade and innodb.log_corruption, we create
valid (upgraded) change buffer pages.
Tested by: Matthias Leich
trx_sys_t::undo_log_nonempty: Set to true if there are undo logs
to rollback and purge.
The algorithm for re-creating the undo tablespace when
trx_sys_t::undo_log_nonempty is disabled:
1) trx_sys_t::reset_page(): Reset the TRX_SYS page and assign all
rollback segment slots from 1..127 to FIL_NULL
2) Free the rollback segment header page of system tablespace
for the slots 1..127
3) Update the binlog and WSREP information in system tablespace
rollback segment header
Step (1), (2) and Step (3) should happen atomically within a
single mini-transaction.
4) srv_undo_delete_old_tablespaces(): Delete the old undo tablespaces
present in the undo log directory
5) Make checkpoint to get rid of old undo log tablespaces redo logs
6) Assign new start space id for the undo log tablespaces
7) Re-create the specified undo log tablespaces. InnoDB uses same
mtr for this one and step (6)
8) Make checkpoint again, so that server or mariabackup
can read the undo log tablespace page0 before applying
the redo logs
srv_undo_tablespaces_reinit(): Recreate the undo log tablespaces.
It does reset trx_sys page, delete the old undo tablespaces,
update the binlog offset, write set replication checkpoint
in system rollback segment page
trx_rseg_update_binlog_offset(): Added 2 new parameters to pass
binlog file name and binlog offset
trx_rseg_array_init(): Return error if the rollback segment
slot points to non-existent tablespace
srv_undo_tablespaces_init(): Added new parameter mtr
to initialize all undo tablespaces
trx_assign_rseg_low(): Allow the transaction to use the rollback
segment slots(1..127) even if InnoDB failed to change to the
requested innodb_undo_tablespaces=0
srv_start(): Override the user specified value of
innodb_undo_tablespaces variable with already existing actual
undo tablespaces
wf_incremental_process(): Detects whether TRX_SYS page has been
modified since last backup. If it is then incremental backup
fails and throws the information about taking full backup again
xb_assign_undo_space_start(): Removed the function. Because
undo001 has first undo space id value in page0
Added test case to test the scenario during startup and mariabackup
incremental process too.
Reviewed-by : Marko Mäkelä
Tested-by : Matthias Leich
In MySQL 5.7, rollback segments 1 to 32 are used for temporary tables,
which is an unnecessary file format change from MySQL 5.6.
This format change was avoided in MariaDB Server by
commit 124bae082b (MDEV-12289).
An upgrade from MySQL 5.7 would crash due to dereferencing a null pointer,
which is a regression due to
commit 0b47c126e3 (MDEV-13542).
trx_rseg_t::get(): Return nullptr if no tablespace exists. This is where
the upgrade would crash.
trx_rseg_mem_restore(): Return DB_TABLESPACE_NOT_FOUND if the
undo tablespace does not exist. This is likely dead code.
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.
We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.
This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.
We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.
Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.
buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.
btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.
btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().
FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.
trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
trx_sys_create_sys_pages(): Merged with trx_sysf_create().
dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.
dir_pathname(): Replaces os_file_make_new_pathname().
row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.
btr_decryption_failed(): Report decryption failures
dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.
dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.
dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.
dict_table_is_corrupted(): Remove.
dict_index_t::is_btree(): Determine if the index is a valid B-tree.
BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.
buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().
fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.
btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.
opt_search_plan_for_table(): Simplify the code.
row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.
row_undo_mod_upd_exist_sec(): Simplify the code.
row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().
fut_get_ptr(): Replace with buf_page_get_gen() calls.
buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.
buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.
fseg_page_is_allocated(): Replaces fseg_page_is_free().
fts_drop_common_tables(): Return an error if the transaction
was rolled back.
fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.
fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.
Clean up mtr_t::page_lock()
buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.
buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.
recv_recover_page(): Return whether the operation succeeded.
recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
- In 10.6, trx_rseg_t mutex was ported to use latch. As part of this porting
profiling of the patch was removed. This patch reenables it given that
the said latch continues to occupy the top-slots in the contention list.
buf_page_t::frame: Moved from buf_block_t::frame.
All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
pages will have frame=nullptr, while all 'fat' buf_block_t
will have a non-null frame pointing to aligned innodb_page_size bytes.
This eliminates the need for separate states for
BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.
buf_page_t:🔒 Moved from buf_block_t::lock. That is, all block
descriptors will have a page latch. The IO_PIN state that was used
for discarding or creating the uncompressed page frame of a
ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
and page X-latch.
page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
of buf_page_t with a single std::atomic<uint32_t>. All modifications
will use store(), fetch_add(), fetch_sub(). This space was previously
wasted to alignment on 64-bit systems. We will use the following encoding
that combines a state (partly read-fix or write-fix) and a buffer-fix
count:
buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
buf_page_t::FREED=3 + fix: pages marked as freed in the file
buf_page_t::UNFIXED=1U<<29 + fix: normal pages
buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)
buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.
buf_page_t::read_complete(): Renamed from buf_page_read_complete().
Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.
buf_page_t::can_relocate(): If the page latch is being held or waited for,
or the block is buffer-fixed or io-fixed, return false. (The condition
on the page latch is new.)
Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
will acquire the page latch before fix(), and unfix() before unlocking.
buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
handling of FREED pages.
buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
by the caller.
buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.
buf_page_get_low(): Ignore guesses that are read-fixed because they
may not yet be registered in buf_pool.page_hash and buf_pool.LRU.
buf_page_optimistic_get(): Acquire latch before buffer-fixing.
buf_page_make_young(): Leave read-fixed blocks alone, because they
might not be registered in buf_pool.LRU yet.
recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
Possibly fix MDEV-26326, by holding a page X-latch instead of
only buffer-fixing the page.
trx_rseg_header_create(): Add a parameter for the value that is
to be written to TRX_RSEG_MAX_TRX_ID. If we omit this write, then
the updated test innodb.undo_truncate will fail for the 4k, 8k, 16k
page sizes. This was broken ever since
commit 947efe17ed (MDEV-15158)
removed the writes of transaction identifiers to the TRX_SYS page.
srv_do_purge(): Truncate undo tablespaces also during slow shutdown
(innodb_fast_shutdown=0).
Thanks to Krunal Bauskar for noticing this problem.
redo_rseg_mutex, noredo_rseg_mutex: Remove the PERFORMANCE_SCHEMA keys.
The rollback segment mutex will be uninstrumented.
trx_sys_t: Remove pointer indirection for rseg_array, temp_rseg.
Align each element to the cache line.
trx_sys_t::rseg_id(): Replaces trx_rseg_t::id.
trx_rseg_t::ref: Replaces needs_purge, trx_ref_count, skip_allocation
in a single std::atomic<uint32_t>.
trx_rseg_t::latch: Replaces trx_rseg_t::mutex.
trx_rseg_t::history_size: Replaces trx_sys_t::rseg_history_len
trx_sys_t::history_size_approx(): Replaces trx_sys.rseg_history_len
in those places where the exact count does not matter. We must not
acquire any trx_rseg_t::latch while holding index page latches, because
normally the trx_rseg_t::latch is acquired before any page latches.
trx_sys_t::history_exists(): Replaces trx_sys.rseg_history_len!=0
with an approximation.
We remove some unnecessary trx_rseg_t::latch acquisition around
trx_undo_set_state_at_prepare() and trx_undo_set_state_at_finish().
Those operations will only access fields that remain constant
after trx_rseg_t::init().
trx_undo_mem_create_at_db_start(): Relax too strict upgrade checks
that were introduced in
commit e46f76c974 (MDEV-15912).
On commit, pages will typically be set to TRX_UNDO_CACHED state.
Having the type TRX_UNDO_INSERT in such pages is common and
unproblematic; the type would be reset in trx_undo_reuse_cached().
trx_rseg_array_init(): On failure, clean up the rollback segments
that were initialized so far, to avoid an assertion failure later
during shutdown.