mariadb/storage/innobase/include/trx0rec.h
Marko Mäkelä ab0190101b MDEV-24402: InnoDB CHECK TABLE ... EXTENDED
Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB,
and InnoDB only counted the records in each index according
to the current read view. Unless the attribute QUICK was specified, the
function btr_validate_index() would be invoked to validate the B-tree
structure (the sibling and child links between index pages).

The EXTENDED check will not only count all index records according to the
current read view, but also ensure that any delete-marked records in the
clustered index are waiting for the purge of history, and that all
secondary index records point to a version of the clustered index record
that is waiting for the purge of history. In other words, no index may
contain orphan records. Normal MVCC reads and the non-EXTENDED version
of CHECK TABLE would ignore these orphans.

Unpurged records merely result in warnings (at most one per index),
not errors, and no indexes will be flagged as corrupted due to such
garbage. It will remain possible to SELECT data from such indexes or
tables (which will skip such records) or to rebuild the table to
reclaim some space.

We introduce purge_sys.end_view that will be (almost) a copy of
purge_sys.view at the end of a batch of purging committed transaction
history. It is not an exact copy, because if the size of a purge batch
is limited by innodb_purge_batch_size, some records that
purge_sys.view would allow to be purged will be left over for
subsequent batches.

The purge_sys.view is relevant in the purge of committed transaction
history, to determine if records are safe to remove. The new
purge_sys.end_view is relevant in MVCC operations and in
CHECK TABLE ... EXTENDED. It tells which undo log records are
safe to access (have not been discarded at the end of a purge batch).

purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(),
clone the oldest read view similar to purge_sys_t::clone_end_view()
so that CHECK TABLE ... EXTENDED will not report bogus failures between
InnoDB restart and the completed purge of committed transaction history.

purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible()
in the case that purge_sys.latch will not be held by the caller.
Among other things, this guards access to BLOBs. It is not safe to
dereference any BLOBs of a delete-marked purgeable record, because
they may have already been freed.

purge_sys_t::view_guard::view(): Return a reference to purge_sys.view
that will be protected by purge_sys.latch, held by purge_sys_t::view_guard.

purge_sys_t::end_view_guard::view(): Return a reference to
purge_sys.end_view while it is protected by purge_sys.end_latch.
Whenever a thread needs to retrieve an older version of a clustered
index record, it will hold a page latch on the clustered index page
and potentially also on a secondary index page that points to the
clustered index page. If these pages contain purgeable records that
would be accessed by a currently running purge batch, the progress of
the purge batch would be blocked by the page latches. Hence, it is
safe to make a copy of purge_sys.end_view while holding an index page
latch, and consult the copy of the view to determine whether a record
should already have been purged.

btr_validate_index(): Remove a redundant check.

row_check_index_match(): Check if a secondary index record and a
version of a clustered index record match each other.

row_check_index(): Replaces row_scan_index_for_mysql().
Count the records in each index directly, duplicating the relevant
logic from row_search_mvcc(). Initialize check_table_extended_view
for CHECK ... EXTENDED while holding an index leaf page latch.
If we encounter an orphan record, the copy of purge_sys.end_view that
we make is safe for visibility checks, and trx_undo_get_undo_rec() will
check for the safety to access each undo log record. Should that check
fail, we should return DB_MISSING_HISTORY to report a corrupted index.
The EXTENDED check tries to match each secondary index record with
every available clustered index record version, by duplicating the logic
of row_vers_build_for_consistent_read() and invoking
trx_undo_prev_version_build() directly.

Before invoking row_check_index_match() on delete-marked clustered index
record versions, we will consult purge_sys.is_purgeable() in order to
avoid accessing freed BLOBs.

We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not
exceed the global maximum. Orphan secondary index records will be
flagged only if everything up to PAGE_MAX_TRX_ID has been purged.
We warn also about clustered index records whose nonzero DB_TRX_ID
should have been reset in purge or rollback.

trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id().

trx_undo_prev_version_build(): Remove two debug-only parameters,
and return an error code instead of a Boolean.

trx_undo_get_undo_rec(): Return a pointer to the undo log record,
or nullptr if one cannot be retrieved. Instead of consulting the
purge_sys.view, consult the purge_sys.end_view to determine which
records can be accessed.

trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec()
that will consult purge_sys.view instead of purge_sys.end_view.

TRX_UNDO_CHECK_PURGEABILITY: A new parameter to
trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry()
so that purge_sys.view instead of purge_sys.end_view will be consulted
to determine whether a secondary index record may be safely purged.

row_upd_changes_disowned_external(): Remove. This should be more
expensive than briefly latching purge_sys in trx_undo_prev_version_build()
(which may make use of transactional memory).

row_sel_reset_old_vers_heap(): New function, split from
row_sel_build_prev_vers_for_mysql().

row_sel_build_prev_vers_for_mysql(): Reorder some parameters
to simplify the call to row_sel_reset_old_vers_heap().

row_search_for_mysql(): Replaced with direct calls to row_search_mvcc().

sel_node_get_nth_plan(): Define inline in row0sel.h

open_step(): Define at the call site, in simplified form.

sel_node_reset_cursor(): Merged with the only caller open_step().
---
ReadViewBase::check_trx_id_sanity(): Remove.
Let us handle "future" DB_TRX_ID in a more meaningful way:

row_sel_clust_sees(): Return DB_SUCCESS if the record is visible,
DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if
the DB_TRX_ID is in the future.

row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore
corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed
that corruption when we were about to modify the record in the first
place (leading us to refuse the operation).

row_vers_build_for_consistent_read(): Return DB_CORRUPTION if
DB_TRX_ID is in the future.

Tested by: Matthias Leich
Reviewed by: Vladislav Lesin
2022-10-21 10:02:54 +03:00

322 lines
13 KiB
C

/*****************************************************************************
Copyright (c) 1996, 2016, Oracle and/or its affiliates. All Rights Reserved.
Copyright (c) 2017, 2022, MariaDB Corporation.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
*****************************************************************************/
/**************************************************//**
@file include/trx0rec.h
Transaction undo log record
Created 3/26/1996 Heikki Tuuri
*******************************************************/
#pragma once
#include "trx0types.h"
#include "row0types.h"
#include "mtr0mtr.h"
#include "rem0types.h"
#include "page0types.h"
#include "row0log.h"
#include "que0types.h"
/***********************************************************************//**
Copies the undo record to the heap.
@param undo_rec record in an undo log page
@param heap memory heap
@return copy of undo_rec
@retval nullptr if the undo log record is corrupted */
inline trx_undo_rec_t* trx_undo_rec_copy(const trx_undo_rec_t *undo_rec,
mem_heap_t *heap)
{
const size_t offset= ut_align_offset(undo_rec, srv_page_size);
const size_t end= mach_read_from_2(undo_rec);
if (end <= offset || end >= srv_page_size - FIL_PAGE_DATA_END)
return nullptr;
const size_t len= end - offset;
trx_undo_rec_t *rec= static_cast<trx_undo_rec_t*>
(mem_heap_dup(heap, undo_rec, len));
mach_write_to_2(rec, len);
return rec;
}
/**********************************************************************//**
Reads the undo log record number.
@return undo no */
inline undo_no_t trx_undo_rec_get_undo_no(const trx_undo_rec_t *undo_rec)
{
return mach_u64_read_much_compressed(undo_rec + 3);
}
/**********************************************************************//**
Returns the start of the undo record data area. */
#define trx_undo_rec_get_ptr(undo_rec, undo_no) \
((undo_rec) + trx_undo_rec_get_offset(undo_no))
/**********************************************************************//**
Reads from an undo log record the general parameters.
@return remaining part of undo log record after reading these values */
const byte*
trx_undo_rec_get_pars(
/*==================*/
const trx_undo_rec_t* undo_rec, /*!< in: undo log record */
ulint* type, /*!< out: undo record type:
TRX_UNDO_INSERT_REC, ... */
ulint* cmpl_info, /*!< out: compiler info, relevant only
for update type records */
bool* updated_extern, /*!< out: true if we updated an
externally stored fild */
undo_no_t* undo_no, /*!< out: undo log record number */
table_id_t* table_id) /*!< out: table id */
MY_ATTRIBUTE((nonnull));
/*******************************************************************//**
Builds a row reference from an undo log record.
@return pointer to remaining part of undo record */
const byte*
trx_undo_rec_get_row_ref(
/*=====================*/
const byte* ptr, /*!< in: remaining part of a copy of an undo log
record, at the start of the row reference;
NOTE that this copy of the undo log record must
be preserved as long as the row reference is
used, as we do NOT copy the data in the
record! */
dict_index_t* index, /*!< in: clustered index */
const dtuple_t**ref, /*!< out, own: row reference */
mem_heap_t* heap) /*!< in: memory heap from which the memory
needed is allocated */
MY_ATTRIBUTE((nonnull));
/**********************************************************************//**
Reads from an undo log update record the system field values of the old
version.
@return remaining part of undo log record after reading these values */
byte*
trx_undo_update_rec_get_sys_cols(
/*=============================*/
const byte* ptr, /*!< in: remaining part of undo
log record after reading
general parameters */
trx_id_t* trx_id, /*!< out: trx id */
roll_ptr_t* roll_ptr, /*!< out: roll ptr */
byte* info_bits); /*!< out: info bits state */
/*******************************************************************//**
Builds an update vector based on a remaining part of an undo log record.
@return remaining part of the record, NULL if an error detected, which
means that the record is corrupted */
byte*
trx_undo_update_rec_get_update(
/*===========================*/
const byte* ptr, /*!< in: remaining part in update undo log
record, after reading the row reference
NOTE that this copy of the undo log record must
be preserved as long as the update vector is
used, as we do NOT copy the data in the
record! */
dict_index_t* index, /*!< in: clustered index */
ulint type, /*!< in: TRX_UNDO_UPD_EXIST_REC,
TRX_UNDO_UPD_DEL_REC, or
TRX_UNDO_DEL_MARK_REC; in the last case,
only trx id and roll ptr fields are added to
the update vector */
trx_id_t trx_id, /*!< in: transaction id from this undorecord */
roll_ptr_t roll_ptr,/*!< in: roll pointer from this undo record */
byte info_bits,/*!< in: info bits from this undo record */
mem_heap_t* heap, /*!< in: memory heap from which the memory
needed is allocated */
upd_t** upd); /*!< out, own: update vector */
/** Report a RENAME TABLE operation.
@param[in,out] trx transaction
@param[in] table table that is being renamed
@return DB_SUCCESS or error code */
dberr_t trx_undo_report_rename(trx_t* trx, const dict_table_t* table)
MY_ATTRIBUTE((nonnull, warn_unused_result));
/***********************************************************************//**
Writes information to an undo log about an insert, update, or a delete marking
of a clustered index record. This information is used in a rollback of the
transaction and in consistent reads that must look to the history of this
transaction.
@return DB_SUCCESS or error code */
dberr_t
trx_undo_report_row_operation(
/*==========================*/
que_thr_t* thr, /*!< in: query thread */
dict_index_t* index, /*!< in: clustered index */
const dtuple_t* clust_entry, /*!< in: in the case of an insert,
index entry to insert into the
clustered index; in updates,
may contain a clustered index
record tuple that also contains
virtual columns of the table;
otherwise, NULL */
const upd_t* update, /*!< in: in the case of an update,
the update vector, otherwise NULL */
ulint cmpl_info, /*!< in: compiler info on secondary
index updates */
const rec_t* rec, /*!< in: case of an update or delete
marking, the record in the clustered
index; NULL if insert */
const rec_offs* offsets, /*!< in: rec_get_offsets(rec) */
roll_ptr_t* roll_ptr) /*!< out: DB_ROLL_PTR to the
undo log record */
MY_ATTRIBUTE((nonnull(1,2,8), warn_unused_result));
/** status bit used for trx_undo_prev_version_build() */
/** TRX_UNDO_PREV_IN_PURGE tells trx_undo_prev_version_build() that it
is being called purge view and we would like to get the purge record
even it is in the purge view (in normal case, it will return without
fetching the purge record */
static constexpr ulint TRX_UNDO_PREV_IN_PURGE = 1;
/** This tells trx_undo_prev_version_build() to fetch the old value in
the undo log (which is the after image for an update) */
static constexpr ulint TRX_UNDO_GET_OLD_V_VALUE = 2;
/** indicate a call from row_vers_old_has_index_entry() */
static constexpr ulint TRX_UNDO_CHECK_PURGEABILITY = 4;
/** Build a previous version of a clustered index record. The caller
must hold a latch on the index page of the clustered index record.
@param rec version of a clustered index record
@param index clustered index
@param offsets rec_get_offsets(rec, index)
@param heap memory heap from which the memory needed is
allocated
@param old_vers previous version or NULL if rec is the
first inserted version, or if history data
has been deleted (an error), or if the purge
could have removed the version
though it has not yet done so
@param v_heap memory heap used to create vrow
dtuple if it is not yet created. This heap
diffs from "heap" above in that it could be
prebuilt->old_vers_heap for selection
@param vrow virtual column info, if any
@param v_status status determine if it is going into this
function by purge thread or not.
And if we read "after image" of undo log
@return error code
@retval DB_SUCCESS if previous version was successfully built,
or if it was an insert or the undo record refers to the table before rebuild
@retval DB_MISSING_HISTORY if the history is missing */
dberr_t
trx_undo_prev_version_build(
const rec_t *rec,
dict_index_t *index,
rec_offs *offsets,
mem_heap_t *heap,
rec_t **old_vers,
mem_heap_t *v_heap,
dtuple_t **vrow,
ulint v_status);
/** Read from an undo log record a non-virtual column value.
@param ptr pointer to remaining part of the undo record
@param field stored field
@param len length of the field, or UNIV_SQL_NULL
@param orig_len original length of the locally stored part
of an externally stored column, or 0
@return remaining part of undo log record after reading these values */
const byte *trx_undo_rec_get_col_val(const byte *ptr, const byte **field,
uint32_t *len, uint32_t *orig_len);
/** Read virtual column value from undo log
@param[in] table the table
@param[in] ptr undo log pointer
@param[in,out] row the dtuple to fill
@param[in] in_purge whether this is called by purge */
void
trx_undo_read_v_cols(
const dict_table_t* table,
const byte* ptr,
dtuple_t* row,
bool in_purge);
/** Read virtual column index from undo log if the undo log contains such
info, and verify the column is still indexed, and output its position
@param[in] table the table
@param[in] ptr undo log pointer
@param[in] first_v_col if this is the first virtual column, which
has the version marker
@param[in,out] is_undo_log his function is used to parse both undo log,
and online log for virtual columns. So
check to see if this is undo log
@param[out] field_no the column number, or FIL_NULL if not indexed
@return remaining part of undo log record after reading these values */
const byte*
trx_undo_read_v_idx(
const dict_table_t* table,
const byte* ptr,
bool first_v_col,
bool* is_undo_log,
uint32_t* field_no);
/* Types of an undo log record: these have to be smaller than 16, as the
compilation info multiplied by 16 is ORed to this value in an undo log
record */
/** Undo log records for DDL operations
Note: special rollback and purge triggers exist for SYS_INDEXES records:
@see dict_drop_index_tree() */
enum trx_undo_ddl_type
{
/** RENAME TABLE (logging the old table name).
Because SYS_TABLES has PRIMARY KEY(NAME), the row-level undo log records
for SYS_TABLES cannot be distinguished from DROP TABLE, CREATE TABLE. */
TRX_UNDO_RENAME_TABLE= 9,
/** insert a metadata pseudo-record for instant ALTER TABLE */
TRX_UNDO_INSERT_METADATA= 10
};
/* DML operations */
#define TRX_UNDO_INSERT_REC 11 /* fresh insert into clustered index */
#define TRX_UNDO_UPD_EXIST_REC 12 /* update of a non-delete-marked
record */
#define TRX_UNDO_UPD_DEL_REC 13 /* update of a delete marked record to
a not delete marked record; also the
fields of the record can change */
#define TRX_UNDO_DEL_MARK_REC 14 /* delete marking of a record; fields
do not change */
/** Bulk insert operation. It is written only when the table is
under exclusive lock and the clustered index root page latch is being held,
and the clustered index is empty. Rollback will empty the table and
free the leaf segment of all indexes, re-create the new
leaf segment and re-initialize the root page alone. */
#define TRX_UNDO_EMPTY 15
#define TRX_UNDO_CMPL_INFO_MULT 16U /* compilation info is multiplied by
this and ORed to the type above */
#define TRX_UNDO_UPD_EXTERN 128U /* This bit can be ORed to type_cmpl
to denote that we updated external
storage fields: used by purge to
free the external storage */
/** The search tuple corresponding to TRX_UNDO_INSERT_METADATA */
extern const dtuple_t trx_undo_metadata;
/** Read the table id from an undo log record.
@param[in] rec Undo log record
@return table id stored as a part of undo log record */
inline table_id_t trx_undo_rec_get_table_id(const trx_undo_rec_t *rec)
{
rec+= 3;
mach_read_next_much_compressed(&rec);
return mach_read_next_much_compressed(&rec);
}