mirror of
https://github.com/MariaDB/server.git
synced 2025-01-16 20:12:31 +01:00
ab0190101b
Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB, and InnoDB only counted the records in each index according to the current read view. Unless the attribute QUICK was specified, the function btr_validate_index() would be invoked to validate the B-tree structure (the sibling and child links between index pages). The EXTENDED check will not only count all index records according to the current read view, but also ensure that any delete-marked records in the clustered index are waiting for the purge of history, and that all secondary index records point to a version of the clustered index record that is waiting for the purge of history. In other words, no index may contain orphan records. Normal MVCC reads and the non-EXTENDED version of CHECK TABLE would ignore these orphans. Unpurged records merely result in warnings (at most one per index), not errors, and no indexes will be flagged as corrupted due to such garbage. It will remain possible to SELECT data from such indexes or tables (which will skip such records) or to rebuild the table to reclaim some space. We introduce purge_sys.end_view that will be (almost) a copy of purge_sys.view at the end of a batch of purging committed transaction history. It is not an exact copy, because if the size of a purge batch is limited by innodb_purge_batch_size, some records that purge_sys.view would allow to be purged will be left over for subsequent batches. The purge_sys.view is relevant in the purge of committed transaction history, to determine if records are safe to remove. The new purge_sys.end_view is relevant in MVCC operations and in CHECK TABLE ... EXTENDED. It tells which undo log records are safe to access (have not been discarded at the end of a purge batch). purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(), clone the oldest read view similar to purge_sys_t::clone_end_view() so that CHECK TABLE ... EXTENDED will not report bogus failures between InnoDB restart and the completed purge of committed transaction history. purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible() in the case that purge_sys.latch will not be held by the caller. Among other things, this guards access to BLOBs. It is not safe to dereference any BLOBs of a delete-marked purgeable record, because they may have already been freed. purge_sys_t::view_guard::view(): Return a reference to purge_sys.view that will be protected by purge_sys.latch, held by purge_sys_t::view_guard. purge_sys_t::end_view_guard::view(): Return a reference to purge_sys.end_view while it is protected by purge_sys.end_latch. Whenever a thread needs to retrieve an older version of a clustered index record, it will hold a page latch on the clustered index page and potentially also on a secondary index page that points to the clustered index page. If these pages contain purgeable records that would be accessed by a currently running purge batch, the progress of the purge batch would be blocked by the page latches. Hence, it is safe to make a copy of purge_sys.end_view while holding an index page latch, and consult the copy of the view to determine whether a record should already have been purged. btr_validate_index(): Remove a redundant check. row_check_index_match(): Check if a secondary index record and a version of a clustered index record match each other. row_check_index(): Replaces row_scan_index_for_mysql(). Count the records in each index directly, duplicating the relevant logic from row_search_mvcc(). Initialize check_table_extended_view for CHECK ... EXTENDED while holding an index leaf page latch. If we encounter an orphan record, the copy of purge_sys.end_view that we make is safe for visibility checks, and trx_undo_get_undo_rec() will check for the safety to access each undo log record. Should that check fail, we should return DB_MISSING_HISTORY to report a corrupted index. The EXTENDED check tries to match each secondary index record with every available clustered index record version, by duplicating the logic of row_vers_build_for_consistent_read() and invoking trx_undo_prev_version_build() directly. Before invoking row_check_index_match() on delete-marked clustered index record versions, we will consult purge_sys.is_purgeable() in order to avoid accessing freed BLOBs. We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not exceed the global maximum. Orphan secondary index records will be flagged only if everything up to PAGE_MAX_TRX_ID has been purged. We warn also about clustered index records whose nonzero DB_TRX_ID should have been reset in purge or rollback. trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id(). trx_undo_prev_version_build(): Remove two debug-only parameters, and return an error code instead of a Boolean. trx_undo_get_undo_rec(): Return a pointer to the undo log record, or nullptr if one cannot be retrieved. Instead of consulting the purge_sys.view, consult the purge_sys.end_view to determine which records can be accessed. trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec() that will consult purge_sys.view instead of purge_sys.end_view. TRX_UNDO_CHECK_PURGEABILITY: A new parameter to trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry() so that purge_sys.view instead of purge_sys.end_view will be consulted to determine whether a secondary index record may be safely purged. row_upd_changes_disowned_external(): Remove. This should be more expensive than briefly latching purge_sys in trx_undo_prev_version_build() (which may make use of transactional memory). row_sel_reset_old_vers_heap(): New function, split from row_sel_build_prev_vers_for_mysql(). row_sel_build_prev_vers_for_mysql(): Reorder some parameters to simplify the call to row_sel_reset_old_vers_heap(). row_search_for_mysql(): Replaced with direct calls to row_search_mvcc(). sel_node_get_nth_plan(): Define inline in row0sel.h open_step(): Define at the call site, in simplified form. sel_node_reset_cursor(): Merged with the only caller open_step(). --- ReadViewBase::check_trx_id_sanity(): Remove. Let us handle "future" DB_TRX_ID in a more meaningful way: row_sel_clust_sees(): Return DB_SUCCESS if the record is visible, DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if the DB_TRX_ID is in the future. row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed that corruption when we were about to modify the record in the first place (leading us to refuse the operation). row_vers_build_for_consistent_read(): Return DB_CORRUPTION if DB_TRX_ID is in the future. Tested by: Matthias Leich Reviewed by: Vladislav Lesin
322 lines
13 KiB
C
322 lines
13 KiB
C
/*****************************************************************************
|
|
|
|
Copyright (c) 1996, 2016, Oracle and/or its affiliates. All Rights Reserved.
|
|
Copyright (c) 2017, 2022, MariaDB Corporation.
|
|
|
|
This program is free software; you can redistribute it and/or modify it under
|
|
the terms of the GNU General Public License as published by the Free Software
|
|
Foundation; version 2 of the License.
|
|
|
|
This program is distributed in the hope that it will be useful, but WITHOUT
|
|
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
|
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
|
|
|
|
You should have received a copy of the GNU General Public License along with
|
|
this program; if not, write to the Free Software Foundation, Inc.,
|
|
51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
|
|
|
|
*****************************************************************************/
|
|
|
|
/**************************************************//**
|
|
@file include/trx0rec.h
|
|
Transaction undo log record
|
|
|
|
Created 3/26/1996 Heikki Tuuri
|
|
*******************************************************/
|
|
|
|
#pragma once
|
|
|
|
#include "trx0types.h"
|
|
#include "row0types.h"
|
|
#include "mtr0mtr.h"
|
|
#include "rem0types.h"
|
|
#include "page0types.h"
|
|
#include "row0log.h"
|
|
#include "que0types.h"
|
|
|
|
/***********************************************************************//**
|
|
Copies the undo record to the heap.
|
|
@param undo_rec record in an undo log page
|
|
@param heap memory heap
|
|
@return copy of undo_rec
|
|
@retval nullptr if the undo log record is corrupted */
|
|
inline trx_undo_rec_t* trx_undo_rec_copy(const trx_undo_rec_t *undo_rec,
|
|
mem_heap_t *heap)
|
|
{
|
|
const size_t offset= ut_align_offset(undo_rec, srv_page_size);
|
|
const size_t end= mach_read_from_2(undo_rec);
|
|
if (end <= offset || end >= srv_page_size - FIL_PAGE_DATA_END)
|
|
return nullptr;
|
|
const size_t len= end - offset;
|
|
trx_undo_rec_t *rec= static_cast<trx_undo_rec_t*>
|
|
(mem_heap_dup(heap, undo_rec, len));
|
|
mach_write_to_2(rec, len);
|
|
return rec;
|
|
}
|
|
|
|
/**********************************************************************//**
|
|
Reads the undo log record number.
|
|
@return undo no */
|
|
inline undo_no_t trx_undo_rec_get_undo_no(const trx_undo_rec_t *undo_rec)
|
|
{
|
|
return mach_u64_read_much_compressed(undo_rec + 3);
|
|
}
|
|
|
|
/**********************************************************************//**
|
|
Returns the start of the undo record data area. */
|
|
#define trx_undo_rec_get_ptr(undo_rec, undo_no) \
|
|
((undo_rec) + trx_undo_rec_get_offset(undo_no))
|
|
|
|
/**********************************************************************//**
|
|
Reads from an undo log record the general parameters.
|
|
@return remaining part of undo log record after reading these values */
|
|
const byte*
|
|
trx_undo_rec_get_pars(
|
|
/*==================*/
|
|
const trx_undo_rec_t* undo_rec, /*!< in: undo log record */
|
|
ulint* type, /*!< out: undo record type:
|
|
TRX_UNDO_INSERT_REC, ... */
|
|
ulint* cmpl_info, /*!< out: compiler info, relevant only
|
|
for update type records */
|
|
bool* updated_extern, /*!< out: true if we updated an
|
|
externally stored fild */
|
|
undo_no_t* undo_no, /*!< out: undo log record number */
|
|
table_id_t* table_id) /*!< out: table id */
|
|
MY_ATTRIBUTE((nonnull));
|
|
|
|
/*******************************************************************//**
|
|
Builds a row reference from an undo log record.
|
|
@return pointer to remaining part of undo record */
|
|
const byte*
|
|
trx_undo_rec_get_row_ref(
|
|
/*=====================*/
|
|
const byte* ptr, /*!< in: remaining part of a copy of an undo log
|
|
record, at the start of the row reference;
|
|
NOTE that this copy of the undo log record must
|
|
be preserved as long as the row reference is
|
|
used, as we do NOT copy the data in the
|
|
record! */
|
|
dict_index_t* index, /*!< in: clustered index */
|
|
const dtuple_t**ref, /*!< out, own: row reference */
|
|
mem_heap_t* heap) /*!< in: memory heap from which the memory
|
|
needed is allocated */
|
|
MY_ATTRIBUTE((nonnull));
|
|
/**********************************************************************//**
|
|
Reads from an undo log update record the system field values of the old
|
|
version.
|
|
@return remaining part of undo log record after reading these values */
|
|
byte*
|
|
trx_undo_update_rec_get_sys_cols(
|
|
/*=============================*/
|
|
const byte* ptr, /*!< in: remaining part of undo
|
|
log record after reading
|
|
general parameters */
|
|
trx_id_t* trx_id, /*!< out: trx id */
|
|
roll_ptr_t* roll_ptr, /*!< out: roll ptr */
|
|
byte* info_bits); /*!< out: info bits state */
|
|
/*******************************************************************//**
|
|
Builds an update vector based on a remaining part of an undo log record.
|
|
@return remaining part of the record, NULL if an error detected, which
|
|
means that the record is corrupted */
|
|
byte*
|
|
trx_undo_update_rec_get_update(
|
|
/*===========================*/
|
|
const byte* ptr, /*!< in: remaining part in update undo log
|
|
record, after reading the row reference
|
|
NOTE that this copy of the undo log record must
|
|
be preserved as long as the update vector is
|
|
used, as we do NOT copy the data in the
|
|
record! */
|
|
dict_index_t* index, /*!< in: clustered index */
|
|
ulint type, /*!< in: TRX_UNDO_UPD_EXIST_REC,
|
|
TRX_UNDO_UPD_DEL_REC, or
|
|
TRX_UNDO_DEL_MARK_REC; in the last case,
|
|
only trx id and roll ptr fields are added to
|
|
the update vector */
|
|
trx_id_t trx_id, /*!< in: transaction id from this undorecord */
|
|
roll_ptr_t roll_ptr,/*!< in: roll pointer from this undo record */
|
|
byte info_bits,/*!< in: info bits from this undo record */
|
|
mem_heap_t* heap, /*!< in: memory heap from which the memory
|
|
needed is allocated */
|
|
upd_t** upd); /*!< out, own: update vector */
|
|
/** Report a RENAME TABLE operation.
|
|
@param[in,out] trx transaction
|
|
@param[in] table table that is being renamed
|
|
@return DB_SUCCESS or error code */
|
|
dberr_t trx_undo_report_rename(trx_t* trx, const dict_table_t* table)
|
|
MY_ATTRIBUTE((nonnull, warn_unused_result));
|
|
/***********************************************************************//**
|
|
Writes information to an undo log about an insert, update, or a delete marking
|
|
of a clustered index record. This information is used in a rollback of the
|
|
transaction and in consistent reads that must look to the history of this
|
|
transaction.
|
|
@return DB_SUCCESS or error code */
|
|
dberr_t
|
|
trx_undo_report_row_operation(
|
|
/*==========================*/
|
|
que_thr_t* thr, /*!< in: query thread */
|
|
dict_index_t* index, /*!< in: clustered index */
|
|
const dtuple_t* clust_entry, /*!< in: in the case of an insert,
|
|
index entry to insert into the
|
|
clustered index; in updates,
|
|
may contain a clustered index
|
|
record tuple that also contains
|
|
virtual columns of the table;
|
|
otherwise, NULL */
|
|
const upd_t* update, /*!< in: in the case of an update,
|
|
the update vector, otherwise NULL */
|
|
ulint cmpl_info, /*!< in: compiler info on secondary
|
|
index updates */
|
|
const rec_t* rec, /*!< in: case of an update or delete
|
|
marking, the record in the clustered
|
|
index; NULL if insert */
|
|
const rec_offs* offsets, /*!< in: rec_get_offsets(rec) */
|
|
roll_ptr_t* roll_ptr) /*!< out: DB_ROLL_PTR to the
|
|
undo log record */
|
|
MY_ATTRIBUTE((nonnull(1,2,8), warn_unused_result));
|
|
|
|
/** status bit used for trx_undo_prev_version_build() */
|
|
|
|
/** TRX_UNDO_PREV_IN_PURGE tells trx_undo_prev_version_build() that it
|
|
is being called purge view and we would like to get the purge record
|
|
even it is in the purge view (in normal case, it will return without
|
|
fetching the purge record */
|
|
static constexpr ulint TRX_UNDO_PREV_IN_PURGE = 1;
|
|
|
|
/** This tells trx_undo_prev_version_build() to fetch the old value in
|
|
the undo log (which is the after image for an update) */
|
|
static constexpr ulint TRX_UNDO_GET_OLD_V_VALUE = 2;
|
|
|
|
/** indicate a call from row_vers_old_has_index_entry() */
|
|
static constexpr ulint TRX_UNDO_CHECK_PURGEABILITY = 4;
|
|
|
|
/** Build a previous version of a clustered index record. The caller
|
|
must hold a latch on the index page of the clustered index record.
|
|
@param rec version of a clustered index record
|
|
@param index clustered index
|
|
@param offsets rec_get_offsets(rec, index)
|
|
@param heap memory heap from which the memory needed is
|
|
allocated
|
|
@param old_vers previous version or NULL if rec is the
|
|
first inserted version, or if history data
|
|
has been deleted (an error), or if the purge
|
|
could have removed the version
|
|
though it has not yet done so
|
|
@param v_heap memory heap used to create vrow
|
|
dtuple if it is not yet created. This heap
|
|
diffs from "heap" above in that it could be
|
|
prebuilt->old_vers_heap for selection
|
|
@param vrow virtual column info, if any
|
|
@param v_status status determine if it is going into this
|
|
function by purge thread or not.
|
|
And if we read "after image" of undo log
|
|
@return error code
|
|
@retval DB_SUCCESS if previous version was successfully built,
|
|
or if it was an insert or the undo record refers to the table before rebuild
|
|
@retval DB_MISSING_HISTORY if the history is missing */
|
|
dberr_t
|
|
trx_undo_prev_version_build(
|
|
const rec_t *rec,
|
|
dict_index_t *index,
|
|
rec_offs *offsets,
|
|
mem_heap_t *heap,
|
|
rec_t **old_vers,
|
|
mem_heap_t *v_heap,
|
|
dtuple_t **vrow,
|
|
ulint v_status);
|
|
|
|
/** Read from an undo log record a non-virtual column value.
|
|
@param ptr pointer to remaining part of the undo record
|
|
@param field stored field
|
|
@param len length of the field, or UNIV_SQL_NULL
|
|
@param orig_len original length of the locally stored part
|
|
of an externally stored column, or 0
|
|
@return remaining part of undo log record after reading these values */
|
|
const byte *trx_undo_rec_get_col_val(const byte *ptr, const byte **field,
|
|
uint32_t *len, uint32_t *orig_len);
|
|
|
|
/** Read virtual column value from undo log
|
|
@param[in] table the table
|
|
@param[in] ptr undo log pointer
|
|
@param[in,out] row the dtuple to fill
|
|
@param[in] in_purge whether this is called by purge */
|
|
void
|
|
trx_undo_read_v_cols(
|
|
const dict_table_t* table,
|
|
const byte* ptr,
|
|
dtuple_t* row,
|
|
bool in_purge);
|
|
|
|
/** Read virtual column index from undo log if the undo log contains such
|
|
info, and verify the column is still indexed, and output its position
|
|
@param[in] table the table
|
|
@param[in] ptr undo log pointer
|
|
@param[in] first_v_col if this is the first virtual column, which
|
|
has the version marker
|
|
@param[in,out] is_undo_log his function is used to parse both undo log,
|
|
and online log for virtual columns. So
|
|
check to see if this is undo log
|
|
@param[out] field_no the column number, or FIL_NULL if not indexed
|
|
@return remaining part of undo log record after reading these values */
|
|
const byte*
|
|
trx_undo_read_v_idx(
|
|
const dict_table_t* table,
|
|
const byte* ptr,
|
|
bool first_v_col,
|
|
bool* is_undo_log,
|
|
uint32_t* field_no);
|
|
|
|
/* Types of an undo log record: these have to be smaller than 16, as the
|
|
compilation info multiplied by 16 is ORed to this value in an undo log
|
|
record */
|
|
|
|
/** Undo log records for DDL operations
|
|
|
|
Note: special rollback and purge triggers exist for SYS_INDEXES records:
|
|
@see dict_drop_index_tree() */
|
|
enum trx_undo_ddl_type
|
|
{
|
|
/** RENAME TABLE (logging the old table name).
|
|
|
|
Because SYS_TABLES has PRIMARY KEY(NAME), the row-level undo log records
|
|
for SYS_TABLES cannot be distinguished from DROP TABLE, CREATE TABLE. */
|
|
TRX_UNDO_RENAME_TABLE= 9,
|
|
/** insert a metadata pseudo-record for instant ALTER TABLE */
|
|
TRX_UNDO_INSERT_METADATA= 10
|
|
};
|
|
|
|
/* DML operations */
|
|
#define TRX_UNDO_INSERT_REC 11 /* fresh insert into clustered index */
|
|
#define TRX_UNDO_UPD_EXIST_REC 12 /* update of a non-delete-marked
|
|
record */
|
|
#define TRX_UNDO_UPD_DEL_REC 13 /* update of a delete marked record to
|
|
a not delete marked record; also the
|
|
fields of the record can change */
|
|
#define TRX_UNDO_DEL_MARK_REC 14 /* delete marking of a record; fields
|
|
do not change */
|
|
/** Bulk insert operation. It is written only when the table is
|
|
under exclusive lock and the clustered index root page latch is being held,
|
|
and the clustered index is empty. Rollback will empty the table and
|
|
free the leaf segment of all indexes, re-create the new
|
|
leaf segment and re-initialize the root page alone. */
|
|
#define TRX_UNDO_EMPTY 15
|
|
|
|
#define TRX_UNDO_CMPL_INFO_MULT 16U /* compilation info is multiplied by
|
|
this and ORed to the type above */
|
|
#define TRX_UNDO_UPD_EXTERN 128U /* This bit can be ORed to type_cmpl
|
|
to denote that we updated external
|
|
storage fields: used by purge to
|
|
free the external storage */
|
|
|
|
/** The search tuple corresponding to TRX_UNDO_INSERT_METADATA */
|
|
extern const dtuple_t trx_undo_metadata;
|
|
|
|
/** Read the table id from an undo log record.
|
|
@param[in] rec Undo log record
|
|
@return table id stored as a part of undo log record */
|
|
inline table_id_t trx_undo_rec_get_table_id(const trx_undo_rec_t *rec)
|
|
{
|
|
rec+= 3;
|
|
mach_read_next_much_compressed(&rec);
|
|
return mach_read_next_much_compressed(&rec);
|
|
}
|