mariadb/storage/innobase/include/trx0rseg.h
Marko Mäkelä 0b47c126e3 MDEV-13542: Crashing on corrupted page is unhelpful
The approach to handling corruption that was chosen by Oracle in
commit 177d8b0c12
is not really useful. Not only did it actually fail to prevent InnoDB
from crashing, but it is making things worse by blocking attempts to
rescue data from or rebuild a partially readable table.

We will try to prevent crashes in a different way: by propagating
errors up the call stack. We will never mark the clustered index
persistently corrupted, so that data recovery may be attempted by
reading from the table, or by rebuilding the table.

This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
it was extensively tested with innodb_file_per_table=0 and a
non-autoextend system tablespace.

We should now avoid crashes in many cases, such as when a page
cannot be read or allocated, or an inconsistency is detected when
attempting to update multiple pages. We will not crash on double-free,
such as on the recovery of DDL in system tablespace in case something
was corrupted.

Crashes on corrupted data are still possible. The fault injection mechanism
that is introduced in the subsequent commit may help catch more of them.

buf_page_import_corrupt_failure: Remove the fault injection, and instead
corrupt some pages using Perl code in the tests.

btr_cur_pessimistic_insert(): Always reserve extents (except for the
change buffer), in order to prevent a subsequent allocation failure.

btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().

btr_assert_not_corrupted(), btr_corruption_report(): Remove.
Similar checks are already part of btr_block_get().

FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.

dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
trx_undo_page_get_s_latched(): Replaced with error-checking calls.

trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().

trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.

trx_sys_create_sys_pages(): Merged with trx_sysf_create().

dict_check_tablespaces_and_store_max_id(): Do not access
DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
Merge dict_check_sys_tables() with this function.

dir_pathname(): Replaces os_file_make_new_pathname().

row_undo_ins_remove_sec(): Do not modify the undo page by adding
a terminating NUL byte to the record.

btr_decryption_failed(): Report decryption failures

dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
dict_set_corrupted_index_cache_only(): Remove.

dict_set_corrupted(): Remove the constant parameter dict_locked=false.
Never flag the clustered index corrupted in SYS_INDEXES, because
that would deny further access to the table. It might be possible to
repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
no B-tree leaf page is corrupted.

dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
the callers to read dict_index_t::type only once.

dict_table_is_corrupted(): Remove.

dict_index_t::is_btree(): Determine if the index is a valid B-tree.

BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.

UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
assertion failures, but error codes being returned.

buf_corrupt_page_release(): Replaced with a direct call to
buf_pool.corrupted_evict().

fil_invalid_page_access_msg(): Never crash on an invalid read;
let the caller of buf_page_get_gen() decide.

btr_pcur_t::restore_position(): Propagate failure status to the caller
by returning CORRUPTED.

opt_search_plan_for_table(): Simplify the code.

row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
when no secondary indexes exist.

row_undo_mod_upd_exist_sec(): Simplify the code.

row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
if the clustered index (and therefore the table) is corrupted, similar
to what we do in row_insert_for_mysql().

fut_get_ptr(): Replace with buf_page_get_gen() calls.

buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
if the page is marked as freed. For other modes than
BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
we will return nullptr for freed pages, so that the callers
can be simplified. The purge of transaction history will be
a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
corrupted data.

buf_page_get_low(): Never crash on a corrupted page, but simply
return nullptr.

fseg_page_is_allocated(): Replaces fseg_page_is_free().

fts_drop_common_tables(): Return an error if the transaction
was rolled back.

fil_space_t::set_corrupted(): Report a tablespace as corrupted if
it was not reported already.

fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
out-of-bounds page access or other errors.

Clean up mtr_t::page_lock()

buf_page_get_low(): Validate the page identifier (to check for
recently read corrupted pages) after acquiring the page latch.

buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.

mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().

recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
if any log records exist for the page. We do not mind if read-ahead
produces corrupted (or all-zero) pages that were not actually needed
during recovery.

recv_recover_page(): Return whether the operation succeeded.

recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.

Thanks to Matthias Leich for testing this extensively and to the
authors of https://rr-project.org for making it easy to diagnose
and fix any failures that were found during the testing.
2022-06-06 14:03:22 +03:00

312 lines
12 KiB
C++

/*****************************************************************************
Copyright (c) 1996, 2016, Oracle and/or its affiliates. All Rights Reserved.
Copyright (c) 2017, 2022, MariaDB Corporation.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
*****************************************************************************/
/**************************************************//**
@file include/trx0rseg.h
Rollback segment
Created 3/26/1996 Heikki Tuuri
*******************************************************/
#pragma once
#include "trx0types.h"
#include "fut0lst.h"
/** Create a rollback segment header.
@param[in,out] space system, undo, or temporary tablespace
@param[in] rseg_id rollback segment identifier
@param[in] max_trx_id new value of TRX_RSEG_MAX_TRX_ID
@param[in,out] mtr mini-transaction
@param[out] err error code
@return the created rollback segment
@retval nullptr on failure */
buf_block_t *trx_rseg_header_create(fil_space_t *space, ulint rseg_id,
trx_id_t max_trx_id, mtr_t *mtr,
dberr_t *err)
MY_ATTRIBUTE((nonnull, warn_unused_result));
/** Initialize or recover the rollback segments at startup. */
dberr_t trx_rseg_array_init();
/** Create the temporary rollback segments. */
dberr_t trx_temp_rseg_create(mtr_t *mtr);
/* Number of undo log slots in a rollback segment file copy */
#define TRX_RSEG_N_SLOTS (srv_page_size / 16)
/* Maximum number of transactions supported by a single rollback segment */
#define TRX_RSEG_MAX_N_TRXS (TRX_RSEG_N_SLOTS / 2)
/** The rollback segment memory object */
struct alignas(CPU_LEVEL1_DCACHE_LINESIZE) trx_rseg_t
{
/** tablespace containing the rollback segment; constant after init() */
fil_space_t *space;
/** latch protecting everything except page_no, space */
srw_spin_lock latch;
/** rollback segment header page number; constant after init() */
uint32_t page_no;
/** length of the TRX_RSEG_HISTORY list (number of transactions) */
uint32_t history_size;
private:
/** Reference counter to track rseg allocated transactions,
with SKIP and NEEDS_PURGE flags. */
std::atomic<uint32_t> ref;
/** Whether undo tablespace truncation is pending */
static constexpr uint32_t SKIP= 1;
/** Whether the log segment needs purge */
static constexpr uint32_t NEEDS_PURGE= 2;
/** Transaction reference count multiplier */
static constexpr uint32_t REF= 4;
uint32_t ref_load() const { return ref.load(std::memory_order_relaxed); }
/** Set a bit in ref */
template<bool needs_purge> void ref_set()
{
static_assert(SKIP == 1U << 0, "compatibility");
static_assert(NEEDS_PURGE == 1U << 1, "compatibility");
#if defined __GNUC__ && (defined __i386__ || defined __x86_64__)
if (needs_purge)
__asm__ __volatile__("lock btsl $1, %0" : "+m" (ref));
else
__asm__ __volatile__("lock btsl $0, %0" : "+m" (ref));
#elif defined _MSC_VER && (defined _M_IX86 || defined _M_X64)
_interlockedbittestandset(reinterpret_cast<volatile long*>(&ref),
needs_purge);
#else
ref.fetch_or(needs_purge ? NEEDS_PURGE : SKIP, std::memory_order_relaxed);
#endif
}
/** Clear a bit in ref */
template<bool needs_purge> void ref_reset()
{
static_assert(SKIP == 1U << 0, "compatibility");
static_assert(NEEDS_PURGE == 1U << 1, "compatibility");
#if defined __GNUC__ && (defined __i386__ || defined __x86_64__)
if (needs_purge)
__asm__ __volatile__("lock btrl $1, %0" : "+m" (ref));
else
__asm__ __volatile__("lock btrl $0, %0" : "+m" (ref));
#elif defined _MSC_VER && (defined _M_IX86 || defined _M_X64)
_interlockedbittestandreset(reinterpret_cast<volatile long*>(&ref),
needs_purge);
#else
ref.fetch_and(needs_purge ? ~NEEDS_PURGE : ~SKIP,
std::memory_order_relaxed);
#endif
}
public:
/** Initialize the fields that are not zero-initialized. */
void init(fil_space_t *space, uint32_t page);
/** Reinitialize the fields on undo tablespace truncation. */
void reinit(uint32_t page);
/** Clean up. */
void destroy();
/** Note that undo tablespace truncation was started. */
void set_skip_allocation() { ut_ad(is_persistent()); ref_set<false>(); }
/** Note that undo tablespace truncation was completed. */
void clear_skip_allocation()
{
ut_ad(is_persistent());
#if defined DBUG_OFF
ref_reset<false>();
#else
ut_d(auto r=) ref.fetch_and(~SKIP, std::memory_order_relaxed);
ut_ad(r == SKIP);
#endif
}
/** Note that the rollback segment requires purge. */
void set_needs_purge() { ref_set<true>(); }
/** Note that the rollback segment will not require purge. */
void clear_needs_purge() { ref_reset<true>(); }
/** @return whether the segment is marked for undo truncation */
bool skip_allocation() const { return ref_load() & SKIP; }
/** @return whether the segment needs purge */
bool needs_purge() const { return ref_load() & NEEDS_PURGE; }
/** Increment the reference count */
void acquire()
{ ut_d(auto r=) ref.fetch_add(REF); ut_ad(!(r & SKIP)); }
/** Increment the reference count if possible
@retval true if the reference count was incremented
@retval false if skip_allocation() holds */
bool acquire_if_available()
{
uint32_t r= 0;
while (!ref.compare_exchange_weak(r, r + REF,
std::memory_order_relaxed,
std::memory_order_relaxed))
if (r & SKIP)
return false;
return true;
}
/** Decrement the reference count */
void release()
{
ut_d(const auto r=)
ref.fetch_sub(REF, std::memory_order_relaxed);
ut_ad(r >= REF);
}
/** @return whether references exist */
bool is_referenced() const { return ref_load() >= REF; }
/** current size in pages */
uint32_t curr_size;
/** List of undo logs (transactions) */
UT_LIST_BASE_NODE_T(trx_undo_t) undo_list;
/** List of undo log segments cached for fast reuse */
UT_LIST_BASE_NODE_T(trx_undo_t) undo_cached;
/** Last not yet purged undo log header; FIL_NULL if all purged */
uint32_t last_page_no;
/** trx_t::no | last_offset << 48 */
uint64_t last_commit_and_offset;
/** @return the commit ID of the last committed transaction */
trx_id_t last_trx_no() const
{ return last_commit_and_offset & ((1ULL << 48) - 1); }
/** @return header offset of the last committed transaction */
uint16_t last_offset() const
{ return static_cast<uint16_t>(last_commit_and_offset >> 48); }
void set_last_commit(uint16_t last_offset, trx_id_t trx_no)
{
last_commit_and_offset= static_cast<uint64_t>(last_offset) << 48 | trx_no;
}
/** @return the page identifier */
page_id_t page_id() const { return page_id_t{space->id, page_no}; }
/** @return the rollback segment header page, exclusively latched */
buf_block_t *get(mtr_t *mtr, dberr_t *err) const;
/** @return whether the rollback segment is persistent */
bool is_persistent() const
{
ut_ad(space == fil_system.temp_space || space == fil_system.sys_space ||
(srv_undo_space_id_start > 0 &&
space->id >= srv_undo_space_id_start &&
space->id <= srv_undo_space_id_start + TRX_SYS_MAX_UNDO_SPACES));
ut_ad(space == fil_system.temp_space || space == fil_system.sys_space ||
!srv_was_started ||
(srv_undo_space_id_start > 0 &&
space->id >= srv_undo_space_id_start
&& space->id <= srv_undo_space_id_start +
srv_undo_tablespaces_open));
return space->id != SRV_TMP_SPACE_ID;
}
};
/* Undo log segment slot in a rollback segment header */
/*-------------------------------------------------------------*/
#define TRX_RSEG_SLOT_PAGE_NO 0 /* Page number of the header page of
an undo log segment */
/*-------------------------------------------------------------*/
/* Slot size */
#define TRX_RSEG_SLOT_SIZE 4
/* The offset of the rollback segment header on its page */
#define TRX_RSEG FSEG_PAGE_DATA
/* Transaction rollback segment header */
/*-------------------------------------------------------------*/
/** 0xfffffffe = pre-MariaDB 10.3.5 format; 0=MariaDB 10.3.5 or later */
#define TRX_RSEG_FORMAT 0
/** Number of pages in the TRX_RSEG_HISTORY list */
#define TRX_RSEG_HISTORY_SIZE 4
/** Committed transaction logs that have not been purged yet */
#define TRX_RSEG_HISTORY 8
#define TRX_RSEG_FSEG_HEADER (8 + FLST_BASE_NODE_SIZE)
/* Header for the file segment where
this page is placed */
#define TRX_RSEG_UNDO_SLOTS (8 + FLST_BASE_NODE_SIZE + FSEG_HEADER_SIZE)
/* Undo log segment slots */
/** Maximum transaction ID (valid only if TRX_RSEG_FORMAT is 0) */
#define TRX_RSEG_MAX_TRX_ID (TRX_RSEG_UNDO_SLOTS + TRX_RSEG_N_SLOTS \
* TRX_RSEG_SLOT_SIZE)
/** 8 bytes offset within the binlog file */
#define TRX_RSEG_BINLOG_OFFSET TRX_RSEG_MAX_TRX_ID + 8
/** MySQL log file name, 512 bytes, including terminating NUL
(valid only if TRX_RSEG_FORMAT is 0).
If no binlog information is present, the first byte is NUL. */
#define TRX_RSEG_BINLOG_NAME TRX_RSEG_MAX_TRX_ID + 16
/** Maximum length of binlog file name, including terminating NUL, in bytes */
#define TRX_RSEG_BINLOG_NAME_LEN 512
#ifdef WITH_WSREP
# include "trx0xa.h"
/** Update the WSREP XID information in rollback segment header.
@param[in,out] rseg_header rollback segment header
@param[in] xid WSREP XID
@param[in,out] mtr mini-transaction */
void
trx_rseg_update_wsrep_checkpoint(
buf_block_t* rseg_header,
const XID* xid,
mtr_t* mtr);
/** Update WSREP checkpoint XID in first rollback segment header
as part of wsrep_set_SE_checkpoint() when it is guaranteed that there
are no wsrep transactions committing.
If the UUID part of the WSREP XID does not match to the UUIDs of XIDs already
stored into rollback segments, the WSREP XID in all the remaining rollback
segments will be reset.
@param[in] xid WSREP XID */
void trx_rseg_update_wsrep_checkpoint(const XID* xid);
/** Recover the latest WSREP checkpoint XID.
@param[out] xid WSREP XID
@return whether the WSREP XID was found */
bool trx_rseg_read_wsrep_checkpoint(XID& xid);
#endif /* WITH_WSREP */
/** Read the page number of an undo log slot.
@param[in] rseg_header rollback segment header
@param[in] n slot number */
inline uint32_t trx_rsegf_get_nth_undo(const buf_block_t *rseg_header, ulint n)
{
ut_ad(n < TRX_RSEG_N_SLOTS);
return mach_read_from_4(TRX_RSEG + TRX_RSEG_UNDO_SLOTS +
n * TRX_RSEG_SLOT_SIZE + rseg_header->page.frame);
}
/** Upgrade a rollback segment header page to MariaDB 10.3 format.
@param[in,out] rseg_header rollback segment header page
@param[in,out] mtr mini-transaction */
void trx_rseg_format_upgrade(buf_block_t *rseg_header, mtr_t *mtr);
/** Update the offset information about the end of the binlog entry
which corresponds to the transaction just being committed.
In a replication slave, this updates the master binlog position
up to which replication has proceeded.
@param[in,out] rseg_header rollback segment header
@param[in] trx committing transaction
@param[in,out] mtr mini-transaction */
void trx_rseg_update_binlog_offset(buf_block_t *rseg_header, const trx_t *trx,
mtr_t *mtr);