mariadb/storage/innodb_plugin/include/fsp0fsp.h
Marko Mäkelä ae309bd336 Bug#13721257 RACE CONDITION IN UPDATES OR INSERTS OF WIDE RECORDS
This bug was originally filed and fixed as Bug#12612184. The original
fix was buggy, and it was patched by Bug#12704861. Also that patch was
buggy (potentially breaking crash recovery), and both fixes were
reverted.

This fix was not ported to the built-in InnoDB of MySQL 5.1, because
the function signatures of many core functions are different from
InnoDB Plugin and later versions. The block allocation routines and
their callers would have to changed so that they handle block
descriptors instead of page frames.

When a record is updated so that its size grows, non-updated columns
can be selected for external (off-page) storage. The bug is that the
initially inserted updated record contains an all-zero BLOB pointer to
the field that was not updated. Only after the BLOB pages have been
allocated and written, the valid pointer can be written to the record.

Between the release of the page latch in mtr_commit(mtr) after
btr_cur_pessimistic_update() and the re-latching of the page in
btr_pcur_restore_position(), other threads can see the invalid BLOB
pointer consisting of 20 zero bytes. Moreover, if the system crashes
at this point, the situation could persist after crash recovery, and
the contents of the non-updated column would be permanently lost.

The problem is amplified by the ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPRESSED that were introduced in
innodb_file_format=barracuda in InnoDB Plugin, but the bug does exist
in all InnoDB versions.

The fix is as follows. After a pessimistic B-tree operation that needs
to write out off-page columns, allocate the pages for these columns in
the mini-transaction that performed the B-tree operation (btr_mtr),
but write the pages in a separate mini-transaction (blob_mtr). Do
mtr_commit(blob_mtr) before mtr_commit(btr_mtr). A quirk: Do not reuse
pages that were previously freed in btr_mtr. Only write the off-page
columns to 'fresh' pages.

In this way, crash recovery will see redo log entries for blob_mtr
before any redo log entry for btr_mtr. It will apply the BLOB page
writes to pages that were marked free at that point. If crash recovery
fails to see all of the btr_mtr redo log, there will be some
unreachable BLOB data in free pages, but the B-tree will be in a
consistent state.

btr_page_alloc_low(): Renamed from btr_page_alloc(). Add the parameter
init_mtr. Return an allocated block, or NULL. If init_mtr!=mtr but
the page was already X-latched in mtr, do not initialize the page.

btr_page_alloc(): Wrapper for btr_page_alloc_for_ibuf() and
btr_page_alloc_low().

btr_page_free(): Add a debug assertion that the page was a B-tree page.

btr_lift_page_up(): Return the father block.

btr_compress(), btr_cur_compress_if_useful(): Add the parameter ibool
adjust, for adjusting the cursor position.

btr_cur_pessimistic_update(): Preserve the cursor position when
big_rec will be written and the new flag BTR_KEEP_POS_FLAG is defined.
Remove a duplicate rec_get_offsets() call. Keep the X-latch on
index->lock when big_rec is needed.

btr_store_big_rec_extern_fields(): Replace update_inplace with
an operation code, and local_mtr with btr_mtr. When not doing a
fresh insert and btr_mtr has freed pages, put aside any pages that
were previously X-latched in btr_mtr, and free the pages after
writing out all data. The data must be written to 'fresh' pages,
because btr_mtr will be committed and written to the redo log after
the BLOB writes have been written to the redo log.

btr_blob_op_is_update(): Check if an operation passed to
btr_store_big_rec_extern_fields() is an update or insert-by-update.

fseg_alloc_free_page_low(), fsp_alloc_free_page(),
fseg_alloc_free_extent(), fseg_alloc_free_page_general(): Add the
parameter init_mtr. Return an allocated block, or NULL. If
init_mtr!=mtr but the page was already X-latched in mtr, do not
initialize the page.

xdes_get_descriptor_with_space_hdr(): Assert that the file space
header is being X-latched.

fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page().

fsp_page_create(): New function, for allocating, X-latching and
potentially initializing a page. If init_mtr!=mtr but the page was
already X-latched in mtr, do not initialize the page.

fsp_free_page(): Add ut_ad(0) to the error outcomes.

fsp_free_page(), fseg_free_page_low(): Increment mtr->n_freed_pages.

fsp_alloc_seg_inode_page(), fseg_create_general(): Assert that the
page was not previously X-latched in the mini-transaction. A file
segment or inode page should never be allocated in the middle of an
mini-transaction that frees pages, such as btr_cur_pessimistic_delete().

fseg_alloc_free_page_low(): If the hinted page was allocated, skip the
check if the tablespace should be extended. Return NULL instead of
FIL_NULL on failure. Remove the flag frag_page_allocated. Instead,
return directly, because the page would already have been initialized.

fseg_find_free_frag_page_slot() would return ULINT_UNDEFINED on error,
not FIL_NULL. Correct a bogus assertion.

fseg_alloc_free_page(): Redefine as a wrapper macro around
fseg_alloc_free_page_general().

buf_block_buf_fix_inc(): Move the definition from the buf0buf.ic to
buf0buf.h, so that it can be called from other modules.

mtr_t: Add n_freed_pages (number of pages that have been freed).

page_rec_get_nth_const(), page_rec_get_nth(): The inverse function of
page_rec_get_n_recs_before(), get the nth record of the record
list. This is faster than iterating the linked list. Refactored from
page_get_middle_rec().

trx_undo_rec_copy(): Add a debug assertion for the length.

trx_undo_add_page(): Return a block descriptor or NULL instead of a
page number or FIL_NULL.

trx_undo_report_row_operation(): Add debug assertions.

trx_sys_create_doublewrite_buf(): Assert that each page was not
previously X-latched.

page_cur_insert_rec_zip_reorg(): Make use of page_rec_get_nth().

row_ins_clust_index_entry_by_modify(): Pass BTR_KEEP_POS_FLAG, so that
the repositioning of the cursor can be avoided.

row_ins_index_entry_low(): Add DEBUG_SYNC points before and after
writing off-page columns. If inserting by updating a delete-marked
record, do not reposition the cursor or commit the mini-transaction
before writing the off-page columns.

row_build(): Tighten a debug assertion about null BLOB pointers.

row_upd_clust_rec(): Add DEBUG_SYNC points before and after writing
off-page columns. Do not reposition the cursor or commit the
mini-transaction before writing the off-page columns.

rb:939 approved by Jimmy Yang
2012-02-17 11:42:04 +02:00

367 lines
14 KiB
C

/*****************************************************************************
Copyright (c) 1995, 2012, Oracle and/or its affiliates. All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Suite 500, Boston, MA 02110-1335 USA
*****************************************************************************/
/**************************************************//**
@file include/fsp0fsp.h
File space management
Created 12/18/1995 Heikki Tuuri
*******************************************************/
#ifndef fsp0fsp_h
#define fsp0fsp_h
#include "univ.i"
#include "mtr0mtr.h"
#include "fut0lst.h"
#include "ut0byte.h"
#include "page0types.h"
#include "fsp0types.h"
/**********************************************************************//**
Initializes the file space system. */
UNIV_INTERN
void
fsp_init(void);
/*==========*/
/**********************************************************************//**
Gets the current free limit of the system tablespace. The free limit
means the place of the first page which has never been put to the
free list for allocation. The space above that address is initialized
to zero. Sets also the global variable log_fsp_current_free_limit.
@return free limit in megabytes */
UNIV_INTERN
ulint
fsp_header_get_free_limit(void);
/*===========================*/
/**********************************************************************//**
Gets the size of the system tablespace from the tablespace header. If
we do not have an auto-extending data file, this should be equal to
the size of the data files. If there is an auto-extending data file,
this can be smaller.
@return size in pages */
UNIV_INTERN
ulint
fsp_header_get_tablespace_size(void);
/*================================*/
/**********************************************************************//**
Reads the file space size stored in the header page.
@return tablespace size stored in the space header */
UNIV_INTERN
ulint
fsp_get_size_low(
/*=============*/
page_t* page); /*!< in: header page (page 0 in the tablespace) */
/**********************************************************************//**
Reads the space id from the first page of a tablespace.
@return space id, ULINT UNDEFINED if error */
UNIV_INTERN
ulint
fsp_header_get_space_id(
/*====================*/
const page_t* page); /*!< in: first page of a tablespace */
/**********************************************************************//**
Reads the space flags from the first page of a tablespace.
@return flags */
UNIV_INTERN
ulint
fsp_header_get_flags(
/*=================*/
const page_t* page); /*!< in: first page of a tablespace */
/**********************************************************************//**
Reads the compressed page size from the first page of a tablespace.
@return compressed page size in bytes, or 0 if uncompressed */
UNIV_INTERN
ulint
fsp_header_get_zip_size(
/*====================*/
const page_t* page); /*!< in: first page of a tablespace */
/**********************************************************************//**
Writes the space id and compressed page size to a tablespace header.
This function is used past the buffer pool when we in fil0fil.c create
a new single-table tablespace. */
UNIV_INTERN
void
fsp_header_init_fields(
/*===================*/
page_t* page, /*!< in/out: first page in the space */
ulint space_id, /*!< in: space id */
ulint flags); /*!< in: tablespace flags (FSP_SPACE_FLAGS):
0, or table->flags if newer than COMPACT */
/**********************************************************************//**
Initializes the space header of a new created space and creates also the
insert buffer tree root if space == 0. */
UNIV_INTERN
void
fsp_header_init(
/*============*/
ulint space, /*!< in: space id */
ulint size, /*!< in: current size in blocks */
mtr_t* mtr); /*!< in: mini-transaction handle */
/**********************************************************************//**
Increases the space size field of a space. */
UNIV_INTERN
void
fsp_header_inc_size(
/*================*/
ulint space, /*!< in: space id */
ulint size_inc,/*!< in: size increment in pages */
mtr_t* mtr); /*!< in: mini-transaction handle */
/**********************************************************************//**
Creates a new segment.
@return the block where the segment header is placed, x-latched, NULL
if could not create segment because of lack of space */
UNIV_INTERN
buf_block_t*
fseg_create(
/*========*/
ulint space, /*!< in: space id */
ulint page, /*!< in: page where the segment header is placed: if
this is != 0, the page must belong to another segment,
if this is 0, a new page will be allocated and it
will belong to the created segment */
ulint byte_offset, /*!< in: byte offset of the created segment header
on the page */
mtr_t* mtr); /*!< in: mtr */
/**********************************************************************//**
Creates a new segment.
@return the block where the segment header is placed, x-latched, NULL
if could not create segment because of lack of space */
UNIV_INTERN
buf_block_t*
fseg_create_general(
/*================*/
ulint space, /*!< in: space id */
ulint page, /*!< in: page where the segment header is placed: if
this is != 0, the page must belong to another segment,
if this is 0, a new page will be allocated and it
will belong to the created segment */
ulint byte_offset, /*!< in: byte offset of the created segment header
on the page */
ibool has_done_reservation, /*!< in: TRUE if the caller has already
done the reservation for the pages with
fsp_reserve_free_extents (at least 2 extents: one for
the inode and the other for the segment) then there is
no need to do the check for this individual
operation */
mtr_t* mtr); /*!< in: mtr */
/**********************************************************************//**
Calculates the number of pages reserved by a segment, and how many pages are
currently used.
@return number of reserved pages */
UNIV_INTERN
ulint
fseg_n_reserved_pages(
/*==================*/
fseg_header_t* header, /*!< in: segment header */
ulint* used, /*!< out: number of pages used (<= reserved) */
mtr_t* mtr); /*!< in: mtr handle */
/**********************************************************************//**
Allocates a single free page from a segment. This function implements
the intelligent allocation strategy which tries to minimize
file space fragmentation.
@param[in/out] seg_header segment header
@param[in] hint hint of which page would be desirable
@param[in] direction if the new page is needed because
of an index page split, and records are
inserted there in order, into which
direction they go alphabetically: FSP_DOWN,
FSP_UP, FSP_NO_DIR
@param[in/out] mtr mini-transaction
@return X-latched block, or NULL if no page could be allocated */
#define fseg_alloc_free_page(seg_header, hint, direction, mtr) \
fseg_alloc_free_page_general(seg_header, hint, direction, \
FALSE, mtr, mtr)
/**********************************************************************//**
Allocates a single free page from a segment. This function implements
the intelligent allocation strategy which tries to minimize file space
fragmentation.
@retval NULL if no page could be allocated
@retval block, rw_lock_x_lock_count(&block->lock) == 1 if allocation succeeded
(init_mtr == mtr, or the page was not previously freed in mtr)
@retval block (not allocated or initialized) otherwise */
UNIV_INTERN
buf_block_t*
fseg_alloc_free_page_general(
/*=========================*/
fseg_header_t* seg_header,/*!< in/out: segment header */
ulint hint, /*!< in: hint of which page would be
desirable */
byte direction,/*!< in: if the new page is needed because
of an index page split, and records are
inserted there in order, into which
direction they go alphabetically: FSP_DOWN,
FSP_UP, FSP_NO_DIR */
ibool has_done_reservation, /*!< in: TRUE if the caller has
already done the reservation for the page
with fsp_reserve_free_extents, then there
is no need to do the check for this individual
page */
mtr_t* mtr, /*!< in/out: mini-transaction */
mtr_t* init_mtr)/*!< in/out: mtr or another mini-transaction
in which the page should be initialized.
If init_mtr!=mtr, but the page is already
latched in mtr, do not initialize the page. */
__attribute__((warn_unused_result, nonnull));
/**********************************************************************//**
Reserves free pages from a tablespace. All mini-transactions which may
use several pages from the tablespace should call this function beforehand
and reserve enough free extents so that they certainly will be able
to do their operation, like a B-tree page split, fully. Reservations
must be released with function fil_space_release_free_extents!
The alloc_type below has the following meaning: FSP_NORMAL means an
operation which will probably result in more space usage, like an
insert in a B-tree; FSP_UNDO means allocation to undo logs: if we are
deleting rows, then this allocation will in the long run result in
less space usage (after a purge); FSP_CLEANING means allocation done
in a physical record delete (like in a purge) or other cleaning operation
which will result in less space usage in the long run. We prefer the latter
two types of allocation: when space is scarce, FSP_NORMAL allocations
will not succeed, but the latter two allocations will succeed, if possible.
The purpose is to avoid dead end where the database is full but the
user cannot free any space because these freeing operations temporarily
reserve some space.
Single-table tablespaces whose size is < 32 pages are a special case. In this
function we would liberally reserve several 64 page extents for every page
split or merge in a B-tree. But we do not want to waste disk space if the table
only occupies < 32 pages. That is why we apply different rules in that special
case, just ensuring that there are 3 free pages available.
@return TRUE if we were able to make the reservation */
UNIV_INTERN
ibool
fsp_reserve_free_extents(
/*=====================*/
ulint* n_reserved,/*!< out: number of extents actually reserved; if we
return TRUE and the tablespace size is < 64 pages,
then this can be 0, otherwise it is n_ext */
ulint space, /*!< in: space id */
ulint n_ext, /*!< in: number of extents to reserve */
ulint alloc_type,/*!< in: FSP_NORMAL, FSP_UNDO, or FSP_CLEANING */
mtr_t* mtr); /*!< in: mtr */
/**********************************************************************//**
This function should be used to get information on how much we still
will be able to insert new data to the database without running out the
tablespace. Only free extents are taken into account and we also subtract
the safety margin required by the above function fsp_reserve_free_extents.
@return available space in kB */
UNIV_INTERN
ullint
fsp_get_available_space_in_free_extents(
/*====================================*/
ulint space); /*!< in: space id */
/**********************************************************************//**
Frees a single page of a segment. */
UNIV_INTERN
void
fseg_free_page(
/*===========*/
fseg_header_t* seg_header, /*!< in: segment header */
ulint space, /*!< in: space id */
ulint page, /*!< in: page offset */
mtr_t* mtr); /*!< in: mtr handle */
/**********************************************************************//**
Frees part of a segment. This function can be used to free a segment
by repeatedly calling this function in different mini-transactions.
Doing the freeing in a single mini-transaction might result in
too big a mini-transaction.
@return TRUE if freeing completed */
UNIV_INTERN
ibool
fseg_free_step(
/*===========*/
fseg_header_t* header, /*!< in, own: segment header; NOTE: if the header
resides on the first page of the frag list
of the segment, this pointer becomes obsolete
after the last freeing step */
mtr_t* mtr); /*!< in: mtr */
/**********************************************************************//**
Frees part of a segment. Differs from fseg_free_step because this function
leaves the header page unfreed.
@return TRUE if freeing completed, except the header page */
UNIV_INTERN
ibool
fseg_free_step_not_header(
/*======================*/
fseg_header_t* header, /*!< in: segment header which must reside on
the first fragment page of the segment */
mtr_t* mtr); /*!< in: mtr */
/***********************************************************************//**
Checks if a page address is an extent descriptor page address.
@return TRUE if a descriptor page */
UNIV_INLINE
ibool
fsp_descr_page(
/*===========*/
ulint zip_size,/*!< in: compressed page size in bytes;
0 for uncompressed pages */
ulint page_no);/*!< in: page number */
/***********************************************************//**
Parses a redo log record of a file page init.
@return end of log record or NULL */
UNIV_INTERN
byte*
fsp_parse_init_file_page(
/*=====================*/
byte* ptr, /*!< in: buffer */
byte* end_ptr, /*!< in: buffer end */
buf_block_t* block); /*!< in: block or NULL */
/*******************************************************************//**
Validates the file space system and its segments.
@return TRUE if ok */
UNIV_INTERN
ibool
fsp_validate(
/*=========*/
ulint space); /*!< in: space id */
/*******************************************************************//**
Prints info of a file space. */
UNIV_INTERN
void
fsp_print(
/*======*/
ulint space); /*!< in: space id */
#ifdef UNIV_DEBUG
/*******************************************************************//**
Validates a segment.
@return TRUE if ok */
UNIV_INTERN
ibool
fseg_validate(
/*==========*/
fseg_header_t* header, /*!< in: segment header */
mtr_t* mtr); /*!< in: mtr */
#endif /* UNIV_DEBUG */
#ifdef UNIV_BTR_PRINT
/*******************************************************************//**
Writes info of a segment. */
UNIV_INTERN
void
fseg_print(
/*=======*/
fseg_header_t* header, /*!< in: segment header */
mtr_t* mtr); /*!< in: mtr */
#endif /* UNIV_BTR_PRINT */
#ifndef UNIV_NONINL
#include "fsp0fsp.ic"
#endif
#endif