2014-02-26 19:11:54 +01:00
|
|
|
/*****************************************************************************
|
|
|
|
|
2016-06-21 14:21:03 +02:00
|
|
|
Copyright (c) 1994, 2016, Oracle and/or its affiliates. All Rights Reserved.
|
2014-02-26 19:11:54 +01:00
|
|
|
Copyright (c) 2012, Facebook Inc.
|
2019-03-22 19:16:45 +02:00
|
|
|
Copyright (c) 2017, 2019, MariaDB Corporation.
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
This program is free software; you can redistribute it and/or modify it under
|
|
|
|
the terms of the GNU General Public License as published by the Free Software
|
|
|
|
Foundation; version 2 of the License.
|
|
|
|
|
|
|
|
This program is distributed in the hope that it will be useful, but WITHOUT
|
|
|
|
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
|
|
|
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
|
|
|
|
|
|
|
|
You should have received a copy of the GNU General Public License along with
|
|
|
|
this program; if not, write to the Free Software Foundation, Inc.,
|
2019-05-11 18:08:32 +03:00
|
|
|
51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
*****************************************************************************/
|
|
|
|
|
|
|
|
/**************************************************//**
|
|
|
|
@file page/page0page.cc
|
|
|
|
Index page routines
|
|
|
|
|
|
|
|
Created 2/2/1994 Heikki Tuuri
|
|
|
|
*******************************************************/
|
|
|
|
|
|
|
|
#include "page0page.h"
|
|
|
|
#include "page0cur.h"
|
|
|
|
#include "page0zip.h"
|
|
|
|
#include "buf0buf.h"
|
2018-11-30 21:48:45 +02:00
|
|
|
#include "buf0checksum.h"
|
2014-02-26 19:11:54 +01:00
|
|
|
#include "btr0btr.h"
|
2016-12-30 15:04:10 +02:00
|
|
|
#include "srv0srv.h"
|
|
|
|
#include "lock0lock.h"
|
|
|
|
#include "fut0lst.h"
|
|
|
|
#include "btr0sea.h"
|
2018-03-11 23:34:23 +02:00
|
|
|
#include "trx0sys.h"
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/* THE INDEX PAGE
|
|
|
|
==============
|
|
|
|
|
|
|
|
The index page consists of a page header which contains the page's
|
|
|
|
id and other information. On top of it are the index records
|
|
|
|
in a heap linked into a one way linear list according to alphabetic order.
|
|
|
|
|
|
|
|
Just below page end is an array of pointers which we call page directory,
|
|
|
|
to about every sixth record in the list. The pointers are placed in
|
|
|
|
the directory in the alphabetical order of the records pointed to,
|
|
|
|
enabling us to make binary search using the array. Each slot n:o I
|
|
|
|
in the directory points to a record, where a 4-bit field contains a count
|
|
|
|
of those records which are in the linear list between pointer I and
|
|
|
|
the pointer I - 1 in the directory, including the record
|
|
|
|
pointed to by pointer I and not including the record pointed to by I - 1.
|
|
|
|
We say that the record pointed to by slot I, or that slot I, owns
|
|
|
|
these records. The count is always kept in the range 4 to 8, with
|
|
|
|
the exception that it is 1 for the first slot, and 1--8 for the second slot.
|
|
|
|
|
|
|
|
An essentially binary search can be performed in the list of index
|
|
|
|
records, like we could do if we had pointer to every record in the
|
|
|
|
page directory. The data structure is, however, more efficient when
|
|
|
|
we are doing inserts, because most inserts are just pushed on a heap.
|
|
|
|
Only every 8th insert requires block move in the directory pointer
|
|
|
|
table, which itself is quite small. A record is deleted from the page
|
|
|
|
by just taking it off the linear list and updating the number of owned
|
|
|
|
records-field of the record which owns it, and updating the page directory,
|
|
|
|
if necessary. A special case is the one when the record owns itself.
|
|
|
|
Because the overhead of inserts is so small, we may also increase the
|
|
|
|
page size from the projected default of 8 kB to 64 kB without too
|
|
|
|
much loss of efficiency in inserts. Bigger page becomes actual
|
|
|
|
when the disk transfer rate compared to seek and latency time rises.
|
|
|
|
On the present system, the page size is set so that the page transfer
|
|
|
|
time (3 ms) is 20 % of the disk random access time (15 ms).
|
|
|
|
|
|
|
|
When the page is split, merged, or becomes full but contains deleted
|
|
|
|
records, we have to reorganize the page.
|
|
|
|
|
|
|
|
Assuming a page size of 8 kB, a typical index page of a secondary
|
|
|
|
index contains 300 index entries, and the size of the page directory
|
|
|
|
is 50 x 4 bytes = 200 bytes. */
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Looks for the directory slot which owns the given record.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return the directory slot number */
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint
|
|
|
|
page_dir_find_owner_slot(
|
|
|
|
/*=====================*/
|
|
|
|
const rec_t* rec) /*!< in: the physical record */
|
|
|
|
{
|
|
|
|
ut_ad(page_rec_check(rec));
|
|
|
|
|
2018-04-21 12:11:04 +03:00
|
|
|
const page_t* page = page_align(rec);
|
|
|
|
const page_dir_slot_t* first_slot = page_dir_get_nth_slot(page, 0);
|
|
|
|
const page_dir_slot_t* slot = page_dir_get_nth_slot(
|
2018-04-28 15:49:09 +03:00
|
|
|
page, ulint(page_dir_get_n_slots(page)) - 1);
|
2018-04-21 12:11:04 +03:00
|
|
|
const rec_t* r = rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
while (rec_get_n_owned_new(r) == 0) {
|
|
|
|
r = rec_get_next_ptr_const(r, TRUE);
|
|
|
|
ut_ad(r >= page + PAGE_NEW_SUPREMUM);
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(r < page + (srv_page_size - PAGE_DIR));
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
while (rec_get_n_owned_old(r) == 0) {
|
|
|
|
r = rec_get_next_ptr_const(r, FALSE);
|
|
|
|
ut_ad(r >= page + PAGE_OLD_SUPREMUM);
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(r < page + (srv_page_size - PAGE_DIR));
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
uint16 rec_offs_bytes = mach_encode_2(ulint(r - page));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
while (UNIV_LIKELY(*(uint16*) slot != rec_offs_bytes)) {
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot == first_slot)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Probable data corruption on page "
|
|
|
|
<< page_get_page_no(page)
|
|
|
|
<< ". Original record on that page;";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
fputs("(compact record)", stderr);
|
|
|
|
} else {
|
|
|
|
rec_print_old(stderr, rec);
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Cannot find the dir slot for this"
|
|
|
|
" record on that page;";
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (page_is_comp(page)) {
|
|
|
|
fputs("(compact record)", stderr);
|
|
|
|
} else {
|
|
|
|
rec_print_old(stderr, page
|
|
|
|
+ mach_decode_2(rec_offs_bytes));
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
slot += PAGE_DIR_SLOT_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
return(((ulint) (first_slot - slot)) / PAGE_DIR_SLOT_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**************************************************************//**
|
|
|
|
Used to check the consistency of a directory slot.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if succeed */
|
2014-02-26 19:11:54 +01:00
|
|
|
static
|
|
|
|
ibool
|
|
|
|
page_dir_slot_check(
|
|
|
|
/*================*/
|
|
|
|
const page_dir_slot_t* slot) /*!< in: slot */
|
|
|
|
{
|
|
|
|
const page_t* page;
|
|
|
|
ulint n_slots;
|
|
|
|
ulint n_owned;
|
|
|
|
|
|
|
|
ut_a(slot);
|
|
|
|
|
|
|
|
page = page_align(slot);
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
ut_a(slot <= page_dir_get_nth_slot(page, 0));
|
|
|
|
ut_a(slot >= page_dir_get_nth_slot(page, n_slots - 1));
|
|
|
|
|
|
|
|
ut_a(page_rec_check(page_dir_slot_get_rec(slot)));
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
n_owned = rec_get_n_owned_new(page_dir_slot_get_rec(slot));
|
|
|
|
} else {
|
|
|
|
n_owned = rec_get_n_owned_old(page_dir_slot_get_rec(slot));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (slot == page_dir_get_nth_slot(page, 0)) {
|
|
|
|
ut_a(n_owned == 1);
|
|
|
|
} else if (slot == page_dir_get_nth_slot(page, n_slots - 1)) {
|
|
|
|
ut_a(n_owned >= 1);
|
|
|
|
ut_a(n_owned <= PAGE_DIR_SLOT_MAX_N_OWNED);
|
|
|
|
} else {
|
|
|
|
ut_a(n_owned >= PAGE_DIR_SLOT_MIN_N_OWNED);
|
|
|
|
ut_a(n_owned <= PAGE_DIR_SLOT_MAX_N_OWNED);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Sets the max trx id field value. */
|
|
|
|
void
|
|
|
|
page_set_max_trx_id(
|
|
|
|
/*================*/
|
|
|
|
buf_block_t* block, /*!< in/out: page */
|
|
|
|
page_zip_des_t* page_zip,/*!< in/out: compressed page, or NULL */
|
|
|
|
trx_id_t trx_id, /*!< in: transaction id */
|
|
|
|
mtr_t* mtr) /*!< in/out: mini-transaction, or NULL */
|
|
|
|
{
|
|
|
|
page_t* page = buf_block_get_frame(block);
|
|
|
|
ut_ad(!mtr || mtr_memo_contains(mtr, block, MTR_MEMO_PAGE_X_FIX));
|
|
|
|
|
|
|
|
/* It is not necessary to write this change to the redo log, as
|
|
|
|
during a database recovery we assume that the max trx id of every
|
|
|
|
page is the maximum trx id assigned before the crash. */
|
|
|
|
|
|
|
|
if (page_zip) {
|
|
|
|
mach_write_to_8(page + (PAGE_HEADER + PAGE_MAX_TRX_ID), trx_id);
|
|
|
|
page_zip_write_header(page_zip,
|
|
|
|
page + (PAGE_HEADER + PAGE_MAX_TRX_ID),
|
|
|
|
8, mtr);
|
|
|
|
} else if (mtr) {
|
|
|
|
mlog_write_ull(page + (PAGE_HEADER + PAGE_MAX_TRX_ID),
|
|
|
|
trx_id, mtr);
|
|
|
|
} else {
|
|
|
|
mach_write_to_8(page + (PAGE_HEADER + PAGE_MAX_TRX_ID), trx_id);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB
This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with
the notable difference that the file format changes are limited to
repurposing a previously unused data field in B-tree pages.
For persistent InnoDB tables, write the last used AUTO_INCREMENT
value to the root page of the clustered index, in the previously
unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC.
Unlike some other previously unused InnoDB data fields, this one was
actually always zero-initialized, at least since MySQL 3.23.49.
The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the
root page. The SX latch will allow concurrent read access to the root
page. (The field PAGE_ROOT_AUTO_INC will only be read on the
first-time call to ha_innobase::open() from the SQL layer. The
PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so
read/write races are not possible.)
During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level
function btr_cur_search_to_nth_level(), adding no extra page
access. [Adaptive hash index lookup will be disabled during INSERT.]
If some rare UPDATE modifies an AUTO_INCREMENT column, the
PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in
ha_innobase::update_row().
When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC
field.
During ALTER TABLE, the initial AUTO_INCREMENT value will be copied
from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will
update PAGE_ROOT_AUTO_INC in real time.
innodb_col_no(): Determine the dict_table_t::cols[] element index
corresponding to a Field of a non-virtual column.
(The MySQL 5.7 implementation of virtual columns breaks the 1:1
relationship between Field::field_index and dict_table_t::cols[].
Virtual columns are omitted from dict_table_t::cols[]. Therefore,
we must translate the field_index of AUTO_INCREMENT columns into
an index of dict_table_t::cols[].)
Upgrade from old data files:
By default, the AUTO_INCREMENT sequence in old data files would appear
to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain
the value 0 in each clustered index page. In new data files,
PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain
any AUTO_INCREMENT column.
For backward compatibility, we use the old method of
SELECT MAX(auto_increment_column) for initializing the sequence.
btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format
data file.
btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc()
that will resort to reading MAX(auto_increment_column) for data files
that did not use AUTO_INCREMENT yet. It was manually tested that during
the execution of innodb.autoinc_persist the compatibility logic is
not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty
clustered index root pages).
initialize_auto_increment(): Replaces
ha_innobase::innobase_initialize_autoinc(). This initializes
the AUTO_INCREMENT metadata. Only called from ha_innobase::open().
ha_innobase::info_low(): Do not try to lazily initialize
dict_table_t::autoinc. It must already have been initialized by
ha_innobase::open() or ha_innobase::create().
Note: The adjustments to class ha_innopart were not tested, because
the source code (native InnoDB partitioning) is not being compiled.
2016-12-14 19:56:39 +02:00
|
|
|
/** Persist the AUTO_INCREMENT value on a clustered index root page.
|
|
|
|
@param[in,out] block clustered index root page
|
|
|
|
@param[in] index clustered index
|
|
|
|
@param[in] autoinc next available AUTO_INCREMENT value
|
|
|
|
@param[in,out] mtr mini-transaction
|
|
|
|
@param[in] reset whether to reset the AUTO_INCREMENT
|
|
|
|
to a possibly smaller value than currently
|
|
|
|
exists in the page */
|
|
|
|
void
|
|
|
|
page_set_autoinc(
|
|
|
|
buf_block_t* block,
|
|
|
|
const dict_index_t* index MY_ATTRIBUTE((unused)),
|
|
|
|
ib_uint64_t autoinc,
|
|
|
|
mtr_t* mtr,
|
|
|
|
bool reset)
|
|
|
|
{
|
|
|
|
ut_ad(mtr_memo_contains_flagged(
|
|
|
|
mtr, block, MTR_MEMO_PAGE_X_FIX | MTR_MEMO_PAGE_SX_FIX));
|
2018-03-22 19:40:38 +02:00
|
|
|
ut_ad(index->is_primary());
|
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB
This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with
the notable difference that the file format changes are limited to
repurposing a previously unused data field in B-tree pages.
For persistent InnoDB tables, write the last used AUTO_INCREMENT
value to the root page of the clustered index, in the previously
unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC.
Unlike some other previously unused InnoDB data fields, this one was
actually always zero-initialized, at least since MySQL 3.23.49.
The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the
root page. The SX latch will allow concurrent read access to the root
page. (The field PAGE_ROOT_AUTO_INC will only be read on the
first-time call to ha_innobase::open() from the SQL layer. The
PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so
read/write races are not possible.)
During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level
function btr_cur_search_to_nth_level(), adding no extra page
access. [Adaptive hash index lookup will be disabled during INSERT.]
If some rare UPDATE modifies an AUTO_INCREMENT column, the
PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in
ha_innobase::update_row().
When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC
field.
During ALTER TABLE, the initial AUTO_INCREMENT value will be copied
from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will
update PAGE_ROOT_AUTO_INC in real time.
innodb_col_no(): Determine the dict_table_t::cols[] element index
corresponding to a Field of a non-virtual column.
(The MySQL 5.7 implementation of virtual columns breaks the 1:1
relationship between Field::field_index and dict_table_t::cols[].
Virtual columns are omitted from dict_table_t::cols[]. Therefore,
we must translate the field_index of AUTO_INCREMENT columns into
an index of dict_table_t::cols[].)
Upgrade from old data files:
By default, the AUTO_INCREMENT sequence in old data files would appear
to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain
the value 0 in each clustered index page. In new data files,
PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain
any AUTO_INCREMENT column.
For backward compatibility, we use the old method of
SELECT MAX(auto_increment_column) for initializing the sequence.
btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format
data file.
btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc()
that will resort to reading MAX(auto_increment_column) for data files
that did not use AUTO_INCREMENT yet. It was manually tested that during
the execution of innodb.autoinc_persist the compatibility logic is
not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty
clustered index root pages).
initialize_auto_increment(): Replaces
ha_innobase::innobase_initialize_autoinc(). This initializes
the AUTO_INCREMENT metadata. Only called from ha_innobase::open().
ha_innobase::info_low(): Do not try to lazily initialize
dict_table_t::autoinc. It must already have been initialized by
ha_innobase::open() or ha_innobase::create().
Note: The adjustments to class ha_innopart were not tested, because
the source code (native InnoDB partitioning) is not being compiled.
2016-12-14 19:56:39 +02:00
|
|
|
ut_ad(index->page == block->page.id.page_no());
|
2018-11-22 17:07:35 +02:00
|
|
|
ut_ad(index->table->space_id == block->page.id.space());
|
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB
This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with
the notable difference that the file format changes are limited to
repurposing a previously unused data field in B-tree pages.
For persistent InnoDB tables, write the last used AUTO_INCREMENT
value to the root page of the clustered index, in the previously
unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC.
Unlike some other previously unused InnoDB data fields, this one was
actually always zero-initialized, at least since MySQL 3.23.49.
The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the
root page. The SX latch will allow concurrent read access to the root
page. (The field PAGE_ROOT_AUTO_INC will only be read on the
first-time call to ha_innobase::open() from the SQL layer. The
PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so
read/write races are not possible.)
During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level
function btr_cur_search_to_nth_level(), adding no extra page
access. [Adaptive hash index lookup will be disabled during INSERT.]
If some rare UPDATE modifies an AUTO_INCREMENT column, the
PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in
ha_innobase::update_row().
When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC
field.
During ALTER TABLE, the initial AUTO_INCREMENT value will be copied
from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will
update PAGE_ROOT_AUTO_INC in real time.
innodb_col_no(): Determine the dict_table_t::cols[] element index
corresponding to a Field of a non-virtual column.
(The MySQL 5.7 implementation of virtual columns breaks the 1:1
relationship between Field::field_index and dict_table_t::cols[].
Virtual columns are omitted from dict_table_t::cols[]. Therefore,
we must translate the field_index of AUTO_INCREMENT columns into
an index of dict_table_t::cols[].)
Upgrade from old data files:
By default, the AUTO_INCREMENT sequence in old data files would appear
to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain
the value 0 in each clustered index page. In new data files,
PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain
any AUTO_INCREMENT column.
For backward compatibility, we use the old method of
SELECT MAX(auto_increment_column) for initializing the sequence.
btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format
data file.
btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc()
that will resort to reading MAX(auto_increment_column) for data files
that did not use AUTO_INCREMENT yet. It was manually tested that during
the execution of innodb.autoinc_persist the compatibility logic is
not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty
clustered index root pages).
initialize_auto_increment(): Replaces
ha_innobase::innobase_initialize_autoinc(). This initializes
the AUTO_INCREMENT metadata. Only called from ha_innobase::open().
ha_innobase::info_low(): Do not try to lazily initialize
dict_table_t::autoinc. It must already have been initialized by
ha_innobase::open() or ha_innobase::create().
Note: The adjustments to class ha_innopart were not tested, because
the source code (native InnoDB partitioning) is not being compiled.
2016-12-14 19:56:39 +02:00
|
|
|
|
|
|
|
byte* field = PAGE_HEADER + PAGE_ROOT_AUTO_INC
|
|
|
|
+ buf_block_get_frame(block);
|
|
|
|
if (!reset && mach_read_from_8(field) >= autoinc) {
|
|
|
|
/* nothing to update */
|
|
|
|
} else if (page_zip_des_t* page_zip = buf_block_get_page_zip(block)) {
|
|
|
|
mach_write_to_8(field, autoinc);
|
|
|
|
page_zip_write_header(page_zip, field, 8, mtr);
|
|
|
|
} else {
|
|
|
|
mlog_write_ull(field, autoinc, mtr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
/************************************************************//**
|
|
|
|
Allocates a block of memory from the heap of an index page.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return pointer to start of allocated buffer, or NULL if allocation fails */
|
2014-02-26 19:11:54 +01:00
|
|
|
byte*
|
|
|
|
page_mem_alloc_heap(
|
|
|
|
/*================*/
|
|
|
|
page_t* page, /*!< in/out: index page */
|
|
|
|
page_zip_des_t* page_zip,/*!< in/out: compressed page with enough
|
|
|
|
space available for inserting the record,
|
|
|
|
or NULL */
|
|
|
|
ulint need, /*!< in: total number of bytes needed */
|
|
|
|
ulint* heap_no)/*!< out: this contains the heap number
|
|
|
|
of the allocated record
|
|
|
|
if allocation succeeds */
|
|
|
|
{
|
|
|
|
byte* block;
|
|
|
|
ulint avl_space;
|
|
|
|
|
|
|
|
ut_ad(page && heap_no);
|
|
|
|
|
|
|
|
avl_space = page_get_max_insert_size(page, 1);
|
|
|
|
|
|
|
|
if (avl_space >= need) {
|
|
|
|
block = page_header_get_ptr(page, PAGE_HEAP_TOP);
|
|
|
|
|
|
|
|
page_header_set_ptr(page, page_zip, PAGE_HEAP_TOP,
|
|
|
|
block + need);
|
|
|
|
*heap_no = page_dir_get_n_heap(page);
|
|
|
|
|
|
|
|
page_dir_set_n_heap(page, page_zip, 1 + *heap_no);
|
|
|
|
|
|
|
|
return(block);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
|
|
|
Writes a log record of page creation. */
|
|
|
|
UNIV_INLINE
|
|
|
|
void
|
|
|
|
page_create_write_log(
|
|
|
|
/*==================*/
|
|
|
|
buf_frame_t* frame, /*!< in: a buffer frame where the page is
|
|
|
|
created */
|
|
|
|
mtr_t* mtr, /*!< in: mini-transaction handle */
|
2016-08-12 11:17:45 +03:00
|
|
|
ibool comp, /*!< in: TRUE=compact page format */
|
|
|
|
bool is_rtree) /*!< in: whether it is R-tree */
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
2016-08-12 11:17:45 +03:00
|
|
|
mlog_id_t type;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
if (is_rtree) {
|
|
|
|
type = comp ? MLOG_COMP_PAGE_CREATE_RTREE
|
|
|
|
: MLOG_PAGE_CREATE_RTREE;
|
|
|
|
} else {
|
|
|
|
type = comp ? MLOG_COMP_PAGE_CREATE : MLOG_PAGE_CREATE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
mlog_write_initial_log_record(frame, type, mtr);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
/** The page infimum and supremum of an empty page in ROW_FORMAT=REDUNDANT */
|
|
|
|
static const byte infimum_supremum_redundant[] = {
|
|
|
|
/* the infimum record */
|
|
|
|
0x08/*end offset*/,
|
|
|
|
0x01/*n_owned*/,
|
|
|
|
0x00, 0x00/*heap_no=0*/,
|
|
|
|
0x03/*n_fields=1, 1-byte offsets*/,
|
|
|
|
0x00, 0x74/* pointer to supremum */,
|
|
|
|
'i', 'n', 'f', 'i', 'm', 'u', 'm', 0,
|
|
|
|
/* the supremum record */
|
|
|
|
0x09/*end offset*/,
|
|
|
|
0x01/*n_owned*/,
|
|
|
|
0x00, 0x08/*heap_no=1*/,
|
|
|
|
0x03/*n_fields=1, 1-byte offsets*/,
|
|
|
|
0x00, 0x00/* end of record list */,
|
|
|
|
's', 'u', 'p', 'r', 'e', 'm', 'u', 'm', 0
|
|
|
|
};
|
|
|
|
|
|
|
|
/** The page infimum and supremum of an empty page in ROW_FORMAT=COMPACT */
|
|
|
|
static const byte infimum_supremum_compact[] = {
|
|
|
|
/* the infimum record */
|
|
|
|
0x01/*n_owned=1*/,
|
|
|
|
0x00, 0x02/* heap_no=0, REC_STATUS_INFIMUM */,
|
|
|
|
0x00, 0x0d/* pointer to supremum */,
|
|
|
|
'i', 'n', 'f', 'i', 'm', 'u', 'm', 0,
|
|
|
|
/* the supremum record */
|
|
|
|
0x01/*n_owned=1*/,
|
|
|
|
0x00, 0x0b/* heap_no=1, REC_STATUS_SUPREMUM */,
|
|
|
|
0x00, 0x00/* end of record list */,
|
|
|
|
's', 'u', 'p', 'r', 'e', 'm', 'u', 'm'
|
|
|
|
};
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/**********************************************************//**
|
|
|
|
The index page creation function.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return pointer to the page */
|
2014-02-26 19:11:54 +01:00
|
|
|
static
|
|
|
|
page_t*
|
|
|
|
page_create_low(
|
|
|
|
/*============*/
|
|
|
|
buf_block_t* block, /*!< in: a buffer block where the
|
|
|
|
page is created */
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint comp, /*!< in: nonzero=compact page format */
|
|
|
|
bool is_rtree) /*!< in: if it is an R-Tree page */
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
|
|
|
page_t* page;
|
|
|
|
|
2018-04-30 15:46:09 +03:00
|
|
|
compile_time_assert(PAGE_BTR_IBUF_FREE_LIST + FLST_BASE_NODE_SIZE
|
|
|
|
<= PAGE_DATA);
|
|
|
|
compile_time_assert(PAGE_BTR_IBUF_FREE_LIST_NODE + FLST_NODE_SIZE
|
|
|
|
<= PAGE_DATA);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
buf_block_modify_clock_inc(block);
|
|
|
|
|
|
|
|
page = buf_block_get_frame(block);
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
if (is_rtree) {
|
|
|
|
fil_page_set_type(page, FIL_PAGE_RTREE);
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
fil_page_set_type(page, FIL_PAGE_INDEX);
|
|
|
|
}
|
|
|
|
|
|
|
|
memset(page + PAGE_HEADER, 0, PAGE_HEADER_PRIV_END);
|
|
|
|
page[PAGE_HEADER + PAGE_N_DIR_SLOTS + 1] = 2;
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
page[PAGE_HEADER + PAGE_INSTANT] = 0;
|
|
|
|
page[PAGE_HEADER + PAGE_DIRECTION_B] = PAGE_NO_DIRECTION;
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
if (comp) {
|
|
|
|
page[PAGE_HEADER + PAGE_N_HEAP] = 0x80;/*page_is_comp()*/
|
|
|
|
page[PAGE_HEADER + PAGE_N_HEAP + 1] = PAGE_HEAP_NO_USER_LOW;
|
|
|
|
page[PAGE_HEADER + PAGE_HEAP_TOP + 1] = PAGE_NEW_SUPREMUM_END;
|
|
|
|
memcpy(page + PAGE_DATA, infimum_supremum_compact,
|
|
|
|
sizeof infimum_supremum_compact);
|
|
|
|
memset(page
|
|
|
|
+ PAGE_NEW_SUPREMUM_END, 0,
|
2018-04-27 13:49:25 +03:00
|
|
|
srv_page_size - PAGE_DIR - PAGE_NEW_SUPREMUM_END);
|
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE * 2 + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_NEW_SUPREMUM;
|
2018-04-27 13:49:25 +03:00
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_NEW_INFIMUM;
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
page[PAGE_HEADER + PAGE_N_HEAP + 1] = PAGE_HEAP_NO_USER_LOW;
|
|
|
|
page[PAGE_HEADER + PAGE_HEAP_TOP + 1] = PAGE_OLD_SUPREMUM_END;
|
|
|
|
memcpy(page + PAGE_DATA, infimum_supremum_redundant,
|
|
|
|
sizeof infimum_supremum_redundant);
|
|
|
|
memset(page
|
|
|
|
+ PAGE_OLD_SUPREMUM_END, 0,
|
2018-04-27 13:49:25 +03:00
|
|
|
srv_page_size - PAGE_DIR - PAGE_OLD_SUPREMUM_END);
|
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE * 2 + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_OLD_SUPREMUM;
|
2018-04-27 13:49:25 +03:00
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_OLD_INFIMUM;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
return(page);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
/** Parses a redo log record of creating a page.
|
|
|
|
@param[in,out] block buffer block, or NULL
|
|
|
|
@param[in] comp nonzero=compact page format
|
|
|
|
@param[in] is_rtree whether it is rtree page */
|
|
|
|
void
|
|
|
|
page_parse_create(
|
|
|
|
buf_block_t* block,
|
|
|
|
ulint comp,
|
|
|
|
bool is_rtree)
|
|
|
|
{
|
|
|
|
if (block != NULL) {
|
|
|
|
page_create_low(block, comp, is_rtree);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
2016-08-12 11:17:45 +03:00
|
|
|
Create an uncompressed B-tree or R-tree index page.
|
|
|
|
@return pointer to the page */
|
2014-02-26 19:11:54 +01:00
|
|
|
page_t*
|
|
|
|
page_create(
|
|
|
|
/*========*/
|
|
|
|
buf_block_t* block, /*!< in: a buffer block where the
|
|
|
|
page is created */
|
|
|
|
mtr_t* mtr, /*!< in: mini-transaction handle */
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint comp, /*!< in: nonzero=compact page format */
|
|
|
|
bool is_rtree) /*!< in: whether it is a R-Tree page */
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
2016-08-12 11:17:45 +03:00
|
|
|
ut_ad(mtr->is_named_space(block->page.id.space()));
|
|
|
|
page_create_write_log(buf_block_get_frame(block), mtr, comp, is_rtree);
|
|
|
|
return(page_create_low(block, comp, is_rtree));
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
|
|
|
Create a compressed B-tree index page.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return pointer to the page */
|
2014-02-26 19:11:54 +01:00
|
|
|
page_t*
|
|
|
|
page_create_zip(
|
|
|
|
/*============*/
|
2016-08-12 11:17:45 +03:00
|
|
|
buf_block_t* block, /*!< in/out: a buffer frame
|
|
|
|
where the page is created */
|
|
|
|
dict_index_t* index, /*!< in: the index of the
|
|
|
|
page, or NULL when applying
|
|
|
|
TRUNCATE log
|
|
|
|
record during recovery */
|
|
|
|
ulint level, /*!< in: the B-tree level
|
|
|
|
of the page */
|
|
|
|
trx_id_t max_trx_id, /*!< in: PAGE_MAX_TRX_ID */
|
|
|
|
mtr_t* mtr) /*!< in/out: mini-transaction
|
|
|
|
handle */
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
2016-08-12 11:17:45 +03:00
|
|
|
page_t* page;
|
|
|
|
page_zip_des_t* page_zip = buf_block_get_page_zip(block);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
ut_ad(block);
|
|
|
|
ut_ad(page_zip);
|
2018-09-10 18:01:54 +03:00
|
|
|
ut_ad(dict_table_is_comp(index->table));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-12-15 18:51:41 +02:00
|
|
|
/* PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC are always 0 for
|
|
|
|
temporary tables. */
|
2018-05-12 09:38:46 +03:00
|
|
|
ut_ad(max_trx_id == 0 || !index->table->is_temporary());
|
2016-12-15 18:51:41 +02:00
|
|
|
/* In secondary indexes and the change buffer, PAGE_MAX_TRX_ID
|
|
|
|
must be zero on non-leaf pages. max_trx_id can be 0 when the
|
|
|
|
index consists of an empty root (leaf) page. */
|
|
|
|
ut_ad(max_trx_id == 0
|
|
|
|
|| level == 0
|
|
|
|
|| !dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
|| index->table->is_temporary());
|
2016-12-15 18:51:41 +02:00
|
|
|
/* In the clustered index, PAGE_ROOT_AUTOINC or
|
|
|
|
PAGE_MAX_TRX_ID must be 0 on other pages than the root. */
|
|
|
|
ut_ad(level == 0 || max_trx_id == 0
|
|
|
|
|| !dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
|| index->table->is_temporary());
|
2016-12-15 18:51:41 +02:00
|
|
|
|
2018-09-10 18:01:54 +03:00
|
|
|
page = page_create_low(block, TRUE, dict_index_is_spatial(index));
|
2014-02-26 19:11:54 +01:00
|
|
|
mach_write_to_2(PAGE_HEADER + PAGE_LEVEL + page, level);
|
|
|
|
mach_write_to_8(PAGE_HEADER + PAGE_MAX_TRX_ID + page, max_trx_id);
|
|
|
|
|
2018-09-10 18:01:54 +03:00
|
|
|
if (!page_zip_compress(page_zip, page, index, page_zip_level, mtr)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
/* The compression of a newly created
|
|
|
|
page should always succeed. */
|
2014-02-26 19:11:54 +01:00
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
return(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
|
|
|
Empty a previously created B-tree index page. */
|
|
|
|
void
|
|
|
|
page_create_empty(
|
|
|
|
/*==============*/
|
|
|
|
buf_block_t* block, /*!< in/out: B-tree block */
|
|
|
|
dict_index_t* index, /*!< in: the index of the page */
|
|
|
|
mtr_t* mtr) /*!< in/out: mini-transaction */
|
|
|
|
{
|
2016-12-15 18:51:41 +02:00
|
|
|
trx_id_t max_trx_id;
|
|
|
|
page_t* page = buf_block_get_frame(block);
|
2014-02-26 19:11:54 +01:00
|
|
|
page_zip_des_t* page_zip= buf_block_get_page_zip(block);
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ut_ad(fil_page_index_page_check(page));
|
2019-03-22 19:16:45 +02:00
|
|
|
ut_ad(!index->is_dummy);
|
2019-03-25 11:02:03 +02:00
|
|
|
ut_ad(block->page.id.space() == index->table->space->id);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
/* Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
|
|
|
if (dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
&& !index->table->is_temporary()
|
2016-08-12 11:17:45 +03:00
|
|
|
&& page_is_leaf(page)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
max_trx_id = page_get_max_trx_id(page);
|
|
|
|
ut_ad(max_trx_id);
|
2019-03-22 19:16:45 +02:00
|
|
|
} else if (block->page.id.page_no() == index->page) {
|
2016-12-15 18:51:41 +02:00
|
|
|
/* Preserve PAGE_ROOT_AUTO_INC. */
|
|
|
|
max_trx_id = page_get_max_trx_id(page);
|
|
|
|
} else {
|
|
|
|
max_trx_id = 0;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (page_zip) {
|
2018-05-12 09:38:46 +03:00
|
|
|
ut_ad(!index->table->is_temporary());
|
2014-02-26 19:11:54 +01:00
|
|
|
page_create_zip(block, index,
|
|
|
|
page_header_get_field(page, PAGE_LEVEL),
|
2018-09-10 18:01:54 +03:00
|
|
|
max_trx_id, mtr);
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
page_create(block, mtr, page_is_comp(page),
|
|
|
|
dict_index_is_spatial(index));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (max_trx_id) {
|
2016-12-15 18:51:41 +02:00
|
|
|
mlog_write_ull(PAGE_HEADER + PAGE_MAX_TRX_ID + page,
|
|
|
|
max_trx_id, mtr);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Differs from page_copy_rec_list_end, because this function does not
|
|
|
|
touch the lock table and max trx id on page or compress the page.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit(). */
|
|
|
|
void
|
|
|
|
page_copy_rec_list_end_no_locks(
|
|
|
|
/*============================*/
|
|
|
|
buf_block_t* new_block, /*!< in: index page to copy to */
|
|
|
|
buf_block_t* block, /*!< in: index page of rec */
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
page_cur_t cur1;
|
|
|
|
rec_t* cur2;
|
|
|
|
mem_heap_t* heap = NULL;
|
|
|
|
ulint offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
ulint* offsets = offsets_;
|
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
|
|
|
page_cur_position(rec, block, &cur1);
|
|
|
|
|
|
|
|
if (page_cur_is_before_first(&cur1)) {
|
|
|
|
|
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
}
|
|
|
|
|
|
|
|
btr_assert_not_corrupted(new_block, index);
|
|
|
|
ut_a(page_is_comp(new_page) == page_rec_is_comp(rec));
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_a(mach_read_from_2(new_page + srv_page_size - 10) == (ulint)
|
2014-02-26 19:11:54 +01:00
|
|
|
(page_is_comp(new_page) ? PAGE_NEW_INFIMUM : PAGE_OLD_INFIMUM));
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
const bool is_leaf = page_is_leaf(block->frame);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
cur2 = page_get_infimum_rec(buf_block_get_frame(new_block));
|
|
|
|
|
|
|
|
/* Copy records from the original page to the new page */
|
|
|
|
|
|
|
|
while (!page_cur_is_after_last(&cur1)) {
|
|
|
|
rec_t* cur1_rec = page_cur_get_rec(&cur1);
|
|
|
|
rec_t* ins_rec;
|
2017-09-19 19:20:11 +03:00
|
|
|
offsets = rec_get_offsets(cur1_rec, index, offsets, is_leaf,
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
ins_rec = page_cur_insert_rec_low(cur2, index,
|
|
|
|
cur1_rec, offsets, mtr);
|
|
|
|
if (UNIV_UNLIKELY(!ins_rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::fatal() << "Rec offset " << page_offset(rec)
|
|
|
|
<< ", cur1 offset "
|
|
|
|
<< page_offset(page_cur_get_rec(&cur1))
|
|
|
|
<< ", cur2 offset " << page_offset(cur2);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
cur2 = ins_rec;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Copies records from page to new_page, from a given record onward,
|
|
|
|
including that record. Infimum and supremum records are not copied.
|
|
|
|
The records are copied to the start of the record list on new_page.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
|
|
|
@return pointer to the original successor of the infimum record on
|
|
|
|
new_page, or NULL on zip overflow (new_block will be decompressed) */
|
|
|
|
rec_t*
|
|
|
|
page_copy_rec_list_end(
|
|
|
|
/*===================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page to copy to */
|
|
|
|
buf_block_t* block, /*!< in: index page containing rec */
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
page_zip_des_t* new_page_zip = buf_block_get_page_zip(new_block);
|
|
|
|
page_t* page = page_align(rec);
|
|
|
|
rec_t* ret = page_rec_get_next(
|
|
|
|
page_get_infimum_rec(new_page));
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint num_moved = 0;
|
|
|
|
rtr_rec_move_t* rec_move = NULL;
|
|
|
|
mem_heap_t* heap = NULL;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
if (new_page_zip) {
|
|
|
|
page_zip_des_t* page_zip = buf_block_get_page_zip(block);
|
|
|
|
ut_a(page_zip);
|
|
|
|
|
|
|
|
/* Strict page_zip_validate() may fail here.
|
|
|
|
Furthermore, btr_compress() may set FIL_PAGE_PREV to
|
|
|
|
FIL_NULL on new_page while leaving it intact on
|
|
|
|
new_page_zip. So, we cannot validate new_page_zip. */
|
|
|
|
ut_a(page_zip_validate_low(page_zip, page, index, TRUE));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
ut_ad(buf_block_get_frame(block) == page);
|
|
|
|
ut_ad(page_is_leaf(page) == page_is_leaf(new_page));
|
|
|
|
ut_ad(page_is_comp(page) == page_is_comp(new_page));
|
|
|
|
/* Here, "ret" may be pointing to a user record or the
|
|
|
|
predefined supremum record. */
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
mtr_log_t log_mode = MTR_LOG_NONE;
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (new_page_zip) {
|
|
|
|
log_mode = mtr_set_log_mode(mtr, MTR_LOG_NONE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_dir_get_n_heap(new_page) == PAGE_HEAP_NO_USER_LOW) {
|
|
|
|
page_copy_rec_list_end_to_created_page(new_page, rec,
|
|
|
|
index, mtr);
|
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
if (dict_index_is_spatial(index)) {
|
|
|
|
ulint max_to_move = page_get_n_recs(
|
|
|
|
buf_block_get_frame(block));
|
|
|
|
heap = mem_heap_create(256);
|
|
|
|
|
|
|
|
rec_move = static_cast<rtr_rec_move_t*>(mem_heap_alloc(
|
|
|
|
heap,
|
|
|
|
sizeof (*rec_move) * max_to_move));
|
|
|
|
|
|
|
|
/* For spatial index, we need to insert recs one by one
|
|
|
|
to keep recs ordered. */
|
|
|
|
rtr_page_copy_rec_list_end_no_locks(new_block,
|
|
|
|
block, rec, index,
|
|
|
|
heap, rec_move,
|
|
|
|
max_to_move,
|
|
|
|
&num_moved,
|
|
|
|
mtr);
|
|
|
|
} else {
|
|
|
|
page_copy_rec_list_end_no_locks(new_block, block, rec,
|
|
|
|
index, mtr);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Update PAGE_MAX_TRX_ID on the uncompressed page.
|
|
|
|
Modifications will be redo logged and copied to the compressed
|
2016-08-12 11:17:45 +03:00
|
|
|
page in page_zip_compress() or page_zip_reorganize() below.
|
|
|
|
Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
|
|
|
if (dict_index_is_sec_or_ibuf(index)
|
|
|
|
&& page_is_leaf(page)
|
2018-05-12 09:38:46 +03:00
|
|
|
&& !index->table->is_temporary()) {
|
2014-02-26 19:11:54 +01:00
|
|
|
page_update_max_trx_id(new_block, NULL,
|
|
|
|
page_get_max_trx_id(page), mtr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (new_page_zip) {
|
|
|
|
mtr_set_log_mode(mtr, log_mode);
|
|
|
|
|
2018-09-10 18:01:54 +03:00
|
|
|
if (!page_zip_compress(new_page_zip, new_page, index,
|
|
|
|
page_zip_level, mtr)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* Before trying to reorganize the page,
|
|
|
|
store the number of preceding records on the page. */
|
|
|
|
ulint ret_pos
|
|
|
|
= page_rec_get_n_recs_before(ret);
|
|
|
|
/* Before copying, "ret" was the successor of
|
|
|
|
the predefined infimum record. It must still
|
|
|
|
have at least one predecessor (the predefined
|
|
|
|
infimum record, or a freshly copied record
|
|
|
|
that is smaller than "ret"). */
|
|
|
|
ut_a(ret_pos > 0);
|
|
|
|
|
|
|
|
if (!page_zip_reorganize(new_block, index, mtr)) {
|
|
|
|
|
|
|
|
if (!page_zip_decompress(new_page_zip,
|
|
|
|
new_page, FALSE)) {
|
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
ut_ad(page_validate(new_page, index));
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
if (heap) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
return(NULL);
|
|
|
|
} else {
|
|
|
|
/* The page was reorganized:
|
|
|
|
Seek to ret_pos. */
|
|
|
|
ret = new_page + PAGE_NEW_INFIMUM;
|
|
|
|
|
|
|
|
do {
|
|
|
|
ret = rec_get_next_ptr(ret, TRUE);
|
|
|
|
} while (--ret_pos);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update the lock table and possible hash index */
|
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (dict_table_is_locking_disabled(index->table)) {
|
|
|
|
} else if (rec_move && dict_index_is_spatial(index)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_rtr_move_rec_list(new_block, block, rec_move, num_moved);
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_move_rec_list_end(new_block, block, rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (heap) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
btr_search_move_or_delete_hash_entries(new_block, block);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Copies records from page to new_page, up to the given record,
|
|
|
|
NOT including that record. Infimum and supremum records are not copied.
|
|
|
|
The records are copied to the end of the record list on new_page.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
|
|
|
@return pointer to the original predecessor of the supremum record on
|
|
|
|
new_page, or NULL on zip overflow (new_block will be decompressed) */
|
|
|
|
rec_t*
|
|
|
|
page_copy_rec_list_start(
|
|
|
|
/*=====================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page to copy to */
|
|
|
|
buf_block_t* block, /*!< in: index page containing rec */
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
page_zip_des_t* new_page_zip = buf_block_get_page_zip(new_block);
|
|
|
|
page_cur_t cur1;
|
|
|
|
rec_t* cur2;
|
|
|
|
mem_heap_t* heap = NULL;
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint num_moved = 0;
|
|
|
|
rtr_rec_move_t* rec_move = NULL;
|
2014-02-26 19:11:54 +01:00
|
|
|
rec_t* ret
|
|
|
|
= page_rec_get_prev(page_get_supremum_rec(new_page));
|
|
|
|
ulint offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
ulint* offsets = offsets_;
|
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
|
|
|
/* Here, "ret" may be pointing to a user record or the
|
|
|
|
predefined infimum record. */
|
|
|
|
|
|
|
|
if (page_rec_is_infimum(rec)) {
|
|
|
|
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
mtr_log_t log_mode = MTR_LOG_NONE;
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (new_page_zip) {
|
|
|
|
log_mode = mtr_set_log_mode(mtr, MTR_LOG_NONE);
|
|
|
|
}
|
|
|
|
|
|
|
|
page_cur_set_before_first(block, &cur1);
|
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
|
|
|
|
cur2 = ret;
|
|
|
|
|
2017-09-19 19:20:11 +03:00
|
|
|
const bool is_leaf = page_rec_is_leaf(rec);
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
/* Copy records from the original page to the new page */
|
2016-08-12 11:17:45 +03:00
|
|
|
if (dict_index_is_spatial(index)) {
|
|
|
|
ulint max_to_move = page_get_n_recs(
|
|
|
|
buf_block_get_frame(block));
|
|
|
|
heap = mem_heap_create(256);
|
|
|
|
|
|
|
|
rec_move = static_cast<rtr_rec_move_t*>(mem_heap_alloc(
|
|
|
|
heap,
|
|
|
|
sizeof (*rec_move) * max_to_move));
|
|
|
|
|
|
|
|
/* For spatial index, we need to insert recs one by one
|
|
|
|
to keep recs ordered. */
|
|
|
|
rtr_page_copy_rec_list_start_no_locks(new_block,
|
|
|
|
block, rec, index, heap,
|
|
|
|
rec_move, max_to_move,
|
|
|
|
&num_moved, mtr);
|
|
|
|
} else {
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
while (page_cur_get_rec(&cur1) != rec) {
|
|
|
|
rec_t* cur1_rec = page_cur_get_rec(&cur1);
|
|
|
|
offsets = rec_get_offsets(cur1_rec, index, offsets,
|
2017-09-19 19:20:11 +03:00
|
|
|
is_leaf,
|
2016-08-12 11:17:45 +03:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
cur2 = page_cur_insert_rec_low(cur2, index,
|
|
|
|
cur1_rec, offsets, mtr);
|
|
|
|
ut_a(cur2);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Update PAGE_MAX_TRX_ID on the uncompressed page.
|
|
|
|
Modifications will be redo logged and copied to the compressed
|
2016-08-12 11:17:45 +03:00
|
|
|
page in page_zip_compress() or page_zip_reorganize() below.
|
|
|
|
Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
2017-09-19 19:20:11 +03:00
|
|
|
if (is_leaf && dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
&& !index->table->is_temporary()) {
|
2014-02-26 19:11:54 +01:00
|
|
|
page_update_max_trx_id(new_block, NULL,
|
|
|
|
page_get_max_trx_id(page_align(rec)),
|
|
|
|
mtr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (new_page_zip) {
|
|
|
|
mtr_set_log_mode(mtr, log_mode);
|
|
|
|
|
|
|
|
DBUG_EXECUTE_IF("page_copy_rec_list_start_compress_fail",
|
|
|
|
goto zip_reorganize;);
|
|
|
|
|
|
|
|
if (!page_zip_compress(new_page_zip, new_page, index,
|
2018-09-10 18:01:54 +03:00
|
|
|
page_zip_level, mtr)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint ret_pos;
|
|
|
|
#ifndef DBUG_OFF
|
|
|
|
zip_reorganize:
|
|
|
|
#endif /* DBUG_OFF */
|
|
|
|
/* Before trying to reorganize the page,
|
|
|
|
store the number of preceding records on the page. */
|
|
|
|
ret_pos = page_rec_get_n_recs_before(ret);
|
|
|
|
/* Before copying, "ret" was the predecessor
|
|
|
|
of the predefined supremum record. If it was
|
|
|
|
the predefined infimum record, then it would
|
|
|
|
still be the infimum, and we would have
|
|
|
|
ret_pos == 0. */
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(!page_zip_reorganize(new_block, index, mtr))) {
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(!page_zip_decompress(new_page_zip,
|
|
|
|
new_page, FALSE))) {
|
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
ut_ad(page_validate(new_page, index));
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The page was reorganized: Seek to ret_pos. */
|
|
|
|
ret = page_rec_get_nth(new_page, ret_pos);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update the lock table and possible hash index */
|
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (dict_table_is_locking_disabled(index->table)) {
|
|
|
|
} else if (dict_index_is_spatial(index)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_rtr_move_rec_list(new_block, block, rec_move, num_moved);
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_move_rec_list_start(new_block, block, rec, ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (heap) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
btr_search_move_or_delete_hash_entries(new_block, block);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
|
|
|
Writes a log record of a record list end or start deletion. */
|
|
|
|
UNIV_INLINE
|
|
|
|
void
|
|
|
|
page_delete_rec_list_write_log(
|
|
|
|
/*===========================*/
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
2016-08-12 11:17:45 +03:00
|
|
|
mlog_id_t type, /*!< in: operation type:
|
2014-02-26 19:11:54 +01:00
|
|
|
MLOG_LIST_END_DELETE, ... */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
byte* log_ptr;
|
|
|
|
ut_ad(type == MLOG_LIST_END_DELETE
|
|
|
|
|| type == MLOG_LIST_START_DELETE
|
|
|
|
|| type == MLOG_COMP_LIST_END_DELETE
|
|
|
|
|| type == MLOG_COMP_LIST_START_DELETE);
|
|
|
|
|
|
|
|
log_ptr = mlog_open_and_write_index(mtr, rec, index, type, 2);
|
|
|
|
if (log_ptr) {
|
|
|
|
/* Write the parameter as a 2-byte ulint */
|
|
|
|
mach_write_to_2(log_ptr, page_offset(rec));
|
|
|
|
mlog_close(mtr, log_ptr + 2);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
|
|
|
Parses a log record of a record list end or start deletion.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return end of log record or NULL */
|
2014-02-26 19:11:54 +01:00
|
|
|
byte*
|
|
|
|
page_parse_delete_rec_list(
|
|
|
|
/*=======================*/
|
2016-08-12 11:17:45 +03:00
|
|
|
mlog_id_t type, /*!< in: MLOG_LIST_END_DELETE,
|
2014-02-26 19:11:54 +01:00
|
|
|
MLOG_LIST_START_DELETE,
|
|
|
|
MLOG_COMP_LIST_END_DELETE or
|
|
|
|
MLOG_COMP_LIST_START_DELETE */
|
|
|
|
byte* ptr, /*!< in: buffer */
|
|
|
|
byte* end_ptr,/*!< in: buffer end */
|
|
|
|
buf_block_t* block, /*!< in/out: buffer block or NULL */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr or NULL */
|
|
|
|
{
|
|
|
|
page_t* page;
|
|
|
|
ulint offset;
|
|
|
|
|
|
|
|
ut_ad(type == MLOG_LIST_END_DELETE
|
|
|
|
|| type == MLOG_LIST_START_DELETE
|
|
|
|
|| type == MLOG_COMP_LIST_END_DELETE
|
|
|
|
|| type == MLOG_COMP_LIST_START_DELETE);
|
|
|
|
|
|
|
|
/* Read the record offset as a 2-byte ulint */
|
|
|
|
|
|
|
|
if (end_ptr < ptr + 2) {
|
|
|
|
|
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
offset = mach_read_from_2(ptr);
|
|
|
|
ptr += 2;
|
|
|
|
|
|
|
|
if (!block) {
|
|
|
|
|
|
|
|
return(ptr);
|
|
|
|
}
|
|
|
|
|
|
|
|
page = buf_block_get_frame(block);
|
|
|
|
|
|
|
|
ut_ad(!!page_is_comp(page) == dict_table_is_comp(index->table));
|
|
|
|
|
|
|
|
if (type == MLOG_LIST_END_DELETE
|
|
|
|
|| type == MLOG_COMP_LIST_END_DELETE) {
|
|
|
|
page_delete_rec_list_end(page + offset, block, index,
|
|
|
|
ULINT_UNDEFINED, ULINT_UNDEFINED,
|
|
|
|
mtr);
|
|
|
|
} else {
|
|
|
|
page_delete_rec_list_start(page + offset, block, index, mtr);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(ptr);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Deletes records from a page from a given record onward, including that record.
|
|
|
|
The infimum and supremum records are not deleted. */
|
|
|
|
void
|
|
|
|
page_delete_rec_list_end(
|
|
|
|
/*=====================*/
|
|
|
|
rec_t* rec, /*!< in: pointer to record on page */
|
|
|
|
buf_block_t* block, /*!< in: buffer block of the page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
ulint n_recs, /*!< in: number of records to delete,
|
|
|
|
or ULINT_UNDEFINED if not known */
|
|
|
|
ulint size, /*!< in: the sum of the sizes of the
|
|
|
|
records in the end of the chain to
|
|
|
|
delete, or ULINT_UNDEFINED if not known */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_dir_slot_t*slot;
|
|
|
|
ulint slot_index;
|
|
|
|
rec_t* last_rec;
|
|
|
|
rec_t* prev_rec;
|
|
|
|
ulint n_owned;
|
|
|
|
page_zip_des_t* page_zip = buf_block_get_page_zip(block);
|
|
|
|
page_t* page = page_align(rec);
|
|
|
|
mem_heap_t* heap = NULL;
|
|
|
|
ulint offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
ulint* offsets = offsets_;
|
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(size == ULINT_UNDEFINED || size < srv_page_size);
|
2014-02-26 19:11:54 +01:00
|
|
|
ut_ad(!page_zip || page_rec_is_comp(rec));
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
ut_a(!page_zip || page_zip_validate(page_zip, page, index));
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
|
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
ut_ad(n_recs == 0 || n_recs == ULINT_UNDEFINED);
|
|
|
|
/* Nothing to do, there are no records bigger than the
|
|
|
|
page supremum. */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (recv_recovery_is_on()) {
|
|
|
|
/* If we are replaying a redo log record, we must
|
|
|
|
replay it exactly. Since MySQL 5.6.11, we should be
|
|
|
|
generating a redo log record for page creation if
|
|
|
|
the page would become empty. Thus, this branch should
|
|
|
|
only be executed when applying redo log that was
|
|
|
|
generated by an older version of MySQL. */
|
|
|
|
} else if (page_rec_is_infimum(rec)
|
|
|
|
|| n_recs == page_get_n_recs(page)) {
|
|
|
|
delete_all:
|
|
|
|
/* We are deleting all records. */
|
|
|
|
page_create_empty(block, index, mtr);
|
|
|
|
return;
|
|
|
|
} else if (page_is_comp(page)) {
|
|
|
|
if (page_rec_get_next_low(page + PAGE_NEW_INFIMUM, 1) == rec) {
|
|
|
|
/* We are deleting everything from the first
|
|
|
|
user record onwards. */
|
|
|
|
goto delete_all;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (page_rec_get_next_low(page + PAGE_OLD_INFIMUM, 0) == rec) {
|
|
|
|
/* We are deleting everything from the first
|
|
|
|
user record onwards. */
|
|
|
|
goto delete_all;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Reset the last insert info in the page header and increment
|
|
|
|
the modify clock for the frame */
|
|
|
|
|
|
|
|
page_header_set_ptr(page, page_zip, PAGE_LAST_INSERT, NULL);
|
|
|
|
|
|
|
|
/* The page gets invalid for optimistic searches: increment the
|
|
|
|
frame modify clock */
|
|
|
|
|
|
|
|
buf_block_modify_clock_inc(block);
|
|
|
|
|
|
|
|
page_delete_rec_list_write_log(rec, index, page_is_comp(page)
|
|
|
|
? MLOG_COMP_LIST_END_DELETE
|
|
|
|
: MLOG_LIST_END_DELETE, mtr);
|
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
const bool is_leaf = page_is_leaf(page);
|
2017-09-19 19:20:11 +03:00
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (page_zip) {
|
2016-08-12 11:17:45 +03:00
|
|
|
mtr_log_t log_mode;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
ut_a(page_is_comp(page));
|
|
|
|
/* Individual deletes are not logged */
|
|
|
|
|
|
|
|
log_mode = mtr_set_log_mode(mtr, MTR_LOG_NONE);
|
|
|
|
|
|
|
|
do {
|
|
|
|
page_cur_t cur;
|
|
|
|
page_cur_position(rec, block, &cur);
|
|
|
|
|
2017-09-19 19:20:11 +03:00
|
|
|
offsets = rec_get_offsets(rec, index, offsets, is_leaf,
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
rec = rec_get_next_ptr(rec, TRUE);
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
ut_a(page_zip_validate(page_zip, page, index));
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
page_cur_delete_rec(&cur, index, offsets, mtr);
|
|
|
|
} while (page_offset(rec) != PAGE_NEW_SUPREMUM);
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Restore log mode */
|
|
|
|
|
|
|
|
mtr_set_log_mode(mtr, log_mode);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
prev_rec = page_rec_get_prev(rec);
|
|
|
|
|
|
|
|
last_rec = page_rec_get_prev(page_get_supremum_rec(page));
|
|
|
|
|
2014-12-22 16:53:17 +02:00
|
|
|
bool scrub = srv_immediate_scrub_data_uncompressed;
|
|
|
|
if ((size == ULINT_UNDEFINED) || (n_recs == ULINT_UNDEFINED) ||
|
|
|
|
scrub) {
|
2014-02-26 19:11:54 +01:00
|
|
|
rec_t* rec2 = rec;
|
|
|
|
/* Calculate the sum of sizes and the number of records */
|
|
|
|
size = 0;
|
|
|
|
n_recs = 0;
|
|
|
|
|
|
|
|
do {
|
|
|
|
ulint s;
|
|
|
|
offsets = rec_get_offsets(rec2, index, offsets,
|
2017-09-19 19:20:11 +03:00
|
|
|
is_leaf,
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
s = rec_offs_size(offsets);
|
2018-04-28 15:49:09 +03:00
|
|
|
ut_ad(ulint(rec2 - page) + s
|
|
|
|
- rec_offs_extra_size(offsets)
|
2018-04-27 13:49:25 +03:00
|
|
|
< srv_page_size);
|
|
|
|
ut_ad(size + s < srv_page_size);
|
2014-02-26 19:11:54 +01:00
|
|
|
size += s;
|
|
|
|
n_recs++;
|
|
|
|
|
2014-12-22 16:53:17 +02:00
|
|
|
if (scrub) {
|
|
|
|
/* scrub record */
|
2017-03-01 08:27:39 +02:00
|
|
|
memset(rec2, 0, rec_offs_data_size(offsets));
|
2014-12-22 16:53:17 +02:00
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
rec2 = page_rec_get_next(rec2);
|
|
|
|
} while (!page_rec_is_supremum(rec2));
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(size < srv_page_size);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/* Update the page directory; there is no need to balance the number
|
|
|
|
of the records owned by the supremum record, as it is allowed to be
|
|
|
|
less than PAGE_DIR_SLOT_MIN_N_OWNED */
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
rec_t* rec2 = rec;
|
|
|
|
ulint count = 0;
|
|
|
|
|
|
|
|
while (rec_get_n_owned_new(rec2) == 0) {
|
|
|
|
count++;
|
|
|
|
|
|
|
|
rec2 = rec_get_next_ptr(rec2, TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_ad(rec_get_n_owned_new(rec2) > count);
|
|
|
|
|
|
|
|
n_owned = rec_get_n_owned_new(rec2) - count;
|
|
|
|
slot_index = page_dir_find_owner_slot(rec2);
|
|
|
|
ut_ad(slot_index > 0);
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_index);
|
|
|
|
} else {
|
|
|
|
rec_t* rec2 = rec;
|
|
|
|
ulint count = 0;
|
|
|
|
|
|
|
|
while (rec_get_n_owned_old(rec2) == 0) {
|
|
|
|
count++;
|
|
|
|
|
|
|
|
rec2 = rec_get_next_ptr(rec2, FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_ad(rec_get_n_owned_old(rec2) > count);
|
|
|
|
|
|
|
|
n_owned = rec_get_n_owned_old(rec2) - count;
|
|
|
|
slot_index = page_dir_find_owner_slot(rec2);
|
|
|
|
ut_ad(slot_index > 0);
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
page_dir_slot_set_rec(slot, page_get_supremum_rec(page));
|
|
|
|
page_dir_slot_set_n_owned(slot, NULL, n_owned);
|
|
|
|
|
|
|
|
page_dir_set_n_slots(page, NULL, slot_index + 1);
|
|
|
|
|
|
|
|
/* Remove the record chain segment from the record chain */
|
|
|
|
page_rec_set_next(prev_rec, page_get_supremum_rec(page));
|
|
|
|
|
|
|
|
/* Catenate the deleted chain segment to the page free list */
|
|
|
|
|
|
|
|
page_rec_set_next(last_rec, page_header_get_ptr(page, PAGE_FREE));
|
|
|
|
page_header_set_ptr(page, NULL, PAGE_FREE, rec);
|
|
|
|
|
|
|
|
page_header_set_field(page, NULL, PAGE_GARBAGE, size
|
|
|
|
+ page_header_get_field(page, PAGE_GARBAGE));
|
|
|
|
|
2019-03-22 19:21:07 +02:00
|
|
|
ut_ad(page_get_n_recs(page) > n_recs);
|
2014-02-26 19:11:54 +01:00
|
|
|
page_header_set_field(page, NULL, PAGE_N_RECS,
|
|
|
|
(ulint)(page_get_n_recs(page) - n_recs));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Deletes records from page, up to the given record, NOT including
|
|
|
|
that record. Infimum and supremum records are not deleted. */
|
|
|
|
void
|
|
|
|
page_delete_rec_list_start(
|
|
|
|
/*=======================*/
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
buf_block_t* block, /*!< in: buffer block of the page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_cur_t cur1;
|
|
|
|
ulint offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
ulint* offsets = offsets_;
|
|
|
|
mem_heap_t* heap = NULL;
|
|
|
|
|
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
|
|
|
ut_ad((ibool) !!page_rec_is_comp(rec)
|
|
|
|
== dict_table_is_comp(index->table));
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
{
|
|
|
|
page_zip_des_t* page_zip= buf_block_get_page_zip(block);
|
|
|
|
page_t* page = buf_block_get_frame(block);
|
|
|
|
|
|
|
|
/* page_zip_validate() would detect a min_rec_mark mismatch
|
|
|
|
in btr_page_split_and_insert()
|
|
|
|
between btr_attach_half_pages() and insert_page = ...
|
|
|
|
when btr_page_get_split_rec_to_left() holds
|
|
|
|
(direction == FSP_DOWN). */
|
|
|
|
ut_a(!page_zip
|
|
|
|
|| page_zip_validate_low(page_zip, page, index, TRUE));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
|
|
|
|
if (page_rec_is_infimum(rec)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
/* We are deleting all records. */
|
|
|
|
page_create_empty(block, index, mtr);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
mlog_id_t type;
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (page_rec_is_comp(rec)) {
|
|
|
|
type = MLOG_COMP_LIST_START_DELETE;
|
|
|
|
} else {
|
|
|
|
type = MLOG_LIST_START_DELETE;
|
|
|
|
}
|
|
|
|
|
|
|
|
page_delete_rec_list_write_log(rec, index, type, mtr);
|
|
|
|
|
|
|
|
page_cur_set_before_first(block, &cur1);
|
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
|
|
|
|
/* Individual deletes are not logged */
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
mtr_log_t log_mode = mtr_set_log_mode(mtr, MTR_LOG_NONE);
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
const bool is_leaf = page_rec_is_leaf(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
while (page_cur_get_rec(&cur1) != rec) {
|
|
|
|
offsets = rec_get_offsets(page_cur_get_rec(&cur1), index,
|
2017-09-19 19:20:11 +03:00
|
|
|
offsets, is_leaf,
|
|
|
|
ULINT_UNDEFINED, &heap);
|
2014-02-26 19:11:54 +01:00
|
|
|
page_cur_delete_rec(&cur1, index, offsets, mtr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Restore log mode */
|
|
|
|
|
|
|
|
mtr_set_log_mode(mtr, log_mode);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Moves record list end to another page. Moved records include
|
|
|
|
split_rec.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
|
|
|
@return TRUE on success; FALSE on compression failure (new_block will
|
|
|
|
be decompressed) */
|
|
|
|
ibool
|
|
|
|
page_move_rec_list_end(
|
|
|
|
/*===================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page where to move */
|
|
|
|
buf_block_t* block, /*!< in: index page from where to move */
|
|
|
|
rec_t* split_rec, /*!< in: first record to move */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
ulint old_data_size;
|
|
|
|
ulint new_data_size;
|
|
|
|
ulint old_n_recs;
|
|
|
|
ulint new_n_recs;
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ut_ad(!dict_index_is_spatial(index));
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
old_data_size = page_get_data_size(new_page);
|
|
|
|
old_n_recs = page_get_n_recs(new_page);
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
{
|
|
|
|
page_zip_des_t* new_page_zip
|
|
|
|
= buf_block_get_page_zip(new_block);
|
|
|
|
page_zip_des_t* page_zip
|
|
|
|
= buf_block_get_page_zip(block);
|
|
|
|
ut_a(!new_page_zip == !page_zip);
|
|
|
|
ut_a(!new_page_zip
|
|
|
|
|| page_zip_validate(new_page_zip, new_page, index));
|
|
|
|
ut_a(!page_zip
|
|
|
|
|| page_zip_validate(page_zip, page_align(split_rec),
|
|
|
|
index));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_copy_rec_list_end(new_block, block,
|
|
|
|
split_rec, index, mtr))) {
|
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
new_data_size = page_get_data_size(new_page);
|
|
|
|
new_n_recs = page_get_n_recs(new_page);
|
|
|
|
|
|
|
|
ut_ad(new_data_size >= old_data_size);
|
|
|
|
|
|
|
|
page_delete_rec_list_end(split_rec, block, index,
|
|
|
|
new_n_recs - old_n_recs,
|
|
|
|
new_data_size - old_data_size, mtr);
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Moves record list start to another page. Moved records do not include
|
|
|
|
split_rec.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE on success; FALSE on compression failure */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_move_rec_list_start(
|
|
|
|
/*=====================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page where to move */
|
|
|
|
buf_block_t* block, /*!< in/out: page containing split_rec */
|
|
|
|
rec_t* split_rec, /*!< in: first record not to move */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
if (UNIV_UNLIKELY(!page_copy_rec_list_start(new_block, block,
|
|
|
|
split_rec, index, mtr))) {
|
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
page_delete_rec_list_start(split_rec, block, index, mtr);
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**************************************************************//**
|
|
|
|
Used to delete n slots from the directory. This function updates
|
|
|
|
also n_owned fields in the records, so that the first slot after
|
|
|
|
the deleted ones inherits the records of the deleted slots. */
|
|
|
|
UNIV_INLINE
|
|
|
|
void
|
|
|
|
page_dir_delete_slot(
|
|
|
|
/*=================*/
|
|
|
|
page_t* page, /*!< in/out: the index page */
|
|
|
|
page_zip_des_t* page_zip,/*!< in/out: compressed page, or NULL */
|
|
|
|
ulint slot_no)/*!< in: slot to be deleted */
|
|
|
|
{
|
|
|
|
page_dir_slot_t* slot;
|
|
|
|
ulint n_owned;
|
|
|
|
ulint i;
|
|
|
|
ulint n_slots;
|
|
|
|
|
|
|
|
ut_ad(!page_zip || page_is_comp(page));
|
|
|
|
ut_ad(slot_no > 0);
|
|
|
|
ut_ad(slot_no + 1 < page_dir_get_n_slots(page));
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
/* 1. Reset the n_owned fields of the slots to be
|
|
|
|
deleted */
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
n_owned = page_dir_slot_get_n_owned(slot);
|
|
|
|
page_dir_slot_set_n_owned(slot, page_zip, 0);
|
|
|
|
|
|
|
|
/* 2. Update the n_owned value of the first non-deleted slot */
|
|
|
|
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no + 1);
|
|
|
|
page_dir_slot_set_n_owned(slot, page_zip,
|
|
|
|
n_owned + page_dir_slot_get_n_owned(slot));
|
|
|
|
|
|
|
|
/* 3. Destroy the slot by copying slots */
|
|
|
|
for (i = slot_no + 1; i < n_slots; i++) {
|
|
|
|
rec_t* rec = (rec_t*)
|
|
|
|
page_dir_slot_get_rec(page_dir_get_nth_slot(page, i));
|
|
|
|
page_dir_slot_set_rec(page_dir_get_nth_slot(page, i - 1), rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* 4. Zero out the last slot, which will be removed */
|
|
|
|
mach_write_to_2(page_dir_get_nth_slot(page, n_slots - 1), 0);
|
|
|
|
|
|
|
|
/* 5. Update the page header */
|
|
|
|
page_header_set_field(page, page_zip, PAGE_N_DIR_SLOTS, n_slots - 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**************************************************************//**
|
|
|
|
Used to add n slots to the directory. Does not set the record pointers
|
|
|
|
in the added slots or update n_owned values: this is the responsibility
|
|
|
|
of the caller. */
|
|
|
|
UNIV_INLINE
|
|
|
|
void
|
|
|
|
page_dir_add_slot(
|
|
|
|
/*==============*/
|
|
|
|
page_t* page, /*!< in/out: the index page */
|
|
|
|
page_zip_des_t* page_zip,/*!< in/out: comprssed page, or NULL */
|
|
|
|
ulint start) /*!< in: the slot above which the new slots
|
|
|
|
are added */
|
|
|
|
{
|
|
|
|
page_dir_slot_t* slot;
|
|
|
|
ulint n_slots;
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
ut_ad(start < n_slots - 1);
|
|
|
|
|
|
|
|
/* Update the page header */
|
|
|
|
page_dir_set_n_slots(page, page_zip, n_slots + 1);
|
|
|
|
|
|
|
|
/* Move slots up */
|
|
|
|
slot = page_dir_get_nth_slot(page, n_slots);
|
|
|
|
memmove(slot, slot + PAGE_DIR_SLOT_SIZE,
|
|
|
|
(n_slots - 1 - start) * PAGE_DIR_SLOT_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/****************************************************************//**
|
|
|
|
Splits a directory slot which owns too many records. */
|
|
|
|
void
|
|
|
|
page_dir_split_slot(
|
|
|
|
/*================*/
|
|
|
|
page_t* page, /*!< in/out: index page */
|
|
|
|
page_zip_des_t* page_zip,/*!< in/out: compressed page whose
|
|
|
|
uncompressed part will be written, or NULL */
|
|
|
|
ulint slot_no)/*!< in: the directory slot */
|
|
|
|
{
|
|
|
|
rec_t* rec;
|
|
|
|
page_dir_slot_t* new_slot;
|
|
|
|
page_dir_slot_t* prev_slot;
|
|
|
|
page_dir_slot_t* slot;
|
|
|
|
ulint i;
|
|
|
|
ulint n_owned;
|
|
|
|
|
|
|
|
ut_ad(!page_zip || page_is_comp(page));
|
|
|
|
ut_ad(slot_no > 0);
|
|
|
|
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
n_owned = page_dir_slot_get_n_owned(slot);
|
|
|
|
ut_ad(n_owned == PAGE_DIR_SLOT_MAX_N_OWNED + 1);
|
|
|
|
|
|
|
|
/* 1. We loop to find a record approximately in the middle of the
|
|
|
|
records owned by the slot. */
|
|
|
|
|
|
|
|
prev_slot = page_dir_get_nth_slot(page, slot_no - 1);
|
|
|
|
rec = (rec_t*) page_dir_slot_get_rec(prev_slot);
|
|
|
|
|
|
|
|
for (i = 0; i < n_owned / 2; i++) {
|
|
|
|
rec = page_rec_get_next(rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_ad(n_owned / 2 >= PAGE_DIR_SLOT_MIN_N_OWNED);
|
|
|
|
|
|
|
|
/* 2. We add one directory slot immediately below the slot to be
|
|
|
|
split. */
|
|
|
|
|
|
|
|
page_dir_add_slot(page, page_zip, slot_no - 1);
|
|
|
|
|
|
|
|
/* The added slot is now number slot_no, and the old slot is
|
|
|
|
now number slot_no + 1 */
|
|
|
|
|
|
|
|
new_slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no + 1);
|
|
|
|
|
|
|
|
/* 3. We store the appropriate values to the new slot. */
|
|
|
|
|
|
|
|
page_dir_slot_set_rec(new_slot, rec);
|
|
|
|
page_dir_slot_set_n_owned(new_slot, page_zip, n_owned / 2);
|
|
|
|
|
|
|
|
/* 4. Finally, we update the number of records field of the
|
|
|
|
original slot */
|
|
|
|
|
|
|
|
page_dir_slot_set_n_owned(slot, page_zip, n_owned - (n_owned / 2));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Tries to balance the given directory slot with too few records with the upper
|
|
|
|
neighbor, so that there are at least the minimum number of records owned by
|
|
|
|
the slot; this may result in the merging of two slots. */
|
|
|
|
void
|
|
|
|
page_dir_balance_slot(
|
|
|
|
/*==================*/
|
|
|
|
page_t* page, /*!< in/out: index page */
|
|
|
|
page_zip_des_t* page_zip,/*!< in/out: compressed page, or NULL */
|
|
|
|
ulint slot_no)/*!< in: the directory slot */
|
|
|
|
{
|
|
|
|
page_dir_slot_t* slot;
|
|
|
|
page_dir_slot_t* up_slot;
|
|
|
|
ulint n_owned;
|
|
|
|
ulint up_n_owned;
|
|
|
|
rec_t* old_rec;
|
|
|
|
rec_t* new_rec;
|
|
|
|
|
|
|
|
ut_ad(!page_zip || page_is_comp(page));
|
|
|
|
ut_ad(slot_no > 0);
|
|
|
|
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
/* The last directory slot cannot be balanced with the upper
|
|
|
|
neighbor, as there is none. */
|
|
|
|
|
2017-09-28 15:12:00 +03:00
|
|
|
if (UNIV_UNLIKELY(slot_no + 1 == page_dir_get_n_slots(page))) {
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
up_slot = page_dir_get_nth_slot(page, slot_no + 1);
|
|
|
|
|
|
|
|
n_owned = page_dir_slot_get_n_owned(slot);
|
|
|
|
up_n_owned = page_dir_slot_get_n_owned(up_slot);
|
|
|
|
|
|
|
|
ut_ad(n_owned == PAGE_DIR_SLOT_MIN_N_OWNED - 1);
|
|
|
|
|
|
|
|
/* If the upper slot has the minimum value of n_owned, we will merge
|
|
|
|
the two slots, therefore we assert: */
|
|
|
|
ut_ad(2 * PAGE_DIR_SLOT_MIN_N_OWNED - 1 <= PAGE_DIR_SLOT_MAX_N_OWNED);
|
|
|
|
|
|
|
|
if (up_n_owned > PAGE_DIR_SLOT_MIN_N_OWNED) {
|
|
|
|
|
|
|
|
/* In this case we can just transfer one record owned
|
|
|
|
by the upper slot to the property of the lower slot */
|
|
|
|
old_rec = (rec_t*) page_dir_slot_get_rec(slot);
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
new_rec = rec_get_next_ptr(old_rec, TRUE);
|
|
|
|
|
|
|
|
rec_set_n_owned_new(old_rec, page_zip, 0);
|
|
|
|
rec_set_n_owned_new(new_rec, page_zip, n_owned + 1);
|
|
|
|
} else {
|
|
|
|
new_rec = rec_get_next_ptr(old_rec, FALSE);
|
|
|
|
|
|
|
|
rec_set_n_owned_old(old_rec, 0);
|
|
|
|
rec_set_n_owned_old(new_rec, n_owned + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
page_dir_slot_set_rec(slot, new_rec);
|
|
|
|
|
|
|
|
page_dir_slot_set_n_owned(up_slot, page_zip, up_n_owned -1);
|
|
|
|
} else {
|
|
|
|
/* In this case we may merge the two slots */
|
|
|
|
page_dir_delete_slot(page, page_zip, slot_no);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/************************************************************//**
|
|
|
|
Returns the nth record of the record list.
|
|
|
|
This is the inverse function of page_rec_get_n_recs_before().
|
2016-08-12 11:17:45 +03:00
|
|
|
@return nth record */
|
2014-02-26 19:11:54 +01:00
|
|
|
const rec_t*
|
|
|
|
page_rec_get_nth_const(
|
|
|
|
/*===================*/
|
|
|
|
const page_t* page, /*!< in: page */
|
|
|
|
ulint nth) /*!< in: nth record */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
ulint i;
|
|
|
|
ulint n_owned;
|
|
|
|
const rec_t* rec;
|
|
|
|
|
|
|
|
if (nth == 0) {
|
|
|
|
return(page_get_infimum_rec(page));
|
|
|
|
}
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(nth < srv_page_size / (REC_N_NEW_EXTRA_BYTES + 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
for (i = 0;; i++) {
|
|
|
|
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
n_owned = page_dir_slot_get_n_owned(slot);
|
|
|
|
|
|
|
|
if (n_owned > nth) {
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
nth -= n_owned;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_ad(i > 0);
|
|
|
|
slot = page_dir_get_nth_slot(page, i - 1);
|
|
|
|
rec = page_dir_slot_get_rec(slot);
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
do {
|
|
|
|
rec = page_rec_get_next_low(rec, TRUE);
|
|
|
|
ut_ad(rec);
|
|
|
|
} while (nth--);
|
|
|
|
} else {
|
|
|
|
do {
|
|
|
|
rec = page_rec_get_next_low(rec, FALSE);
|
|
|
|
ut_ad(rec);
|
|
|
|
} while (nth--);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Returns the number of records before the given record in chain.
|
|
|
|
The number includes infimum and supremum records.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return number of records */
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint
|
|
|
|
page_rec_get_n_recs_before(
|
|
|
|
/*=======================*/
|
|
|
|
const rec_t* rec) /*!< in: the physical record */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
const rec_t* slot_rec;
|
|
|
|
const page_t* page;
|
|
|
|
ulint i;
|
|
|
|
lint n = 0;
|
|
|
|
|
|
|
|
ut_ad(page_rec_check(rec));
|
|
|
|
|
|
|
|
page = page_align(rec);
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
while (rec_get_n_owned_new(rec) == 0) {
|
|
|
|
|
|
|
|
rec = rec_get_next_ptr_const(rec, TRUE);
|
|
|
|
n--;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; ; i++) {
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
slot_rec = page_dir_slot_get_rec(slot);
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
n += lint(rec_get_n_owned_new(slot_rec));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (rec == slot_rec) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
while (rec_get_n_owned_old(rec) == 0) {
|
|
|
|
|
|
|
|
rec = rec_get_next_ptr_const(rec, FALSE);
|
|
|
|
n--;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; ; i++) {
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
slot_rec = page_dir_slot_get_rec(slot);
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
n += lint(rec_get_n_owned_old(slot_rec));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (rec == slot_rec) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
n--;
|
|
|
|
|
|
|
|
ut_ad(n >= 0);
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad((ulong) n < srv_page_size / (REC_N_NEW_EXTRA_BYTES + 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
return((ulint) n);
|
|
|
|
}
|
|
|
|
|
|
|
|
/************************************************************//**
|
|
|
|
Prints record contents including the data relevant only in
|
|
|
|
the index page context. */
|
|
|
|
void
|
|
|
|
page_rec_print(
|
|
|
|
/*===========*/
|
|
|
|
const rec_t* rec, /*!< in: physical record */
|
|
|
|
const ulint* offsets)/*!< in: record descriptor */
|
|
|
|
{
|
|
|
|
ut_a(!page_rec_is_comp(rec) == !rec_offs_comp(offsets));
|
|
|
|
rec_print_new(stderr, rec, offsets);
|
|
|
|
if (page_rec_is_comp(rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::info() << "n_owned: " << rec_get_n_owned_new(rec)
|
|
|
|
<< "; heap_no: " << rec_get_heap_no_new(rec)
|
|
|
|
<< "; next rec: " << rec_get_next_offs(rec, TRUE);
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::info() << "n_owned: " << rec_get_n_owned_old(rec)
|
|
|
|
<< "; heap_no: " << rec_get_heap_no_old(rec)
|
|
|
|
<< "; next rec: " << rec_get_next_offs(rec, FALSE);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
page_rec_check(rec);
|
|
|
|
rec_validate(rec, offsets);
|
|
|
|
}
|
|
|
|
|
2016-12-30 15:04:10 +02:00
|
|
|
#ifdef UNIV_BTR_PRINT
|
2014-02-26 19:11:54 +01:00
|
|
|
/***************************************************************//**
|
|
|
|
This is used to print the contents of the directory for
|
|
|
|
debugging purposes. */
|
|
|
|
void
|
|
|
|
page_dir_print(
|
|
|
|
/*===========*/
|
|
|
|
page_t* page, /*!< in: index page */
|
|
|
|
ulint pr_n) /*!< in: print n first and n last entries */
|
|
|
|
{
|
|
|
|
ulint n;
|
|
|
|
ulint i;
|
|
|
|
page_dir_slot_t* slot;
|
|
|
|
|
|
|
|
n = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
fprintf(stderr, "--------------------------------\n"
|
|
|
|
"PAGE DIRECTORY\n"
|
|
|
|
"Page address %p\n"
|
|
|
|
"Directory stack top at offs: %lu; number of slots: %lu\n",
|
|
|
|
page, (ulong) page_offset(page_dir_get_nth_slot(page, n - 1)),
|
|
|
|
(ulong) n);
|
|
|
|
for (i = 0; i < n; i++) {
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
if ((i == pr_n) && (i < n - pr_n)) {
|
|
|
|
fputs(" ... \n", stderr);
|
|
|
|
}
|
|
|
|
if ((i < pr_n) || (i >= n - pr_n)) {
|
|
|
|
fprintf(stderr,
|
|
|
|
"Contents of slot: %lu: n_owned: %lu,"
|
|
|
|
" rec offs: %lu\n",
|
|
|
|
(ulong) i,
|
|
|
|
(ulong) page_dir_slot_get_n_owned(slot),
|
|
|
|
(ulong)
|
|
|
|
page_offset(page_dir_slot_get_rec(slot)));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
fprintf(stderr, "Total of %lu records\n"
|
|
|
|
"--------------------------------\n",
|
|
|
|
(ulong) (PAGE_HEAP_NO_USER_LOW + page_get_n_recs(page)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This is used to print the contents of the page record list for
|
|
|
|
debugging purposes. */
|
|
|
|
void
|
|
|
|
page_print_list(
|
|
|
|
/*============*/
|
|
|
|
buf_block_t* block, /*!< in: index page */
|
|
|
|
dict_index_t* index, /*!< in: dictionary index of the page */
|
|
|
|
ulint pr_n) /*!< in: print n first and n last entries */
|
|
|
|
{
|
|
|
|
page_t* page = block->frame;
|
|
|
|
page_cur_t cur;
|
|
|
|
ulint count;
|
|
|
|
ulint n_recs;
|
|
|
|
mem_heap_t* heap = NULL;
|
|
|
|
ulint offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
ulint* offsets = offsets_;
|
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
|
|
|
ut_a((ibool)!!page_is_comp(page) == dict_table_is_comp(index->table));
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
fprint(stderr,
|
2014-02-26 19:11:54 +01:00
|
|
|
"--------------------------------\n"
|
|
|
|
"PAGE RECORD LIST\n"
|
|
|
|
"Page address %p\n", page);
|
|
|
|
|
|
|
|
n_recs = page_get_n_recs(page);
|
|
|
|
|
|
|
|
page_cur_set_before_first(block, &cur);
|
|
|
|
count = 0;
|
|
|
|
for (;;) {
|
|
|
|
offsets = rec_get_offsets(cur.rec, index, offsets,
|
MDEV-15662 Instant DROP COLUMN or changing the order of columns
Allow ADD COLUMN anywhere in a table, not only adding as the
last column.
Allow instant DROP COLUMN and instant changing the order of columns.
The added columns will always be added last in clustered index records.
In new records, instantly dropped columns will be stored as NULL or
empty when possible.
Information about dropped and reordered columns will be written in
a metadata BLOB (mblob), which is stored before the first 'user' field
in the hidden metadata record at the start of the clustered index.
The presence of mblob is indicated by setting the delete-mark flag in
the metadata record.
The metadata BLOB stores the number of clustered index fields,
followed by an array of column information for each field.
For dropped columns, we store the NOT NULL flag, the fixed length,
and for variable-length columns, whether the maximum length exceeded
255 bytes. For non-dropped columns, we store the column position.
Unlike with MDEV-11369, when a table becomes empty, it cannot
be converted back to the canonical format. The reason for this is
that other threads may hold cached objects such as
row_prebuilt_t::ins_node that could refer to dropped or reordered
index fields.
For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC,
we must store the n_core_null_bytes in the root page, so that the
chain of node pointer records can be followed in order to reach the
leftmost leaf page where the metadata record is located.
If the mblob is present, we will zero-initialize the strings
"infimum" and "supremum" in the root page, and use the last byte of
"supremum" for storing the number of null bytes (which are allocated
but useless on node pointer pages). This is necessary for
btr_cur_instant_init_metadata() to be able to navigate to the mblob.
If the PRIMARY KEY contains any variable-length column and some
nullable columns were instantly dropped, the dict_index_t::n_nullable
in the data dictionary could be smaller than it actually is in the
non-leaf pages. Because of this, the non-leaf pages could use more
bytes for the null flags than the data dictionary expects, and we
could be reading the lengths of the variable-length columns from the
wrong offset, and thus reading the child page number from wrong place.
This is the result of two design mistakes that involve unnecessary
storage of data: First, it is nonsense to store any data fields for
the leftmost node pointer records, because the comparisons would be
resolved by the MIN_REC_FLAG alone. Second, there cannot be any null
fields in the clustered index node pointer fields, but we nevertheless
reserve space for all the null flags.
Limitations (future work):
MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists
MDEV-17468 Avoid table rebuild on operations on generated columns
MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large
btr_page_reorganize_low(): Preserve any metadata in the root page.
Call lock_move_reorganize_page() only after restoring the "infimum"
and "supremum" records, to avoid a memcmp() assertion failure.
dict_col_t::DROPPED: Magic value for dict_col_t::ind.
dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant().
Do not assert that the column was instantly added, because we
sometimes call this unconditionally for all columns.
Convert an instantly added column to a "core column". The old name
remove_instant() could be mistaken to refer to "instant DROP COLUMN".
dict_col_t::is_added(): Rename from dict_col_t::is_instant().
dtype_t::metadata_blob_init(): Initialize the mblob data type.
dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(),
upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits
refer to a metadata record.
dict_table_t::instant: Metadata about dropped or reordered columns.
dict_table_t::prepare_instant(): Prepare
ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE.
innobase_instant_try() will pass this to dict_table_t::instant_column().
On rollback, dict_table_t::rollback_instant() will be called.
dict_table_t::instant_column(): Renamed from instant_add_column().
Add the parameter col_map so that columns can be reordered.
Copy and adjust v_cols[] as well.
dict_table_t::find(): Find an old column based on a new column number.
dict_table_t::serialise_columns(), dict_table_t::deserialise_columns():
Convert the mblob.
dict_index_t::instant_metadata(): Create the metadata record
for instant ALTER TABLE. Invoke dict_table_t::serialise_columns().
dict_index_t::reconstruct_fields(): Invoked by
dict_table_t::deserialise_columns().
dict_index_t::clear_instant_alter(): Move the fields for the
dropped columns to the end, and sort the surviving index fields
in ascending order of column position.
ha_innobase::check_if_supported_inplace_alter(): Do not allow
adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists
due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.)
instant_alter_column_possible(): Add a parameter for InnoDB table,
to check for additional conditions, such as the maximum number of
index fields.
ha_innobase_inplace_ctx::first_alter_pos: The first column whose position
is affected by instant ADD, DROP, or changing the order of columns.
innobase_build_col_map(): Skip added virtual columns.
prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol.
Remove some unnecessary code. Note that the call to
innodb_base_col_setup() should be executed later.
commit_try_norebuild(): If ctx->is_instant(), let the virtual
columns be added or dropped by innobase_instant_try().
innobase_instant_try(): Fill in a zero default value for the
hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459).
If any columns were dropped or reordered (or added not last),
delete any SYS_COLUMNS records for the following columns, and
insert SYS_COLUMNS records for all subsequent stored columns as well
as for all virtual columns. If any virtual column is dropped, rewrite
all virtual column metadata. Use a shortcut only for adding
virtual columns. This is because innobase_drop_virtual_try()
assumes that the dropped virtual columns still exist in ctx->old_table.
innodb_update_cols(): Renamed from innodb_update_n_cols().
innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change
the return type to bool, and invoke my_error() when detecting an error.
innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS.
Refactored from innobase_add_one_virtual() and innobase_instant_add_col().
innobase_instant_add_col(): Replace the parameter dfield with type.
innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS
and all columns from SYS_VIRTUAL.
innobase_add_virtual_try(), innobase_drop_virtual_try(): Let
the caller invoke innodb_update_cols().
innobase_rename_column_try(): Skip dropped columns.
commit_cache_norebuild(): Update table->fts->doc_col.
dict_mem_table_col_rename_low(): Skip dropped columns.
trx_undo_rec_get_partial_row(): Skip dropped columns.
trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly.
trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields.
Log metadata records consistently.
Apparently, the first fields of a clustered index may be updated
in an update_undo vector when the index is ID_IND of SYS_FOREIGN,
as part of renaming the table during ALTER TABLE. Normally, updates of
the PRIMARY KEY should be logged as delete-mark and an insert.
row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec():
Use trx_undo_metadata.
row_undo_mod_clust_low(): On metadata rollback, roll back the root page too.
row_undo_mod_clust(): Relax an assertion. The delete-mark flag was
repurposed for ALTER TABLE metadata records.
row_rec_to_index_entry_impl(): Add the template parameter mblob
and the optional parameter info_bits for specifying the desired new
info bits. For the metadata tuple, allow conversion between the original
format (ADD COLUMN only) and the generic format (with hidden BLOB).
Add the optional parameter "pad" to determine whether the tuple should
be padded to the index fields (on ALTER TABLE it should), or whether
it should remain at its original size (on rollback).
row_build_index_entry_low(): Clean up the code, removing
redundant variables and conditions. For instantly dropped columns,
generate a dummy value that is NULL, the empty string, or a
fixed length of NUL bytes, depending on the type of the dropped column.
row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY
of a record that contained a dropped column whose value was stored
externally, we will be inserting a dummy NULL or empty string value
to the field of the dropped column. The externally stored column would
eventually be dropped when purge removes the delete-marked record for
the old PRIMARY KEY value.
btr_index_rec_validate(): Recognize the metadata record.
btr_discard_only_page_on_level(): Preserve the generic instant
ALTER TABLE metadata.
btr_set_instant(): Replaces page_set_instant(). This sets a clustered
index root page to the appropriate format, or upgrades from
the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format.
btr_cur_instant_init_low(): Read and validate the metadata BLOB page
before reconstructing the dictionary information based on it.
btr_cur_instant_init_metadata(): Do not read any lengths from the
metadata record header before reading the BLOB. At this point, we
would not actually know how many nullable fields the metadata record
contains.
btr_cur_instant_root_init(): Initialize n_core_null_bytes in one
of two possible ways.
btr_cur_trim(): Handle the mblob record.
row_metadata_to_tuple(): Convert a metadata record to a data tuple,
based on the new info_bits of the metadata record.
btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed.
Invoke dtuple_convert_big_rec() for metadata records if the record is
too large, or if the mblob is not yet marked as externally stored.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
When the last user record is deleted, do not delete the
generic instant ALTER TABLE metadata record. Only delete
MDEV-11369 instant ADD COLUMN metadata records.
btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size.
btr_pcur_store_position(): Allow a logically empty page to contain
a metadata record for generic ALTER TABLE.
REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW.
This is for the old instant ADD COLUMN (MDEV-11369) only.
REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record,
with additional information for dropped or reordered columns.
rec_info_bits_valid(): Remove. The only case when this would fail
is when the record is the generic ALTER TABLE metadata record.
rec_is_alter_metadata(): Check if a record is the metadata record
for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function
must not be invoked on node pointer records, because the delete-mark
flag in those records may be set (it is garbage), and then a debug
assertion could fail because index->is_instant() does not necessarily
hold.
rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata
record (not more generic instant ALTER TABLE).
rec_get_converted_size_comp_prefix_low(): Assume that the metadata
field will be stored externally. In dtuple_convert_big_rec() during
the rec_get_converted_size() call, it would not be there yet.
rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple.
rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(),
rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>.
With mblob=true, process a record with a metadata BLOB.
rec_copy_prefix_to_buf(): Assert that no fields beyond the key and
system columns are being copied. Exclude the metadata BLOB field.
rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple
into a record.
row_upd_index_replace_metadata(): Apply an update vector to an
alter_metadata tuple.
row_log_allocate(): Replace dict_index_t::is_instant()
with a more appropriate condition that ignores dict_table_t::instant.
Only a table on which the MDEV-11369 ADD COLUMN was performed
can "lose its instantness" when it becomes empty. After
instant DROP COLUMN or reordering columns, we cannot simply
convert the table to the canonical format, because the data
dictionary cache and all possibly existing references to it
from other client connection threads would have to be adjusted.
row_quiesce_write_index_fields(): Do not crash when the table contains
an instantly dropped column.
Thanks to Thirunarayanan Balathandayuthapani for discussing the design
and implementing an initial prototype of this.
Thanks to Matthias Leich for testing.
2018-10-19 16:49:54 +03:00
|
|
|
page_rec_is_leaf(cur.rec),
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
page_rec_print(cur.rec, offsets);
|
|
|
|
|
|
|
|
if (count == pr_n) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (page_cur_is_after_last(&cur)) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
page_cur_move_to_next(&cur);
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (n_recs > 2 * pr_n) {
|
|
|
|
fputs(" ... \n", stderr);
|
|
|
|
}
|
|
|
|
|
|
|
|
while (!page_cur_is_after_last(&cur)) {
|
|
|
|
page_cur_move_to_next(&cur);
|
|
|
|
|
|
|
|
if (count + pr_n >= n_recs) {
|
|
|
|
offsets = rec_get_offsets(cur.rec, index, offsets,
|
MDEV-15662 Instant DROP COLUMN or changing the order of columns
Allow ADD COLUMN anywhere in a table, not only adding as the
last column.
Allow instant DROP COLUMN and instant changing the order of columns.
The added columns will always be added last in clustered index records.
In new records, instantly dropped columns will be stored as NULL or
empty when possible.
Information about dropped and reordered columns will be written in
a metadata BLOB (mblob), which is stored before the first 'user' field
in the hidden metadata record at the start of the clustered index.
The presence of mblob is indicated by setting the delete-mark flag in
the metadata record.
The metadata BLOB stores the number of clustered index fields,
followed by an array of column information for each field.
For dropped columns, we store the NOT NULL flag, the fixed length,
and for variable-length columns, whether the maximum length exceeded
255 bytes. For non-dropped columns, we store the column position.
Unlike with MDEV-11369, when a table becomes empty, it cannot
be converted back to the canonical format. The reason for this is
that other threads may hold cached objects such as
row_prebuilt_t::ins_node that could refer to dropped or reordered
index fields.
For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC,
we must store the n_core_null_bytes in the root page, so that the
chain of node pointer records can be followed in order to reach the
leftmost leaf page where the metadata record is located.
If the mblob is present, we will zero-initialize the strings
"infimum" and "supremum" in the root page, and use the last byte of
"supremum" for storing the number of null bytes (which are allocated
but useless on node pointer pages). This is necessary for
btr_cur_instant_init_metadata() to be able to navigate to the mblob.
If the PRIMARY KEY contains any variable-length column and some
nullable columns were instantly dropped, the dict_index_t::n_nullable
in the data dictionary could be smaller than it actually is in the
non-leaf pages. Because of this, the non-leaf pages could use more
bytes for the null flags than the data dictionary expects, and we
could be reading the lengths of the variable-length columns from the
wrong offset, and thus reading the child page number from wrong place.
This is the result of two design mistakes that involve unnecessary
storage of data: First, it is nonsense to store any data fields for
the leftmost node pointer records, because the comparisons would be
resolved by the MIN_REC_FLAG alone. Second, there cannot be any null
fields in the clustered index node pointer fields, but we nevertheless
reserve space for all the null flags.
Limitations (future work):
MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists
MDEV-17468 Avoid table rebuild on operations on generated columns
MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large
btr_page_reorganize_low(): Preserve any metadata in the root page.
Call lock_move_reorganize_page() only after restoring the "infimum"
and "supremum" records, to avoid a memcmp() assertion failure.
dict_col_t::DROPPED: Magic value for dict_col_t::ind.
dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant().
Do not assert that the column was instantly added, because we
sometimes call this unconditionally for all columns.
Convert an instantly added column to a "core column". The old name
remove_instant() could be mistaken to refer to "instant DROP COLUMN".
dict_col_t::is_added(): Rename from dict_col_t::is_instant().
dtype_t::metadata_blob_init(): Initialize the mblob data type.
dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(),
upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits
refer to a metadata record.
dict_table_t::instant: Metadata about dropped or reordered columns.
dict_table_t::prepare_instant(): Prepare
ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE.
innobase_instant_try() will pass this to dict_table_t::instant_column().
On rollback, dict_table_t::rollback_instant() will be called.
dict_table_t::instant_column(): Renamed from instant_add_column().
Add the parameter col_map so that columns can be reordered.
Copy and adjust v_cols[] as well.
dict_table_t::find(): Find an old column based on a new column number.
dict_table_t::serialise_columns(), dict_table_t::deserialise_columns():
Convert the mblob.
dict_index_t::instant_metadata(): Create the metadata record
for instant ALTER TABLE. Invoke dict_table_t::serialise_columns().
dict_index_t::reconstruct_fields(): Invoked by
dict_table_t::deserialise_columns().
dict_index_t::clear_instant_alter(): Move the fields for the
dropped columns to the end, and sort the surviving index fields
in ascending order of column position.
ha_innobase::check_if_supported_inplace_alter(): Do not allow
adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists
due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.)
instant_alter_column_possible(): Add a parameter for InnoDB table,
to check for additional conditions, such as the maximum number of
index fields.
ha_innobase_inplace_ctx::first_alter_pos: The first column whose position
is affected by instant ADD, DROP, or changing the order of columns.
innobase_build_col_map(): Skip added virtual columns.
prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol.
Remove some unnecessary code. Note that the call to
innodb_base_col_setup() should be executed later.
commit_try_norebuild(): If ctx->is_instant(), let the virtual
columns be added or dropped by innobase_instant_try().
innobase_instant_try(): Fill in a zero default value for the
hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459).
If any columns were dropped or reordered (or added not last),
delete any SYS_COLUMNS records for the following columns, and
insert SYS_COLUMNS records for all subsequent stored columns as well
as for all virtual columns. If any virtual column is dropped, rewrite
all virtual column metadata. Use a shortcut only for adding
virtual columns. This is because innobase_drop_virtual_try()
assumes that the dropped virtual columns still exist in ctx->old_table.
innodb_update_cols(): Renamed from innodb_update_n_cols().
innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change
the return type to bool, and invoke my_error() when detecting an error.
innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS.
Refactored from innobase_add_one_virtual() and innobase_instant_add_col().
innobase_instant_add_col(): Replace the parameter dfield with type.
innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS
and all columns from SYS_VIRTUAL.
innobase_add_virtual_try(), innobase_drop_virtual_try(): Let
the caller invoke innodb_update_cols().
innobase_rename_column_try(): Skip dropped columns.
commit_cache_norebuild(): Update table->fts->doc_col.
dict_mem_table_col_rename_low(): Skip dropped columns.
trx_undo_rec_get_partial_row(): Skip dropped columns.
trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly.
trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields.
Log metadata records consistently.
Apparently, the first fields of a clustered index may be updated
in an update_undo vector when the index is ID_IND of SYS_FOREIGN,
as part of renaming the table during ALTER TABLE. Normally, updates of
the PRIMARY KEY should be logged as delete-mark and an insert.
row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec():
Use trx_undo_metadata.
row_undo_mod_clust_low(): On metadata rollback, roll back the root page too.
row_undo_mod_clust(): Relax an assertion. The delete-mark flag was
repurposed for ALTER TABLE metadata records.
row_rec_to_index_entry_impl(): Add the template parameter mblob
and the optional parameter info_bits for specifying the desired new
info bits. For the metadata tuple, allow conversion between the original
format (ADD COLUMN only) and the generic format (with hidden BLOB).
Add the optional parameter "pad" to determine whether the tuple should
be padded to the index fields (on ALTER TABLE it should), or whether
it should remain at its original size (on rollback).
row_build_index_entry_low(): Clean up the code, removing
redundant variables and conditions. For instantly dropped columns,
generate a dummy value that is NULL, the empty string, or a
fixed length of NUL bytes, depending on the type of the dropped column.
row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY
of a record that contained a dropped column whose value was stored
externally, we will be inserting a dummy NULL or empty string value
to the field of the dropped column. The externally stored column would
eventually be dropped when purge removes the delete-marked record for
the old PRIMARY KEY value.
btr_index_rec_validate(): Recognize the metadata record.
btr_discard_only_page_on_level(): Preserve the generic instant
ALTER TABLE metadata.
btr_set_instant(): Replaces page_set_instant(). This sets a clustered
index root page to the appropriate format, or upgrades from
the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format.
btr_cur_instant_init_low(): Read and validate the metadata BLOB page
before reconstructing the dictionary information based on it.
btr_cur_instant_init_metadata(): Do not read any lengths from the
metadata record header before reading the BLOB. At this point, we
would not actually know how many nullable fields the metadata record
contains.
btr_cur_instant_root_init(): Initialize n_core_null_bytes in one
of two possible ways.
btr_cur_trim(): Handle the mblob record.
row_metadata_to_tuple(): Convert a metadata record to a data tuple,
based on the new info_bits of the metadata record.
btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed.
Invoke dtuple_convert_big_rec() for metadata records if the record is
too large, or if the mblob is not yet marked as externally stored.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
When the last user record is deleted, do not delete the
generic instant ALTER TABLE metadata record. Only delete
MDEV-11369 instant ADD COLUMN metadata records.
btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size.
btr_pcur_store_position(): Allow a logically empty page to contain
a metadata record for generic ALTER TABLE.
REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW.
This is for the old instant ADD COLUMN (MDEV-11369) only.
REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record,
with additional information for dropped or reordered columns.
rec_info_bits_valid(): Remove. The only case when this would fail
is when the record is the generic ALTER TABLE metadata record.
rec_is_alter_metadata(): Check if a record is the metadata record
for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function
must not be invoked on node pointer records, because the delete-mark
flag in those records may be set (it is garbage), and then a debug
assertion could fail because index->is_instant() does not necessarily
hold.
rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata
record (not more generic instant ALTER TABLE).
rec_get_converted_size_comp_prefix_low(): Assume that the metadata
field will be stored externally. In dtuple_convert_big_rec() during
the rec_get_converted_size() call, it would not be there yet.
rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple.
rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(),
rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>.
With mblob=true, process a record with a metadata BLOB.
rec_copy_prefix_to_buf(): Assert that no fields beyond the key and
system columns are being copied. Exclude the metadata BLOB field.
rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple
into a record.
row_upd_index_replace_metadata(): Apply an update vector to an
alter_metadata tuple.
row_log_allocate(): Replace dict_index_t::is_instant()
with a more appropriate condition that ignores dict_table_t::instant.
Only a table on which the MDEV-11369 ADD COLUMN was performed
can "lose its instantness" when it becomes empty. After
instant DROP COLUMN or reordering columns, we cannot simply
convert the table to the canonical format, because the data
dictionary cache and all possibly existing references to it
from other client connection threads would have to be adjusted.
row_quiesce_write_index_fields(): Do not crash when the table contains
an instantly dropped column.
Thanks to Thirunarayanan Balathandayuthapani for discussing the design
and implementing an initial prototype of this.
Thanks to Matthias Leich for testing.
2018-10-19 16:49:54 +03:00
|
|
|
page_rec_is_leaf(cur.rec),
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
page_rec_print(cur.rec, offsets);
|
|
|
|
}
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
"Total of %lu records \n"
|
|
|
|
"--------------------------------\n",
|
|
|
|
(ulong) (count + 1));
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Prints the info in a page header. */
|
|
|
|
void
|
|
|
|
page_header_print(
|
|
|
|
/*==============*/
|
|
|
|
const page_t* page)
|
|
|
|
{
|
|
|
|
fprintf(stderr,
|
|
|
|
"--------------------------------\n"
|
|
|
|
"PAGE HEADER INFO\n"
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
"Page address %p, n records %u (%s)\n"
|
|
|
|
"n dir slots %u, heap top %u\n"
|
|
|
|
"Page n heap %u, free %u, garbage %u\n"
|
|
|
|
"Page last insert %u, direction %u, n direction %u\n",
|
|
|
|
page, page_header_get_field(page, PAGE_N_RECS),
|
2014-02-26 19:11:54 +01:00
|
|
|
page_is_comp(page) ? "compact format" : "original format",
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
page_header_get_field(page, PAGE_N_DIR_SLOTS),
|
|
|
|
page_header_get_field(page, PAGE_HEAP_TOP),
|
|
|
|
page_dir_get_n_heap(page),
|
|
|
|
page_header_get_field(page, PAGE_FREE),
|
|
|
|
page_header_get_field(page, PAGE_GARBAGE),
|
|
|
|
page_header_get_field(page, PAGE_LAST_INSERT),
|
|
|
|
page_get_direction(page),
|
|
|
|
page_header_get_field(page, PAGE_N_DIRECTION));
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This is used to print the contents of the page for
|
|
|
|
debugging purposes. */
|
|
|
|
void
|
|
|
|
page_print(
|
|
|
|
/*=======*/
|
|
|
|
buf_block_t* block, /*!< in: index page */
|
|
|
|
dict_index_t* index, /*!< in: dictionary index of the page */
|
|
|
|
ulint dn, /*!< in: print dn first and last entries
|
|
|
|
in directory */
|
|
|
|
ulint rn) /*!< in: print rn first and last records
|
|
|
|
in directory */
|
|
|
|
{
|
|
|
|
page_t* page = block->frame;
|
|
|
|
|
|
|
|
page_header_print(page);
|
|
|
|
page_dir_print(page, dn);
|
|
|
|
page_print_list(block, index, rn);
|
|
|
|
}
|
2016-12-30 15:04:10 +02:00
|
|
|
#endif /* UNIV_BTR_PRINT */
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
The following is used to validate a record on a page. This function
|
|
|
|
differs from rec_validate as it can also check the n_owned field and
|
|
|
|
the heap_no field.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if ok */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_rec_validate(
|
|
|
|
/*==============*/
|
|
|
|
const rec_t* rec, /*!< in: physical record */
|
|
|
|
const ulint* offsets)/*!< in: array returned by rec_get_offsets() */
|
|
|
|
{
|
|
|
|
ulint n_owned;
|
|
|
|
ulint heap_no;
|
|
|
|
const page_t* page;
|
|
|
|
|
|
|
|
page = page_align(rec);
|
|
|
|
ut_a(!page_is_comp(page) == !rec_offs_comp(offsets));
|
|
|
|
|
|
|
|
page_rec_check(rec);
|
|
|
|
rec_validate(rec, offsets);
|
|
|
|
|
|
|
|
if (page_rec_is_comp(rec)) {
|
|
|
|
n_owned = rec_get_n_owned_new(rec);
|
|
|
|
heap_no = rec_get_heap_no_new(rec);
|
|
|
|
} else {
|
|
|
|
n_owned = rec_get_n_owned_old(rec);
|
|
|
|
heap_no = rec_get_heap_no_old(rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!(n_owned <= PAGE_DIR_SLOT_MAX_N_OWNED))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::warn() << "Dir slot of rec " << page_offset(rec)
|
|
|
|
<< ", n owned too big " << n_owned;
|
2014-02-26 19:11:54 +01:00
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!(heap_no < page_dir_get_n_heap(page)))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::warn() << "Heap no of rec " << page_offset(rec)
|
|
|
|
<< " too big " << heap_no << " "
|
|
|
|
<< page_dir_get_n_heap(page);
|
2014-02-26 19:11:54 +01:00
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
#ifdef UNIV_DEBUG
|
2014-02-26 19:11:54 +01:00
|
|
|
/***************************************************************//**
|
|
|
|
Checks that the first directory slot points to the infimum record and
|
|
|
|
the last to the supremum. This function is intended to track if the
|
|
|
|
bug fixed in 4.0.14 has caused corruption to users' databases. */
|
|
|
|
void
|
|
|
|
page_check_dir(
|
|
|
|
/*===========*/
|
|
|
|
const page_t* page) /*!< in: index page */
|
|
|
|
{
|
|
|
|
ulint n_slots;
|
|
|
|
ulint infimum_offs;
|
|
|
|
ulint supremum_offs;
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
infimum_offs = mach_read_from_2(page_dir_get_nth_slot(page, 0));
|
|
|
|
supremum_offs = mach_read_from_2(page_dir_get_nth_slot(page,
|
|
|
|
n_slots - 1));
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_is_infimum_low(infimum_offs))) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::fatal() << "Page directory corruption: infimum not"
|
|
|
|
" pointed to";
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_is_supremum_low(supremum_offs))) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::fatal() << "Page directory corruption: supremum not"
|
|
|
|
" pointed to";
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
2016-08-12 11:17:45 +03:00
|
|
|
#endif /* UNIV_DEBUG */
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This function checks the consistency of an index page when we do not
|
|
|
|
know the index. This is also resilient so that this should never crash
|
|
|
|
even if the page is total garbage.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if ok */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_simple_validate_old(
|
|
|
|
/*=====================*/
|
|
|
|
const page_t* page) /*!< in: index page in ROW_FORMAT=REDUNDANT */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
ulint slot_no;
|
|
|
|
ulint n_slots;
|
|
|
|
const rec_t* rec;
|
|
|
|
const byte* rec_heap_top;
|
|
|
|
ulint count;
|
|
|
|
ulint own_count;
|
|
|
|
ibool ret = FALSE;
|
|
|
|
|
|
|
|
ut_a(!page_is_comp(page));
|
|
|
|
|
|
|
|
/* Check first that the record heap and the directory do not
|
|
|
|
overlap. */
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(n_slots > srv_page_size / 4)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Nonsensical number " << n_slots
|
|
|
|
<< " of page dir slots";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec_heap_top = page_header_get_ptr(page, PAGE_HEAP_TOP);
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_heap_top
|
|
|
|
> page_dir_get_nth_slot(page, n_slots - 1))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error()
|
|
|
|
<< "Record heap and dir overlap on a page, heap top "
|
|
|
|
<< page_header_get_field(page, PAGE_HEAP_TOP)
|
|
|
|
<< ", dir "
|
|
|
|
<< page_offset(page_dir_get_nth_slot(page,
|
|
|
|
n_slots - 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Validate the record list in a loop checking also that it is
|
|
|
|
consistent with the page record directory. */
|
|
|
|
|
|
|
|
count = 0;
|
|
|
|
own_count = 1;
|
|
|
|
slot_no = 0;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
rec = page_get_infimum_rec(page);
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Record " << (rec - page)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< (rec_heap_top - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_old(rec) != 0)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* This is a record pointed to by a dir slot */
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_old(rec)
|
|
|
|
!= own_count)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Wrong owned count "
|
|
|
|
<< rec_get_n_owned_old(rec)
|
|
|
|
<< ", " << own_count << ", rec "
|
|
|
|
<< (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(page_dir_slot_get_rec(slot) != rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Dir slot does not point"
|
|
|
|
" to right rec " << (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
own_count = 0;
|
|
|
|
|
|
|
|
if (!page_rec_is_supremum(rec)) {
|
|
|
|
slot_no++;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(rec_get_next_offs(rec, FALSE) < FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec_get_next_offs(rec, FALSE) >= srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Next record offset nonsensical "
|
|
|
|
<< rec_get_next_offs(rec, FALSE) << " for rec "
|
|
|
|
<< (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page record list appears"
|
|
|
|
" to be circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
own_count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_old(rec) == 0)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n owned is zero in a supremum rec";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot_no != n_slots - 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n slots wrong "
|
|
|
|
<< slot_no << ", " << (n_slots - 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2017-09-14 08:00:28 +03:00
|
|
|
if (UNIV_UNLIKELY(ulint(page_header_get_field(page, PAGE_N_RECS))
|
2014-02-26 19:11:54 +01:00
|
|
|
+ PAGE_HEAP_NO_USER_LOW
|
|
|
|
!= count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n recs wrong "
|
|
|
|
<< page_header_get_field(page, PAGE_N_RECS)
|
|
|
|
+ PAGE_HEAP_NO_USER_LOW << " " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check then the free list */
|
|
|
|
rec = page_header_get_ptr(page, PAGE_FREE);
|
|
|
|
|
|
|
|
while (rec != NULL) {
|
|
|
|
if (UNIV_UNLIKELY(rec < page + FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec >= page + srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Free list record has"
|
|
|
|
" a nonsensical offset " << (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Free list record " << (rec - page)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< (rec_heap_top - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page free list appears"
|
|
|
|
" to be circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(page_dir_get_n_heap(page) != count + 1)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "N heap is wrong "
|
|
|
|
<< page_dir_get_n_heap(page) << ", " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = TRUE;
|
|
|
|
|
|
|
|
func_exit:
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This function checks the consistency of an index page when we do not
|
|
|
|
know the index. This is also resilient so that this should never crash
|
|
|
|
even if the page is total garbage.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if ok */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_simple_validate_new(
|
|
|
|
/*=====================*/
|
|
|
|
const page_t* page) /*!< in: index page in ROW_FORMAT!=REDUNDANT */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
ulint slot_no;
|
|
|
|
ulint n_slots;
|
|
|
|
const rec_t* rec;
|
|
|
|
const byte* rec_heap_top;
|
|
|
|
ulint count;
|
|
|
|
ulint own_count;
|
|
|
|
ibool ret = FALSE;
|
|
|
|
|
|
|
|
ut_a(page_is_comp(page));
|
|
|
|
|
|
|
|
/* Check first that the record heap and the directory do not
|
|
|
|
overlap. */
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(n_slots > srv_page_size / 4)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Nonsensical number " << n_slots
|
|
|
|
<< " of page dir slots";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec_heap_top = page_header_get_ptr(page, PAGE_HEAP_TOP);
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_heap_top
|
|
|
|
> page_dir_get_nth_slot(page, n_slots - 1))) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Record heap and dir overlap on a page,"
|
|
|
|
" heap top "
|
|
|
|
<< page_header_get_field(page, PAGE_HEAP_TOP)
|
|
|
|
<< ", dir " << page_offset(
|
|
|
|
page_dir_get_nth_slot(page, n_slots - 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Validate the record list in a loop checking also that it is
|
|
|
|
consistent with the page record directory. */
|
|
|
|
|
|
|
|
count = 0;
|
|
|
|
own_count = 1;
|
|
|
|
slot_no = 0;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
rec = page_get_infimum_rec(page);
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Record " << page_offset(rec)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< page_offset(rec_heap_top);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec) != 0)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* This is a record pointed to by a dir slot */
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec)
|
|
|
|
!= own_count)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Wrong owned count "
|
|
|
|
<< rec_get_n_owned_new(rec) << ", "
|
|
|
|
<< own_count << ", rec "
|
|
|
|
<< page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(page_dir_slot_get_rec(slot) != rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Dir slot does not point"
|
|
|
|
" to right rec " << page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
own_count = 0;
|
|
|
|
|
|
|
|
if (!page_rec_is_supremum(rec)) {
|
|
|
|
slot_no++;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(rec_get_next_offs(rec, TRUE) < FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec_get_next_offs(rec, TRUE) >= srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Next record offset nonsensical "
|
|
|
|
<< rec_get_next_offs(rec, TRUE)
|
|
|
|
<< " for rec " << page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page record list appears to be"
|
|
|
|
" circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
own_count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec) == 0)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n owned is zero in a supremum rec";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot_no != n_slots - 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n slots wrong " << slot_no << ", "
|
|
|
|
<< (n_slots - 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2017-09-14 08:00:28 +03:00
|
|
|
if (UNIV_UNLIKELY(ulint(page_header_get_field(page, PAGE_N_RECS))
|
2014-02-26 19:11:54 +01:00
|
|
|
+ PAGE_HEAP_NO_USER_LOW
|
|
|
|
!= count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n recs wrong "
|
|
|
|
<< page_header_get_field(page, PAGE_N_RECS)
|
|
|
|
+ PAGE_HEAP_NO_USER_LOW << " " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check then the free list */
|
|
|
|
rec = page_header_get_ptr(page, PAGE_FREE);
|
|
|
|
|
|
|
|
while (rec != NULL) {
|
|
|
|
if (UNIV_UNLIKELY(rec < page + FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec >= page + srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Free list record has"
|
|
|
|
" a nonsensical offset " << page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Free list record " << page_offset(rec)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< page_offset(rec_heap_top);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page free list appears to be"
|
|
|
|
" circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(page_dir_get_n_heap(page) != count + 1)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "N heap is wrong "
|
|
|
|
<< page_dir_get_n_heap(page) << ", " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = TRUE;
|
|
|
|
|
|
|
|
func_exit:
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This function checks the consistency of an index page.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if ok */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_validate(
|
|
|
|
/*==========*/
|
|
|
|
const page_t* page, /*!< in: index page */
|
|
|
|
dict_index_t* index) /*!< in: data dictionary index containing
|
|
|
|
the page record type definition */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
const rec_t* rec;
|
|
|
|
const rec_t* old_rec = NULL;
|
|
|
|
ulint offs;
|
|
|
|
ulint n_slots;
|
2019-07-01 18:24:35 +03:00
|
|
|
ibool ret = TRUE;
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint i;
|
|
|
|
ulint* offsets = NULL;
|
|
|
|
ulint* old_offsets = NULL;
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
#ifdef UNIV_GIS_DEBUG
|
|
|
|
if (dict_index_is_spatial(index)) {
|
|
|
|
fprintf(stderr, "Page no: %lu\n", page_get_page_no(page));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_DEBUG */
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (UNIV_UNLIKELY((ibool) !!page_is_comp(page)
|
|
|
|
!= dict_table_is_comp(index->table))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "'compact format' flag mismatch";
|
2019-07-01 18:24:35 +03:00
|
|
|
func_exit2:
|
|
|
|
ib::error() << "Apparent corruption in space "
|
|
|
|
<< page_get_space_id(page) << " page "
|
|
|
|
<< page_get_page_no(page)
|
|
|
|
<< " of index " << index->name
|
|
|
|
<< " of table " << index->table->name;
|
|
|
|
return FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
if (UNIV_UNLIKELY(!page_simple_validate_new(page))) {
|
|
|
|
goto func_exit2;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (UNIV_UNLIKELY(!page_simple_validate_old(page))) {
|
|
|
|
goto func_exit2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
/* Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
MDEV-15132 Avoid accessing the TRX_SYS page
InnoDB maintains an internal persistent sequence of transaction
identifiers. This sequence is used for assigning both transaction
start identifiers (DB_TRX_ID=trx->id) and end identifiers (trx->no)
as well as end identifiers for the mysql.transaction_registry table
that was introduced in MDEV-12894.
TRX_SYS_TRX_ID_WRITE_MARGIN: Remove. After this many updates of
the sequence we used to update the TRX_SYS page. We can avoid accessing
the TRX_SYS page if we modify the InnoDB startup so that resurrecting
the sequence from other pages of the transaction system.
TRX_SYS_TRX_ID_STORE: Deprecate. The field only exists for the purpose
of upgrading from an earlier version of MySQL or MariaDB.
Starting with this fix, MariaDB will rely on the fields
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO in the undo log header page of
each non-committed transaction, and on the new field
TRX_RSEG_MAX_TRX_ID in rollback segment header pages.
Because of this change, setting innodb_force_recovery=5 or 6 may cause
the system to recover with trx_sys.get_max_trx_id()==0. We must adjust
checks for invalid DB_TRX_ID and PAGE_MAX_TRX_ID accordingly.
We will change the startup and shutdown messages to display the
trx_sys.get_max_trx_id() in addition to the log sequence number.
trx_sys_t::flush_max_trx_id(): Remove.
trx_undo_mem_create_at_db_start(), trx_undo_lists_init():
Add an output parameter max_trx_id, to be updated from
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO.
TRX_RSEG_MAX_TRX_ID: New field, for persisting
trx_sys.get_max_trx_id() at the time of the latest transaction commit.
Startup is not reading the undo log pages of committed transactions.
We want to avoid additional page accesses on startup, as well as
trouble when all undo logs have been emptied.
On startup, we will simply determine the maximum value from all pages
that are being read anyway.
TRX_RSEG_FORMAT: Redefined from TRX_RSEG_MAX_SIZE.
Old versions of InnoDB wrote uninitialized garbage to unused data fields.
Because of this, we cannot simply introduce a new field in the
rollback segment pages and expect it to be always zero, like it would
if the database was created by a recent enough InnoDB version.
Luckily, it looks like the field TRX_RSEG_MAX_SIZE was always written
as 0xfffffffe. We will indicate a new subformat of the page by writing
0 to this field. This has the nice side effect that after a downgrade
to older versions of InnoDB, transactions should fail to allocate any
undo log, that is, writes will be blocked. So, there is no problem of
getting corrupted transaction identifiers after downgrading.
trx_rseg_t::max_size: Remove.
trx_rseg_header_create(): Remove the parameter max_size=ULINT_MAX.
trx_purge_add_undo_to_history(): Update TRX_RSEG_MAX_SIZE
(and TRX_RSEG_FORMAT if needed). This is invoked on transaction commit.
trx_rseg_mem_restore(): If TRX_RSEG_FORMAT contains 0,
read TRX_RSEG_MAX_SIZE.
trx_rseg_array_init(): Invoke trx_sys.init_max_trx_id(max_trx_id + 1)
where max_trx_id was the maximum that was encountered in the rollback
segment pages and the undo log pages of recovered active, XA PREPARE,
or some committed transactions. (See trx_purge_add_undo_to_history()
which invokes trx_rsegf_set_nth_undo(..., FIL_NULL, ...);
not all committed transactions will be immediately detached from the
rollback segment header.)
2018-01-31 10:24:19 +02:00
|
|
|
if (!page_is_leaf(page) || page_is_empty(page)
|
|
|
|
|| !dict_index_is_sec_or_ibuf(index)
|
|
|
|
|| index->table->is_temporary()) {
|
|
|
|
} else if (trx_id_t sys_max_trx_id = trx_sys.get_max_trx_id()) {
|
2014-02-26 19:11:54 +01:00
|
|
|
trx_id_t max_trx_id = page_get_max_trx_id(page);
|
|
|
|
|
|
|
|
if (max_trx_id == 0 || max_trx_id > sys_max_trx_id) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "PAGE_MAX_TRX_ID out of bounds: "
|
|
|
|
<< max_trx_id << ", " << sys_max_trx_id;
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
MDEV-15132 Avoid accessing the TRX_SYS page
InnoDB maintains an internal persistent sequence of transaction
identifiers. This sequence is used for assigning both transaction
start identifiers (DB_TRX_ID=trx->id) and end identifiers (trx->no)
as well as end identifiers for the mysql.transaction_registry table
that was introduced in MDEV-12894.
TRX_SYS_TRX_ID_WRITE_MARGIN: Remove. After this many updates of
the sequence we used to update the TRX_SYS page. We can avoid accessing
the TRX_SYS page if we modify the InnoDB startup so that resurrecting
the sequence from other pages of the transaction system.
TRX_SYS_TRX_ID_STORE: Deprecate. The field only exists for the purpose
of upgrading from an earlier version of MySQL or MariaDB.
Starting with this fix, MariaDB will rely on the fields
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO in the undo log header page of
each non-committed transaction, and on the new field
TRX_RSEG_MAX_TRX_ID in rollback segment header pages.
Because of this change, setting innodb_force_recovery=5 or 6 may cause
the system to recover with trx_sys.get_max_trx_id()==0. We must adjust
checks for invalid DB_TRX_ID and PAGE_MAX_TRX_ID accordingly.
We will change the startup and shutdown messages to display the
trx_sys.get_max_trx_id() in addition to the log sequence number.
trx_sys_t::flush_max_trx_id(): Remove.
trx_undo_mem_create_at_db_start(), trx_undo_lists_init():
Add an output parameter max_trx_id, to be updated from
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO.
TRX_RSEG_MAX_TRX_ID: New field, for persisting
trx_sys.get_max_trx_id() at the time of the latest transaction commit.
Startup is not reading the undo log pages of committed transactions.
We want to avoid additional page accesses on startup, as well as
trouble when all undo logs have been emptied.
On startup, we will simply determine the maximum value from all pages
that are being read anyway.
TRX_RSEG_FORMAT: Redefined from TRX_RSEG_MAX_SIZE.
Old versions of InnoDB wrote uninitialized garbage to unused data fields.
Because of this, we cannot simply introduce a new field in the
rollback segment pages and expect it to be always zero, like it would
if the database was created by a recent enough InnoDB version.
Luckily, it looks like the field TRX_RSEG_MAX_SIZE was always written
as 0xfffffffe. We will indicate a new subformat of the page by writing
0 to this field. This has the nice side effect that after a downgrade
to older versions of InnoDB, transactions should fail to allocate any
undo log, that is, writes will be blocked. So, there is no problem of
getting corrupted transaction identifiers after downgrading.
trx_rseg_t::max_size: Remove.
trx_rseg_header_create(): Remove the parameter max_size=ULINT_MAX.
trx_purge_add_undo_to_history(): Update TRX_RSEG_MAX_SIZE
(and TRX_RSEG_FORMAT if needed). This is invoked on transaction commit.
trx_rseg_mem_restore(): If TRX_RSEG_FORMAT contains 0,
read TRX_RSEG_MAX_SIZE.
trx_rseg_array_init(): Invoke trx_sys.init_max_trx_id(max_trx_id + 1)
where max_trx_id was the maximum that was encountered in the rollback
segment pages and the undo log pages of recovered active, XA PREPARE,
or some committed transactions. (See trx_purge_add_undo_to_history()
which invokes trx_rsegf_set_nth_undo(..., FIL_NULL, ...);
not all committed transactions will be immediately detached from the
rollback segment header.)
2018-01-31 10:24:19 +02:00
|
|
|
} else {
|
|
|
|
ut_ad(srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Check first that the record heap and the directory do not
|
|
|
|
overlap. */
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!(page_header_get_ptr(page, PAGE_HEAP_TOP)
|
|
|
|
<= page_dir_get_nth_slot(page, n_slots - 1)))) {
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::warn() << "Record heap and directory overlap";
|
|
|
|
goto func_exit2;
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
switch (uint16_t type = fil_page_get_type(page)) {
|
|
|
|
case FIL_PAGE_RTREE:
|
|
|
|
if (!index->is_spatial()) {
|
|
|
|
wrong_page_type:
|
|
|
|
ib::warn() << "Wrong page type " << type;
|
|
|
|
ret = FALSE;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case FIL_PAGE_TYPE_INSTANT:
|
|
|
|
if (index->is_instant()
|
|
|
|
&& page_get_page_no(page) == index->page) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
goto wrong_page_type;
|
|
|
|
case FIL_PAGE_INDEX:
|
|
|
|
if (index->is_spatial()) {
|
|
|
|
goto wrong_page_type;
|
|
|
|
}
|
|
|
|
if (index->is_instant()
|
|
|
|
&& page_get_page_no(page) == index->page) {
|
|
|
|
goto wrong_page_type;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
goto wrong_page_type;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
/* The following buffer is used to check that the
|
|
|
|
records in the page record heap do not overlap */
|
|
|
|
mem_heap_t* heap = mem_heap_create(srv_page_size + 200);;
|
|
|
|
byte* buf = static_cast<byte*>(mem_heap_zalloc(heap, srv_page_size));
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
/* Validate the record list in a loop checking also that
|
|
|
|
it is consistent with the directory. */
|
2019-07-01 18:24:35 +03:00
|
|
|
ulint count = 0, data_size = 0, own_count = 1, slot_no = 0;
|
2014-02-26 19:11:54 +01:00
|
|
|
slot_no = 0;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
rec = page_get_infimum_rec(page);
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
offsets = rec_get_offsets(rec, index, offsets,
|
2017-09-19 19:20:11 +03:00
|
|
|
page_is_leaf(page),
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
|
|
|
|
if (page_is_comp(page) && page_rec_is_user_rec(rec)
|
|
|
|
&& UNIV_UNLIKELY(rec_get_node_ptr_flag(rec)
|
|
|
|
== page_is_leaf(page))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "'node_ptr' flag mismatch";
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
|
|
|
goto next_rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_validate(rec, offsets))) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
|
|
|
goto next_rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Check that the records are in the ascending order */
|
2016-08-12 11:17:45 +03:00
|
|
|
if (count >= PAGE_HEAP_NO_USER_LOW
|
2014-02-26 19:11:54 +01:00
|
|
|
&& !page_rec_is_supremum(rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
int ret = cmp_rec_rec(
|
|
|
|
rec, old_rec, offsets, old_offsets, index);
|
|
|
|
|
|
|
|
/* For spatial index, on nonleaf leavel, we
|
|
|
|
allow recs to be equal. */
|
2019-07-01 18:24:35 +03:00
|
|
|
if (ret <= 0 && !(ret == 0 && index->is_spatial()
|
|
|
|
&& !page_is_leaf(page))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Records in wrong order";
|
2016-08-12 11:17:45 +03:00
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
fputs("\nInnoDB: previous record ", stderr);
|
2016-08-12 11:17:45 +03:00
|
|
|
/* For spatial index, print the mbr info.*/
|
|
|
|
if (index->type & DICT_SPATIAL) {
|
|
|
|
putc('\n', stderr);
|
|
|
|
rec_print_mbr_rec(stderr,
|
|
|
|
old_rec, old_offsets);
|
|
|
|
fputs("\nInnoDB: record ", stderr);
|
|
|
|
putc('\n', stderr);
|
|
|
|
rec_print_mbr_rec(stderr, rec, offsets);
|
|
|
|
putc('\n', stderr);
|
|
|
|
putc('\n', stderr);
|
|
|
|
|
|
|
|
} else {
|
|
|
|
rec_print_new(stderr, old_rec, old_offsets);
|
|
|
|
fputs("\nInnoDB: record ", stderr);
|
|
|
|
rec_print_new(stderr, rec, offsets);
|
|
|
|
putc('\n', stderr);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_user_rec(rec)) {
|
|
|
|
|
|
|
|
data_size += rec_offs_size(offsets);
|
2016-08-12 11:17:45 +03:00
|
|
|
|
2018-04-26 16:33:05 +03:00
|
|
|
#if defined(UNIV_GIS_DEBUG)
|
2016-08-12 11:17:45 +03:00
|
|
|
/* For spatial index, print the mbr info.*/
|
|
|
|
if (index->type & DICT_SPATIAL) {
|
|
|
|
rec_print_mbr_rec(stderr, rec, offsets);
|
|
|
|
putc('\n', stderr);
|
|
|
|
}
|
|
|
|
#endif /* UNIV_GIS_DEBUG */
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
offs = page_offset(rec_get_start(rec, offsets));
|
|
|
|
i = rec_offs_size(offsets);
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(offs + i >= srv_page_size)) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Record offset out of bounds: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
|
|
|
goto next_rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
while (i--) {
|
|
|
|
if (UNIV_UNLIKELY(buf[offs + i])) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Record overlaps another: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
|
|
|
break;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
buf[offs + i] = 1;
|
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
if (ulint rec_own_count = page_is_comp(page)
|
|
|
|
? rec_get_n_owned_new(rec)
|
|
|
|
: rec_get_n_owned_old(rec)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* This is a record pointed to by a dir slot */
|
|
|
|
if (UNIV_UNLIKELY(rec_own_count != own_count)) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Wrong owned count at " << offs
|
|
|
|
<< ": " << rec_own_count
|
|
|
|
<< ", " << own_count;
|
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (page_dir_slot_get_rec(slot) != rec) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Dir slot does not"
|
2019-07-01 18:24:35 +03:00
|
|
|
" point to right rec at " << offs;
|
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
if (ret) {
|
|
|
|
page_dir_slot_check(slot);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
own_count = 0;
|
|
|
|
if (!page_rec_is_supremum(rec)) {
|
|
|
|
slot_no++;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
next_rec:
|
2014-02-26 19:11:54 +01:00
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
own_count++;
|
|
|
|
old_rec = rec;
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
|
|
|
|
/* set old_offsets to offsets; recycle offsets */
|
|
|
|
{
|
|
|
|
ulint* offs = old_offsets;
|
|
|
|
old_offsets = offsets;
|
|
|
|
offsets = offs;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec) == 0)) {
|
|
|
|
|
|
|
|
goto n_owned_zero;
|
|
|
|
}
|
|
|
|
} else if (UNIV_UNLIKELY(rec_get_n_owned_old(rec) == 0)) {
|
|
|
|
n_owned_zero:
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "n owned is zero at " << offs;
|
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot_no != n_slots - 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n slots wrong " << slot_no << " "
|
|
|
|
<< (n_slots - 1);
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2017-09-14 08:06:40 +03:00
|
|
|
if (UNIV_UNLIKELY(ulint(page_header_get_field(page, PAGE_N_RECS))
|
2014-02-26 19:11:54 +01:00
|
|
|
+ PAGE_HEAP_NO_USER_LOW
|
|
|
|
!= count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n recs wrong "
|
|
|
|
<< page_header_get_field(page, PAGE_N_RECS)
|
|
|
|
+ PAGE_HEAP_NO_USER_LOW << " " << (count + 1);
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(data_size != page_get_data_size(page))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Summed data size " << data_size
|
|
|
|
<< ", returned by func " << page_get_data_size(page);
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Check then the free list */
|
2019-07-01 18:24:35 +03:00
|
|
|
for (rec = page_header_get_ptr(page, PAGE_FREE);
|
|
|
|
rec;
|
|
|
|
rec = page_rec_get_next_const(rec)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
offsets = rec_get_offsets(rec, index, offsets,
|
2017-09-19 19:20:11 +03:00
|
|
|
page_is_leaf(page),
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_validate(rec, offsets))) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
|
|
|
continue;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
offs = page_offset(rec_get_start(rec, offsets));
|
|
|
|
i = rec_offs_size(offsets);
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(offs + i >= srv_page_size)) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Free record offset out of bounds: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
|
|
|
continue;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
while (i--) {
|
|
|
|
if (UNIV_UNLIKELY(buf[offs + i])) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Free record overlaps another: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
|
|
|
break;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
buf[offs + i] = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(page_dir_get_n_heap(page) != count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "N heap is wrong "
|
|
|
|
<< page_dir_get_n_heap(page) << " " << count + 1;
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
mem_heap_free(heap);
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
if (UNIV_UNLIKELY(!ret)) {
|
|
|
|
goto func_exit2;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Looks in the page record list for a record with the given heap number.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return record, NULL if not found */
|
2014-02-26 19:11:54 +01:00
|
|
|
const rec_t*
|
|
|
|
page_find_rec_with_heap_no(
|
|
|
|
/*=======================*/
|
|
|
|
const page_t* page, /*!< in: index page */
|
|
|
|
ulint heap_no)/*!< in: heap number */
|
|
|
|
{
|
|
|
|
const rec_t* rec;
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
rec = page + PAGE_NEW_INFIMUM;
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
for (;;) {
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint rec_heap_no = rec_get_heap_no_new(rec);
|
|
|
|
|
|
|
|
if (rec_heap_no == heap_no) {
|
|
|
|
|
|
|
|
return(rec);
|
|
|
|
} else if (rec_heap_no == PAGE_HEAP_NO_SUPREMUM) {
|
|
|
|
|
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page + rec_get_next_offs(rec, TRUE);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
rec = page + PAGE_OLD_INFIMUM;
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
ulint rec_heap_no = rec_get_heap_no_old(rec);
|
|
|
|
|
|
|
|
if (rec_heap_no == heap_no) {
|
|
|
|
|
|
|
|
return(rec);
|
|
|
|
} else if (rec_heap_no == PAGE_HEAP_NO_SUPREMUM) {
|
|
|
|
|
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page + rec_get_next_offs(rec, FALSE);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*******************************************************//**
|
|
|
|
Removes the record from a leaf page. This function does not log
|
|
|
|
any changes. It is used by the IMPORT tablespace functions.
|
|
|
|
The cursor is moved to the next record after the deleted one.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return true if success, i.e., the page did not become too empty */
|
2014-02-26 19:11:54 +01:00
|
|
|
bool
|
|
|
|
page_delete_rec(
|
|
|
|
/*============*/
|
|
|
|
const dict_index_t* index, /*!< in: The index that the record
|
|
|
|
belongs to */
|
|
|
|
page_cur_t* pcur, /*!< in/out: page cursor on record
|
|
|
|
to delete */
|
2018-05-01 01:10:37 +03:00
|
|
|
page_zip_des_t*
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
page_zip/*!< in: compressed page descriptor */
|
|
|
|
#endif
|
|
|
|
,
|
2014-02-26 19:11:54 +01:00
|
|
|
const ulint* offsets)/*!< in: offsets for record */
|
|
|
|
{
|
|
|
|
bool no_compress_needed;
|
|
|
|
buf_block_t* block = pcur->block;
|
|
|
|
page_t* page = buf_block_get_frame(block);
|
|
|
|
|
|
|
|
ut_ad(page_is_leaf(page));
|
|
|
|
|
|
|
|
if (!rec_offs_any_extern(offsets)
|
|
|
|
&& ((page_get_data_size(page) - rec_offs_size(offsets)
|
2016-08-12 11:17:45 +03:00
|
|
|
< BTR_CUR_PAGE_COMPRESS_LIMIT(index))
|
2018-02-08 22:34:21 +02:00
|
|
|
|| !page_has_siblings(page)
|
2014-02-26 19:11:54 +01:00
|
|
|
|| (page_get_n_recs(page) < 2))) {
|
|
|
|
|
|
|
|
ulint root_page_no = dict_index_get_page(index);
|
|
|
|
|
|
|
|
/* The page fillfactor will drop below a predefined
|
|
|
|
minimum value, OR the level in the B-tree contains just
|
|
|
|
one page, OR the page will become empty: we recommend
|
|
|
|
compression if this is not the root page. */
|
|
|
|
|
|
|
|
no_compress_needed = page_get_page_no(page) == root_page_no;
|
|
|
|
} else {
|
|
|
|
no_compress_needed = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (no_compress_needed) {
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
ut_a(!page_zip || page_zip_validate(page_zip, page, index));
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
|
|
|
|
page_cur_delete_rec(pcur, index, offsets, 0);
|
|
|
|
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
ut_a(!page_zip || page_zip_validate(page_zip, page, index));
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
}
|
|
|
|
|
|
|
|
return(no_compress_needed);
|
|
|
|
}
|
|
|
|
|
2014-05-05 18:20:28 +02:00
|
|
|
/** Get the last non-delete-marked record on a page.
|
|
|
|
@param[in] page index tree leaf page
|
|
|
|
@return the last record, not delete-marked
|
|
|
|
@retval infimum record if all records are delete-marked */
|
|
|
|
const rec_t*
|
|
|
|
page_find_rec_max_not_deleted(
|
|
|
|
const page_t* page)
|
|
|
|
{
|
|
|
|
const rec_t* rec = page_get_infimum_rec(page);
|
|
|
|
const rec_t* prev_rec = NULL; // remove warning
|
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
/* Because the page infimum is never delete-marked
|
2018-09-19 07:21:24 +03:00
|
|
|
and never the metadata pseudo-record (MIN_REC_FLAG)),
|
2014-05-05 18:20:28 +02:00
|
|
|
prev_rec will always be assigned to it first. */
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
ut_ad(!rec_get_info_bits(rec, page_rec_is_comp(rec)));
|
|
|
|
ut_ad(page_is_leaf(page));
|
|
|
|
|
2014-05-05 18:20:28 +02:00
|
|
|
if (page_is_comp(page)) {
|
|
|
|
do {
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (!(rec[-REC_NEW_INFO_BITS]
|
|
|
|
& (REC_INFO_DELETED_FLAG
|
|
|
|
| REC_INFO_MIN_REC_FLAG))) {
|
2014-05-05 18:20:28 +02:00
|
|
|
prev_rec = rec;
|
|
|
|
}
|
|
|
|
rec = page_rec_get_next_low(rec, true);
|
|
|
|
} while (rec != page + PAGE_NEW_SUPREMUM);
|
|
|
|
} else {
|
|
|
|
do {
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (!(rec[-REC_OLD_INFO_BITS]
|
|
|
|
& (REC_INFO_DELETED_FLAG
|
|
|
|
| REC_INFO_MIN_REC_FLAG))) {
|
2014-05-05 18:20:28 +02:00
|
|
|
prev_rec = rec;
|
|
|
|
}
|
|
|
|
rec = page_rec_get_next_low(rec, false);
|
|
|
|
} while (rec != page + PAGE_OLD_SUPREMUM);
|
|
|
|
}
|
|
|
|
return(prev_rec);
|
|
|
|
}
|