2014-02-26 19:11:54 +01:00
|
|
|
/*****************************************************************************
|
|
|
|
|
2016-06-21 14:21:03 +02:00
|
|
|
Copyright (c) 1994, 2016, Oracle and/or its affiliates. All Rights Reserved.
|
2014-02-26 19:11:54 +01:00
|
|
|
Copyright (c) 2012, Facebook Inc.
|
2021-04-13 10:28:13 +03:00
|
|
|
Copyright (c) 2017, 2021, MariaDB Corporation.
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
This program is free software; you can redistribute it and/or modify it under
|
|
|
|
the terms of the GNU General Public License as published by the Free Software
|
|
|
|
Foundation; version 2 of the License.
|
|
|
|
|
|
|
|
This program is distributed in the hope that it will be useful, but WITHOUT
|
|
|
|
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
|
|
|
|
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
|
|
|
|
|
|
|
|
You should have received a copy of the GNU General Public License along with
|
|
|
|
this program; if not, write to the Free Software Foundation, Inc.,
|
2019-05-11 18:08:32 +03:00
|
|
|
51 Franklin Street, Fifth Floor, Boston, MA 02110-1335 USA
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
*****************************************************************************/
|
|
|
|
|
|
|
|
/**************************************************//**
|
|
|
|
@file page/page0page.cc
|
|
|
|
Index page routines
|
|
|
|
|
|
|
|
Created 2/2/1994 Heikki Tuuri
|
|
|
|
*******************************************************/
|
|
|
|
|
|
|
|
#include "page0page.h"
|
|
|
|
#include "page0cur.h"
|
|
|
|
#include "page0zip.h"
|
|
|
|
#include "buf0buf.h"
|
2018-11-30 21:48:45 +02:00
|
|
|
#include "buf0checksum.h"
|
2014-02-26 19:11:54 +01:00
|
|
|
#include "btr0btr.h"
|
2016-12-30 15:04:10 +02:00
|
|
|
#include "srv0srv.h"
|
|
|
|
#include "lock0lock.h"
|
|
|
|
#include "fut0lst.h"
|
|
|
|
#include "btr0sea.h"
|
2018-03-11 23:34:23 +02:00
|
|
|
#include "trx0sys.h"
|
MDEV-20950 Reduce size of record offsets
offset_t: this is a type which represents one record offset.
It's unsigned short int.
a lot of functions: replace ulint with offset_t
btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
allocate record offsets on the stack instead of waiting for rec_get_offsets()
to allocate it from mem_heap_t. So, reducing memory allocations.
RECORD_OFFSET, INDEX_OFFSET:
now it's less convenient to store pointers in offset_t*
array. One pointer occupies now several offset_t. And those constant are start
indexes into array to places where to store pointer values
REC_OFFS_HEADER_SIZE: adjusted for the new reality
REC_OFFS_NORMAL_SIZE:
increase size from 100 to 300 which means less heap allocations.
And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
is smaller than previous 800 bytes.
REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
rem0rec.h, rem0rec.ic, rem0rec.cc:
various arguments, return values and local variables types were changed to
fix numerous integer conversions issues.
enum field_type_t:
offset types concept was introduces which replaces old offset flags stuff.
Like in earlier version, 2 upper bits are used to store offset type.
And this enum represents those types.
REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
get_type(), set_type(), get_value(), combine():
these are convenience functions to work with offsets and it's types
rec_offs_base()[0]:
still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
rec_offs_base()[i]:
these have type offset_t now. Two upper bits contains type.
2019-11-04 22:30:12 +03:00
|
|
|
#include <algorithm>
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/* THE INDEX PAGE
|
|
|
|
==============
|
|
|
|
|
|
|
|
The index page consists of a page header which contains the page's
|
|
|
|
id and other information. On top of it are the index records
|
|
|
|
in a heap linked into a one way linear list according to alphabetic order.
|
|
|
|
|
|
|
|
Just below page end is an array of pointers which we call page directory,
|
|
|
|
to about every sixth record in the list. The pointers are placed in
|
|
|
|
the directory in the alphabetical order of the records pointed to,
|
|
|
|
enabling us to make binary search using the array. Each slot n:o I
|
|
|
|
in the directory points to a record, where a 4-bit field contains a count
|
|
|
|
of those records which are in the linear list between pointer I and
|
|
|
|
the pointer I - 1 in the directory, including the record
|
|
|
|
pointed to by pointer I and not including the record pointed to by I - 1.
|
|
|
|
We say that the record pointed to by slot I, or that slot I, owns
|
|
|
|
these records. The count is always kept in the range 4 to 8, with
|
|
|
|
the exception that it is 1 for the first slot, and 1--8 for the second slot.
|
|
|
|
|
|
|
|
An essentially binary search can be performed in the list of index
|
|
|
|
records, like we could do if we had pointer to every record in the
|
|
|
|
page directory. The data structure is, however, more efficient when
|
|
|
|
we are doing inserts, because most inserts are just pushed on a heap.
|
|
|
|
Only every 8th insert requires block move in the directory pointer
|
|
|
|
table, which itself is quite small. A record is deleted from the page
|
|
|
|
by just taking it off the linear list and updating the number of owned
|
|
|
|
records-field of the record which owns it, and updating the page directory,
|
|
|
|
if necessary. A special case is the one when the record owns itself.
|
|
|
|
Because the overhead of inserts is so small, we may also increase the
|
|
|
|
page size from the projected default of 8 kB to 64 kB without too
|
|
|
|
much loss of efficiency in inserts. Bigger page becomes actual
|
|
|
|
when the disk transfer rate compared to seek and latency time rises.
|
|
|
|
On the present system, the page size is set so that the page transfer
|
|
|
|
time (3 ms) is 20 % of the disk random access time (15 ms).
|
|
|
|
|
|
|
|
When the page is split, merged, or becomes full but contains deleted
|
|
|
|
records, we have to reorganize the page.
|
|
|
|
|
|
|
|
Assuming a page size of 8 kB, a typical index page of a secondary
|
|
|
|
index contains 300 index entries, and the size of the page directory
|
|
|
|
is 50 x 4 bytes = 200 bytes. */
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Looks for the directory slot which owns the given record.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return the directory slot number */
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint
|
|
|
|
page_dir_find_owner_slot(
|
|
|
|
/*=====================*/
|
|
|
|
const rec_t* rec) /*!< in: the physical record */
|
|
|
|
{
|
|
|
|
ut_ad(page_rec_check(rec));
|
|
|
|
|
2018-04-21 12:11:04 +03:00
|
|
|
const page_t* page = page_align(rec);
|
|
|
|
const page_dir_slot_t* first_slot = page_dir_get_nth_slot(page, 0);
|
|
|
|
const page_dir_slot_t* slot = page_dir_get_nth_slot(
|
2018-04-28 15:49:09 +03:00
|
|
|
page, ulint(page_dir_get_n_slots(page)) - 1);
|
2018-04-21 12:11:04 +03:00
|
|
|
const rec_t* r = rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
while (rec_get_n_owned_new(r) == 0) {
|
|
|
|
r = rec_get_next_ptr_const(r, TRUE);
|
|
|
|
ut_ad(r >= page + PAGE_NEW_SUPREMUM);
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(r < page + (srv_page_size - PAGE_DIR));
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
while (rec_get_n_owned_old(r) == 0) {
|
|
|
|
r = rec_get_next_ptr_const(r, FALSE);
|
|
|
|
ut_ad(r >= page + PAGE_OLD_SUPREMUM);
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(r < page + (srv_page_size - PAGE_DIR));
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
uint16 rec_offs_bytes = mach_encode_2(ulint(r - page));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
while (UNIV_LIKELY(*(uint16*) slot != rec_offs_bytes)) {
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot == first_slot)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Probable data corruption on page "
|
|
|
|
<< page_get_page_no(page)
|
|
|
|
<< ". Original record on that page;";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
fputs("(compact record)", stderr);
|
|
|
|
} else {
|
|
|
|
rec_print_old(stderr, rec);
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Cannot find the dir slot for this"
|
|
|
|
" record on that page;";
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (page_is_comp(page)) {
|
|
|
|
fputs("(compact record)", stderr);
|
|
|
|
} else {
|
|
|
|
rec_print_old(stderr, page
|
|
|
|
+ mach_decode_2(rec_offs_bytes));
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
slot += PAGE_DIR_SLOT_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
return(((ulint) (first_slot - slot)) / PAGE_DIR_SLOT_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**************************************************************//**
|
|
|
|
Used to check the consistency of a directory slot.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if succeed */
|
2014-02-26 19:11:54 +01:00
|
|
|
static
|
|
|
|
ibool
|
|
|
|
page_dir_slot_check(
|
|
|
|
/*================*/
|
|
|
|
const page_dir_slot_t* slot) /*!< in: slot */
|
|
|
|
{
|
|
|
|
const page_t* page;
|
|
|
|
ulint n_slots;
|
|
|
|
ulint n_owned;
|
|
|
|
|
|
|
|
ut_a(slot);
|
|
|
|
|
|
|
|
page = page_align(slot);
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
ut_a(slot <= page_dir_get_nth_slot(page, 0));
|
|
|
|
ut_a(slot >= page_dir_get_nth_slot(page, n_slots - 1));
|
|
|
|
|
|
|
|
ut_a(page_rec_check(page_dir_slot_get_rec(slot)));
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
n_owned = rec_get_n_owned_new(page_dir_slot_get_rec(slot));
|
|
|
|
} else {
|
|
|
|
n_owned = rec_get_n_owned_old(page_dir_slot_get_rec(slot));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (slot == page_dir_get_nth_slot(page, 0)) {
|
|
|
|
ut_a(n_owned == 1);
|
|
|
|
} else if (slot == page_dir_get_nth_slot(page, n_slots - 1)) {
|
|
|
|
ut_a(n_owned >= 1);
|
|
|
|
ut_a(n_owned <= PAGE_DIR_SLOT_MAX_N_OWNED);
|
|
|
|
} else {
|
|
|
|
ut_a(n_owned >= PAGE_DIR_SLOT_MIN_N_OWNED);
|
|
|
|
ut_a(n_owned <= PAGE_DIR_SLOT_MAX_N_OWNED);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Sets the max trx id field value. */
|
|
|
|
void
|
|
|
|
page_set_max_trx_id(
|
|
|
|
/*================*/
|
|
|
|
buf_block_t* block, /*!< in/out: page */
|
|
|
|
page_zip_des_t* page_zip,/*!< in/out: compressed page, or NULL */
|
|
|
|
trx_id_t trx_id, /*!< in: transaction id */
|
|
|
|
mtr_t* mtr) /*!< in/out: mini-transaction, or NULL */
|
|
|
|
{
|
2020-06-10 07:43:58 +03:00
|
|
|
ut_ad(!mtr || mtr->memo_contains_flagged(block, MTR_MEMO_PAGE_X_FIX));
|
MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
|
|
|
ut_ad(!page_zip || page_zip == &block->page.zip);
|
2019-12-03 10:19:45 +02:00
|
|
|
static_assert((PAGE_HEADER + PAGE_MAX_TRX_ID) % 8 == 0, "alignment");
|
2020-01-21 22:22:48 +08:00
|
|
|
byte *max_trx_id= my_assume_aligned<8>(PAGE_MAX_TRX_ID +
|
|
|
|
PAGE_HEADER + block->frame);
|
2019-12-03 10:19:45 +02:00
|
|
|
|
MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
|
|
|
mtr->write<8>(*block, max_trx_id, trx_id);
|
2019-12-03 10:19:45 +02:00
|
|
|
if (UNIV_LIKELY_NULL(page_zip))
|
MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
|
|
|
memcpy_aligned<8>(&page_zip->data[PAGE_MAX_TRX_ID + PAGE_HEADER],
|
|
|
|
max_trx_id, 8);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB
This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with
the notable difference that the file format changes are limited to
repurposing a previously unused data field in B-tree pages.
For persistent InnoDB tables, write the last used AUTO_INCREMENT
value to the root page of the clustered index, in the previously
unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC.
Unlike some other previously unused InnoDB data fields, this one was
actually always zero-initialized, at least since MySQL 3.23.49.
The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the
root page. The SX latch will allow concurrent read access to the root
page. (The field PAGE_ROOT_AUTO_INC will only be read on the
first-time call to ha_innobase::open() from the SQL layer. The
PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so
read/write races are not possible.)
During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level
function btr_cur_search_to_nth_level(), adding no extra page
access. [Adaptive hash index lookup will be disabled during INSERT.]
If some rare UPDATE modifies an AUTO_INCREMENT column, the
PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in
ha_innobase::update_row().
When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC
field.
During ALTER TABLE, the initial AUTO_INCREMENT value will be copied
from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will
update PAGE_ROOT_AUTO_INC in real time.
innodb_col_no(): Determine the dict_table_t::cols[] element index
corresponding to a Field of a non-virtual column.
(The MySQL 5.7 implementation of virtual columns breaks the 1:1
relationship between Field::field_index and dict_table_t::cols[].
Virtual columns are omitted from dict_table_t::cols[]. Therefore,
we must translate the field_index of AUTO_INCREMENT columns into
an index of dict_table_t::cols[].)
Upgrade from old data files:
By default, the AUTO_INCREMENT sequence in old data files would appear
to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain
the value 0 in each clustered index page. In new data files,
PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain
any AUTO_INCREMENT column.
For backward compatibility, we use the old method of
SELECT MAX(auto_increment_column) for initializing the sequence.
btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format
data file.
btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc()
that will resort to reading MAX(auto_increment_column) for data files
that did not use AUTO_INCREMENT yet. It was manually tested that during
the execution of innodb.autoinc_persist the compatibility logic is
not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty
clustered index root pages).
initialize_auto_increment(): Replaces
ha_innobase::innobase_initialize_autoinc(). This initializes
the AUTO_INCREMENT metadata. Only called from ha_innobase::open().
ha_innobase::info_low(): Do not try to lazily initialize
dict_table_t::autoinc. It must already have been initialized by
ha_innobase::open() or ha_innobase::create().
Note: The adjustments to class ha_innopart were not tested, because
the source code (native InnoDB partitioning) is not being compiled.
2016-12-14 19:56:39 +02:00
|
|
|
/** Persist the AUTO_INCREMENT value on a clustered index root page.
|
|
|
|
@param[in,out] block clustered index root page
|
|
|
|
@param[in] index clustered index
|
|
|
|
@param[in] autoinc next available AUTO_INCREMENT value
|
|
|
|
@param[in,out] mtr mini-transaction
|
|
|
|
@param[in] reset whether to reset the AUTO_INCREMENT
|
|
|
|
to a possibly smaller value than currently
|
|
|
|
exists in the page */
|
|
|
|
void
|
|
|
|
page_set_autoinc(
|
|
|
|
buf_block_t* block,
|
|
|
|
ib_uint64_t autoinc,
|
|
|
|
mtr_t* mtr,
|
|
|
|
bool reset)
|
|
|
|
{
|
2019-12-03 10:19:45 +02:00
|
|
|
ut_ad(mtr->memo_contains_flagged(block, MTR_MEMO_PAGE_X_FIX |
|
|
|
|
MTR_MEMO_PAGE_SX_FIX));
|
|
|
|
|
MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
|
|
|
byte *field= my_assume_aligned<8>(PAGE_HEADER + PAGE_ROOT_AUTO_INC +
|
|
|
|
block->frame);
|
2020-01-08 14:08:00 +02:00
|
|
|
ib_uint64_t old= mach_read_from_8(field);
|
|
|
|
if (old == autoinc || (old > autoinc && !reset))
|
MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
|
|
|
return; /* nothing to update */
|
|
|
|
|
|
|
|
mtr->write<8>(*block, field, autoinc);
|
|
|
|
if (UNIV_LIKELY_NULL(block->page.zip.data))
|
|
|
|
memcpy_aligned<8>(PAGE_HEADER + PAGE_ROOT_AUTO_INC + block->page.zip.data,
|
|
|
|
field, 8);
|
MDEV-6076 Persistent AUTO_INCREMENT for InnoDB
This should be functionally equivalent to WL#6204 in MySQL 8.0.0, with
the notable difference that the file format changes are limited to
repurposing a previously unused data field in B-tree pages.
For persistent InnoDB tables, write the last used AUTO_INCREMENT
value to the root page of the clustered index, in the previously
unused (0) PAGE_MAX_TRX_ID field, now aliased as PAGE_ROOT_AUTO_INC.
Unlike some other previously unused InnoDB data fields, this one was
actually always zero-initialized, at least since MySQL 3.23.49.
The writes to PAGE_ROOT_AUTO_INC are protected by SX or X latch on the
root page. The SX latch will allow concurrent read access to the root
page. (The field PAGE_ROOT_AUTO_INC will only be read on the
first-time call to ha_innobase::open() from the SQL layer. The
PAGE_ROOT_AUTO_INC can only be updated when executing SQL, so
read/write races are not possible.)
During INSERT, the PAGE_ROOT_AUTO_INC is updated by the low-level
function btr_cur_search_to_nth_level(), adding no extra page
access. [Adaptive hash index lookup will be disabled during INSERT.]
If some rare UPDATE modifies an AUTO_INCREMENT column, the
PAGE_ROOT_AUTO_INC will be adjusted in a separate mini-transaction in
ha_innobase::update_row().
When a page is reorganized, we have to preserve the PAGE_ROOT_AUTO_INC
field.
During ALTER TABLE, the initial AUTO_INCREMENT value will be copied
from the table. ALGORITHM=COPY and online log apply in LOCK=NONE will
update PAGE_ROOT_AUTO_INC in real time.
innodb_col_no(): Determine the dict_table_t::cols[] element index
corresponding to a Field of a non-virtual column.
(The MySQL 5.7 implementation of virtual columns breaks the 1:1
relationship between Field::field_index and dict_table_t::cols[].
Virtual columns are omitted from dict_table_t::cols[]. Therefore,
we must translate the field_index of AUTO_INCREMENT columns into
an index of dict_table_t::cols[].)
Upgrade from old data files:
By default, the AUTO_INCREMENT sequence in old data files would appear
to be reset, because PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC would contain
the value 0 in each clustered index page. In new data files,
PAGE_ROOT_AUTO_INC can only be 0 if the table is empty or does not contain
any AUTO_INCREMENT column.
For backward compatibility, we use the old method of
SELECT MAX(auto_increment_column) for initializing the sequence.
btr_read_autoinc(): Read the AUTO_INCREMENT sequence from a new-format
data file.
btr_read_autoinc_with_fallback(): A variant of btr_read_autoinc()
that will resort to reading MAX(auto_increment_column) for data files
that did not use AUTO_INCREMENT yet. It was manually tested that during
the execution of innodb.autoinc_persist the compatibility logic is
not activated (for new files, PAGE_ROOT_AUTO_INC is never 0 in nonempty
clustered index root pages).
initialize_auto_increment(): Replaces
ha_innobase::innobase_initialize_autoinc(). This initializes
the AUTO_INCREMENT metadata. Only called from ha_innobase::open().
ha_innobase::info_low(): Do not try to lazily initialize
dict_table_t::autoinc. It must already have been initialized by
ha_innobase::open() or ha_innobase::create().
Note: The adjustments to class ha_innopart were not tested, because
the source code (native InnoDB partitioning) is not being compiled.
2016-12-14 19:56:39 +02:00
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
/** The page infimum and supremum of an empty page in ROW_FORMAT=REDUNDANT */
|
|
|
|
static const byte infimum_supremum_redundant[] = {
|
|
|
|
/* the infimum record */
|
|
|
|
0x08/*end offset*/,
|
|
|
|
0x01/*n_owned*/,
|
|
|
|
0x00, 0x00/*heap_no=0*/,
|
|
|
|
0x03/*n_fields=1, 1-byte offsets*/,
|
|
|
|
0x00, 0x74/* pointer to supremum */,
|
|
|
|
'i', 'n', 'f', 'i', 'm', 'u', 'm', 0,
|
|
|
|
/* the supremum record */
|
|
|
|
0x09/*end offset*/,
|
|
|
|
0x01/*n_owned*/,
|
|
|
|
0x00, 0x08/*heap_no=1*/,
|
|
|
|
0x03/*n_fields=1, 1-byte offsets*/,
|
|
|
|
0x00, 0x00/* end of record list */,
|
|
|
|
's', 'u', 'p', 'r', 'e', 'm', 'u', 'm', 0
|
|
|
|
};
|
|
|
|
|
|
|
|
/** The page infimum and supremum of an empty page in ROW_FORMAT=COMPACT */
|
|
|
|
static const byte infimum_supremum_compact[] = {
|
|
|
|
/* the infimum record */
|
|
|
|
0x01/*n_owned=1*/,
|
|
|
|
0x00, 0x02/* heap_no=0, REC_STATUS_INFIMUM */,
|
|
|
|
0x00, 0x0d/* pointer to supremum */,
|
|
|
|
'i', 'n', 'f', 'i', 'm', 'u', 'm', 0,
|
|
|
|
/* the supremum record */
|
|
|
|
0x01/*n_owned=1*/,
|
|
|
|
0x00, 0x0b/* heap_no=1, REC_STATUS_SUPREMUM */,
|
|
|
|
0x00, 0x00/* end of record list */,
|
|
|
|
's', 'u', 'p', 'r', 'e', 'm', 'u', 'm'
|
|
|
|
};
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-12-11 18:04:34 +02:00
|
|
|
/** Create an index page.
|
|
|
|
@param[in,out] block buffer block
|
|
|
|
@param[in] comp nonzero=compact page format */
|
|
|
|
void page_create_low(const buf_block_t* block, bool comp)
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
|
|
|
page_t* page;
|
|
|
|
|
2018-04-30 15:46:09 +03:00
|
|
|
compile_time_assert(PAGE_BTR_IBUF_FREE_LIST + FLST_BASE_NODE_SIZE
|
|
|
|
<= PAGE_DATA);
|
|
|
|
compile_time_assert(PAGE_BTR_IBUF_FREE_LIST_NODE + FLST_NODE_SIZE
|
|
|
|
<= PAGE_DATA);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2021-10-21 15:31:21 +03:00
|
|
|
page = block->frame;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-12-11 18:04:34 +02:00
|
|
|
fil_page_set_type(page, FIL_PAGE_INDEX);
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
memset(page + PAGE_HEADER, 0, PAGE_HEADER_PRIV_END);
|
|
|
|
page[PAGE_HEADER + PAGE_N_DIR_SLOTS + 1] = 2;
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
page[PAGE_HEADER + PAGE_INSTANT] = 0;
|
|
|
|
page[PAGE_HEADER + PAGE_DIRECTION_B] = PAGE_NO_DIRECTION;
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
if (comp) {
|
|
|
|
page[PAGE_HEADER + PAGE_N_HEAP] = 0x80;/*page_is_comp()*/
|
|
|
|
page[PAGE_HEADER + PAGE_N_HEAP + 1] = PAGE_HEAP_NO_USER_LOW;
|
|
|
|
page[PAGE_HEADER + PAGE_HEAP_TOP + 1] = PAGE_NEW_SUPREMUM_END;
|
|
|
|
memcpy(page + PAGE_DATA, infimum_supremum_compact,
|
|
|
|
sizeof infimum_supremum_compact);
|
|
|
|
memset(page
|
|
|
|
+ PAGE_NEW_SUPREMUM_END, 0,
|
2018-04-27 13:49:25 +03:00
|
|
|
srv_page_size - PAGE_DIR - PAGE_NEW_SUPREMUM_END);
|
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE * 2 + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_NEW_SUPREMUM;
|
2018-04-27 13:49:25 +03:00
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_NEW_INFIMUM;
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
page[PAGE_HEADER + PAGE_N_HEAP + 1] = PAGE_HEAP_NO_USER_LOW;
|
|
|
|
page[PAGE_HEADER + PAGE_HEAP_TOP + 1] = PAGE_OLD_SUPREMUM_END;
|
|
|
|
memcpy(page + PAGE_DATA, infimum_supremum_redundant,
|
|
|
|
sizeof infimum_supremum_redundant);
|
|
|
|
memset(page
|
|
|
|
+ PAGE_OLD_SUPREMUM_END, 0,
|
2018-04-27 13:49:25 +03:00
|
|
|
srv_page_size - PAGE_DIR - PAGE_OLD_SUPREMUM_END);
|
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE * 2 + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_OLD_SUPREMUM;
|
2018-04-27 13:49:25 +03:00
|
|
|
page[srv_page_size - PAGE_DIR - PAGE_DIR_SLOT_SIZE + 1]
|
2016-08-12 11:17:45 +03:00
|
|
|
= PAGE_OLD_INFIMUM;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-12-11 18:04:34 +02:00
|
|
|
/** Create an uncompressed index page.
|
|
|
|
@param[in,out] block buffer block
|
|
|
|
@param[in,out] mtr mini-transaction
|
|
|
|
@param[in] comp set unless ROW_FORMAT=REDUNDANT */
|
MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
|
|
|
void page_create(buf_block_t *block, mtr_t *mtr, bool comp)
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag
log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.
In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.
When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.
Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.
Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.
MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).
mtr_buf_t::empty(): Check if the buffer is empty.
mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.
mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.
page_recv_t::last_offset: Reflects mtr_t::m_last_offset.
Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.
mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.
trx_undo_page_init(): Combine the writes in WRITE record.
trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.
flst_write_addr(): Define as a static function. Combine the writes.
flst_zero_both(): Replaces two flst_zero_addr() calls.
flst_init(): Do not inline the function.
fsp_free_seg_inode(): Zerofill the whole inode.
fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.
btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.
fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.
fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.
recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.
buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.
page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.
trx_undo_header_add_space_for_xid(): Remove.
trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().
btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.
page_rec_find_owner_rec(): Merge with the only callers.
page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).
page_cur_insert_rec_zip(): Combine writes to page header fields.
PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.
PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.
mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.
For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.
mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.
In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().
Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
|
|
|
mtr->page_create(*block, comp);
|
|
|
|
buf_block_modify_clock_inc(block);
|
|
|
|
page_create_low(block, comp);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
2019-12-11 18:04:34 +02:00
|
|
|
Create a compressed B-tree index page. */
|
|
|
|
void
|
2014-02-26 19:11:54 +01:00
|
|
|
page_create_zip(
|
|
|
|
/*============*/
|
2016-08-12 11:17:45 +03:00
|
|
|
buf_block_t* block, /*!< in/out: a buffer frame
|
|
|
|
where the page is created */
|
|
|
|
dict_index_t* index, /*!< in: the index of the
|
2019-09-26 10:25:34 +03:00
|
|
|
page */
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint level, /*!< in: the B-tree level
|
|
|
|
of the page */
|
|
|
|
trx_id_t max_trx_id, /*!< in: PAGE_MAX_TRX_ID */
|
|
|
|
mtr_t* mtr) /*!< in/out: mini-transaction
|
|
|
|
handle */
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
|
|
|
ut_ad(block);
|
2019-06-12 19:47:40 +03:00
|
|
|
ut_ad(buf_block_get_page_zip(block));
|
2018-09-10 18:01:54 +03:00
|
|
|
ut_ad(dict_table_is_comp(index->table));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-12-15 18:51:41 +02:00
|
|
|
/* PAGE_MAX_TRX_ID or PAGE_ROOT_AUTO_INC are always 0 for
|
|
|
|
temporary tables. */
|
2018-05-12 09:38:46 +03:00
|
|
|
ut_ad(max_trx_id == 0 || !index->table->is_temporary());
|
2016-12-15 18:51:41 +02:00
|
|
|
/* In secondary indexes and the change buffer, PAGE_MAX_TRX_ID
|
|
|
|
must be zero on non-leaf pages. max_trx_id can be 0 when the
|
|
|
|
index consists of an empty root (leaf) page. */
|
|
|
|
ut_ad(max_trx_id == 0
|
|
|
|
|| level == 0
|
|
|
|
|| !dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
|| index->table->is_temporary());
|
2016-12-15 18:51:41 +02:00
|
|
|
/* In the clustered index, PAGE_ROOT_AUTOINC or
|
|
|
|
PAGE_MAX_TRX_ID must be 0 on other pages than the root. */
|
|
|
|
ut_ad(level == 0 || max_trx_id == 0
|
|
|
|
|| !dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
|| index->table->is_temporary());
|
2016-12-15 18:51:41 +02:00
|
|
|
|
2019-12-11 18:04:34 +02:00
|
|
|
buf_block_modify_clock_inc(block);
|
|
|
|
page_create_low(block, true);
|
|
|
|
|
|
|
|
if (index->is_spatial()) {
|
|
|
|
mach_write_to_2(FIL_PAGE_TYPE + block->frame, FIL_PAGE_RTREE);
|
|
|
|
memset(block->frame + FIL_RTREE_SPLIT_SEQ_NUM, 0, 8);
|
|
|
|
memset(block->page.zip.data + FIL_RTREE_SPLIT_SEQ_NUM, 0, 8);
|
|
|
|
}
|
|
|
|
|
|
|
|
mach_write_to_2(PAGE_HEADER + PAGE_LEVEL + block->frame, level);
|
|
|
|
mach_write_to_8(PAGE_HEADER + PAGE_MAX_TRX_ID + block->frame,
|
|
|
|
max_trx_id);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-06-12 19:47:40 +03:00
|
|
|
if (!page_zip_compress(block, index, page_zip_level, mtr)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
/* The compression of a newly created
|
|
|
|
page should always succeed. */
|
2014-02-26 19:11:54 +01:00
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************//**
|
|
|
|
Empty a previously created B-tree index page. */
|
|
|
|
void
|
|
|
|
page_create_empty(
|
|
|
|
/*==============*/
|
|
|
|
buf_block_t* block, /*!< in/out: B-tree block */
|
|
|
|
dict_index_t* index, /*!< in: the index of the page */
|
|
|
|
mtr_t* mtr) /*!< in/out: mini-transaction */
|
|
|
|
{
|
2016-12-15 18:51:41 +02:00
|
|
|
trx_id_t max_trx_id;
|
2014-02-26 19:11:54 +01:00
|
|
|
page_zip_des_t* page_zip= buf_block_get_page_zip(block);
|
|
|
|
|
2019-12-11 18:04:34 +02:00
|
|
|
ut_ad(fil_page_index_page_check(block->frame));
|
2019-03-22 19:16:45 +02:00
|
|
|
ut_ad(!index->is_dummy);
|
MDEV-15053 Reduce buf_pool_t::mutex contention
User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
and will no longer report the PAGE_STATE value READY_FOR_USE.
We will remove some fields from buf_page_t and move much code to
member functions of buf_pool_t and buf_page_t, so that the access
rules of data members can be enforced consistently.
Evicting or adding pages in buf_pool.LRU will remain covered by
buf_pool.mutex.
Evicting or adding pages in buf_pool.page_hash will remain
covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.
After this fix, buf_pool.page_hash lookups can entirely
avoid acquiring buf_pool.mutex, only relying on
buf_pool.hash_lock_get() S-latch.
Similarly, buf_flush_check_neighbors() can will rely solely on
buf_pool.mutex, no buf_pool.page_hash latch at all.
The buf_pool.mutex is rather contended in I/O heavy benchmarks,
especially when the workload does not fit in the buffer pool.
The first attempt to alleviate the contention was the
buf_pool_t::mutex split in
commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858
which introduced buf_block_t::mutex, which we are now removing.
Later, multiple instances of buf_pool_t were introduced
in commit c18084f71b02ea707c6461353e6cfc15d7553bc6
and recently removed by us in
commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058).
UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
related debugging in otherwise non-debug builds has not been used
for years. Instead, we have been using UNIV_DEBUG, which is enabled
in CMAKE_BUILD_TYPE=Debug.
buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
std::atomic and the buf_pool.page_hash latches, and in some cases
depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
We must always release buf_block_t::lock before invoking
unfix() or io_unfix(), to prevent a glitch where a block that was
added to the buf_pool.free list would apper X-latched. See
commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch
was finally caught in a debug environment.
We move some buf_pool_t::page_hash specific code from the
ha and hash modules to buf_pool, for improved readability.
buf_pool_t::close(): Assert that all blocks are clean, except
on aborted startup or crash-like shutdown.
buf_pool_t::validate(): No longer attempt to validate
n_flush[] against the number of BUF_IO_WRITE fixed blocks,
because buf_page_t::flush_type no longer exists.
buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
Reduce mutex contention by separating the buf_pool.watch[]
allocation and the insert into buf_pool.page_hash.
buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
buf_pool.page_hash latch.
Replaces and extends buf_page_hash_lock_s_confirm()
and buf_page_hash_lock_x_confirm().
buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.
buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
Use Atomic_counter.
buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().
buf_pool_t::LRU_remove(): Remove a block from the LRU list
and return its predecessor. Incorporates buf_LRU_adjust_hp(),
which was removed.
buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
BTR_DELETE_OP (purge), which is never invoked on temporary tables.
buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.
buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.
buf_LRU_free_page(): Clarify the function comment.
buf_flush_check_neighbor(), buf_flush_check_neighbors():
Rewrite the construction of the page hash range. We will hold
the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
consecutive lookups of buf_pool.page_hash.
buf_flush_page_and_try_neighbors(): Remove.
Merge to its only callers, and remove redundant operations in
buf_flush_LRU_list_batch().
buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
Do not acquire buf_pool.mutex, and iterate directly with page_id_t.
ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
and avoids any loops.
fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.
buf_flush_page(): Add a fil_space_t* parameter. Minimize the
buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
atomically with the io_fix, and we will protect most buf_block_t
fields with buf_block_t::lock. The function
buf_flush_write_block_low() is removed and merged here.
buf_page_init_for_read(): Use static linkage. Initialize the newly
allocated block and acquire the exclusive buf_block_t::lock while not
holding any mutex.
IORequest::IORequest(): Remove the body. We only need to invoke
set_punch_hole() in buf_flush_page() and nowhere else.
buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
This field is only used during a fil_io() call.
That function already takes IORequest as a parameter, so we had
better introduce for the rarely changing field.
buf_block_t::init(): Replaces buf_page_init().
buf_page_t::init(): Replaces buf_page_init_low().
buf_block_t::initialise(): Initialise many fields, but
keep the buf_page_t::state(). Both buf_pool_t::validate() and
buf_page_optimistic_get() requires that buf_page_t::in_file()
be protected atomically with buf_page_t::in_page_hash
and buf_page_t::in_LRU_list.
buf_page_optimistic_get(): Now that buf_block_t::mutex
no longer exists, we must check buf_page_t::io_fix()
after acquiring the buf_pool.page_hash lock, to detect
whether buf_page_init_for_read() has been initiated.
We will also check the io_fix() before acquiring hash_lock
in order to avoid unnecessary computation.
The field buf_block_t::modify_clock (protected by buf_block_t::lock)
allows buf_page_optimistic_get() to validate the block.
buf_page_t::real_size: Remove. It was only used while flushing
pages of page_compressed tables.
buf_page_encrypt(): Add an output parameter that allows us ot eliminate
buf_page_t::real_size. Replace a condition with debug assertion.
buf_page_should_punch_hole(): Remove.
buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
Add the parameter size (to replace buf_page_t::real_size).
buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
Add the parameter size (to replace buf_page_t::real_size).
fil_system_t::detach(): Replaces fil_space_detach().
Ensure that fil_validate() will not be violated even if
fil_system.mutex is released and reacquired.
fil_node_t::complete_io(): Renamed from fil_node_complete_io().
fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
Avoid invoking fil_node_t::close() because fil_system.n_open
has already been decremented in fil_space_t::detach().
BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.
BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
and distinguish dirty pages by buf_page_t::oldest_modification().
BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
This state was only being used for buf_page_t that are in
buf_pool.watch.
buf_pool_t::watch[]: Remove pointer indirection.
buf_page_t::in_flush_list: Remove. It was set if and only if
buf_page_t::oldest_modification() is nonzero.
buf_page_decrypt_after_read(), buf_corrupt_page_release(),
buf_page_check_corrupt(): Change the const fil_space_t* parameter
to const fil_node_t& so that we can report the correct file name.
buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.
buf_page_io_complete(): Split to buf_page_read_complete() and
buf_page_write_complete().
buf_dblwr_t::in_use: Remove.
buf_dblwr_t::buf_block_array: Add IORequest::flush_t.
buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
os_aio_wait_until_no_pending_writes().
buf_flush_write_complete(): Declare static, not global.
Add the parameter IORequest::flush_t.
buf_flush_freed_page(): Simplify the code.
recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.
fil_read(), fil_write(): Replaced with direct use of fil_io().
fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.
fil_mutex_enter_and_prepare_for_io(): Return the resolved
fil_space_t* to avoid a duplicated lookup in the caller.
fil_report_invalid_page_access(): Clean up the parameters.
fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
Always invoke fil_space_t::acquire_for_io() and let either the
sync=true caller or fil_aio_callback() invoke
fil_space_t::release_for_io().
fil_aio_callback(): Rewrite to replace buf_page_io_complete().
fil_check_pending_operations(): Remove a parameter, and remove some
redundant lookups.
fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
do an extra lookup of the tablespace between fil_io() and the
completion of the operation, we must give fil_node_t::complete_io() a
chance to decrement the counter.
fil_close_tablespace(): Remove unused parameter trx, and document
that this is only invoked during the error handling of IMPORT TABLESPACE.
row_import_discard_changes(): Merged with the only caller,
row_import_cleanup(). Do not lock up the data dictionary while
invoking fil_close_tablespace().
logs_empty_and_mark_files_at_shutdown(): Do not invoke
fil_close_all_files(), to avoid a !needs_flush assertion failure
on fil_node_t::close().
innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().
fil_close_all_files(): Invoke fil_flush_file_spaces()
to ensure proper durability.
thread_pool::unbind(): Fix a crash that would occur on Windows
after srv_thread_pool->disable_aio() and os_file_close().
This fix was submitted by Vladislav Vaintroub.
Thanks to Matthias Leich and Axel Schwenke for extensive testing,
Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
|
|
|
ut_ad(block->page.id().space() == index->table->space->id);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
/* Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
|
|
|
if (dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
&& !index->table->is_temporary()
|
2019-12-11 18:04:34 +02:00
|
|
|
&& page_is_leaf(block->frame)) {
|
|
|
|
max_trx_id = page_get_max_trx_id(block->frame);
|
2014-02-26 19:11:54 +01:00
|
|
|
ut_ad(max_trx_id);
|
MDEV-15053 Reduce buf_pool_t::mutex contention
User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
and will no longer report the PAGE_STATE value READY_FOR_USE.
We will remove some fields from buf_page_t and move much code to
member functions of buf_pool_t and buf_page_t, so that the access
rules of data members can be enforced consistently.
Evicting or adding pages in buf_pool.LRU will remain covered by
buf_pool.mutex.
Evicting or adding pages in buf_pool.page_hash will remain
covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.
After this fix, buf_pool.page_hash lookups can entirely
avoid acquiring buf_pool.mutex, only relying on
buf_pool.hash_lock_get() S-latch.
Similarly, buf_flush_check_neighbors() can will rely solely on
buf_pool.mutex, no buf_pool.page_hash latch at all.
The buf_pool.mutex is rather contended in I/O heavy benchmarks,
especially when the workload does not fit in the buffer pool.
The first attempt to alleviate the contention was the
buf_pool_t::mutex split in
commit 4ed7082eefe56b3e97e0edefb3df76dd7ef5e858
which introduced buf_block_t::mutex, which we are now removing.
Later, multiple instances of buf_pool_t were introduced
in commit c18084f71b02ea707c6461353e6cfc15d7553bc6
and recently removed by us in
commit 1a6f708ec594ac0ae2dd30db926ab07b100fa24b (MDEV-15058).
UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
related debugging in otherwise non-debug builds has not been used
for years. Instead, we have been using UNIV_DEBUG, which is enabled
in CMAKE_BUILD_TYPE=Debug.
buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
std::atomic and the buf_pool.page_hash latches, and in some cases
depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
We must always release buf_block_t::lock before invoking
unfix() or io_unfix(), to prevent a glitch where a block that was
added to the buf_pool.free list would apper X-latched. See
commit c5883debd6ef440a037011c11873b396923e93c5 how this glitch
was finally caught in a debug environment.
We move some buf_pool_t::page_hash specific code from the
ha and hash modules to buf_pool, for improved readability.
buf_pool_t::close(): Assert that all blocks are clean, except
on aborted startup or crash-like shutdown.
buf_pool_t::validate(): No longer attempt to validate
n_flush[] against the number of BUF_IO_WRITE fixed blocks,
because buf_page_t::flush_type no longer exists.
buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
Reduce mutex contention by separating the buf_pool.watch[]
allocation and the insert into buf_pool.page_hash.
buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
buf_pool.page_hash latch.
Replaces and extends buf_page_hash_lock_s_confirm()
and buf_page_hash_lock_x_confirm().
buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.
buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
Use Atomic_counter.
buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().
buf_pool_t::LRU_remove(): Remove a block from the LRU list
and return its predecessor. Incorporates buf_LRU_adjust_hp(),
which was removed.
buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
BTR_DELETE_OP (purge), which is never invoked on temporary tables.
buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.
buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.
buf_LRU_free_page(): Clarify the function comment.
buf_flush_check_neighbor(), buf_flush_check_neighbors():
Rewrite the construction of the page hash range. We will hold
the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
consecutive lookups of buf_pool.page_hash.
buf_flush_page_and_try_neighbors(): Remove.
Merge to its only callers, and remove redundant operations in
buf_flush_LRU_list_batch().
buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
Do not acquire buf_pool.mutex, and iterate directly with page_id_t.
ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
and avoids any loops.
fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.
buf_flush_page(): Add a fil_space_t* parameter. Minimize the
buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
atomically with the io_fix, and we will protect most buf_block_t
fields with buf_block_t::lock. The function
buf_flush_write_block_low() is removed and merged here.
buf_page_init_for_read(): Use static linkage. Initialize the newly
allocated block and acquire the exclusive buf_block_t::lock while not
holding any mutex.
IORequest::IORequest(): Remove the body. We only need to invoke
set_punch_hole() in buf_flush_page() and nowhere else.
buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
This field is only used during a fil_io() call.
That function already takes IORequest as a parameter, so we had
better introduce for the rarely changing field.
buf_block_t::init(): Replaces buf_page_init().
buf_page_t::init(): Replaces buf_page_init_low().
buf_block_t::initialise(): Initialise many fields, but
keep the buf_page_t::state(). Both buf_pool_t::validate() and
buf_page_optimistic_get() requires that buf_page_t::in_file()
be protected atomically with buf_page_t::in_page_hash
and buf_page_t::in_LRU_list.
buf_page_optimistic_get(): Now that buf_block_t::mutex
no longer exists, we must check buf_page_t::io_fix()
after acquiring the buf_pool.page_hash lock, to detect
whether buf_page_init_for_read() has been initiated.
We will also check the io_fix() before acquiring hash_lock
in order to avoid unnecessary computation.
The field buf_block_t::modify_clock (protected by buf_block_t::lock)
allows buf_page_optimistic_get() to validate the block.
buf_page_t::real_size: Remove. It was only used while flushing
pages of page_compressed tables.
buf_page_encrypt(): Add an output parameter that allows us ot eliminate
buf_page_t::real_size. Replace a condition with debug assertion.
buf_page_should_punch_hole(): Remove.
buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
Add the parameter size (to replace buf_page_t::real_size).
buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
Add the parameter size (to replace buf_page_t::real_size).
fil_system_t::detach(): Replaces fil_space_detach().
Ensure that fil_validate() will not be violated even if
fil_system.mutex is released and reacquired.
fil_node_t::complete_io(): Renamed from fil_node_complete_io().
fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
Avoid invoking fil_node_t::close() because fil_system.n_open
has already been decremented in fil_space_t::detach().
BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.
BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
and distinguish dirty pages by buf_page_t::oldest_modification().
BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
This state was only being used for buf_page_t that are in
buf_pool.watch.
buf_pool_t::watch[]: Remove pointer indirection.
buf_page_t::in_flush_list: Remove. It was set if and only if
buf_page_t::oldest_modification() is nonzero.
buf_page_decrypt_after_read(), buf_corrupt_page_release(),
buf_page_check_corrupt(): Change the const fil_space_t* parameter
to const fil_node_t& so that we can report the correct file name.
buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.
buf_page_io_complete(): Split to buf_page_read_complete() and
buf_page_write_complete().
buf_dblwr_t::in_use: Remove.
buf_dblwr_t::buf_block_array: Add IORequest::flush_t.
buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
os_aio_wait_until_no_pending_writes().
buf_flush_write_complete(): Declare static, not global.
Add the parameter IORequest::flush_t.
buf_flush_freed_page(): Simplify the code.
recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.
fil_read(), fil_write(): Replaced with direct use of fil_io().
fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.
fil_mutex_enter_and_prepare_for_io(): Return the resolved
fil_space_t* to avoid a duplicated lookup in the caller.
fil_report_invalid_page_access(): Clean up the parameters.
fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
Always invoke fil_space_t::acquire_for_io() and let either the
sync=true caller or fil_aio_callback() invoke
fil_space_t::release_for_io().
fil_aio_callback(): Rewrite to replace buf_page_io_complete().
fil_check_pending_operations(): Remove a parameter, and remove some
redundant lookups.
fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
do an extra lookup of the tablespace between fil_io() and the
completion of the operation, we must give fil_node_t::complete_io() a
chance to decrement the counter.
fil_close_tablespace(): Remove unused parameter trx, and document
that this is only invoked during the error handling of IMPORT TABLESPACE.
row_import_discard_changes(): Merged with the only caller,
row_import_cleanup(). Do not lock up the data dictionary while
invoking fil_close_tablespace().
logs_empty_and_mark_files_at_shutdown(): Do not invoke
fil_close_all_files(), to avoid a !needs_flush assertion failure
on fil_node_t::close().
innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().
fil_close_all_files(): Invoke fil_flush_file_spaces()
to ensure proper durability.
thread_pool::unbind(): Fix a crash that would occur on Windows
after srv_thread_pool->disable_aio() and os_file_close().
This fix was submitted by Vladislav Vaintroub.
Thanks to Matthias Leich and Axel Schwenke for extensive testing,
Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
|
|
|
} else if (block->page.id().page_no() == index->page) {
|
2016-12-15 18:51:41 +02:00
|
|
|
/* Preserve PAGE_ROOT_AUTO_INC. */
|
2019-12-11 18:04:34 +02:00
|
|
|
max_trx_id = page_get_max_trx_id(block->frame);
|
2016-12-15 18:51:41 +02:00
|
|
|
} else {
|
|
|
|
max_trx_id = 0;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (page_zip) {
|
2018-05-12 09:38:46 +03:00
|
|
|
ut_ad(!index->table->is_temporary());
|
2014-02-26 19:11:54 +01:00
|
|
|
page_create_zip(block, index,
|
2019-12-11 18:04:34 +02:00
|
|
|
page_header_get_field(block->frame,
|
|
|
|
PAGE_LEVEL),
|
2018-09-10 18:01:54 +03:00
|
|
|
max_trx_id, mtr);
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2019-12-11 18:04:34 +02:00
|
|
|
page_create(block, mtr, index->table->not_redundant());
|
|
|
|
if (index->is_spatial()) {
|
|
|
|
static_assert(((FIL_PAGE_INDEX & 0xff00)
|
|
|
|
| byte(FIL_PAGE_RTREE))
|
|
|
|
== FIL_PAGE_RTREE, "compatibility");
|
|
|
|
mtr->write<1>(*block, FIL_PAGE_TYPE + 1 + block->frame,
|
|
|
|
byte(FIL_PAGE_RTREE));
|
|
|
|
if (mach_read_from_8(block->frame
|
|
|
|
+ FIL_RTREE_SPLIT_SEQ_NUM)) {
|
|
|
|
mtr->memset(block, FIL_RTREE_SPLIT_SEQ_NUM,
|
|
|
|
8, 0);
|
|
|
|
}
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (max_trx_id) {
|
2019-12-03 10:19:45 +02:00
|
|
|
mtr->write<8>(*block, PAGE_HEADER + PAGE_MAX_TRX_ID
|
|
|
|
+ block->frame, max_trx_id);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Differs from page_copy_rec_list_end, because this function does not
|
|
|
|
touch the lock table and max trx id on page or compress the page.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit(). */
|
|
|
|
void
|
|
|
|
page_copy_rec_list_end_no_locks(
|
|
|
|
/*============================*/
|
|
|
|
buf_block_t* new_block, /*!< in: index page to copy to */
|
|
|
|
buf_block_t* block, /*!< in: index page of rec */
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
page_cur_t cur1;
|
2019-12-13 15:58:30 +02:00
|
|
|
page_cur_t cur2;
|
2014-02-26 19:11:54 +01:00
|
|
|
mem_heap_t* heap = NULL;
|
2020-04-28 10:46:51 +10:00
|
|
|
rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
rec_offs* offsets = offsets_;
|
2014-02-26 19:11:54 +01:00
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
|
|
|
page_cur_position(rec, block, &cur1);
|
|
|
|
|
|
|
|
if (page_cur_is_before_first(&cur1)) {
|
|
|
|
|
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
}
|
|
|
|
|
|
|
|
btr_assert_not_corrupted(new_block, index);
|
|
|
|
ut_a(page_is_comp(new_page) == page_rec_is_comp(rec));
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_a(mach_read_from_2(new_page + srv_page_size - 10) == (ulint)
|
2014-02-26 19:11:54 +01:00
|
|
|
(page_is_comp(new_page) ? PAGE_NEW_INFIMUM : PAGE_OLD_INFIMUM));
|
2021-04-13 10:28:13 +03:00
|
|
|
const ulint n_core = page_is_leaf(block->frame)
|
|
|
|
? index->n_core_fields : 0;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-12-13 15:58:30 +02:00
|
|
|
page_cur_set_before_first(new_block, &cur2);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/* Copy records from the original page to the new page */
|
|
|
|
|
|
|
|
while (!page_cur_is_after_last(&cur1)) {
|
|
|
|
rec_t* ins_rec;
|
2021-04-13 10:28:13 +03:00
|
|
|
offsets = rec_get_offsets(cur1.rec, index, offsets, n_core,
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
2019-12-13 15:58:30 +02:00
|
|
|
ins_rec = page_cur_insert_rec_low(&cur2, index,
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
cur1.rec, offsets, mtr);
|
2014-02-26 19:11:54 +01:00
|
|
|
if (UNIV_UNLIKELY(!ins_rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::fatal() << "Rec offset " << page_offset(rec)
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
<< ", cur1 offset " << page_offset(cur1.rec)
|
2019-12-13 15:58:30 +02:00
|
|
|
<< ", cur2 offset " << page_offset(cur2.rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
page_cur_move_to_next(&cur1);
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
ut_ad(!(rec_get_info_bits(cur1.rec, page_is_comp(new_page))
|
|
|
|
& REC_INFO_MIN_REC_FLAG));
|
2019-12-13 15:58:30 +02:00
|
|
|
cur2.rec = ins_rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Copies records from page to new_page, from a given record onward,
|
|
|
|
including that record. Infimum and supremum records are not copied.
|
|
|
|
The records are copied to the start of the record list on new_page.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
|
|
|
@return pointer to the original successor of the infimum record on
|
|
|
|
new_page, or NULL on zip overflow (new_block will be decompressed) */
|
|
|
|
rec_t*
|
|
|
|
page_copy_rec_list_end(
|
|
|
|
/*===================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page to copy to */
|
|
|
|
buf_block_t* block, /*!< in: index page containing rec */
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
page_zip_des_t* new_page_zip = buf_block_get_page_zip(new_block);
|
2019-12-03 10:19:45 +02:00
|
|
|
page_t* page = block->frame;
|
2014-02-26 19:11:54 +01:00
|
|
|
rec_t* ret = page_rec_get_next(
|
|
|
|
page_get_infimum_rec(new_page));
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint num_moved = 0;
|
|
|
|
rtr_rec_move_t* rec_move = NULL;
|
|
|
|
mem_heap_t* heap = NULL;
|
2019-12-03 10:19:45 +02:00
|
|
|
ut_ad(page_align(rec) == page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
if (new_page_zip) {
|
|
|
|
page_zip_des_t* page_zip = buf_block_get_page_zip(block);
|
|
|
|
ut_a(page_zip);
|
|
|
|
|
|
|
|
/* Strict page_zip_validate() may fail here.
|
|
|
|
Furthermore, btr_compress() may set FIL_PAGE_PREV to
|
|
|
|
FIL_NULL on new_page while leaving it intact on
|
|
|
|
new_page_zip. So, we cannot validate new_page_zip. */
|
|
|
|
ut_a(page_zip_validate_low(page_zip, page, index, TRUE));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
ut_ad(buf_block_get_frame(block) == page);
|
|
|
|
ut_ad(page_is_leaf(page) == page_is_leaf(new_page));
|
|
|
|
ut_ad(page_is_comp(page) == page_is_comp(new_page));
|
|
|
|
/* Here, "ret" may be pointing to a user record or the
|
|
|
|
predefined supremum record. */
|
|
|
|
|
2020-01-22 18:51:29 +02:00
|
|
|
const mtr_log_t log_mode = new_page_zip
|
|
|
|
? mtr->set_log_mode(MTR_LOG_NONE) : MTR_LOG_NONE;
|
2020-04-14 18:43:03 +03:00
|
|
|
const bool was_empty = page_dir_get_n_heap(new_page)
|
|
|
|
== PAGE_HEAP_NO_USER_LOW;
|
|
|
|
alignas(2) byte h[PAGE_N_DIRECTION + 2 - PAGE_LAST_INSERT];
|
|
|
|
memcpy_aligned<2>(h, PAGE_HEADER + PAGE_LAST_INSERT + new_page,
|
|
|
|
sizeof h);
|
2020-01-22 18:51:29 +02:00
|
|
|
|
|
|
|
if (index->is_spatial()) {
|
|
|
|
ulint max_to_move = page_get_n_recs(
|
|
|
|
buf_block_get_frame(block));
|
|
|
|
heap = mem_heap_create(256);
|
|
|
|
|
|
|
|
rec_move = static_cast<rtr_rec_move_t*>(
|
|
|
|
mem_heap_alloc(heap, max_to_move * sizeof *rec_move));
|
|
|
|
|
|
|
|
/* For spatial index, we need to insert recs one by one
|
|
|
|
to keep recs ordered. */
|
|
|
|
rtr_page_copy_rec_list_end_no_locks(new_block,
|
|
|
|
block, rec, index,
|
|
|
|
heap, rec_move,
|
|
|
|
max_to_move,
|
|
|
|
&num_moved,
|
|
|
|
mtr);
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2020-01-22 18:51:29 +02:00
|
|
|
page_copy_rec_list_end_no_locks(new_block, block, rec,
|
|
|
|
index, mtr);
|
2020-04-14 18:43:03 +03:00
|
|
|
if (was_empty) {
|
|
|
|
mtr->memcpy<mtr_t::MAYBE_NOP>(*new_block, PAGE_HEADER
|
|
|
|
+ PAGE_LAST_INSERT
|
|
|
|
+ new_page, h, sizeof h);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Update PAGE_MAX_TRX_ID on the uncompressed page.
|
|
|
|
Modifications will be redo logged and copied to the compressed
|
2016-08-12 11:17:45 +03:00
|
|
|
page in page_zip_compress() or page_zip_reorganize() below.
|
|
|
|
Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
|
|
|
if (dict_index_is_sec_or_ibuf(index)
|
|
|
|
&& page_is_leaf(page)
|
2018-05-12 09:38:46 +03:00
|
|
|
&& !index->table->is_temporary()) {
|
2020-01-22 18:51:29 +02:00
|
|
|
ut_ad(!was_empty || page_dir_get_n_heap(new_page)
|
|
|
|
== PAGE_HEAP_NO_USER_LOW
|
|
|
|
+ page_header_get_field(new_page, PAGE_N_RECS));
|
2014-02-26 19:11:54 +01:00
|
|
|
page_update_max_trx_id(new_block, NULL,
|
|
|
|
page_get_max_trx_id(page), mtr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (new_page_zip) {
|
|
|
|
mtr_set_log_mode(mtr, log_mode);
|
|
|
|
|
2019-06-12 19:47:40 +03:00
|
|
|
if (!page_zip_compress(new_block, index,
|
2018-09-10 18:01:54 +03:00
|
|
|
page_zip_level, mtr)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* Before trying to reorganize the page,
|
|
|
|
store the number of preceding records on the page. */
|
|
|
|
ulint ret_pos
|
|
|
|
= page_rec_get_n_recs_before(ret);
|
|
|
|
/* Before copying, "ret" was the successor of
|
|
|
|
the predefined infimum record. It must still
|
|
|
|
have at least one predecessor (the predefined
|
|
|
|
infimum record, or a freshly copied record
|
|
|
|
that is smaller than "ret"). */
|
|
|
|
ut_a(ret_pos > 0);
|
|
|
|
|
2019-06-17 15:38:14 +03:00
|
|
|
if (!page_zip_reorganize(new_block, index,
|
|
|
|
page_zip_level, mtr)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (!page_zip_decompress(new_page_zip,
|
|
|
|
new_page, FALSE)) {
|
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
ut_ad(page_validate(new_page, index));
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
if (heap) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
return(NULL);
|
|
|
|
} else {
|
|
|
|
/* The page was reorganized:
|
|
|
|
Seek to ret_pos. */
|
2019-06-17 15:38:14 +03:00
|
|
|
ret = page_rec_get_nth(new_page, ret_pos);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update the lock table and possible hash index */
|
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (dict_table_is_locking_disabled(index->table)) {
|
|
|
|
} else if (rec_move && dict_index_is_spatial(index)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_rtr_move_rec_list(new_block, block, rec_move, num_moved);
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_move_rec_list_end(new_block, block, rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (heap) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
btr_search_move_or_delete_hash_entries(new_block, block);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Copies records from page to new_page, up to the given record,
|
|
|
|
NOT including that record. Infimum and supremum records are not copied.
|
|
|
|
The records are copied to the end of the record list on new_page.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
|
|
|
@return pointer to the original predecessor of the supremum record on
|
|
|
|
new_page, or NULL on zip overflow (new_block will be decompressed) */
|
|
|
|
rec_t*
|
|
|
|
page_copy_rec_list_start(
|
|
|
|
/*=====================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page to copy to */
|
|
|
|
buf_block_t* block, /*!< in: index page containing rec */
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
ut_ad(page_align(rec) == block->frame);
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
page_zip_des_t* new_page_zip = buf_block_get_page_zip(new_block);
|
|
|
|
page_cur_t cur1;
|
2019-12-13 15:58:30 +02:00
|
|
|
page_cur_t cur2;
|
2014-02-26 19:11:54 +01:00
|
|
|
mem_heap_t* heap = NULL;
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint num_moved = 0;
|
|
|
|
rtr_rec_move_t* rec_move = NULL;
|
2014-02-26 19:11:54 +01:00
|
|
|
rec_t* ret
|
|
|
|
= page_rec_get_prev(page_get_supremum_rec(new_page));
|
2020-04-28 10:46:51 +10:00
|
|
|
rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
rec_offs* offsets = offsets_;
|
2014-02-26 19:11:54 +01:00
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
|
|
|
/* Here, "ret" may be pointing to a user record or the
|
|
|
|
predefined infimum record. */
|
|
|
|
|
|
|
|
if (page_rec_is_infimum(rec)) {
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
mtr_log_t log_mode = MTR_LOG_NONE;
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (new_page_zip) {
|
|
|
|
log_mode = mtr_set_log_mode(mtr, MTR_LOG_NONE);
|
|
|
|
}
|
|
|
|
|
|
|
|
page_cur_set_before_first(block, &cur1);
|
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
|
2019-12-13 15:58:30 +02:00
|
|
|
page_cur_position(ret, new_block, &cur2);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2021-04-13 10:28:13 +03:00
|
|
|
const ulint n_core = page_rec_is_leaf(rec) ? index->n_core_fields : 0;
|
2017-09-19 19:20:11 +03:00
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
/* Copy records from the original page to the new page */
|
2019-10-10 20:40:26 +03:00
|
|
|
if (index->is_spatial()) {
|
|
|
|
ut_ad(!index->is_instant());
|
2016-08-12 11:17:45 +03:00
|
|
|
ulint max_to_move = page_get_n_recs(
|
|
|
|
buf_block_get_frame(block));
|
|
|
|
heap = mem_heap_create(256);
|
|
|
|
|
|
|
|
rec_move = static_cast<rtr_rec_move_t*>(mem_heap_alloc(
|
|
|
|
heap,
|
|
|
|
sizeof (*rec_move) * max_to_move));
|
|
|
|
|
|
|
|
/* For spatial index, we need to insert recs one by one
|
|
|
|
to keep recs ordered. */
|
|
|
|
rtr_page_copy_rec_list_start_no_locks(new_block,
|
|
|
|
block, rec, index, heap,
|
|
|
|
rec_move, max_to_move,
|
|
|
|
&num_moved, mtr);
|
|
|
|
} else {
|
|
|
|
while (page_cur_get_rec(&cur1) != rec) {
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
offsets = rec_get_offsets(cur1.rec, index, offsets,
|
2021-04-13 10:28:13 +03:00
|
|
|
n_core,
|
2016-08-12 11:17:45 +03:00
|
|
|
ULINT_UNDEFINED, &heap);
|
2019-12-13 15:58:30 +02:00
|
|
|
cur2.rec = page_cur_insert_rec_low(&cur2, index,
|
|
|
|
cur1.rec, offsets,
|
|
|
|
mtr);
|
|
|
|
ut_a(cur2.rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
page_cur_move_to_next(&cur1);
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
ut_ad(!(rec_get_info_bits(cur1.rec,
|
|
|
|
page_is_comp(new_page))
|
|
|
|
& REC_INFO_MIN_REC_FLAG));
|
2016-08-12 11:17:45 +03:00
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Update PAGE_MAX_TRX_ID on the uncompressed page.
|
|
|
|
Modifications will be redo logged and copied to the compressed
|
2016-08-12 11:17:45 +03:00
|
|
|
page in page_zip_compress() or page_zip_reorganize() below.
|
|
|
|
Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
2021-04-13 10:28:13 +03:00
|
|
|
if (n_core && dict_index_is_sec_or_ibuf(index)
|
2018-05-12 09:38:46 +03:00
|
|
|
&& !index->table->is_temporary()) {
|
2019-12-03 10:19:45 +02:00
|
|
|
page_update_max_trx_id(new_block,
|
|
|
|
new_page_zip,
|
|
|
|
page_get_max_trx_id(block->frame),
|
2014-02-26 19:11:54 +01:00
|
|
|
mtr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (new_page_zip) {
|
|
|
|
mtr_set_log_mode(mtr, log_mode);
|
|
|
|
|
|
|
|
DBUG_EXECUTE_IF("page_copy_rec_list_start_compress_fail",
|
|
|
|
goto zip_reorganize;);
|
|
|
|
|
2019-06-12 19:47:40 +03:00
|
|
|
if (!page_zip_compress(new_block, index,
|
2018-09-10 18:01:54 +03:00
|
|
|
page_zip_level, mtr)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint ret_pos;
|
|
|
|
#ifndef DBUG_OFF
|
|
|
|
zip_reorganize:
|
|
|
|
#endif /* DBUG_OFF */
|
|
|
|
/* Before trying to reorganize the page,
|
|
|
|
store the number of preceding records on the page. */
|
|
|
|
ret_pos = page_rec_get_n_recs_before(ret);
|
|
|
|
/* Before copying, "ret" was the predecessor
|
|
|
|
of the predefined supremum record. If it was
|
|
|
|
the predefined infimum record, then it would
|
|
|
|
still be the infimum, and we would have
|
|
|
|
ret_pos == 0. */
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
2019-06-17 15:38:14 +03:00
|
|
|
(!page_zip_reorganize(new_block, index,
|
|
|
|
page_zip_level, mtr))) {
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(!page_zip_decompress(new_page_zip,
|
|
|
|
new_page, FALSE))) {
|
|
|
|
ut_error;
|
|
|
|
}
|
|
|
|
ut_ad(page_validate(new_page, index));
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The page was reorganized: Seek to ret_pos. */
|
|
|
|
ret = page_rec_get_nth(new_page, ret_pos);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update the lock table and possible hash index */
|
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (dict_table_is_locking_disabled(index->table)) {
|
|
|
|
} else if (dict_index_is_spatial(index)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_rtr_move_rec_list(new_block, block, rec_move, num_moved);
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
lock_move_rec_list_start(new_block, block, rec, ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (heap) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
btr_search_move_or_delete_hash_entries(new_block, block);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Deletes records from a page from a given record onward, including that record.
|
|
|
|
The infimum and supremum records are not deleted. */
|
|
|
|
void
|
|
|
|
page_delete_rec_list_end(
|
|
|
|
/*=====================*/
|
|
|
|
rec_t* rec, /*!< in: pointer to record on page */
|
|
|
|
buf_block_t* block, /*!< in: buffer block of the page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
ulint n_recs, /*!< in: number of records to delete,
|
|
|
|
or ULINT_UNDEFINED if not known */
|
|
|
|
ulint size, /*!< in: the sum of the sizes of the
|
|
|
|
records in the end of the chain to
|
|
|
|
delete, or ULINT_UNDEFINED if not known */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
2020-02-17 13:30:01 +02:00
|
|
|
ut_ad(size == ULINT_UNDEFINED || size < srv_page_size);
|
|
|
|
ut_ad(page_align(rec) == block->frame);
|
|
|
|
ut_ad(index->table->not_redundant() == !!page_is_comp(block->frame));
|
2014-02-26 19:11:54 +01:00
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
2020-02-17 13:30:01 +02:00
|
|
|
ut_a(!block->page.zip.data ||
|
2020-02-24 15:13:00 +02:00
|
|
|
page_zip_validate(&block->page.zip, block->frame, index));
|
2014-02-26 19:11:54 +01:00
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
|
2020-02-17 13:30:01 +02:00
|
|
|
if (page_rec_is_supremum(rec))
|
|
|
|
{
|
|
|
|
ut_ad(n_recs == 0 || n_recs == ULINT_UNDEFINED);
|
|
|
|
/* Nothing to do, there are no records bigger than the page supremum. */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_infimum(rec) || n_recs == page_get_n_recs(block->frame) ||
|
|
|
|
rec == (page_is_comp(block->frame)
|
|
|
|
? page_rec_get_next_low(block->frame + PAGE_NEW_INFIMUM, 1)
|
|
|
|
: page_rec_get_next_low(block->frame + PAGE_OLD_INFIMUM, 0)))
|
|
|
|
{
|
|
|
|
/* We are deleting all records. */
|
|
|
|
page_create_empty(block, index, mtr);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
#if 0 // FIXME: consider deleting the last record as a special case
|
|
|
|
if (page_rec_is_last(rec))
|
|
|
|
{
|
|
|
|
page_cur_t cursor= { index, rec, offsets, block };
|
|
|
|
page_cur_delete_rec(&cursor, index, offsets, mtr);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-04-14 11:35:39 +03:00
|
|
|
/* The page becomes invalid for optimistic searches */
|
2020-02-17 13:30:01 +02:00
|
|
|
buf_block_modify_clock_inc(block);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2021-04-14 11:35:39 +03:00
|
|
|
const ulint n_core= page_is_leaf(block->frame) ? index->n_core_fields : 0;
|
2020-02-17 13:30:01 +02:00
|
|
|
mem_heap_t *heap= nullptr;
|
2020-05-05 21:16:22 +03:00
|
|
|
rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
rec_offs *offsets= offsets_;
|
2020-02-17 13:30:01 +02:00
|
|
|
rec_offs_init(offsets_);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2020-02-17 13:30:01 +02:00
|
|
|
#if 1 // FIXME: remove this, and write minimal amount of log! */
|
|
|
|
if (UNIV_LIKELY_NULL(block->page.zip.data))
|
|
|
|
{
|
|
|
|
ut_ad(page_is_comp(block->frame));
|
|
|
|
do
|
|
|
|
{
|
|
|
|
page_cur_t cur;
|
|
|
|
page_cur_position(rec, block, &cur);
|
2021-04-14 11:35:39 +03:00
|
|
|
offsets= rec_get_offsets(rec, index, offsets, n_core,
|
2020-02-17 13:30:01 +02:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
rec= rec_get_next_ptr(rec, TRUE);
|
2014-02-26 19:11:54 +01:00
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
2020-02-24 15:13:00 +02:00
|
|
|
ut_a(page_zip_validate(&block->page.zip, block->frame, index));
|
2014-02-26 19:11:54 +01:00
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
2020-02-17 13:30:01 +02:00
|
|
|
page_cur_delete_rec(&cur, index, offsets, mtr);
|
|
|
|
}
|
|
|
|
while (page_offset(rec) != PAGE_NEW_SUPREMUM);
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap))
|
|
|
|
mem_heap_free(heap);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
byte *prev_rec= page_rec_get_prev(rec);
|
|
|
|
byte *last_rec= page_rec_get_prev(page_get_supremum_rec(block->frame));
|
|
|
|
|
|
|
|
// FIXME: consider a special case of shrinking PAGE_HEAP_TOP
|
|
|
|
|
|
|
|
const bool scrub= srv_immediate_scrub_data_uncompressed;
|
|
|
|
if (scrub || size == ULINT_UNDEFINED || n_recs == ULINT_UNDEFINED)
|
|
|
|
{
|
|
|
|
rec_t *rec2= rec;
|
|
|
|
/* Calculate the sum of sizes and the number of records */
|
|
|
|
size= 0;
|
|
|
|
n_recs= 0;
|
|
|
|
|
|
|
|
do
|
|
|
|
{
|
2021-04-14 11:35:39 +03:00
|
|
|
offsets = rec_get_offsets(rec2, index, offsets, n_core,
|
2020-02-17 13:30:01 +02:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
ulint s= rec_offs_size(offsets);
|
|
|
|
ut_ad(ulint(rec2 - block->frame) + s - rec_offs_extra_size(offsets) <
|
|
|
|
srv_page_size);
|
|
|
|
ut_ad(size + s < srv_page_size);
|
|
|
|
size+= s;
|
|
|
|
n_recs++;
|
|
|
|
|
|
|
|
if (scrub)
|
|
|
|
mtr->memset(block, page_offset(rec2), rec_offs_data_size(offsets), 0);
|
|
|
|
|
|
|
|
rec2 = page_rec_get_next(rec2);
|
|
|
|
}
|
|
|
|
while (!page_rec_is_supremum(rec2));
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap))
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_ad(size < srv_page_size);
|
|
|
|
|
|
|
|
ulint slot_index, n_owned;
|
|
|
|
{
|
|
|
|
const rec_t *owner_rec= rec;
|
|
|
|
ulint count= 0;
|
|
|
|
|
|
|
|
if (page_is_comp(block->frame))
|
|
|
|
while (!(n_owned= rec_get_n_owned_new(owner_rec)))
|
|
|
|
{
|
|
|
|
count++;
|
|
|
|
owner_rec= rec_get_next_ptr_const(owner_rec, TRUE);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
while (!(n_owned= rec_get_n_owned_old(owner_rec)))
|
|
|
|
{
|
|
|
|
count++;
|
|
|
|
owner_rec= rec_get_next_ptr_const(owner_rec, FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_ad(n_owned > count);
|
|
|
|
n_owned-= count;
|
|
|
|
slot_index= page_dir_find_owner_slot(owner_rec);
|
|
|
|
ut_ad(slot_index > 0);
|
|
|
|
}
|
|
|
|
|
2020-04-02 19:34:34 +03:00
|
|
|
mtr->write<2,mtr_t::MAYBE_NOP>(*block, my_assume_aligned<2>
|
|
|
|
(PAGE_N_DIR_SLOTS + PAGE_HEADER +
|
|
|
|
block->frame), slot_index + 1);
|
|
|
|
mtr->write<2,mtr_t::MAYBE_NOP>(*block, my_assume_aligned<2>
|
|
|
|
(PAGE_LAST_INSERT + PAGE_HEADER +
|
|
|
|
block->frame), 0U);
|
2020-02-17 13:30:01 +02:00
|
|
|
/* Catenate the deleted chain segment to the page free list */
|
|
|
|
alignas(4) byte page_header[4];
|
|
|
|
byte *page_free= my_assume_aligned<4>(PAGE_HEADER + PAGE_FREE +
|
|
|
|
block->frame);
|
|
|
|
const uint16_t free= page_header_get_field(block->frame, PAGE_FREE);
|
|
|
|
static_assert(PAGE_FREE + 2 == PAGE_GARBAGE, "compatibility");
|
|
|
|
|
|
|
|
mach_write_to_2(page_header, page_offset(rec));
|
|
|
|
mach_write_to_2(my_assume_aligned<2>(page_header + 2),
|
|
|
|
mach_read_from_2(my_assume_aligned<2>(page_free + 2)) +
|
|
|
|
size);
|
|
|
|
mtr->memcpy(*block, page_free, page_header, 4);
|
|
|
|
|
|
|
|
byte *page_n_recs= my_assume_aligned<2>(PAGE_N_RECS + PAGE_HEADER +
|
|
|
|
block->frame);
|
|
|
|
mtr->write<2>(*block, page_n_recs,
|
|
|
|
ulint{mach_read_from_2(page_n_recs)} - n_recs);
|
|
|
|
|
|
|
|
/* Update the page directory; there is no need to balance the number
|
|
|
|
of the records owned by the supremum record, as it is allowed to be
|
|
|
|
less than PAGE_DIR_SLOT_MIN_N_OWNED */
|
|
|
|
page_dir_slot_t *slot= page_dir_get_nth_slot(block->frame, slot_index);
|
|
|
|
|
|
|
|
if (page_is_comp(block->frame))
|
|
|
|
{
|
2020-04-02 19:34:34 +03:00
|
|
|
mtr->write<2,mtr_t::MAYBE_NOP>(*block, slot, PAGE_NEW_SUPREMUM);
|
2020-02-17 13:30:01 +02:00
|
|
|
byte *owned= PAGE_NEW_SUPREMUM - REC_NEW_N_OWNED + block->frame;
|
MDEV-21907: InnoDB: Enable -Wconversion on clang and GCC
The -Wconversion in GCC seems to be stricter than in clang.
GCC at least since version 4.4.7 issues truncation warnings for
assignments to bitfields, while clang 10 appears to only issue
warnings when the sizes in bytes rounded to the nearest integer
powers of 2 are different.
Before GCC 10.0.0, -Wconversion required more casts and would not
allow some operations, such as x<<=1 or x+=1 on a data type that
is narrower than int.
GCC 5 (but not GCC 4, GCC 6, or any later version) is complaining
about x|=y even when x and y are compatible types that are narrower
than int. Hence, we must rewrite some x|=y as
x=static_cast<byte>(x|y) or similar, or we must disable -Wconversion.
In GCC 6 and later, the warning for assigning wider to bitfields
that are narrower than 8, 16, or 32 bits can be suppressed by
applying a bitwise & with the exact bitmask of the bitfield.
For older GCC, we must disable -Wconversion for GCC 4 or 5 in such
cases.
The bitwise negation operator appears to promote short integers
to a wider type, and hence we must add explicit truncation casts
around them. Microsoft Visual C does not allow a static_cast to
truncate a constant, such as static_cast<byte>(1) truncating int.
Hence, we will use the constructor-style cast byte(~1) for such cases.
This has been tested at least with GCC 4.8.5, 5.4.0, 7.4.0, 9.2.1, 10.0.0,
clang 9.0.1, 10.0.0, and MSVC 14.22.27905 (Microsoft Visual Studio 2019)
on 64-bit and 32-bit targets (IA-32, AMD64, POWER 8, POWER 9, ARMv8).
2020-03-12 19:46:41 +02:00
|
|
|
byte new_owned= static_cast<byte>((*owned & ~REC_N_OWNED_MASK) |
|
|
|
|
n_owned << REC_N_OWNED_SHIFT);
|
2020-02-17 13:30:01 +02:00
|
|
|
#if 0 // FIXME: implement minimal logging for ROW_FORMAT=COMPRESSED
|
|
|
|
if (UNIV_LIKELY_NULL(block->page.zip.data))
|
|
|
|
{
|
|
|
|
*owned= new_owned;
|
|
|
|
memcpy_aligned<2>(PAGE_N_DIR_SLOTS + PAGE_HEADER + block->page.zip.data,
|
|
|
|
PAGE_N_DIR_SLOTS + PAGE_HEADER + block->frame,
|
|
|
|
PAGE_N_RECS + 2 - PAGE_N_DIR_SLOTS);
|
|
|
|
// TODO: the equivalent of page_zip_dir_delete() for all records
|
|
|
|
mach_write_to_2(prev_rec - REC_NEXT, static_cast<uint16_t>
|
|
|
|
(PAGE_NEW_SUPREMUM - page_offset(prev_rec)));
|
|
|
|
mach_write_to_2(last_rec - REC_NEXT, free
|
|
|
|
? static_cast<uint16_t>(free - page_offset(last_rec))
|
|
|
|
: 0U);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
2020-04-02 19:34:34 +03:00
|
|
|
mtr->write<1,mtr_t::MAYBE_NOP>(*block, owned, new_owned);
|
2020-02-17 13:30:01 +02:00
|
|
|
mtr->write<2>(*block, prev_rec - REC_NEXT, static_cast<uint16_t>
|
|
|
|
(PAGE_NEW_SUPREMUM - page_offset(prev_rec)));
|
|
|
|
mtr->write<2>(*block, last_rec - REC_NEXT, free
|
|
|
|
? static_cast<uint16_t>(free - page_offset(last_rec))
|
|
|
|
: 0U);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2020-04-02 19:34:34 +03:00
|
|
|
mtr->write<2,mtr_t::MAYBE_NOP>(*block, slot, PAGE_OLD_SUPREMUM);
|
2020-02-17 13:30:01 +02:00
|
|
|
byte *owned= PAGE_OLD_SUPREMUM - REC_OLD_N_OWNED + block->frame;
|
MDEV-21907: InnoDB: Enable -Wconversion on clang and GCC
The -Wconversion in GCC seems to be stricter than in clang.
GCC at least since version 4.4.7 issues truncation warnings for
assignments to bitfields, while clang 10 appears to only issue
warnings when the sizes in bytes rounded to the nearest integer
powers of 2 are different.
Before GCC 10.0.0, -Wconversion required more casts and would not
allow some operations, such as x<<=1 or x+=1 on a data type that
is narrower than int.
GCC 5 (but not GCC 4, GCC 6, or any later version) is complaining
about x|=y even when x and y are compatible types that are narrower
than int. Hence, we must rewrite some x|=y as
x=static_cast<byte>(x|y) or similar, or we must disable -Wconversion.
In GCC 6 and later, the warning for assigning wider to bitfields
that are narrower than 8, 16, or 32 bits can be suppressed by
applying a bitwise & with the exact bitmask of the bitfield.
For older GCC, we must disable -Wconversion for GCC 4 or 5 in such
cases.
The bitwise negation operator appears to promote short integers
to a wider type, and hence we must add explicit truncation casts
around them. Microsoft Visual C does not allow a static_cast to
truncate a constant, such as static_cast<byte>(1) truncating int.
Hence, we will use the constructor-style cast byte(~1) for such cases.
This has been tested at least with GCC 4.8.5, 5.4.0, 7.4.0, 9.2.1, 10.0.0,
clang 9.0.1, 10.0.0, and MSVC 14.22.27905 (Microsoft Visual Studio 2019)
on 64-bit and 32-bit targets (IA-32, AMD64, POWER 8, POWER 9, ARMv8).
2020-03-12 19:46:41 +02:00
|
|
|
byte new_owned= static_cast<byte>((*owned & ~REC_N_OWNED_MASK) |
|
|
|
|
n_owned << REC_N_OWNED_SHIFT);
|
2020-04-02 19:34:34 +03:00
|
|
|
mtr->write<1,mtr_t::MAYBE_NOP>(*block, owned, new_owned);
|
2020-02-17 13:30:01 +02:00
|
|
|
mtr->write<2>(*block, prev_rec - REC_NEXT, PAGE_OLD_SUPREMUM);
|
|
|
|
mtr->write<2>(*block, last_rec - REC_NEXT, free);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Deletes records from page, up to the given record, NOT including
|
|
|
|
that record. Infimum and supremum records are not deleted. */
|
|
|
|
void
|
|
|
|
page_delete_rec_list_start(
|
|
|
|
/*=======================*/
|
|
|
|
rec_t* rec, /*!< in: record on page */
|
|
|
|
buf_block_t* block, /*!< in: buffer block of the page */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_cur_t cur1;
|
2020-04-28 10:46:51 +10:00
|
|
|
rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
rec_offs* offsets = offsets_;
|
2014-02-26 19:11:54 +01:00
|
|
|
mem_heap_t* heap = NULL;
|
|
|
|
|
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
ut_ad(page_align(rec) == block->frame);
|
2014-02-26 19:11:54 +01:00
|
|
|
ut_ad((ibool) !!page_rec_is_comp(rec)
|
|
|
|
== dict_table_is_comp(index->table));
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
{
|
|
|
|
page_zip_des_t* page_zip= buf_block_get_page_zip(block);
|
|
|
|
page_t* page = buf_block_get_frame(block);
|
|
|
|
|
|
|
|
/* page_zip_validate() would detect a min_rec_mark mismatch
|
|
|
|
in btr_page_split_and_insert()
|
|
|
|
between btr_attach_half_pages() and insert_page = ...
|
|
|
|
when btr_page_get_split_rec_to_left() holds
|
|
|
|
(direction == FSP_DOWN). */
|
|
|
|
ut_a(!page_zip
|
|
|
|
|| page_zip_validate_low(page_zip, page, index, TRUE));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
|
|
|
|
if (page_rec_is_infimum(rec)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
/* We are deleting all records. */
|
|
|
|
page_create_empty(block, index, mtr);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
page_cur_set_before_first(block, &cur1);
|
|
|
|
page_cur_move_to_next(&cur1);
|
|
|
|
|
2021-04-13 10:28:13 +03:00
|
|
|
const ulint n_core = page_rec_is_leaf(rec)
|
|
|
|
? index->n_core_fields : 0;
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
while (page_cur_get_rec(&cur1) != rec) {
|
|
|
|
offsets = rec_get_offsets(page_cur_get_rec(&cur1), index,
|
2021-04-13 10:28:13 +03:00
|
|
|
offsets, n_core,
|
2017-09-19 19:20:11 +03:00
|
|
|
ULINT_UNDEFINED, &heap);
|
2014-02-26 19:11:54 +01:00
|
|
|
page_cur_delete_rec(&cur1, index, offsets, mtr);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Moves record list end to another page. Moved records include
|
|
|
|
split_rec.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
|
|
|
@return TRUE on success; FALSE on compression failure (new_block will
|
|
|
|
be decompressed) */
|
|
|
|
ibool
|
|
|
|
page_move_rec_list_end(
|
|
|
|
/*===================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page where to move */
|
|
|
|
buf_block_t* block, /*!< in: index page from where to move */
|
|
|
|
rec_t* split_rec, /*!< in: first record to move */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
page_t* new_page = buf_block_get_frame(new_block);
|
|
|
|
ulint old_data_size;
|
|
|
|
ulint new_data_size;
|
|
|
|
ulint old_n_recs;
|
|
|
|
ulint new_n_recs;
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ut_ad(!dict_index_is_spatial(index));
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
old_data_size = page_get_data_size(new_page);
|
|
|
|
old_n_recs = page_get_n_recs(new_page);
|
|
|
|
#ifdef UNIV_ZIP_DEBUG
|
|
|
|
{
|
|
|
|
page_zip_des_t* new_page_zip
|
|
|
|
= buf_block_get_page_zip(new_block);
|
|
|
|
page_zip_des_t* page_zip
|
|
|
|
= buf_block_get_page_zip(block);
|
|
|
|
ut_a(!new_page_zip == !page_zip);
|
|
|
|
ut_a(!new_page_zip
|
|
|
|
|| page_zip_validate(new_page_zip, new_page, index));
|
|
|
|
ut_a(!page_zip
|
|
|
|
|| page_zip_validate(page_zip, page_align(split_rec),
|
|
|
|
index));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_ZIP_DEBUG */
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_copy_rec_list_end(new_block, block,
|
|
|
|
split_rec, index, mtr))) {
|
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
new_data_size = page_get_data_size(new_page);
|
|
|
|
new_n_recs = page_get_n_recs(new_page);
|
|
|
|
|
|
|
|
ut_ad(new_data_size >= old_data_size);
|
|
|
|
|
|
|
|
page_delete_rec_list_end(split_rec, block, index,
|
|
|
|
new_n_recs - old_n_recs,
|
|
|
|
new_data_size - old_data_size, mtr);
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*************************************************************//**
|
|
|
|
Moves record list start to another page. Moved records do not include
|
|
|
|
split_rec.
|
|
|
|
|
|
|
|
IMPORTANT: The caller will have to update IBUF_BITMAP_FREE
|
|
|
|
if new_block is a compressed leaf page in a secondary index.
|
|
|
|
This has to be done either within the same mini-transaction,
|
|
|
|
or by invoking ibuf_reset_free_bits() before mtr_commit().
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE on success; FALSE on compression failure */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_move_rec_list_start(
|
|
|
|
/*=====================*/
|
|
|
|
buf_block_t* new_block, /*!< in/out: index page where to move */
|
|
|
|
buf_block_t* block, /*!< in/out: page containing split_rec */
|
|
|
|
rec_t* split_rec, /*!< in: first record not to move */
|
|
|
|
dict_index_t* index, /*!< in: record descriptor */
|
|
|
|
mtr_t* mtr) /*!< in: mtr */
|
|
|
|
{
|
|
|
|
if (UNIV_UNLIKELY(!page_copy_rec_list_start(new_block, block,
|
|
|
|
split_rec, index, mtr))) {
|
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
page_delete_rec_list_start(split_rec, block, index, mtr);
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/************************************************************//**
|
|
|
|
Returns the nth record of the record list.
|
|
|
|
This is the inverse function of page_rec_get_n_recs_before().
|
2016-08-12 11:17:45 +03:00
|
|
|
@return nth record */
|
2014-02-26 19:11:54 +01:00
|
|
|
const rec_t*
|
|
|
|
page_rec_get_nth_const(
|
|
|
|
/*===================*/
|
|
|
|
const page_t* page, /*!< in: page */
|
|
|
|
ulint nth) /*!< in: nth record */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
ulint i;
|
|
|
|
ulint n_owned;
|
|
|
|
const rec_t* rec;
|
|
|
|
|
|
|
|
if (nth == 0) {
|
|
|
|
return(page_get_infimum_rec(page));
|
|
|
|
}
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad(nth < srv_page_size / (REC_N_NEW_EXTRA_BYTES + 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
for (i = 0;; i++) {
|
|
|
|
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
n_owned = page_dir_slot_get_n_owned(slot);
|
|
|
|
|
|
|
|
if (n_owned > nth) {
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
nth -= n_owned;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ut_ad(i > 0);
|
|
|
|
slot = page_dir_get_nth_slot(page, i - 1);
|
|
|
|
rec = page_dir_slot_get_rec(slot);
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
do {
|
|
|
|
rec = page_rec_get_next_low(rec, TRUE);
|
|
|
|
ut_ad(rec);
|
|
|
|
} while (nth--);
|
|
|
|
} else {
|
|
|
|
do {
|
|
|
|
rec = page_rec_get_next_low(rec, FALSE);
|
|
|
|
ut_ad(rec);
|
|
|
|
} while (nth--);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Returns the number of records before the given record in chain.
|
|
|
|
The number includes infimum and supremum records.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return number of records */
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint
|
|
|
|
page_rec_get_n_recs_before(
|
|
|
|
/*=======================*/
|
|
|
|
const rec_t* rec) /*!< in: the physical record */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
const rec_t* slot_rec;
|
|
|
|
const page_t* page;
|
|
|
|
ulint i;
|
|
|
|
lint n = 0;
|
|
|
|
|
|
|
|
ut_ad(page_rec_check(rec));
|
|
|
|
|
|
|
|
page = page_align(rec);
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
while (rec_get_n_owned_new(rec) == 0) {
|
|
|
|
|
|
|
|
rec = rec_get_next_ptr_const(rec, TRUE);
|
|
|
|
n--;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; ; i++) {
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
slot_rec = page_dir_slot_get_rec(slot);
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
n += lint(rec_get_n_owned_new(slot_rec));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (rec == slot_rec) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
while (rec_get_n_owned_old(rec) == 0) {
|
|
|
|
|
|
|
|
rec = rec_get_next_ptr_const(rec, FALSE);
|
|
|
|
n--;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; ; i++) {
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
slot_rec = page_dir_slot_get_rec(slot);
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
n += lint(rec_get_n_owned_old(slot_rec));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
if (rec == slot_rec) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
n--;
|
|
|
|
|
|
|
|
ut_ad(n >= 0);
|
2018-04-27 13:49:25 +03:00
|
|
|
ut_ad((ulong) n < srv_page_size / (REC_N_NEW_EXTRA_BYTES + 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
return((ulint) n);
|
|
|
|
}
|
|
|
|
|
|
|
|
/************************************************************//**
|
|
|
|
Prints record contents including the data relevant only in
|
|
|
|
the index page context. */
|
|
|
|
void
|
|
|
|
page_rec_print(
|
|
|
|
/*===========*/
|
|
|
|
const rec_t* rec, /*!< in: physical record */
|
2020-04-28 10:46:51 +10:00
|
|
|
const rec_offs* offsets)/*!< in: record descriptor */
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
|
|
|
ut_a(!page_rec_is_comp(rec) == !rec_offs_comp(offsets));
|
|
|
|
rec_print_new(stderr, rec, offsets);
|
|
|
|
if (page_rec_is_comp(rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::info() << "n_owned: " << rec_get_n_owned_new(rec)
|
|
|
|
<< "; heap_no: " << rec_get_heap_no_new(rec)
|
|
|
|
<< "; next rec: " << rec_get_next_offs(rec, TRUE);
|
2014-02-26 19:11:54 +01:00
|
|
|
} else {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::info() << "n_owned: " << rec_get_n_owned_old(rec)
|
|
|
|
<< "; heap_no: " << rec_get_heap_no_old(rec)
|
|
|
|
<< "; next rec: " << rec_get_next_offs(rec, FALSE);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
page_rec_check(rec);
|
|
|
|
rec_validate(rec, offsets);
|
|
|
|
}
|
|
|
|
|
2016-12-30 15:04:10 +02:00
|
|
|
#ifdef UNIV_BTR_PRINT
|
2014-02-26 19:11:54 +01:00
|
|
|
/***************************************************************//**
|
|
|
|
This is used to print the contents of the directory for
|
|
|
|
debugging purposes. */
|
|
|
|
void
|
|
|
|
page_dir_print(
|
|
|
|
/*===========*/
|
|
|
|
page_t* page, /*!< in: index page */
|
|
|
|
ulint pr_n) /*!< in: print n first and n last entries */
|
|
|
|
{
|
|
|
|
ulint n;
|
|
|
|
ulint i;
|
|
|
|
page_dir_slot_t* slot;
|
|
|
|
|
|
|
|
n = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
fprintf(stderr, "--------------------------------\n"
|
|
|
|
"PAGE DIRECTORY\n"
|
|
|
|
"Page address %p\n"
|
|
|
|
"Directory stack top at offs: %lu; number of slots: %lu\n",
|
|
|
|
page, (ulong) page_offset(page_dir_get_nth_slot(page, n - 1)),
|
|
|
|
(ulong) n);
|
|
|
|
for (i = 0; i < n; i++) {
|
|
|
|
slot = page_dir_get_nth_slot(page, i);
|
|
|
|
if ((i == pr_n) && (i < n - pr_n)) {
|
|
|
|
fputs(" ... \n", stderr);
|
|
|
|
}
|
|
|
|
if ((i < pr_n) || (i >= n - pr_n)) {
|
|
|
|
fprintf(stderr,
|
|
|
|
"Contents of slot: %lu: n_owned: %lu,"
|
|
|
|
" rec offs: %lu\n",
|
|
|
|
(ulong) i,
|
|
|
|
(ulong) page_dir_slot_get_n_owned(slot),
|
|
|
|
(ulong)
|
|
|
|
page_offset(page_dir_slot_get_rec(slot)));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
fprintf(stderr, "Total of %lu records\n"
|
|
|
|
"--------------------------------\n",
|
|
|
|
(ulong) (PAGE_HEAP_NO_USER_LOW + page_get_n_recs(page)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This is used to print the contents of the page record list for
|
|
|
|
debugging purposes. */
|
|
|
|
void
|
|
|
|
page_print_list(
|
|
|
|
/*============*/
|
|
|
|
buf_block_t* block, /*!< in: index page */
|
|
|
|
dict_index_t* index, /*!< in: dictionary index of the page */
|
|
|
|
ulint pr_n) /*!< in: print n first and n last entries */
|
|
|
|
{
|
|
|
|
page_t* page = block->frame;
|
|
|
|
page_cur_t cur;
|
|
|
|
ulint count;
|
|
|
|
ulint n_recs;
|
|
|
|
mem_heap_t* heap = NULL;
|
2020-04-28 10:46:51 +10:00
|
|
|
rec_offs offsets_[REC_OFFS_NORMAL_SIZE];
|
|
|
|
rec_offs* offsets = offsets_;
|
2014-02-26 19:11:54 +01:00
|
|
|
rec_offs_init(offsets_);
|
|
|
|
|
|
|
|
ut_a((ibool)!!page_is_comp(page) == dict_table_is_comp(index->table));
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
fprint(stderr,
|
2014-02-26 19:11:54 +01:00
|
|
|
"--------------------------------\n"
|
|
|
|
"PAGE RECORD LIST\n"
|
|
|
|
"Page address %p\n", page);
|
|
|
|
|
|
|
|
n_recs = page_get_n_recs(page);
|
|
|
|
|
|
|
|
page_cur_set_before_first(block, &cur);
|
|
|
|
count = 0;
|
|
|
|
for (;;) {
|
|
|
|
offsets = rec_get_offsets(cur.rec, index, offsets,
|
MDEV-15662 Instant DROP COLUMN or changing the order of columns
Allow ADD COLUMN anywhere in a table, not only adding as the
last column.
Allow instant DROP COLUMN and instant changing the order of columns.
The added columns will always be added last in clustered index records.
In new records, instantly dropped columns will be stored as NULL or
empty when possible.
Information about dropped and reordered columns will be written in
a metadata BLOB (mblob), which is stored before the first 'user' field
in the hidden metadata record at the start of the clustered index.
The presence of mblob is indicated by setting the delete-mark flag in
the metadata record.
The metadata BLOB stores the number of clustered index fields,
followed by an array of column information for each field.
For dropped columns, we store the NOT NULL flag, the fixed length,
and for variable-length columns, whether the maximum length exceeded
255 bytes. For non-dropped columns, we store the column position.
Unlike with MDEV-11369, when a table becomes empty, it cannot
be converted back to the canonical format. The reason for this is
that other threads may hold cached objects such as
row_prebuilt_t::ins_node that could refer to dropped or reordered
index fields.
For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC,
we must store the n_core_null_bytes in the root page, so that the
chain of node pointer records can be followed in order to reach the
leftmost leaf page where the metadata record is located.
If the mblob is present, we will zero-initialize the strings
"infimum" and "supremum" in the root page, and use the last byte of
"supremum" for storing the number of null bytes (which are allocated
but useless on node pointer pages). This is necessary for
btr_cur_instant_init_metadata() to be able to navigate to the mblob.
If the PRIMARY KEY contains any variable-length column and some
nullable columns were instantly dropped, the dict_index_t::n_nullable
in the data dictionary could be smaller than it actually is in the
non-leaf pages. Because of this, the non-leaf pages could use more
bytes for the null flags than the data dictionary expects, and we
could be reading the lengths of the variable-length columns from the
wrong offset, and thus reading the child page number from wrong place.
This is the result of two design mistakes that involve unnecessary
storage of data: First, it is nonsense to store any data fields for
the leftmost node pointer records, because the comparisons would be
resolved by the MIN_REC_FLAG alone. Second, there cannot be any null
fields in the clustered index node pointer fields, but we nevertheless
reserve space for all the null flags.
Limitations (future work):
MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists
MDEV-17468 Avoid table rebuild on operations on generated columns
MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large
btr_page_reorganize_low(): Preserve any metadata in the root page.
Call lock_move_reorganize_page() only after restoring the "infimum"
and "supremum" records, to avoid a memcmp() assertion failure.
dict_col_t::DROPPED: Magic value for dict_col_t::ind.
dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant().
Do not assert that the column was instantly added, because we
sometimes call this unconditionally for all columns.
Convert an instantly added column to a "core column". The old name
remove_instant() could be mistaken to refer to "instant DROP COLUMN".
dict_col_t::is_added(): Rename from dict_col_t::is_instant().
dtype_t::metadata_blob_init(): Initialize the mblob data type.
dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(),
upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits
refer to a metadata record.
dict_table_t::instant: Metadata about dropped or reordered columns.
dict_table_t::prepare_instant(): Prepare
ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE.
innobase_instant_try() will pass this to dict_table_t::instant_column().
On rollback, dict_table_t::rollback_instant() will be called.
dict_table_t::instant_column(): Renamed from instant_add_column().
Add the parameter col_map so that columns can be reordered.
Copy and adjust v_cols[] as well.
dict_table_t::find(): Find an old column based on a new column number.
dict_table_t::serialise_columns(), dict_table_t::deserialise_columns():
Convert the mblob.
dict_index_t::instant_metadata(): Create the metadata record
for instant ALTER TABLE. Invoke dict_table_t::serialise_columns().
dict_index_t::reconstruct_fields(): Invoked by
dict_table_t::deserialise_columns().
dict_index_t::clear_instant_alter(): Move the fields for the
dropped columns to the end, and sort the surviving index fields
in ascending order of column position.
ha_innobase::check_if_supported_inplace_alter(): Do not allow
adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists
due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.)
instant_alter_column_possible(): Add a parameter for InnoDB table,
to check for additional conditions, such as the maximum number of
index fields.
ha_innobase_inplace_ctx::first_alter_pos: The first column whose position
is affected by instant ADD, DROP, or changing the order of columns.
innobase_build_col_map(): Skip added virtual columns.
prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol.
Remove some unnecessary code. Note that the call to
innodb_base_col_setup() should be executed later.
commit_try_norebuild(): If ctx->is_instant(), let the virtual
columns be added or dropped by innobase_instant_try().
innobase_instant_try(): Fill in a zero default value for the
hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459).
If any columns were dropped or reordered (or added not last),
delete any SYS_COLUMNS records for the following columns, and
insert SYS_COLUMNS records for all subsequent stored columns as well
as for all virtual columns. If any virtual column is dropped, rewrite
all virtual column metadata. Use a shortcut only for adding
virtual columns. This is because innobase_drop_virtual_try()
assumes that the dropped virtual columns still exist in ctx->old_table.
innodb_update_cols(): Renamed from innodb_update_n_cols().
innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change
the return type to bool, and invoke my_error() when detecting an error.
innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS.
Refactored from innobase_add_one_virtual() and innobase_instant_add_col().
innobase_instant_add_col(): Replace the parameter dfield with type.
innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS
and all columns from SYS_VIRTUAL.
innobase_add_virtual_try(), innobase_drop_virtual_try(): Let
the caller invoke innodb_update_cols().
innobase_rename_column_try(): Skip dropped columns.
commit_cache_norebuild(): Update table->fts->doc_col.
dict_mem_table_col_rename_low(): Skip dropped columns.
trx_undo_rec_get_partial_row(): Skip dropped columns.
trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly.
trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields.
Log metadata records consistently.
Apparently, the first fields of a clustered index may be updated
in an update_undo vector when the index is ID_IND of SYS_FOREIGN,
as part of renaming the table during ALTER TABLE. Normally, updates of
the PRIMARY KEY should be logged as delete-mark and an insert.
row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec():
Use trx_undo_metadata.
row_undo_mod_clust_low(): On metadata rollback, roll back the root page too.
row_undo_mod_clust(): Relax an assertion. The delete-mark flag was
repurposed for ALTER TABLE metadata records.
row_rec_to_index_entry_impl(): Add the template parameter mblob
and the optional parameter info_bits for specifying the desired new
info bits. For the metadata tuple, allow conversion between the original
format (ADD COLUMN only) and the generic format (with hidden BLOB).
Add the optional parameter "pad" to determine whether the tuple should
be padded to the index fields (on ALTER TABLE it should), or whether
it should remain at its original size (on rollback).
row_build_index_entry_low(): Clean up the code, removing
redundant variables and conditions. For instantly dropped columns,
generate a dummy value that is NULL, the empty string, or a
fixed length of NUL bytes, depending on the type of the dropped column.
row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY
of a record that contained a dropped column whose value was stored
externally, we will be inserting a dummy NULL or empty string value
to the field of the dropped column. The externally stored column would
eventually be dropped when purge removes the delete-marked record for
the old PRIMARY KEY value.
btr_index_rec_validate(): Recognize the metadata record.
btr_discard_only_page_on_level(): Preserve the generic instant
ALTER TABLE metadata.
btr_set_instant(): Replaces page_set_instant(). This sets a clustered
index root page to the appropriate format, or upgrades from
the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format.
btr_cur_instant_init_low(): Read and validate the metadata BLOB page
before reconstructing the dictionary information based on it.
btr_cur_instant_init_metadata(): Do not read any lengths from the
metadata record header before reading the BLOB. At this point, we
would not actually know how many nullable fields the metadata record
contains.
btr_cur_instant_root_init(): Initialize n_core_null_bytes in one
of two possible ways.
btr_cur_trim(): Handle the mblob record.
row_metadata_to_tuple(): Convert a metadata record to a data tuple,
based on the new info_bits of the metadata record.
btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed.
Invoke dtuple_convert_big_rec() for metadata records if the record is
too large, or if the mblob is not yet marked as externally stored.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
When the last user record is deleted, do not delete the
generic instant ALTER TABLE metadata record. Only delete
MDEV-11369 instant ADD COLUMN metadata records.
btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size.
btr_pcur_store_position(): Allow a logically empty page to contain
a metadata record for generic ALTER TABLE.
REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW.
This is for the old instant ADD COLUMN (MDEV-11369) only.
REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record,
with additional information for dropped or reordered columns.
rec_info_bits_valid(): Remove. The only case when this would fail
is when the record is the generic ALTER TABLE metadata record.
rec_is_alter_metadata(): Check if a record is the metadata record
for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function
must not be invoked on node pointer records, because the delete-mark
flag in those records may be set (it is garbage), and then a debug
assertion could fail because index->is_instant() does not necessarily
hold.
rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata
record (not more generic instant ALTER TABLE).
rec_get_converted_size_comp_prefix_low(): Assume that the metadata
field will be stored externally. In dtuple_convert_big_rec() during
the rec_get_converted_size() call, it would not be there yet.
rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple.
rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(),
rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>.
With mblob=true, process a record with a metadata BLOB.
rec_copy_prefix_to_buf(): Assert that no fields beyond the key and
system columns are being copied. Exclude the metadata BLOB field.
rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple
into a record.
row_upd_index_replace_metadata(): Apply an update vector to an
alter_metadata tuple.
row_log_allocate(): Replace dict_index_t::is_instant()
with a more appropriate condition that ignores dict_table_t::instant.
Only a table on which the MDEV-11369 ADD COLUMN was performed
can "lose its instantness" when it becomes empty. After
instant DROP COLUMN or reordering columns, we cannot simply
convert the table to the canonical format, because the data
dictionary cache and all possibly existing references to it
from other client connection threads would have to be adjusted.
row_quiesce_write_index_fields(): Do not crash when the table contains
an instantly dropped column.
Thanks to Thirunarayanan Balathandayuthapani for discussing the design
and implementing an initial prototype of this.
Thanks to Matthias Leich for testing.
2018-10-19 16:49:54 +03:00
|
|
|
page_rec_is_leaf(cur.rec),
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
page_rec_print(cur.rec, offsets);
|
|
|
|
|
|
|
|
if (count == pr_n) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (page_cur_is_after_last(&cur)) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
page_cur_move_to_next(&cur);
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (n_recs > 2 * pr_n) {
|
|
|
|
fputs(" ... \n", stderr);
|
|
|
|
}
|
|
|
|
|
|
|
|
while (!page_cur_is_after_last(&cur)) {
|
|
|
|
page_cur_move_to_next(&cur);
|
|
|
|
|
|
|
|
if (count + pr_n >= n_recs) {
|
|
|
|
offsets = rec_get_offsets(cur.rec, index, offsets,
|
MDEV-15662 Instant DROP COLUMN or changing the order of columns
Allow ADD COLUMN anywhere in a table, not only adding as the
last column.
Allow instant DROP COLUMN and instant changing the order of columns.
The added columns will always be added last in clustered index records.
In new records, instantly dropped columns will be stored as NULL or
empty when possible.
Information about dropped and reordered columns will be written in
a metadata BLOB (mblob), which is stored before the first 'user' field
in the hidden metadata record at the start of the clustered index.
The presence of mblob is indicated by setting the delete-mark flag in
the metadata record.
The metadata BLOB stores the number of clustered index fields,
followed by an array of column information for each field.
For dropped columns, we store the NOT NULL flag, the fixed length,
and for variable-length columns, whether the maximum length exceeded
255 bytes. For non-dropped columns, we store the column position.
Unlike with MDEV-11369, when a table becomes empty, it cannot
be converted back to the canonical format. The reason for this is
that other threads may hold cached objects such as
row_prebuilt_t::ins_node that could refer to dropped or reordered
index fields.
For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC,
we must store the n_core_null_bytes in the root page, so that the
chain of node pointer records can be followed in order to reach the
leftmost leaf page where the metadata record is located.
If the mblob is present, we will zero-initialize the strings
"infimum" and "supremum" in the root page, and use the last byte of
"supremum" for storing the number of null bytes (which are allocated
but useless on node pointer pages). This is necessary for
btr_cur_instant_init_metadata() to be able to navigate to the mblob.
If the PRIMARY KEY contains any variable-length column and some
nullable columns were instantly dropped, the dict_index_t::n_nullable
in the data dictionary could be smaller than it actually is in the
non-leaf pages. Because of this, the non-leaf pages could use more
bytes for the null flags than the data dictionary expects, and we
could be reading the lengths of the variable-length columns from the
wrong offset, and thus reading the child page number from wrong place.
This is the result of two design mistakes that involve unnecessary
storage of data: First, it is nonsense to store any data fields for
the leftmost node pointer records, because the comparisons would be
resolved by the MIN_REC_FLAG alone. Second, there cannot be any null
fields in the clustered index node pointer fields, but we nevertheless
reserve space for all the null flags.
Limitations (future work):
MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists
MDEV-17468 Avoid table rebuild on operations on generated columns
MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large
btr_page_reorganize_low(): Preserve any metadata in the root page.
Call lock_move_reorganize_page() only after restoring the "infimum"
and "supremum" records, to avoid a memcmp() assertion failure.
dict_col_t::DROPPED: Magic value for dict_col_t::ind.
dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant().
Do not assert that the column was instantly added, because we
sometimes call this unconditionally for all columns.
Convert an instantly added column to a "core column". The old name
remove_instant() could be mistaken to refer to "instant DROP COLUMN".
dict_col_t::is_added(): Rename from dict_col_t::is_instant().
dtype_t::metadata_blob_init(): Initialize the mblob data type.
dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(),
upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits
refer to a metadata record.
dict_table_t::instant: Metadata about dropped or reordered columns.
dict_table_t::prepare_instant(): Prepare
ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE.
innobase_instant_try() will pass this to dict_table_t::instant_column().
On rollback, dict_table_t::rollback_instant() will be called.
dict_table_t::instant_column(): Renamed from instant_add_column().
Add the parameter col_map so that columns can be reordered.
Copy and adjust v_cols[] as well.
dict_table_t::find(): Find an old column based on a new column number.
dict_table_t::serialise_columns(), dict_table_t::deserialise_columns():
Convert the mblob.
dict_index_t::instant_metadata(): Create the metadata record
for instant ALTER TABLE. Invoke dict_table_t::serialise_columns().
dict_index_t::reconstruct_fields(): Invoked by
dict_table_t::deserialise_columns().
dict_index_t::clear_instant_alter(): Move the fields for the
dropped columns to the end, and sort the surviving index fields
in ascending order of column position.
ha_innobase::check_if_supported_inplace_alter(): Do not allow
adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists
due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.)
instant_alter_column_possible(): Add a parameter for InnoDB table,
to check for additional conditions, such as the maximum number of
index fields.
ha_innobase_inplace_ctx::first_alter_pos: The first column whose position
is affected by instant ADD, DROP, or changing the order of columns.
innobase_build_col_map(): Skip added virtual columns.
prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol.
Remove some unnecessary code. Note that the call to
innodb_base_col_setup() should be executed later.
commit_try_norebuild(): If ctx->is_instant(), let the virtual
columns be added or dropped by innobase_instant_try().
innobase_instant_try(): Fill in a zero default value for the
hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459).
If any columns were dropped or reordered (or added not last),
delete any SYS_COLUMNS records for the following columns, and
insert SYS_COLUMNS records for all subsequent stored columns as well
as for all virtual columns. If any virtual column is dropped, rewrite
all virtual column metadata. Use a shortcut only for adding
virtual columns. This is because innobase_drop_virtual_try()
assumes that the dropped virtual columns still exist in ctx->old_table.
innodb_update_cols(): Renamed from innodb_update_n_cols().
innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change
the return type to bool, and invoke my_error() when detecting an error.
innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS.
Refactored from innobase_add_one_virtual() and innobase_instant_add_col().
innobase_instant_add_col(): Replace the parameter dfield with type.
innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS
and all columns from SYS_VIRTUAL.
innobase_add_virtual_try(), innobase_drop_virtual_try(): Let
the caller invoke innodb_update_cols().
innobase_rename_column_try(): Skip dropped columns.
commit_cache_norebuild(): Update table->fts->doc_col.
dict_mem_table_col_rename_low(): Skip dropped columns.
trx_undo_rec_get_partial_row(): Skip dropped columns.
trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly.
trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields.
Log metadata records consistently.
Apparently, the first fields of a clustered index may be updated
in an update_undo vector when the index is ID_IND of SYS_FOREIGN,
as part of renaming the table during ALTER TABLE. Normally, updates of
the PRIMARY KEY should be logged as delete-mark and an insert.
row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec():
Use trx_undo_metadata.
row_undo_mod_clust_low(): On metadata rollback, roll back the root page too.
row_undo_mod_clust(): Relax an assertion. The delete-mark flag was
repurposed for ALTER TABLE metadata records.
row_rec_to_index_entry_impl(): Add the template parameter mblob
and the optional parameter info_bits for specifying the desired new
info bits. For the metadata tuple, allow conversion between the original
format (ADD COLUMN only) and the generic format (with hidden BLOB).
Add the optional parameter "pad" to determine whether the tuple should
be padded to the index fields (on ALTER TABLE it should), or whether
it should remain at its original size (on rollback).
row_build_index_entry_low(): Clean up the code, removing
redundant variables and conditions. For instantly dropped columns,
generate a dummy value that is NULL, the empty string, or a
fixed length of NUL bytes, depending on the type of the dropped column.
row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY
of a record that contained a dropped column whose value was stored
externally, we will be inserting a dummy NULL or empty string value
to the field of the dropped column. The externally stored column would
eventually be dropped when purge removes the delete-marked record for
the old PRIMARY KEY value.
btr_index_rec_validate(): Recognize the metadata record.
btr_discard_only_page_on_level(): Preserve the generic instant
ALTER TABLE metadata.
btr_set_instant(): Replaces page_set_instant(). This sets a clustered
index root page to the appropriate format, or upgrades from
the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format.
btr_cur_instant_init_low(): Read and validate the metadata BLOB page
before reconstructing the dictionary information based on it.
btr_cur_instant_init_metadata(): Do not read any lengths from the
metadata record header before reading the BLOB. At this point, we
would not actually know how many nullable fields the metadata record
contains.
btr_cur_instant_root_init(): Initialize n_core_null_bytes in one
of two possible ways.
btr_cur_trim(): Handle the mblob record.
row_metadata_to_tuple(): Convert a metadata record to a data tuple,
based on the new info_bits of the metadata record.
btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed.
Invoke dtuple_convert_big_rec() for metadata records if the record is
too large, or if the mblob is not yet marked as externally stored.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
When the last user record is deleted, do not delete the
generic instant ALTER TABLE metadata record. Only delete
MDEV-11369 instant ADD COLUMN metadata records.
btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size.
btr_pcur_store_position(): Allow a logically empty page to contain
a metadata record for generic ALTER TABLE.
REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW.
This is for the old instant ADD COLUMN (MDEV-11369) only.
REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record,
with additional information for dropped or reordered columns.
rec_info_bits_valid(): Remove. The only case when this would fail
is when the record is the generic ALTER TABLE metadata record.
rec_is_alter_metadata(): Check if a record is the metadata record
for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function
must not be invoked on node pointer records, because the delete-mark
flag in those records may be set (it is garbage), and then a debug
assertion could fail because index->is_instant() does not necessarily
hold.
rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata
record (not more generic instant ALTER TABLE).
rec_get_converted_size_comp_prefix_low(): Assume that the metadata
field will be stored externally. In dtuple_convert_big_rec() during
the rec_get_converted_size() call, it would not be there yet.
rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple.
rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(),
rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>.
With mblob=true, process a record with a metadata BLOB.
rec_copy_prefix_to_buf(): Assert that no fields beyond the key and
system columns are being copied. Exclude the metadata BLOB field.
rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple
into a record.
row_upd_index_replace_metadata(): Apply an update vector to an
alter_metadata tuple.
row_log_allocate(): Replace dict_index_t::is_instant()
with a more appropriate condition that ignores dict_table_t::instant.
Only a table on which the MDEV-11369 ADD COLUMN was performed
can "lose its instantness" when it becomes empty. After
instant DROP COLUMN or reordering columns, we cannot simply
convert the table to the canonical format, because the data
dictionary cache and all possibly existing references to it
from other client connection threads would have to be adjusted.
row_quiesce_write_index_fields(): Do not crash when the table contains
an instantly dropped column.
Thanks to Thirunarayanan Balathandayuthapani for discussing the design
and implementing an initial prototype of this.
Thanks to Matthias Leich for testing.
2018-10-19 16:49:54 +03:00
|
|
|
page_rec_is_leaf(cur.rec),
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
page_rec_print(cur.rec, offsets);
|
|
|
|
}
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
"Total of %lu records \n"
|
|
|
|
"--------------------------------\n",
|
|
|
|
(ulong) (count + 1));
|
|
|
|
|
|
|
|
if (UNIV_LIKELY_NULL(heap)) {
|
|
|
|
mem_heap_free(heap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Prints the info in a page header. */
|
|
|
|
void
|
|
|
|
page_header_print(
|
|
|
|
/*==============*/
|
|
|
|
const page_t* page)
|
|
|
|
{
|
|
|
|
fprintf(stderr,
|
|
|
|
"--------------------------------\n"
|
|
|
|
"PAGE HEADER INFO\n"
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
"Page address %p, n records %u (%s)\n"
|
|
|
|
"n dir slots %u, heap top %u\n"
|
|
|
|
"Page n heap %u, free %u, garbage %u\n"
|
|
|
|
"Page last insert %u, direction %u, n direction %u\n",
|
|
|
|
page, page_header_get_field(page, PAGE_N_RECS),
|
2014-02-26 19:11:54 +01:00
|
|
|
page_is_comp(page) ? "compact format" : "original format",
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
page_header_get_field(page, PAGE_N_DIR_SLOTS),
|
|
|
|
page_header_get_field(page, PAGE_HEAP_TOP),
|
|
|
|
page_dir_get_n_heap(page),
|
|
|
|
page_header_get_field(page, PAGE_FREE),
|
|
|
|
page_header_get_field(page, PAGE_GARBAGE),
|
|
|
|
page_header_get_field(page, PAGE_LAST_INSERT),
|
|
|
|
page_get_direction(page),
|
|
|
|
page_header_get_field(page, PAGE_N_DIRECTION));
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This is used to print the contents of the page for
|
|
|
|
debugging purposes. */
|
|
|
|
void
|
|
|
|
page_print(
|
|
|
|
/*=======*/
|
|
|
|
buf_block_t* block, /*!< in: index page */
|
|
|
|
dict_index_t* index, /*!< in: dictionary index of the page */
|
|
|
|
ulint dn, /*!< in: print dn first and last entries
|
|
|
|
in directory */
|
|
|
|
ulint rn) /*!< in: print rn first and last records
|
|
|
|
in directory */
|
|
|
|
{
|
|
|
|
page_t* page = block->frame;
|
|
|
|
|
|
|
|
page_header_print(page);
|
|
|
|
page_dir_print(page, dn);
|
|
|
|
page_print_list(block, index, rn);
|
|
|
|
}
|
2016-12-30 15:04:10 +02:00
|
|
|
#endif /* UNIV_BTR_PRINT */
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
The following is used to validate a record on a page. This function
|
|
|
|
differs from rec_validate as it can also check the n_owned field and
|
|
|
|
the heap_no field.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if ok */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_rec_validate(
|
|
|
|
/*==============*/
|
|
|
|
const rec_t* rec, /*!< in: physical record */
|
2020-04-28 10:46:51 +10:00
|
|
|
const rec_offs* offsets)/*!< in: array returned by rec_get_offsets() */
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
|
|
|
ulint n_owned;
|
|
|
|
ulint heap_no;
|
|
|
|
const page_t* page;
|
|
|
|
|
|
|
|
page = page_align(rec);
|
|
|
|
ut_a(!page_is_comp(page) == !rec_offs_comp(offsets));
|
|
|
|
|
|
|
|
page_rec_check(rec);
|
|
|
|
rec_validate(rec, offsets);
|
|
|
|
|
|
|
|
if (page_rec_is_comp(rec)) {
|
|
|
|
n_owned = rec_get_n_owned_new(rec);
|
|
|
|
heap_no = rec_get_heap_no_new(rec);
|
|
|
|
} else {
|
|
|
|
n_owned = rec_get_n_owned_old(rec);
|
|
|
|
heap_no = rec_get_heap_no_old(rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!(n_owned <= PAGE_DIR_SLOT_MAX_N_OWNED))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::warn() << "Dir slot of rec " << page_offset(rec)
|
|
|
|
<< ", n owned too big " << n_owned;
|
2014-02-26 19:11:54 +01:00
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!(heap_no < page_dir_get_n_heap(page)))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::warn() << "Heap no of rec " << page_offset(rec)
|
|
|
|
<< " too big " << heap_no << " "
|
|
|
|
<< page_dir_get_n_heap(page);
|
2014-02-26 19:11:54 +01:00
|
|
|
return(FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return(TRUE);
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
#ifdef UNIV_DEBUG
|
2014-02-26 19:11:54 +01:00
|
|
|
/***************************************************************//**
|
|
|
|
Checks that the first directory slot points to the infimum record and
|
|
|
|
the last to the supremum. This function is intended to track if the
|
|
|
|
bug fixed in 4.0.14 has caused corruption to users' databases. */
|
|
|
|
void
|
|
|
|
page_check_dir(
|
|
|
|
/*===========*/
|
|
|
|
const page_t* page) /*!< in: index page */
|
|
|
|
{
|
|
|
|
ulint n_slots;
|
|
|
|
ulint infimum_offs;
|
|
|
|
ulint supremum_offs;
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
infimum_offs = mach_read_from_2(page_dir_get_nth_slot(page, 0));
|
|
|
|
supremum_offs = mach_read_from_2(page_dir_get_nth_slot(page,
|
|
|
|
n_slots - 1));
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_is_infimum_low(infimum_offs))) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::fatal() << "Page directory corruption: infimum not"
|
|
|
|
" pointed to";
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_is_supremum_low(supremum_offs))) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::fatal() << "Page directory corruption: supremum not"
|
|
|
|
" pointed to";
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
2016-08-12 11:17:45 +03:00
|
|
|
#endif /* UNIV_DEBUG */
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This function checks the consistency of an index page when we do not
|
|
|
|
know the index. This is also resilient so that this should never crash
|
|
|
|
even if the page is total garbage.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if ok */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_simple_validate_old(
|
|
|
|
/*=====================*/
|
|
|
|
const page_t* page) /*!< in: index page in ROW_FORMAT=REDUNDANT */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
ulint slot_no;
|
|
|
|
ulint n_slots;
|
|
|
|
const rec_t* rec;
|
|
|
|
const byte* rec_heap_top;
|
|
|
|
ulint count;
|
|
|
|
ulint own_count;
|
|
|
|
ibool ret = FALSE;
|
|
|
|
|
|
|
|
ut_a(!page_is_comp(page));
|
|
|
|
|
|
|
|
/* Check first that the record heap and the directory do not
|
|
|
|
overlap. */
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
2020-07-16 06:55:23 +03:00
|
|
|
if (UNIV_UNLIKELY(n_slots < 2 || n_slots > srv_page_size / 4)) {
|
2020-07-15 19:26:20 +03:00
|
|
|
ib::error() << "Nonsensical number of page dir slots: "
|
|
|
|
<< n_slots;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec_heap_top = page_header_get_ptr(page, PAGE_HEAP_TOP);
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_heap_top
|
|
|
|
> page_dir_get_nth_slot(page, n_slots - 1))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error()
|
|
|
|
<< "Record heap and dir overlap on a page, heap top "
|
|
|
|
<< page_header_get_field(page, PAGE_HEAP_TOP)
|
|
|
|
<< ", dir "
|
|
|
|
<< page_offset(page_dir_get_nth_slot(page,
|
|
|
|
n_slots - 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Validate the record list in a loop checking also that it is
|
|
|
|
consistent with the page record directory. */
|
|
|
|
|
|
|
|
count = 0;
|
|
|
|
own_count = 1;
|
|
|
|
slot_no = 0;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
rec = page_get_infimum_rec(page);
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Record " << (rec - page)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< (rec_heap_top - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_old(rec) != 0)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* This is a record pointed to by a dir slot */
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_old(rec)
|
|
|
|
!= own_count)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Wrong owned count "
|
|
|
|
<< rec_get_n_owned_old(rec)
|
|
|
|
<< ", " << own_count << ", rec "
|
|
|
|
<< (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(page_dir_slot_get_rec(slot) != rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Dir slot does not point"
|
|
|
|
" to right rec " << (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
own_count = 0;
|
|
|
|
|
|
|
|
if (!page_rec_is_supremum(rec)) {
|
|
|
|
slot_no++;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(rec_get_next_offs(rec, FALSE) < FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec_get_next_offs(rec, FALSE) >= srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Next record offset nonsensical "
|
|
|
|
<< rec_get_next_offs(rec, FALSE) << " for rec "
|
|
|
|
<< (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page record list appears"
|
|
|
|
" to be circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
own_count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_old(rec) == 0)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n owned is zero in a supremum rec";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot_no != n_slots - 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n slots wrong "
|
|
|
|
<< slot_no << ", " << (n_slots - 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2017-09-14 08:00:28 +03:00
|
|
|
if (UNIV_UNLIKELY(ulint(page_header_get_field(page, PAGE_N_RECS))
|
2014-02-26 19:11:54 +01:00
|
|
|
+ PAGE_HEAP_NO_USER_LOW
|
|
|
|
!= count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n recs wrong "
|
|
|
|
<< page_header_get_field(page, PAGE_N_RECS)
|
|
|
|
+ PAGE_HEAP_NO_USER_LOW << " " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check then the free list */
|
|
|
|
rec = page_header_get_ptr(page, PAGE_FREE);
|
|
|
|
|
|
|
|
while (rec != NULL) {
|
|
|
|
if (UNIV_UNLIKELY(rec < page + FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec >= page + srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Free list record has"
|
|
|
|
" a nonsensical offset " << (rec - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Free list record " << (rec - page)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< (rec_heap_top - page);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page free list appears"
|
|
|
|
" to be circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
ulint offs = rec_get_next_offs(rec, FALSE);
|
|
|
|
if (!offs) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (UNIV_UNLIKELY(offs < PAGE_OLD_INFIMUM
|
|
|
|
|| offs >= srv_page_size)) {
|
|
|
|
ib::error() << "Page free list is corrupted " << count;
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page + offs;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(page_dir_get_n_heap(page) != count + 1)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "N heap is wrong "
|
|
|
|
<< page_dir_get_n_heap(page) << ", " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = TRUE;
|
|
|
|
|
|
|
|
func_exit:
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
This function checks the consistency of an index page when we do not
|
|
|
|
know the index. This is also resilient so that this should never crash
|
|
|
|
even if the page is total garbage.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return TRUE if ok */
|
2014-02-26 19:11:54 +01:00
|
|
|
ibool
|
|
|
|
page_simple_validate_new(
|
|
|
|
/*=====================*/
|
|
|
|
const page_t* page) /*!< in: index page in ROW_FORMAT!=REDUNDANT */
|
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
ulint slot_no;
|
|
|
|
ulint n_slots;
|
|
|
|
const rec_t* rec;
|
|
|
|
const byte* rec_heap_top;
|
|
|
|
ulint count;
|
|
|
|
ulint own_count;
|
|
|
|
ibool ret = FALSE;
|
|
|
|
|
|
|
|
ut_a(page_is_comp(page));
|
|
|
|
|
|
|
|
/* Check first that the record heap and the directory do not
|
|
|
|
overlap. */
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
2020-07-15 19:26:20 +03:00
|
|
|
if (UNIV_UNLIKELY(n_slots < 2 || n_slots > srv_page_size / 4)) {
|
|
|
|
ib::error() << "Nonsensical number of page dir slots: "
|
|
|
|
<< n_slots;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec_heap_top = page_header_get_ptr(page, PAGE_HEAP_TOP);
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_heap_top
|
|
|
|
> page_dir_get_nth_slot(page, n_slots - 1))) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Record heap and dir overlap on a page,"
|
|
|
|
" heap top "
|
|
|
|
<< page_header_get_field(page, PAGE_HEAP_TOP)
|
|
|
|
<< ", dir " << page_offset(
|
|
|
|
page_dir_get_nth_slot(page, n_slots - 1));
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Validate the record list in a loop checking also that it is
|
|
|
|
consistent with the page record directory. */
|
|
|
|
|
|
|
|
count = 0;
|
|
|
|
own_count = 1;
|
|
|
|
slot_no = 0;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
rec = page_get_infimum_rec(page);
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Record " << page_offset(rec)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< page_offset(rec_heap_top);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2018-04-28 15:49:09 +03:00
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec) != 0)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* This is a record pointed to by a dir slot */
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec)
|
|
|
|
!= own_count)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Wrong owned count "
|
|
|
|
<< rec_get_n_owned_new(rec) << ", "
|
|
|
|
<< own_count << ", rec "
|
|
|
|
<< page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(page_dir_slot_get_rec(slot) != rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Dir slot does not point"
|
|
|
|
" to right rec " << page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
own_count = 0;
|
|
|
|
|
|
|
|
if (!page_rec_is_supremum(rec)) {
|
|
|
|
slot_no++;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY
|
|
|
|
(rec_get_next_offs(rec, TRUE) < FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec_get_next_offs(rec, TRUE) >= srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Next record offset nonsensical "
|
|
|
|
<< rec_get_next_offs(rec, TRUE)
|
|
|
|
<< " for rec " << page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page record list appears to be"
|
|
|
|
" circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
own_count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec) == 0)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n owned is zero in a supremum rec";
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot_no != n_slots - 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n slots wrong " << slot_no << ", "
|
|
|
|
<< (n_slots - 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
2017-09-14 08:00:28 +03:00
|
|
|
if (UNIV_UNLIKELY(ulint(page_header_get_field(page, PAGE_N_RECS))
|
2014-02-26 19:11:54 +01:00
|
|
|
+ PAGE_HEAP_NO_USER_LOW
|
|
|
|
!= count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n recs wrong "
|
|
|
|
<< page_header_get_field(page, PAGE_N_RECS)
|
|
|
|
+ PAGE_HEAP_NO_USER_LOW << " " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check then the free list */
|
|
|
|
rec = page_header_get_ptr(page, PAGE_FREE);
|
|
|
|
|
|
|
|
while (rec != NULL) {
|
|
|
|
if (UNIV_UNLIKELY(rec < page + FIL_PAGE_DATA
|
2018-04-27 13:49:25 +03:00
|
|
|
|| rec >= page + srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
ib::error() << "Free list record has"
|
|
|
|
" a nonsensical offset " << page_offset(rec);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(rec > rec_heap_top)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Free list record " << page_offset(rec)
|
|
|
|
<< " is above rec heap top "
|
|
|
|
<< page_offset(rec_heap_top);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(count > srv_page_size)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Page free list appears to be"
|
|
|
|
" circular " << count;
|
2014-02-26 19:11:54 +01:00
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
const ulint offs = rec_get_next_offs(rec, TRUE);
|
|
|
|
if (!offs) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (UNIV_UNLIKELY(offs < PAGE_OLD_INFIMUM
|
|
|
|
|| offs >= srv_page_size)) {
|
|
|
|
ib::error() << "Page free list is corrupted " << count;
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page + offs;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(page_dir_get_n_heap(page) != count + 1)) {
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "N heap is wrong "
|
|
|
|
<< page_dir_get_n_heap(page) << ", " << (count + 1);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
goto func_exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = TRUE;
|
|
|
|
|
|
|
|
func_exit:
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
2019-10-09 13:25:11 +03:00
|
|
|
/** Check the consistency of an index page.
|
|
|
|
@param[in] page index page
|
|
|
|
@param[in] index B-tree or R-tree index
|
|
|
|
@return whether the page is valid */
|
|
|
|
bool page_validate(const page_t* page, const dict_index_t* index)
|
2014-02-26 19:11:54 +01:00
|
|
|
{
|
|
|
|
const page_dir_slot_t* slot;
|
|
|
|
const rec_t* rec;
|
|
|
|
const rec_t* old_rec = NULL;
|
2019-09-30 20:58:50 +03:00
|
|
|
const rec_t* first_rec = NULL;
|
2020-06-05 17:45:27 +03:00
|
|
|
ulint offs = 0;
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint n_slots;
|
2019-07-01 18:24:35 +03:00
|
|
|
ibool ret = TRUE;
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint i;
|
2020-04-28 10:46:51 +10:00
|
|
|
rec_offs offsets_1[REC_OFFS_NORMAL_SIZE];
|
|
|
|
rec_offs offsets_2[REC_OFFS_NORMAL_SIZE];
|
|
|
|
rec_offs* offsets = offsets_1;
|
|
|
|
rec_offs* old_offsets = offsets_2;
|
MDEV-20950 Reduce size of record offsets
offset_t: this is a type which represents one record offset.
It's unsigned short int.
a lot of functions: replace ulint with offset_t
btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
allocate record offsets on the stack instead of waiting for rec_get_offsets()
to allocate it from mem_heap_t. So, reducing memory allocations.
RECORD_OFFSET, INDEX_OFFSET:
now it's less convenient to store pointers in offset_t*
array. One pointer occupies now several offset_t. And those constant are start
indexes into array to places where to store pointer values
REC_OFFS_HEADER_SIZE: adjusted for the new reality
REC_OFFS_NORMAL_SIZE:
increase size from 100 to 300 which means less heap allocations.
And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
is smaller than previous 800 bytes.
REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
rem0rec.h, rem0rec.ic, rem0rec.cc:
various arguments, return values and local variables types were changed to
fix numerous integer conversions issues.
enum field_type_t:
offset types concept was introduces which replaces old offset flags stuff.
Like in earlier version, 2 upper bits are used to store offset type.
And this enum represents those types.
REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
get_type(), set_type(), get_value(), combine():
these are convenience functions to work with offsets and it's types
rec_offs_base()[0]:
still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
rec_offs_base()[i]:
these have type offset_t now. Two upper bits contains type.
2019-11-04 22:30:12 +03:00
|
|
|
|
|
|
|
rec_offs_init(offsets_1);
|
|
|
|
rec_offs_init(offsets_2);
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
#ifdef UNIV_GIS_DEBUG
|
|
|
|
if (dict_index_is_spatial(index)) {
|
|
|
|
fprintf(stderr, "Page no: %lu\n", page_get_page_no(page));
|
|
|
|
}
|
|
|
|
#endif /* UNIV_DEBUG */
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (UNIV_UNLIKELY((ibool) !!page_is_comp(page)
|
|
|
|
!= dict_table_is_comp(index->table))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "'compact format' flag mismatch";
|
2019-07-01 18:24:35 +03:00
|
|
|
func_exit2:
|
|
|
|
ib::error() << "Apparent corruption in space "
|
|
|
|
<< page_get_space_id(page) << " page "
|
|
|
|
<< page_get_page_no(page)
|
|
|
|
<< " of index " << index->name
|
|
|
|
<< " of table " << index->table->name;
|
|
|
|
return FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
2020-07-15 19:25:24 +03:00
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
if (page_is_comp(page)) {
|
|
|
|
if (UNIV_UNLIKELY(!page_simple_validate_new(page))) {
|
|
|
|
goto func_exit2;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
if (UNIV_UNLIKELY(!page_simple_validate_old(page))) {
|
|
|
|
goto func_exit2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
/* Multiple transactions cannot simultaneously operate on the
|
|
|
|
same temp-table in parallel.
|
|
|
|
max_trx_id is ignored for temp tables because it not required
|
|
|
|
for MVCC. */
|
MDEV-15132 Avoid accessing the TRX_SYS page
InnoDB maintains an internal persistent sequence of transaction
identifiers. This sequence is used for assigning both transaction
start identifiers (DB_TRX_ID=trx->id) and end identifiers (trx->no)
as well as end identifiers for the mysql.transaction_registry table
that was introduced in MDEV-12894.
TRX_SYS_TRX_ID_WRITE_MARGIN: Remove. After this many updates of
the sequence we used to update the TRX_SYS page. We can avoid accessing
the TRX_SYS page if we modify the InnoDB startup so that resurrecting
the sequence from other pages of the transaction system.
TRX_SYS_TRX_ID_STORE: Deprecate. The field only exists for the purpose
of upgrading from an earlier version of MySQL or MariaDB.
Starting with this fix, MariaDB will rely on the fields
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO in the undo log header page of
each non-committed transaction, and on the new field
TRX_RSEG_MAX_TRX_ID in rollback segment header pages.
Because of this change, setting innodb_force_recovery=5 or 6 may cause
the system to recover with trx_sys.get_max_trx_id()==0. We must adjust
checks for invalid DB_TRX_ID and PAGE_MAX_TRX_ID accordingly.
We will change the startup and shutdown messages to display the
trx_sys.get_max_trx_id() in addition to the log sequence number.
trx_sys_t::flush_max_trx_id(): Remove.
trx_undo_mem_create_at_db_start(), trx_undo_lists_init():
Add an output parameter max_trx_id, to be updated from
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO.
TRX_RSEG_MAX_TRX_ID: New field, for persisting
trx_sys.get_max_trx_id() at the time of the latest transaction commit.
Startup is not reading the undo log pages of committed transactions.
We want to avoid additional page accesses on startup, as well as
trouble when all undo logs have been emptied.
On startup, we will simply determine the maximum value from all pages
that are being read anyway.
TRX_RSEG_FORMAT: Redefined from TRX_RSEG_MAX_SIZE.
Old versions of InnoDB wrote uninitialized garbage to unused data fields.
Because of this, we cannot simply introduce a new field in the
rollback segment pages and expect it to be always zero, like it would
if the database was created by a recent enough InnoDB version.
Luckily, it looks like the field TRX_RSEG_MAX_SIZE was always written
as 0xfffffffe. We will indicate a new subformat of the page by writing
0 to this field. This has the nice side effect that after a downgrade
to older versions of InnoDB, transactions should fail to allocate any
undo log, that is, writes will be blocked. So, there is no problem of
getting corrupted transaction identifiers after downgrading.
trx_rseg_t::max_size: Remove.
trx_rseg_header_create(): Remove the parameter max_size=ULINT_MAX.
trx_purge_add_undo_to_history(): Update TRX_RSEG_MAX_SIZE
(and TRX_RSEG_FORMAT if needed). This is invoked on transaction commit.
trx_rseg_mem_restore(): If TRX_RSEG_FORMAT contains 0,
read TRX_RSEG_MAX_SIZE.
trx_rseg_array_init(): Invoke trx_sys.init_max_trx_id(max_trx_id + 1)
where max_trx_id was the maximum that was encountered in the rollback
segment pages and the undo log pages of recovered active, XA PREPARE,
or some committed transactions. (See trx_purge_add_undo_to_history()
which invokes trx_rsegf_set_nth_undo(..., FIL_NULL, ...);
not all committed transactions will be immediately detached from the
rollback segment header.)
2018-01-31 10:24:19 +02:00
|
|
|
if (!page_is_leaf(page) || page_is_empty(page)
|
|
|
|
|| !dict_index_is_sec_or_ibuf(index)
|
|
|
|
|| index->table->is_temporary()) {
|
|
|
|
} else if (trx_id_t sys_max_trx_id = trx_sys.get_max_trx_id()) {
|
2014-02-26 19:11:54 +01:00
|
|
|
trx_id_t max_trx_id = page_get_max_trx_id(page);
|
|
|
|
|
|
|
|
if (max_trx_id == 0 || max_trx_id > sys_max_trx_id) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "PAGE_MAX_TRX_ID out of bounds: "
|
|
|
|
<< max_trx_id << ", " << sys_max_trx_id;
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
MDEV-15132 Avoid accessing the TRX_SYS page
InnoDB maintains an internal persistent sequence of transaction
identifiers. This sequence is used for assigning both transaction
start identifiers (DB_TRX_ID=trx->id) and end identifiers (trx->no)
as well as end identifiers for the mysql.transaction_registry table
that was introduced in MDEV-12894.
TRX_SYS_TRX_ID_WRITE_MARGIN: Remove. After this many updates of
the sequence we used to update the TRX_SYS page. We can avoid accessing
the TRX_SYS page if we modify the InnoDB startup so that resurrecting
the sequence from other pages of the transaction system.
TRX_SYS_TRX_ID_STORE: Deprecate. The field only exists for the purpose
of upgrading from an earlier version of MySQL or MariaDB.
Starting with this fix, MariaDB will rely on the fields
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO in the undo log header page of
each non-committed transaction, and on the new field
TRX_RSEG_MAX_TRX_ID in rollback segment header pages.
Because of this change, setting innodb_force_recovery=5 or 6 may cause
the system to recover with trx_sys.get_max_trx_id()==0. We must adjust
checks for invalid DB_TRX_ID and PAGE_MAX_TRX_ID accordingly.
We will change the startup and shutdown messages to display the
trx_sys.get_max_trx_id() in addition to the log sequence number.
trx_sys_t::flush_max_trx_id(): Remove.
trx_undo_mem_create_at_db_start(), trx_undo_lists_init():
Add an output parameter max_trx_id, to be updated from
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO.
TRX_RSEG_MAX_TRX_ID: New field, for persisting
trx_sys.get_max_trx_id() at the time of the latest transaction commit.
Startup is not reading the undo log pages of committed transactions.
We want to avoid additional page accesses on startup, as well as
trouble when all undo logs have been emptied.
On startup, we will simply determine the maximum value from all pages
that are being read anyway.
TRX_RSEG_FORMAT: Redefined from TRX_RSEG_MAX_SIZE.
Old versions of InnoDB wrote uninitialized garbage to unused data fields.
Because of this, we cannot simply introduce a new field in the
rollback segment pages and expect it to be always zero, like it would
if the database was created by a recent enough InnoDB version.
Luckily, it looks like the field TRX_RSEG_MAX_SIZE was always written
as 0xfffffffe. We will indicate a new subformat of the page by writing
0 to this field. This has the nice side effect that after a downgrade
to older versions of InnoDB, transactions should fail to allocate any
undo log, that is, writes will be blocked. So, there is no problem of
getting corrupted transaction identifiers after downgrading.
trx_rseg_t::max_size: Remove.
trx_rseg_header_create(): Remove the parameter max_size=ULINT_MAX.
trx_purge_add_undo_to_history(): Update TRX_RSEG_MAX_SIZE
(and TRX_RSEG_FORMAT if needed). This is invoked on transaction commit.
trx_rseg_mem_restore(): If TRX_RSEG_FORMAT contains 0,
read TRX_RSEG_MAX_SIZE.
trx_rseg_array_init(): Invoke trx_sys.init_max_trx_id(max_trx_id + 1)
where max_trx_id was the maximum that was encountered in the rollback
segment pages and the undo log pages of recovered active, XA PREPARE,
or some committed transactions. (See trx_purge_add_undo_to_history()
which invokes trx_rsegf_set_nth_undo(..., FIL_NULL, ...);
not all committed transactions will be immediately detached from the
rollback segment header.)
2018-01-31 10:24:19 +02:00
|
|
|
} else {
|
|
|
|
ut_ad(srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Check first that the record heap and the directory do not
|
|
|
|
overlap. */
|
|
|
|
|
|
|
|
n_slots = page_dir_get_n_slots(page);
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!(page_header_get_ptr(page, PAGE_HEAP_TOP)
|
|
|
|
<= page_dir_get_nth_slot(page, n_slots - 1)))) {
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::warn() << "Record heap and directory overlap";
|
|
|
|
goto func_exit2;
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
switch (uint16_t type = fil_page_get_type(page)) {
|
|
|
|
case FIL_PAGE_RTREE:
|
|
|
|
if (!index->is_spatial()) {
|
|
|
|
wrong_page_type:
|
|
|
|
ib::warn() << "Wrong page type " << type;
|
|
|
|
ret = FALSE;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case FIL_PAGE_TYPE_INSTANT:
|
|
|
|
if (index->is_instant()
|
|
|
|
&& page_get_page_no(page) == index->page) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
goto wrong_page_type;
|
|
|
|
case FIL_PAGE_INDEX:
|
|
|
|
if (index->is_spatial()) {
|
|
|
|
goto wrong_page_type;
|
|
|
|
}
|
|
|
|
if (index->is_instant()
|
|
|
|
&& page_get_page_no(page) == index->page) {
|
|
|
|
goto wrong_page_type;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
goto wrong_page_type;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
/* The following buffer is used to check that the
|
|
|
|
records in the page record heap do not overlap */
|
|
|
|
mem_heap_t* heap = mem_heap_create(srv_page_size + 200);;
|
|
|
|
byte* buf = static_cast<byte*>(mem_heap_zalloc(heap, srv_page_size));
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
/* Validate the record list in a loop checking also that
|
|
|
|
it is consistent with the directory. */
|
2019-07-01 18:24:35 +03:00
|
|
|
ulint count = 0, data_size = 0, own_count = 1, slot_no = 0;
|
2020-08-04 10:13:35 +03:00
|
|
|
ulint info_bits;
|
2014-02-26 19:11:54 +01:00
|
|
|
slot_no = 0;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
|
|
|
|
rec = page_get_infimum_rec(page);
|
|
|
|
|
2021-04-13 10:28:13 +03:00
|
|
|
const ulint n_core = page_is_leaf(page) ? index->n_core_fields : 0;
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
for (;;) {
|
2021-04-13 10:28:13 +03:00
|
|
|
offsets = rec_get_offsets(rec, index, offsets, n_core,
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
|
|
|
|
if (page_is_comp(page) && page_rec_is_user_rec(rec)
|
|
|
|
&& UNIV_UNLIKELY(rec_get_node_ptr_flag(rec)
|
|
|
|
== page_is_leaf(page))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "'node_ptr' flag mismatch";
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
|
|
|
goto next_rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_validate(rec, offsets))) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
|
|
|
goto next_rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2020-08-04 10:13:35 +03:00
|
|
|
info_bits = rec_get_info_bits(rec, page_is_comp(page));
|
|
|
|
if (info_bits
|
|
|
|
& ~(REC_INFO_MIN_REC_FLAG | REC_INFO_DELETED_FLAG)) {
|
|
|
|
ib::error() << "info_bits has an incorrect value "
|
|
|
|
<< info_bits;
|
|
|
|
ret = false;
|
|
|
|
}
|
|
|
|
|
2019-09-30 20:58:50 +03:00
|
|
|
if (rec == first_rec) {
|
2020-08-04 10:13:35 +03:00
|
|
|
if (info_bits & REC_INFO_MIN_REC_FLAG) {
|
2019-10-09 13:25:11 +03:00
|
|
|
if (page_has_prev(page)) {
|
|
|
|
ib::error() << "REC_INFO_MIN_REC_FLAG "
|
2019-10-10 20:40:26 +03:00
|
|
|
"is set on non-left page";
|
2019-10-09 13:25:11 +03:00
|
|
|
ret = false;
|
|
|
|
} else if (!page_is_leaf(page)) {
|
|
|
|
/* leftmost node pointer page */
|
|
|
|
} else if (!index->is_instant()) {
|
|
|
|
ib::error() << "REC_INFO_MIN_REC_FLAG "
|
|
|
|
"is set in a leaf-page record";
|
|
|
|
ret = false;
|
2020-08-20 11:01:47 +03:00
|
|
|
} else if (!(info_bits & REC_INFO_DELETED_FLAG)
|
2019-10-10 11:19:25 +03:00
|
|
|
!= !index->table->instant) {
|
|
|
|
ib::error() << (index->table->instant
|
|
|
|
? "Metadata record "
|
|
|
|
"is not delete-marked"
|
|
|
|
: "Metadata record "
|
|
|
|
"is delete-marked");
|
2019-10-09 13:25:11 +03:00
|
|
|
ret = false;
|
|
|
|
}
|
|
|
|
} else if (!page_has_prev(page)
|
|
|
|
&& index->is_instant()) {
|
|
|
|
ib::error() << "Metadata record is missing";
|
2019-09-30 20:58:50 +03:00
|
|
|
ret = false;
|
|
|
|
}
|
2020-08-04 10:13:35 +03:00
|
|
|
} else if (info_bits & REC_INFO_MIN_REC_FLAG) {
|
2019-09-30 20:58:50 +03:00
|
|
|
ib::error() << "REC_INFO_MIN_REC_FLAG record is not "
|
|
|
|
"first in page";
|
|
|
|
ret = false;
|
|
|
|
}
|
|
|
|
|
2020-08-04 10:13:35 +03:00
|
|
|
if (page_is_comp(page)) {
|
|
|
|
const rec_comp_status_t status = rec_get_status(rec);
|
|
|
|
if (status != REC_STATUS_ORDINARY
|
|
|
|
&& status != REC_STATUS_NODE_PTR
|
|
|
|
&& status != REC_STATUS_INFIMUM
|
|
|
|
&& status != REC_STATUS_SUPREMUM
|
2020-08-20 11:01:47 +03:00
|
|
|
&& status != REC_STATUS_INSTANT) {
|
2020-08-04 10:13:35 +03:00
|
|
|
ib::error() << "impossible record status "
|
|
|
|
<< status;
|
|
|
|
ret = false;
|
|
|
|
} else if (page_rec_is_infimum(rec)) {
|
|
|
|
if (status != REC_STATUS_INFIMUM) {
|
|
|
|
ib::error()
|
|
|
|
<< "infimum record has status "
|
|
|
|
<< status;
|
|
|
|
ret = false;
|
|
|
|
}
|
|
|
|
} else if (page_rec_is_supremum(rec)) {
|
|
|
|
if (status != REC_STATUS_SUPREMUM) {
|
|
|
|
ib::error() << "supremum record has "
|
|
|
|
"status "
|
|
|
|
<< status;
|
|
|
|
ret = false;
|
|
|
|
}
|
|
|
|
} else if (!page_is_leaf(page)) {
|
|
|
|
if (status != REC_STATUS_NODE_PTR) {
|
|
|
|
ib::error() << "node ptr record has "
|
|
|
|
"status "
|
|
|
|
<< status;
|
|
|
|
ret = false;
|
|
|
|
}
|
|
|
|
} else if (!index->is_instant()
|
2020-08-20 11:01:47 +03:00
|
|
|
&& status == REC_STATUS_INSTANT) {
|
2020-08-04 10:13:35 +03:00
|
|
|
ib::error() << "instantly added record in a "
|
|
|
|
"non-instant index";
|
|
|
|
ret = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
/* Check that the records are in the ascending order */
|
2016-08-12 11:17:45 +03:00
|
|
|
if (count >= PAGE_HEAP_NO_USER_LOW
|
2014-02-26 19:11:54 +01:00
|
|
|
&& !page_rec_is_supremum(rec)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
|
|
|
int ret = cmp_rec_rec(
|
|
|
|
rec, old_rec, offsets, old_offsets, index);
|
|
|
|
|
|
|
|
/* For spatial index, on nonleaf leavel, we
|
|
|
|
allow recs to be equal. */
|
2019-07-01 18:24:35 +03:00
|
|
|
if (ret <= 0 && !(ret == 0 && index->is_spatial()
|
|
|
|
&& !page_is_leaf(page))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Records in wrong order";
|
2016-08-12 11:17:45 +03:00
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
fputs("\nInnoDB: previous record ", stderr);
|
2016-08-12 11:17:45 +03:00
|
|
|
/* For spatial index, print the mbr info.*/
|
|
|
|
if (index->type & DICT_SPATIAL) {
|
|
|
|
putc('\n', stderr);
|
|
|
|
rec_print_mbr_rec(stderr,
|
|
|
|
old_rec, old_offsets);
|
|
|
|
fputs("\nInnoDB: record ", stderr);
|
|
|
|
putc('\n', stderr);
|
|
|
|
rec_print_mbr_rec(stderr, rec, offsets);
|
|
|
|
putc('\n', stderr);
|
|
|
|
putc('\n', stderr);
|
|
|
|
|
|
|
|
} else {
|
|
|
|
rec_print_new(stderr, old_rec, old_offsets);
|
|
|
|
fputs("\nInnoDB: record ", stderr);
|
|
|
|
rec_print_new(stderr, rec, offsets);
|
|
|
|
putc('\n', stderr);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_rec_is_user_rec(rec)) {
|
|
|
|
|
|
|
|
data_size += rec_offs_size(offsets);
|
2016-08-12 11:17:45 +03:00
|
|
|
|
2018-04-26 16:33:05 +03:00
|
|
|
#if defined(UNIV_GIS_DEBUG)
|
2016-08-12 11:17:45 +03:00
|
|
|
/* For spatial index, print the mbr info.*/
|
|
|
|
if (index->type & DICT_SPATIAL) {
|
|
|
|
rec_print_mbr_rec(stderr, rec, offsets);
|
|
|
|
putc('\n', stderr);
|
|
|
|
}
|
|
|
|
#endif /* UNIV_GIS_DEBUG */
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
offs = page_offset(rec_get_start(rec, offsets));
|
|
|
|
i = rec_offs_size(offsets);
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(offs + i >= srv_page_size)) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Record offset out of bounds: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
|
|
|
goto next_rec;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
while (i--) {
|
|
|
|
if (UNIV_UNLIKELY(buf[offs + i])) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Record overlaps another: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
|
|
|
break;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
buf[offs + i] = 1;
|
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
if (ulint rec_own_count = page_is_comp(page)
|
|
|
|
? rec_get_n_owned_new(rec)
|
|
|
|
: rec_get_n_owned_old(rec)) {
|
2014-02-26 19:11:54 +01:00
|
|
|
/* This is a record pointed to by a dir slot */
|
|
|
|
if (UNIV_UNLIKELY(rec_own_count != own_count)) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Wrong owned count at " << offs
|
|
|
|
<< ": " << rec_own_count
|
|
|
|
<< ", " << own_count;
|
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (page_dir_slot_get_rec(slot) != rec) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Dir slot does not"
|
2019-07-01 18:24:35 +03:00
|
|
|
" point to right rec at " << offs;
|
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
if (ret) {
|
|
|
|
page_dir_slot_check(slot);
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
|
|
|
own_count = 0;
|
|
|
|
if (!page_rec_is_supremum(rec)) {
|
|
|
|
slot_no++;
|
|
|
|
slot = page_dir_get_nth_slot(page, slot_no);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
next_rec:
|
2014-02-26 19:11:54 +01:00
|
|
|
if (page_rec_is_supremum(rec)) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
own_count++;
|
|
|
|
old_rec = rec;
|
|
|
|
rec = page_rec_get_next_const(rec);
|
|
|
|
|
2019-09-30 20:58:50 +03:00
|
|
|
if (page_rec_is_infimum(old_rec)
|
|
|
|
&& page_rec_is_user_rec(rec)) {
|
|
|
|
first_rec = rec;
|
|
|
|
}
|
|
|
|
|
2014-02-26 19:11:54 +01:00
|
|
|
/* set old_offsets to offsets; recycle offsets */
|
MDEV-20950 Reduce size of record offsets
offset_t: this is a type which represents one record offset.
It's unsigned short int.
a lot of functions: replace ulint with offset_t
btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
allocate record offsets on the stack instead of waiting for rec_get_offsets()
to allocate it from mem_heap_t. So, reducing memory allocations.
RECORD_OFFSET, INDEX_OFFSET:
now it's less convenient to store pointers in offset_t*
array. One pointer occupies now several offset_t. And those constant are start
indexes into array to places where to store pointer values
REC_OFFS_HEADER_SIZE: adjusted for the new reality
REC_OFFS_NORMAL_SIZE:
increase size from 100 to 300 which means less heap allocations.
And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
is smaller than previous 800 bytes.
REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
rem0rec.h, rem0rec.ic, rem0rec.cc:
various arguments, return values and local variables types were changed to
fix numerous integer conversions issues.
enum field_type_t:
offset types concept was introduces which replaces old offset flags stuff.
Like in earlier version, 2 upper bits are used to store offset type.
And this enum represents those types.
REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
get_type(), set_type(), get_value(), combine():
these are convenience functions to work with offsets and it's types
rec_offs_base()[0]:
still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
rec_offs_base()[i]:
these have type offset_t now. Two upper bits contains type.
2019-11-04 22:30:12 +03:00
|
|
|
std::swap(old_offsets, offsets);
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
if (UNIV_UNLIKELY(rec_get_n_owned_new(rec) == 0)) {
|
|
|
|
|
|
|
|
goto n_owned_zero;
|
|
|
|
}
|
|
|
|
} else if (UNIV_UNLIKELY(rec_get_n_owned_old(rec) == 0)) {
|
|
|
|
n_owned_zero:
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "n owned is zero at " << offs;
|
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(slot_no != n_slots - 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n slots wrong " << slot_no << " "
|
|
|
|
<< (n_slots - 1);
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
2017-09-14 08:06:40 +03:00
|
|
|
if (UNIV_UNLIKELY(ulint(page_header_get_field(page, PAGE_N_RECS))
|
2014-02-26 19:11:54 +01:00
|
|
|
+ PAGE_HEAP_NO_USER_LOW
|
|
|
|
!= count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "n recs wrong "
|
|
|
|
<< page_header_get_field(page, PAGE_N_RECS)
|
|
|
|
+ PAGE_HEAP_NO_USER_LOW << " " << (count + 1);
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(data_size != page_get_data_size(page))) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "Summed data size " << data_size
|
|
|
|
<< ", returned by func " << page_get_data_size(page);
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Check then the free list */
|
|
|
|
rec = page_header_get_ptr(page, PAGE_FREE);
|
|
|
|
|
|
|
|
while (rec != NULL) {
|
2021-04-13 10:28:13 +03:00
|
|
|
offsets = rec_get_offsets(rec, index, offsets, n_core,
|
2014-02-26 19:11:54 +01:00
|
|
|
ULINT_UNDEFINED, &heap);
|
|
|
|
if (UNIV_UNLIKELY(!page_rec_validate(rec, offsets))) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
next_free:
|
|
|
|
const ulint offs = rec_get_next_offs(
|
|
|
|
rec, page_is_comp(page));
|
|
|
|
if (!offs) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (UNIV_UNLIKELY(offs < PAGE_OLD_INFIMUM
|
|
|
|
|| offs >= srv_page_size)) {
|
|
|
|
ib::error() << "Page free list is corrupted";
|
|
|
|
ret = FALSE;
|
|
|
|
break;
|
|
|
|
}
|
2014-02-26 19:11:54 +01:00
|
|
|
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
rec = page + offs;
|
2019-07-01 18:24:35 +03:00
|
|
|
continue;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
count++;
|
|
|
|
offs = page_offset(rec_get_start(rec, offsets));
|
|
|
|
i = rec_offs_size(offsets);
|
2018-04-27 13:49:25 +03:00
|
|
|
if (UNIV_UNLIKELY(offs + i >= srv_page_size)) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Free record offset out of bounds: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
goto next_free;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
while (i--) {
|
|
|
|
if (UNIV_UNLIKELY(buf[offs + i])) {
|
2019-07-01 18:24:35 +03:00
|
|
|
ib::error() << "Free record overlaps another: "
|
|
|
|
<< offs << '+' << i;
|
|
|
|
ret = FALSE;
|
|
|
|
break;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
buf[offs + i] = 1;
|
|
|
|
}
|
|
|
|
|
MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.
To catch MDEV-19783, in
commit ed0793e096a17955c5a03844b248bcf8303dd335 and
commit 99dc40d6ac2234fa4c990665cc1914c1925cd641
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.
page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.
page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.
btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.
btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
|
|
|
goto next_free;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (UNIV_UNLIKELY(page_dir_get_n_heap(page) != count + 1)) {
|
2016-08-12 11:17:45 +03:00
|
|
|
ib::error() << "N heap is wrong "
|
|
|
|
<< page_dir_get_n_heap(page) << " " << count + 1;
|
2019-07-01 18:24:35 +03:00
|
|
|
ret = FALSE;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
mem_heap_free(heap);
|
|
|
|
|
2019-07-01 18:24:35 +03:00
|
|
|
if (UNIV_UNLIKELY(!ret)) {
|
|
|
|
goto func_exit2;
|
2014-02-26 19:11:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
return(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/***************************************************************//**
|
|
|
|
Looks in the page record list for a record with the given heap number.
|
2016-08-12 11:17:45 +03:00
|
|
|
@return record, NULL if not found */
|
2014-02-26 19:11:54 +01:00
|
|
|
const rec_t*
|
|
|
|
page_find_rec_with_heap_no(
|
|
|
|
/*=======================*/
|
|
|
|
const page_t* page, /*!< in: index page */
|
|
|
|
ulint heap_no)/*!< in: heap number */
|
|
|
|
{
|
|
|
|
const rec_t* rec;
|
|
|
|
|
|
|
|
if (page_is_comp(page)) {
|
|
|
|
rec = page + PAGE_NEW_INFIMUM;
|
|
|
|
|
2016-08-12 11:17:45 +03:00
|
|
|
for (;;) {
|
2014-02-26 19:11:54 +01:00
|
|
|
ulint rec_heap_no = rec_get_heap_no_new(rec);
|
|
|
|
|
|
|
|
if (rec_heap_no == heap_no) {
|
|
|
|
|
|
|
|
return(rec);
|
|
|
|
} else if (rec_heap_no == PAGE_HEAP_NO_SUPREMUM) {
|
|
|
|
|
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page + rec_get_next_offs(rec, TRUE);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
rec = page + PAGE_OLD_INFIMUM;
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
ulint rec_heap_no = rec_get_heap_no_old(rec);
|
|
|
|
|
|
|
|
if (rec_heap_no == heap_no) {
|
|
|
|
|
|
|
|
return(rec);
|
|
|
|
} else if (rec_heap_no == PAGE_HEAP_NO_SUPREMUM) {
|
|
|
|
|
|
|
|
return(NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
rec = page + rec_get_next_offs(rec, FALSE);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-05-05 18:20:28 +02:00
|
|
|
/** Get the last non-delete-marked record on a page.
|
|
|
|
@param[in] page index tree leaf page
|
|
|
|
@return the last record, not delete-marked
|
|
|
|
@retval infimum record if all records are delete-marked */
|
|
|
|
const rec_t*
|
|
|
|
page_find_rec_max_not_deleted(
|
|
|
|
const page_t* page)
|
|
|
|
{
|
|
|
|
const rec_t* rec = page_get_infimum_rec(page);
|
|
|
|
const rec_t* prev_rec = NULL; // remove warning
|
|
|
|
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
/* Because the page infimum is never delete-marked
|
2018-09-19 07:21:24 +03:00
|
|
|
and never the metadata pseudo-record (MIN_REC_FLAG)),
|
2014-05-05 18:20:28 +02:00
|
|
|
prev_rec will always be assigned to it first. */
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
ut_ad(!rec_get_info_bits(rec, page_rec_is_comp(rec)));
|
|
|
|
ut_ad(page_is_leaf(page));
|
|
|
|
|
2014-05-05 18:20:28 +02:00
|
|
|
if (page_is_comp(page)) {
|
|
|
|
do {
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (!(rec[-REC_NEW_INFO_BITS]
|
|
|
|
& (REC_INFO_DELETED_FLAG
|
|
|
|
| REC_INFO_MIN_REC_FLAG))) {
|
2014-05-05 18:20:28 +02:00
|
|
|
prev_rec = rec;
|
|
|
|
}
|
|
|
|
rec = page_rec_get_next_low(rec, true);
|
|
|
|
} while (rec != page + PAGE_NEW_SUPREMUM);
|
|
|
|
} else {
|
|
|
|
do {
|
MDEV-11369 Instant ADD COLUMN for InnoDB
For InnoDB tables, adding, dropping and reordering columns has
required a rebuild of the table and all its indexes. Since MySQL 5.6
(and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
concurrent modification of the tables.
This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
with only minor changes performed to the table structure. The counter
innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
is incremented whenever a table rebuild operation is converted into
an instant ADD COLUMN operation.
ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
Some usability limitations will be addressed in subsequent work:
MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
and ALGORITHM=INSTANT
MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
The format of the clustered index (PRIMARY KEY) is changed as follows:
(1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
and a new field PAGE_INSTANT will contain the original number of fields
in the clustered index ('core' fields).
If instant ADD COLUMN has not been used or the table becomes empty,
or the very first instant ADD COLUMN operation is rolled back,
the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
to 0 and FIL_PAGE_INDEX.
(2) A special 'default row' record is inserted into the leftmost leaf,
between the page infimum and the first user record. This record is
distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
same format as records that contain values for the instantly added
columns. This 'default row' always has the same number of fields as
the clustered index according to the table definition. The values of
'core' fields are to be ignored. For other fields, the 'default row'
will contain the default values as they were during the ALTER TABLE
statement. (If the column default values are changed later, those
values will only be stored in the .frm file. The 'default row' will
contain the original evaluated values, which must be the same for
every row.) The 'default row' must be completely hidden from
higher-level access routines. Assertions have been added to ensure
that no 'default row' is ever present in the adaptive hash index
or in locked records. The 'default row' is never delete-marked.
(3) In clustered index leaf page records, the number of fields must
reside between the number of 'core' fields (dict_index_t::n_core_fields
introduced in this work) and dict_index_t::n_fields. If the number
of fields is less than dict_index_t::n_fields, the missing fields
are replaced with the column value of the 'default row'.
Note: The number of fields in the record may shrink if some of the
last instantly added columns are updated to the value that is
in the 'default row'. The function btr_cur_trim() implements this
'compression' on update and rollback; dtuple::trim() implements it
on insert.
(4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
a new record header that will encode n_fields-n_core_fields-1 in
1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
always explicitly encodes the number of fields.)
We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
covering the insert of the 'default row' record when instant ADD COLUMN
is used for the first time. Subsequent instant ADD COLUMN can use
TRX_UNDO_UPD_EXIST_REC.
This is joint work with Vin Chen (陈福荣) from Tencent. The design
that was discussed in April 2017 would not have allowed import or
export of data files, because instead of the 'default row' it would
have introduced a data dictionary table. The test
rpl.rpl_alter_instant is exactly as contributed in pull request #408.
The test innodb.instant_alter is based on a contributed test.
The redo log record format changes for ROW_FORMAT=DYNAMIC and
ROW_FORMAT=COMPACT are as contributed. (With this change present,
crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
Also the semantics of higher-level redo log records that modify the
PAGE_INSTANT field is changed. The redo log format version identifier
was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
Everything else has been rewritten by me. Thanks to Elena Stepanova,
the code has been tested extensively.
When rolling back an instant ADD COLUMN operation, we must empty the
PAGE_FREE list after deleting or shortening the 'default row' record,
by calling either btr_page_empty() or btr_page_reorganize(). We must
know the size of each entry in the PAGE_FREE list. If rollback left a
freed copy of the 'default row' in the PAGE_FREE list, we would be
unable to determine its size (if it is in ROW_FORMAT=COMPACT or
ROW_FORMAT=DYNAMIC) because it would contain more fields than the
rolled-back definition of the clustered index.
UNIV_SQL_DEFAULT: A new special constant that designates an instantly
added column that is not present in the clustered index record.
len_is_stored(): Check if a length is an actual length. There are
two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
dict_col_t::def_val: The 'default row' value of the column. If the
column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
instant_value().
dict_col_t::remove_instant(): Remove the 'instant ADD' status of
a column.
dict_col_t::name(const dict_table_t& table): Replaces
dict_table_get_col_name().
dict_index_t::n_core_fields: The original number of fields.
For secondary indexes and if instant ADD COLUMN has not been used,
this will be equal to dict_index_t::n_fields.
dict_index_t::n_core_null_bytes: Number of bytes needed to
represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
n_core_null_bytes was not initialized yet from the clustered index
root page.
dict_index_t: Add the accessors is_instant(), is_clust(),
get_n_nullable(), instant_field_value().
dict_index_t::instant_add_field(): Adjust clustered index metadata
for instant ADD COLUMN.
dict_index_t::remove_instant(): Remove the 'instant ADD' status
of a clustered index when the table becomes empty, or the very first
instant ADD COLUMN operation is rolled back.
dict_table_t: Add the accessors is_instant(), is_temporary(),
supports_instant().
dict_table_t::instant_add_column(): Adjust metadata for
instant ADD COLUMN.
dict_table_t::rollback_instant(): Adjust metadata on the rollback
of instant ADD COLUMN.
prepare_inplace_alter_table_dict(): First create the ctx->new_table,
and only then decide if the table really needs to be rebuilt.
We must split the creation of table or index metadata from the
creation of the dictionary table records and the creation of
the data. In this way, we can transform a table-rebuilding operation
into an instant ADD COLUMN operation. Dictionary objects will only
be added to cache when table rebuilding or index creation is needed.
The ctx->instant_table will never be added to cache.
dict_table_t::add_to_cache(): Modified and renamed from
dict_table_add_to_cache(). Do not modify the table metadata.
Let the callers invoke dict_table_add_system_columns() and if needed,
set can_be_evicted.
dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
system columns (which will now exist in the dict_table_t object
already at this point).
dict_create_table_step(): Expect the callers to invoke
dict_table_add_system_columns().
pars_create_table(): Before creating the table creation execution
graph, invoke dict_table_add_system_columns().
row_create_table_for_mysql(): Expect all callers to invoke
dict_table_add_system_columns().
create_index_dict(): Replaces row_merge_create_index_graph().
innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
Call my_error() if an error occurs.
btr_cur_instant_init(), btr_cur_instant_init_low(),
btr_cur_instant_root_init():
Load additional metadata from the clustered index and set
dict_index_t::n_core_null_bytes. This is invoked
when table metadata is first loaded into the data dictionary.
dict_boot(): Initialize n_core_null_bytes for the four hard-coded
dictionary tables.
dict_create_index_step(): Initialize n_core_null_bytes. This is
executed as part of CREATE TABLE.
dict_index_build_internal_clust(): Initialize n_core_null_bytes to
NO_CORE_NULL_BYTES if table->supports_instant().
row_create_index_for_mysql(): Initialize n_core_null_bytes for
CREATE TEMPORARY TABLE.
commit_cache_norebuild(): Call the code to rename or enlarge columns
in the cache only if instant ADD COLUMN is not being used.
(Instant ADD COLUMN would copy all column metadata from
instant_table to old_table, including the names and lengths.)
PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
least significant 3 bits were used. The original byte containing
PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
page_ptr_get_direction(), page_get_direction(),
page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
page_direction_increment(): Increment PAGE_N_DIRECTION
and set PAGE_DIRECTION.
rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
and assume that heap_no is always set.
Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
even if the record contains fewer fields.
rec_offs_make_valid(): Add the parameter 'leaf'.
rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
on the core fields. Instant ADD COLUMN only applies to the
clustered index, and we should never build a search key that has
more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
All these columns are always present.
dict_index_build_data_tuple(): Remove assertions that would be
duplicated in rec_copy_prefix_to_dtuple().
rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
number of fields is between n_core_fields and n_fields.
cmp_rec_rec_with_match(): Implement the comparison between two
MIN_REC_FLAG records.
trx_t::in_rollback: Make the field available in non-debug builds.
trx_start_for_ddl_low(): Remove dangerous error-tolerance.
A dictionary transaction must be flagged as such before it has generated
any undo log records. This is because trx_undo_assign_undo() will mark
the transaction as a dictionary transaction in the undo log header
right before the very first undo log record is being written.
btr_index_rec_validate(): Account for instant ADD COLUMN
row_undo_ins_remove_clust_rec(): On the rollback of an insert into
SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
last column from the table and the clustered index.
row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
trx_undo_update_rec_get_update(): Handle the 'default row'
as a special case.
dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
before insert or update. After instant ADD COLUMN, if the last fields
of a clustered index tuple match the 'default row', there is no
need to store them. While trimming the entry, we must hold a page latch,
so that the table cannot be emptied and the 'default row' be deleted.
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
Invoke dtuple_t::trim() if needed.
row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
row_ins_clust_index_entry_low().
rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
of fields to be between n_core_fields and n_fields. Do not support
infimum,supremum. They are never supposed to be stored in dtuple_t,
because page creation nowadays uses a lower-level method for initializing
them.
rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
number of fields.
btr_cur_trim(): In an update, trim the index entry as needed. For the
'default row', handle rollback specially. For user records, omit
fields that match the 'default row'.
btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
Skip locking and adaptive hash index for the 'default row'.
row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
In the temporary file that is applied by row_log_table_apply(),
we must identify whether the records contain the extra header for
instantly added columns. For now, we will allocate an additional byte
for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
fine, as they will be converted and will only contain 'core' columns
(PRIMARY KEY and some system columns) that are converted from dtuple_t.
rec_get_converted_size_temp(), rec_init_offsets_temp(),
rec_convert_dtuple_to_temp(): Add the parameter 'status'.
REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
An info_bits constant for distinguishing the 'default row' record.
rec_comp_status_t: An enum of the status bit values.
rec_leaf_format: An enum that replaces the bool parameter of
rec_init_offsets_comp_ordinary().
2017-10-06 07:00:05 +03:00
|
|
|
if (!(rec[-REC_OLD_INFO_BITS]
|
|
|
|
& (REC_INFO_DELETED_FLAG
|
|
|
|
| REC_INFO_MIN_REC_FLAG))) {
|
2014-05-05 18:20:28 +02:00
|
|
|
prev_rec = rec;
|
|
|
|
}
|
|
|
|
rec = page_rec_get_next_low(rec, false);
|
|
|
|
} while (rec != page + PAGE_OLD_SUPREMUM);
|
|
|
|
}
|
|
|
|
return(prev_rec);
|
|
|
|
}
|