Pass buf_block_t* to more functions that write redo log.
page_zip_write_node_ptr(), page_zip_write_blob_ptr(),
page_zip_compress_write_log_no_data():
Take buf_block_t* as parameter, and do not tolerate mtr=NULL.
page_zip_compress(): Do not tolerate mtr=NULL.
page_zip_dir_insert(): Take page_cur_t* as parameter.
mlog_write_initial_log_record(): Remove. This function was unused.
RecIterator::remove(): Remove the redundant page_zip parameter.
PageConverter::m_page_zip_ptr: Remove.
page_cur_delete_rec(): Do not tolerate mtr=NULL.
page_delete_rec(): Merge with the only caller, RecIterator::remove().
RecIterator::m_mtr: New data member: a dummy mini-transaction.
offset_t: this is a type which represents one record offset.
It's unsigned short int.
a lot of functions: replace ulint with offset_t
btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
allocate record offsets on the stack instead of waiting for rec_get_offsets()
to allocate it from mem_heap_t. So, reducing memory allocations.
RECORD_OFFSET, INDEX_OFFSET:
now it's less convenient to store pointers in offset_t*
array. One pointer occupies now several offset_t. And those constant are start
indexes into array to places where to store pointer values
REC_OFFS_HEADER_SIZE: adjusted for the new reality
REC_OFFS_NORMAL_SIZE:
increase size from 100 to 300 which means less heap allocations.
And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
is smaller than previous 800 bytes.
REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
rem0rec.h, rem0rec.ic, rem0rec.cc:
various arguments, return values and local variables types were changed to
fix numerous integer conversions issues.
enum field_type_t:
offset types concept was introduces which replaces old offset flags stuff.
Like in earlier version, 2 upper bits are used to store offset type.
And this enum represents those types.
REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
get_type(), set_type(), get_value(), combine():
these are convenience functions to work with offsets and it's types
rec_offs_base()[0]:
still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
rec_offs_base()[i]:
these have type offset_t now. Two upper bits contains type.
recv_dblwr_t::list is used for appending to the beginning and iterating
through its elements. std::deque fits better for that purpose because
it does less allocations than std::forward_list and provides better memory
locality.
Use std::queue backed by std::deque instead of list because it does less
allocations which still having O(1) push and pop operations.
Also store trx_purge_rec_t directly, because its only 16 bytes and allocating
it is to wasteful. This should be faster.
purge_node_t::purge_node_t: container is already empty after creation
purge_node_t::end(): replace clearing container with assertion that it's clear
InnoDB RNG maintains global state, causing otherwise unnecessary bus
traffic. Even worse, this is cross-mutex traffic. That is, different
mutexes suffer from contention.
Fixed delay of 4 was verified to give best throughput by OLTP update
index and read-write benchmarks on Intel Broadwell (2/20/40) and
ARM (1/46/46).
This is a backport of ce04790065 from
MariaDB Server 10.3.
We should not need anywhere near 32 bits of entropy, so we might
just limit ourselves to a 32-bit random number generator.
Also, it might be cheaper to use exclusive-or, bit shifting and
conditional jumps, instead of multiplication and addition.
We use relaxed atomic operations on the global random number generator
state in order in an attempt to silence any warnings about race conditions.
There is an obvious race condition between the load and store in
ut_rnd_gen(), but we do not think that it matters much that the
state of the random number generator could 'stutter'.
This change seems makes the 'uncompress_ops' nondeterministic
in innodb_zip.cmp_per_index after the restart. It looks like
there is an inherent race condition in the test, because the
table could be opened for InnoDB statistics recalculation
already before innodb_cmp_per_index_enabled was set. We might
end up having uncompress_ops anywhere between 0 and 9, or perhaps
even more. Let us remove that part of the test.
ut_rnd_interval(): Remove the first parameter, which was mostly
passed as 0. Implement as a simple wrapper around ut_rnd_gen().
Trivially return 0 if the size of the interval is smaller than 2.
ut_rnd_ulint_counter, ut_rnd_gen_next_ulint(), ut_rnd_gen_ulint(): Remove.
ut_rnd_set_seed(): Unused function; remove.
ut_rnd_gen(): Renamed from page_cur_lcg_prng().
ut_rnd_current: The internal state of ut_rnd_gen().
page_cur_open_on_rnd_user_rec(): Replace linear search with
page_rec_get_nth().
This is joint work with Thirunarayanan Balathandayuthapani.
The MDL interface between InnoDB and the rest of the server
(in storage/innobase/dict/dict0dict.cc and in include/)
is my work, while most everything else is Thiru's.
The collection of InnoDB persistent statistics and the
defragmentation were not refactored to use MDL. They will
keep relying on lower-level interlocking with
fil_check_pending_operations().
The purge of transaction history and the background operations on
fulltext indexes will use MDL. We will revert
commit 2c4844c9e7
(MDEV-17813) because thanks to MDL, purge cannot conflict
with DDL operations anymore. For a similar reason, we will remove
the MDEV-16222 test case from gcol.innodb_virtual_debug_purge.
Purge is essentially replacing all use of the global dict_sys.latch
with MDL. Purge will skip the undo log records for tables whose names
start with #sql-ib or #sql2. Theoretically, such tables might
be renamed back to visible table names if TRUNCATE fails to
create a new table, or the final rename in ALTER TABLE...ALGORITHM=COPY
fails. In that case, purge could permanently leave some garbage
in the table. Such garbage will be tolerated; the table would not
be considered corrupted.
To avoid repeated MDL releases and acquisitions,
trx_purge_attach_undo_recs() will sort undo log records by table_id,
and purge_node_t will keep the MDL and table handle open for multiple
successive undo log records.
get_purge_table(): A new accessor, used during the purge of
history for indexed virtual columns. This interface should ideally
not exist at all.
thd_mdl_context(): Accessor of THD::mdl_context.
Wrapped in a new thd_mdl_service.
dict_get_db_name_len(): Define inline.
dict_acquire_mdl_shared(): Acquire explicit shared MDL on a table name
if needed.
dict_table_open_on_id(): Return MDL_ticket, if requested.
dict_table_close(): Release MDL ticket, if requested.
dict_fts_index_syncing(), dict_index_t::index_fts_syncing: Remove.
row_drop_table_for_mysql() no longer needs to check these, because
MDL guarantees that a fulltext index sync will not be in progress
while MDL_EXCLUSIVE is protecting a DDL operation.
dict_table_t::parse_name(): Parse the table name for acquiring MDL.
purge_node_t::undo_recs: Change the type to std::list<trx_purge_rec_t*>
(different container, and storing also roll_ptr).
purge_node_t: Add mdl_ticket, last_table_id, purge_thd, mdl_hold_recs
for acquiring MDL and for keeping the table open across multiple
undo log records.
purge_vcol_info_t, row_purge_store_vsec_cur(), row_purge_restore_vsec_cur():
Remove. We will acquire the MDL earlier.
purge_sys_t::heap: Added, for reading undo log records.
fts_sync_during_ddl(): Invoked during ALGORITHM=INPLACE operations
to ensure that fts_sync_table() will not conflict with MDL_EXCLUSIVE.
Uses fts_t::sync_message for bookkeeping.
The InnoDB internal SQL parser, which is used for updating the InnoDB
data dictionary tables (to be removed in MDEV-11655), persistent
statistics (to be refactored in MDEV-15020) and fulltext indexes,
implements some unused keywords and built-in functions:
OUT BINARY BLOB INTEGER FLOAT SUM DISTINCT READ
COMPACT BLOCK_SIZE
TO_CHAR TO_NUMBER BINARY_TO_NUMBER REPLSTR SYSDATE PRINTF ASSERT
RND RND_STR ROW_PRINTF UNSIGNED
Also, procedures are never declared with parameters. Only one top-level
procedure is declared and invoked at a time, and parameters are being
passed via pars_info_t.
Before commit 90c52e5291 introduced
aligned_malloc(), InnoDB always used a pattern of over-allocating
memory and invoking ut_align() to guarantee the desired alignment.
It is cleaner to invoke aligned_malloc() and aligned_free() directly.
ut_align(): Remove. In assertions, ut_align_down() can be used instead.
The function fsp_header_get_space_id() returns ulint instead of
uint32_t, only to be able to complain that the two adjacent
tablespace ID fields in the page differ. Remove the function,
and merge the check to the callers.
Also, make some more use of aligned_malloc().
InnoDB startup was discovering undo tablespaces in a dirty way.
It was reading a possibly stale copy of the TRX_SYS page before
processing any redo log records.
srv_start(): Do not call buf_pool_invalidate(). Invoke
trx_rseg_get_n_undo_tablespaces() after the recovery has been initiated.
recv_recovery_from_checkpoint_start(): Assert that the buffer pool is
empty. This used to be guaranteed by the buf_pool_invalidate() call.
trx_rseg_get_n_undo_tablespaces(): Move to the calling compilation unit,
and reimplement in a simpler way.
srv_undo_tablespace_create(): Remove the constant parameter
size=SRV_UNDO_TABLESPACE_SIZE_IN_PAGES.
srv_undo_tablespace_open(): Reimplement in a cleaner way, with
more robust error handling.
srv_all_undo_tablespaces_open(): Split from srv_undo_tablespaces_init().
srv_undo_tablespaces_init(): Read all "undo001","undo002" tablespace
files directly, without consulting the TRX_SYS page via calling
trx_rseg_get_n_undo_tablespaces().
This is joint work with Thirunarayanan Balathandayuthapani.
In commit af5947f433
the function btr_discard_page() is invoking btr_set_min_rec_mark()
with the wrong buf_block_t* object. node_ptr is on merge_block,
not block.
btr_discard_page(): Remove the variables merge_page, page, and
always refer to block->frame or merge_block->frame instead.
Also, limit the scope of node_ptr and avoid duplicated conditions.
btr_set_min_rec_mark(): Add a template parameter, so that the
caller can specify whether the page is supposed to have a left sibling.
Otherwise, the assertion (which was introduced in the same commit)
would fail in btr_discard_page().
mtr_t::memcpy(): Replaces mlog_write_string(), mlog_log_string().
The buf_block_t is passed a parameter, so that
mlog_write_initial_log_record_low() can be used instead of
mlog_write_initial_log_record_fast().
fil_space_crypt_t::write_page0(): Remove the fil_space_t* parameter.
Passing buf_block_t helps us avoid calling
mlog_write_initial_log_record_fast() and page_get_page_no(),
and allows us to implement more debug checks, such as
that on ROW_FORMAT=COMPRESSED index pages, only the page header
may be modified by MLOG_MEMSET records.
fseg_n_reserved_pages(): Add a buf_block_t parameter.
mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull().
Optimize away writes if the page contents does not change,
except when a dummy write has been explicitly requested.
Because the member function template takes a block descriptor as a
parameter, it is possible to introduce better consistency checks.
Due to this, the code for handling file-based lists, undo logs
and user transactions was refactored to pass around buf_block_t.
page_create_write_log(), mlog_write_initial_log_record():
Merge to the only caller, and use
mlog_write_initial_log_record_low() for writing the log record.
btr_set_min_rec_mark(): Write MLOG_1BYTE instead of
MLOG_REC_MIN_MARK or MLOG_COMP_REC_MIN_MARK.
On ROW_FORMAT=COMPRESSED pages, the minimum record flag is not stored
at all. The flag is computed for the uncompressed page by
page_zip_decompress(). Hence, nothing needs to be logged for
ROW_FORMAT=COMPRESSED tables for this operation.
To facilitate crash-upgrade and hot backup from older versions,
we will retain the code to parse and apply the old log record types
MLOG_REC_MIN_MARK and MLOG_COMP_REC_MIN_MARK.
The MLOG_FILE_WRITE_CRYPT_DATA record was completely redundant.
It can be replaced with a single MLOG_WRITE_STRING record.
To facilitate upgrade from older versions, we will retain
fil_parse_write_crypt_data().
fil_crypt_parse(): Recover fil_space_crypt_t::write_page0().
fil_space_crypt_t::write_page0(): Write everything in a single
MLOG_WRITE_STRING for easy parsing.
fil_space_crypt_t::page0_offset: Remove.
The function get_wkb_of_default_point() should never have been
added, and the cleanup in commit 56ff6f1b0b
should have removed it.
The unnecessary code was added in
mysql/mysql-server@0a27b72171
and mostly disabled in
mysql/mysql-server@8f11af7734.
In MariaDB, the types DATA_POINT and DATA_VAR_POINT are never used.
Instead, DATA_GEOMETRY continues to be used since 10.2.2, introduced by
mysql/mysql-server@673bad7c7e
in MySQL 5.7.1.
Before commit c0f47a4a58 (MDEV-12026)
introduced the innodb_checksum_algorithm=full_crc32 format,
it was impossible to tell if InnoDB data files contained garbage in
the FIL_PAGE_TYPE header field (and possibly other fields).
This is because before commit 3926673ce7
in MySQL 5.1.48, InnoDB would write uninitialized data to some fields,
and because there was no way to tell with which InnoDB version a data
file was created.
If fil_space_t::full_crc32() holds, the data file cannot contain
uninitialized garbage or invalid FIL_PAGE_TYPE, and thus
fil_block_check_type() should not be invoked to correct anything.
Introduce memcpy_aligned<N>(), memcmp_aligned<N>(), memset_aligned<N>()
and use them for accessing InnoDB page header fields that are known
to be aligned.
MY_ASSUME_ALIGNED(): Wrapper for the GCC/clang __builtin_assume_aligned().
Nothing similar seems to exist in Microsoft Visual Studio, and the
C++20 std::assume_aligned is not available to us yet.
Explicitly specified alignment guarantees allow compilers to generate
faster code on platforms with strict alignment rules, instead of
emitting calls to potentially unaligned memcpy(), memcmp(), or memset().
We must relax too strict debug assertions. For latin1_swedish_ci,
mtype=DATA_CHAR or mtype=DATA_VARCHAR will be used instead of
mtype=DATA_MYSQL or mtype=DATA_VARMYSQL. Likewise, some changes of
dtype_get_charset_coll() do not affect the data type encoding,
but only any indexes that are defined on the column.
Charset::same_encoding(): Check whether two charset-collations have
the same character set encoding.
dict_col_t::same_encoding(): Check whether two character columns
have the same character set encoding.
dict_col_t::same_type(): Check whether two columns have a compatible
data type encoding.
dict_col_t::same_format(), dict_table_t::instant_column(): Do not
compare mtype or the charset-collation of prtype directly.
Rely on dict_col_t::same_type() instead.
dtype_get_charset_coll(): Narrow the return type to uint16_t.
This is a refined version of a fix that was developed by
Thirunarayanan Balathandayuthapani.
At each mini-transaction commit, the log sequence number of the
mini-transaction must be written to each modified page, so that
it will be available in the FIL_PAGE_LSN field when the page is
being read in crash recovery.
InnoDB was unnecessarily allocating redundant storage for the
field, in buf_page_t::newest_modification. Let us access
FIL_PAGE_LSN directly.
Furthermore, on ALTER TABLE...IMPORT TABLESPACE, let us write
0 to FIL_PAGE_LSN instead of using log_sys.lsn.
buf_flush_init_for_writing(), buf_flush_update_zip_checksum(),
fil_encrypt_buf_for_full_crc32(), fil_encrypt_buf(),
fil_space_encrypt(): Remove the parameter lsn.
buf_page_get_newest_modification(): Merge with the only caller.
buf_tmp_reserve_compression_buf(), buf_tmp_page_encrypt(),
buf_page_encrypt(): Define static in the same compilation unit
with the only caller.
PageConverter::m_current_lsn: Remove. Write 0 to FIL_PAGE_LSN
on ALTER TABLE...IMPORT TABLESPACE.
Currently InnoDB uses internal parser for adding foreign keys. Remove
internal parser and use data parsed by SQL parser (sql_yacc) for
adding foreign keys.
- create_table_info_t::create_foreign_keys() replacement for
dict_create_foreign_constraints_low();
- Pass constraint name via Foreign_key object.
Temporary until MDEV-20865:
- Pass alter_info as part of create_info.