page_mem_alloc_free(), page_dir_set_n_heap(), page_ptr_set_direction():
Merge with the callers.
page_direction_reset(), page_direction_increment(),
page_zip_dir_insert(), page_zip_write_rec_ext(), page_zip_write_rec():
Add the parameter mtr, and write log.
PageBulk::insert(), PageBulk::finish(): Write log for all changes.
page_cur_rec_insert(), page_cur_insert_rec_write_log(),
page_cur_insert_rec_write_log(): Remove.
page_rec_set_next(), page_header_set_field(), page_header_set_ptr():
Remove. Use lower-level operations with or without logging.
page_zip_dir_add_slot(): Move to the same compilation unit with
its only caller, page_cur_insert_rec_zip().
page_cur_insert_rec_zip(): Mark pieces of code that must be skipped
once this task is completed.
btr_defragment_chunk(): Before starting a mini-transaction that
is writing (a lot), invoke log_free_check(). This should allow
the test innodb.innodb_defrag_concurrent to pass with the
mtr default_mysqld.cnf setting of innodb_log_file_size=10M.
MLOG_BUF_MARGIN: Remove.
No longer write the following redo log records:
MLOG_COMP_LIST_END_DELETE, MLOG_LIST_END_DELETE,
MLOG_COMP_LIST_START_DELETE, MLOG_LIST_START_DELETE,
MLOG_REC_DELETE,MLOG_COMP_REC_DELETE.
Each individual deleted record will be logged separately
using physical log records.
page_dir_slot_set_n_owned(),
page_zip_rec_set_owned(), page_zip_dir_delete(), page_zip_clear_rec():
Add the parameter mtr, and write redo log.
page_dir_slot_set_rec(): Remove. Replaced with lower-level operations
that write redo log when necessary.
page_rec_set_n_owned(): Replaces rec_set_n_owned_old(),
rec_set_n_owned_new().
rec_set_heap_no(): Replaces rec_set_heap_no_old(), rec_set_heap_no_new().
page_mem_free(), page_dir_split_slot(), page_dir_balance_slot():
Add the parameter mtr.
page_dir_set_n_slots(): Merge with the caller page_dir_split_slot().
page_dir_slot_set_rec(): Merge with the callers page_dir_split_slot()
and page_dir_balance_slot().
page_cur_insert_rec_low(), page_cur_insert_rec_zip():
Suppress the logging of lower-level operations.
page_cur_delete_rec_write_log(): Remove.
page_cur_delete_rec(): Do not tolerate mtr=NULL.
rec_convert_dtuple_to_rec_old(), rec_convert_dtuple_to_rec_comp():
Replace rec_set_heap_no_old() and rec_set_heap_no_new() with direct
access that does not involve redo logging.
mtr_t::memcpy(): Do allow non-redo-logged writes to uncompressed pages
of ROW_FORMAT=COMPRESSED pages.
buf_page_io_complete(): Evict the uncompressed page of
a ROW_FORMAT=COMPRESSED page after recovery. Because we no longer
write logical log records for deleting index records, but instead
write physical records that may refer directly to the compressed
page frame of a ROW_FORMAT=COMPRESSED page, and because on recovery
we will only apply the changes to the ROW_FORMAT=COMPRESSED page,
the uncompressed page frame can be stale until page_zip_decompress()
is executed.
recv_parse_or_apply_log_rec_body(): After applying MLOG_ZIP_WRITE_STRING,
ensure that the FIL_PAGE_TYPE of the uncompressed page matches the
compressed page, because buf_flush_init_for_writing() assumes that
field to be valid.
mlog_init_t::mark_ibuf_exist(): Invoke page_zip_decompress(), because
the uncompressed page after buf_page_create() is not necessarily
up to date.
buf_LRU_block_remove_hashed(): Bypass a page_zip_validate() check
during redo log apply.
recv_apply_hashed_log_recs(): Invoke mlog_init.mark_ibuf_exist()
also for the last batch, to ensure that page_zip_decompress() will
be called for freshly initialized pages.
btr_cur_upd_rec_sys(): Replaces row_upd_rec_sys_fields() and
implements redo logging.
row_upd_rec_sys_fields_in_recovery(): Remove, and merge to the
only remaining caller btr_cur_parse_update_in_place().
btr_cur_del_mark_set_clust_rec_log(),
btr_cur_del_mark_set_sec_rec_log(),
btr_cur_set_deleted_flag_for_ibuf():
Remove, and replace with btr_rec_set_deleted<bool>().
page_zip_rec_set_deleted(): Add the parameter mtr, and write a
MLOG_ZIP_WRITE_STRING record to the log.
Log the low-level operations for ROW_FORMAT=COMPRESSED index pages
using a new record, MLOG_ZIP_WRITE_STRING. We will still use
MLOG_1BYTE,..., MLOG_8BYTES or MLOG_WRITE_STRING for operations
on other than index pages (such as the page allocation bitmap pages).
We will stop writing the record MLOG_ZIP_PAGE_COMPRESS later, after
replacing all MLOG_REC_ and MLOG_COMP_REC_ that update index pages.
page_zip_reorganize(): Restore the page on failure.
In callers, omit now-redundant calls to page_zip_decompress().
btr_page_reorganize_low(): Define in static scope only, and
remove the z_level parameter. Assert that ROW_FORMAT is not COMPRESSED.
btr_page_reorganize_block(), btr_page_reorganize(): Invoke
page_zip_reorganize() for ROW_FORMAT=COMPRESSED.
page_zip_compress_write_log_no_data(): Remove.
We no longer write the MLOG_ZIP_PAGE_COMPRESS_NO_DATA record.
Instead, we will write MLOG_ZIP_PAGE_COMPRESS records.
Pass buf_block_t* to more functions that write redo log.
page_zip_write_node_ptr(), page_zip_write_blob_ptr(),
page_zip_compress_write_log_no_data():
Take buf_block_t* as parameter, and do not tolerate mtr=NULL.
page_zip_compress(): Do not tolerate mtr=NULL.
page_zip_dir_insert(): Take page_cur_t* as parameter.
mlog_write_initial_log_record(): Remove. This function was unused.
RecIterator::remove(): Remove the redundant page_zip parameter.
PageConverter::m_page_zip_ptr: Remove.
offset_t: this is a type which represents one record offset.
It's unsigned short int.
a lot of functions: replace ulint with offset_t
btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
allocate record offsets on the stack instead of waiting for rec_get_offsets()
to allocate it from mem_heap_t. So, reducing memory allocations.
RECORD_OFFSET, INDEX_OFFSET:
now it's less convenient to store pointers in offset_t*
array. One pointer occupies now several offset_t. And those constant are start
indexes into array to places where to store pointer values
REC_OFFS_HEADER_SIZE: adjusted for the new reality
REC_OFFS_NORMAL_SIZE:
increase size from 100 to 300 which means less heap allocations.
And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
is smaller than previous 800 bytes.
REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
rem0rec.h, rem0rec.ic, rem0rec.cc:
various arguments, return values and local variables types were changed to
fix numerous integer conversions issues.
enum field_type_t:
offset types concept was introduces which replaces old offset flags stuff.
Like in earlier version, 2 upper bits are used to store offset type.
And this enum represents those types.
REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
get_type(), set_type(), get_value(), combine():
these are convenience functions to work with offsets and it's types
rec_offs_base()[0]:
still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
rec_offs_base()[i]:
these have type offset_t now. Two upper bits contains type.
Except for fil_name_process(), which invokes os_normalize_path(),
the redo log record parser will not modify the redo log records.
Add const qualifiers accordingly.
page_zip_compress(), page_zip_compress_write_log(),
page_zip_copy_recs(): Replace the parameters page,page_zip with block,
and set buf_page_t::init_on_flush on success
if innodb_log_optimize_ddl=OFF.
page_zip_parse_compress_no_data(): Merge with the only caller
recv_parse_or_apply_log_rec_body().
The record MLOG_ZIP_PAGE_COMPRESS is similar to MLOG_INIT_FILE_PAGE2
that it contains all the information needed to initialize the page.
Like for the other record, do initialize the entire page on recovery.
This is a follow-up task to MDEV-12026, which introduced
innodb_checksum_algorithm=full_crc32 and a simpler page format.
MDEV-12026 did not enable full_crc32 for page_compressed tables,
which we will be doing now.
This is joint work with Thirunarayanan Balathandayuthapani.
For innodb_checksum_algorithm=full_crc32 we change the
page_compressed format as follows:
FIL_PAGE_TYPE: The most significant bit will be set to indicate
page_compressed format. The least significant bits will contain
the compressed page size, rounded up to a multiple of 256 bytes.
The checksum will be stored in the last 4 bytes of the page
(whether it is the full page or a page_compressed page whose
size is determined by FIL_PAGE_TYPE), covering all preceding
bytes of the page. If encryption is used, then the page will
be encrypted between compression and computing the checksum.
For page_compressed, FIL_PAGE_LSN will not be repeated at
the end of the page.
FSP_SPACE_FLAGS (already implemented as part of MDEV-12026):
We will store the innodb_compression_algorithm that may be used
to compress pages. Previously, the choice of algorithm was written
to each compressed data page separately, and one would be unable
to know in advance which compression algorithm(s) are used.
fil_space_t::full_crc32_page_compressed_len(): Determine if the
page_compressed algorithm of the tablespace needs to know the
exact length of the compressed data. If yes, we will reserve and
write an extra byte for this right before the checksum.
buf_page_is_compressed(): Determine if a page uses page_compressed
(in any innodb_checksum_algorithm).
fil_page_decompress(): Pass also fil_space_t::flags so that the
format can be determined.
buf_page_is_zeroes(): Check if a page is full of zero bytes.
buf_page_full_crc32_is_corrupted(): Renamed from
buf_encrypted_full_crc32_page_is_corrupted(). For full_crc32,
we always simply validate the checksum to the page contents,
while the physical page size is explicitly specified by an
unencrypted part of the page header.
buf_page_full_crc32_size(): Determine the size of a full_crc32 page.
buf_dblwr_check_page_lsn(): Make this a debug-only function, because
it involves potentially costly lookups of fil_space_t.
create_table_info_t::check_table_options(),
ha_innobase::check_if_supported_inplace_alter(): Do allow the creation
of SPATIAL INDEX with full_crc32 also when page_compressed is used.
commit_cache_norebuild(): Preserve the compression algorithm when
updating the page_compression_level.
dict_tf_to_fsp_flags(): Set the flags for page compression algorithm.
FIXME: Maybe there should be a table option page_compression_algorithm
and a session variable to back it?
MySQL 5.7 introduced the class page_size_t and increased the size of
buffer pool page descriptors by introducing this object to them.
Maybe the intention of this exercise was to prepare for a future
where the buffer pool could accommodate multiple page sizes.
But that future never arrived, not even in MySQL 8.0. It is much
easier to manage a pool of a single page size, and typically all
storage devices of an InnoDB instance benefit from using the same
page size.
Let us remove page_size_t from MariaDB Server. This will make it
easier to remove support for ROW_FORMAT=COMPRESSED (or make it a
compile-time option) in the future, just by removing various
occurrences of zip_size.
Remove the bug-compatible crc32 algorithm variant that was added
to allow an upgrade from data files from big-endian systems where
innodb_checksum_algorithm=crc32 was used on MySQL 5.6
or MariaDB 10.0 or 10.1.
Affected users should be able to recompute page checksums using
innochecksum.
In MySQL 5.7, it was noticed that files are not portable between
big-endian and little-endian processor architectures
(such as SPARC and x86), because the original implementation of
innodb_checksum_algorithm=crc32 was not byte order agnostic.
A byte order agnostic implementation of innodb_checksum_algorithm=crc32
was only added to MySQL 5.7, not backported to 5.6. Consequently,
MariaDB Server versions 10.0 and 10.1 only contain the CRC-32C
implementation that works incorrectly on big-endian architectures,
and MariaDB Server 10.2.2 got the byte-order agnostic CRC-32C
implementation from MySQL 5.7.
MySQL 5.7 introduced a "legacy crc32" variant that is functionally
equivalent to the big-endian version of the original crc32 implementation.
Thanks to this variant, old data files can be transferred from big-endian
systems to newer versions.
Introducing new variants of checksum algorithms (without introducing
new names for them, or something on the pages themselves to identify
the algorithm) generally is a bad idea, because each checksum algorithm
is like a lottery ticket. The more algorithms you try, the more likely
it will be for the checksum to match on a corrupted page.
So, essentially MySQL 5.7 weakened innodb_checksum_algorithm=crc32,
and MariaDB 10.2.2 inherited this weakening.
We introduce a build option that together with MDEV-17957
makes innodb_checksum_algorithm=strict_crc32 strict again
by only allowing one variant of the checksum to match.
WITH_INNODB_BUG_ENDIAN_CRC32: A new cmake option for enabling the
bug-compatible "legacy crc32" checksum. This is only enabled on
big-endian systems by default, to facilitate an upgrade from
MariaDB 10.0 or 10.1. Checked by #ifdef INNODB_BUG_ENDIAN_CRC32.
ut_crc32_byte_by_byte: Remove (unused function).
legacy_big_endian_checksum: Remove. This variable seems to have
unnecessarily complicated the logic. When the weakening is enabled,
we must always fall back to the buggy checksum.
buf_page_check_crc32(): A helper function to compute one or
two CRC-32C variants.
Problem:
Innodb_checksum_algorithm checks for all checksum algorithm to
validate the page checksum even though the algorithm is specified as
strict_crc32, strict_innodb, strict_none.
Fix:
Remove the checks for all checksum algorithm to validate the page
checksum if the algo is specified as strict_* values.
Also, related to MDEV-15522, MDEV-17304, MDEV-17835,
remove the Galera xtrabackup tests, because xtrabackup never worked
with MariaDB Server 10.3 due to InnoDB redo log format changes.
Stop supporting the additional *trunc.log files that were
introduced via MySQL 5.7 to MariaDB Server 10.2 and 10.3.
DB_TABLESPACE_TRUNCATED: Remove.
purge_sys.truncate: A new structure to track undo tablespace
file truncation.
srv_start(): Remove the call to buf_pool_invalidate(). It is
no longer necessary, given that we no longer access things in
ways that violate the ARIES protocol. This call was originally
added for innodb_file_format, and it may later have been necessary
for the proper function of the MySQL 5.7 TRUNCATE recovery, which
we are now removing.
trx_purge_cleanse_purge_queue(): Take the undo tablespace as a
parameter.
trx_purge_truncate_history(): Rewrite everything mostly in a
single function, replacing references to undo::Truncate.
recv_apply_hashed_log_recs(): If any redo log is to be applied,
and if the log_sys.log.subformat indicates that separately
logged truncate may have been used, refuse to proceed except if
innodb_force_recovery is set. We will still refuse crash-upgrade
if TRUNCATE TABLE was logged. Undo tablespace truncation would
only be logged in undo*trunc.log files, which we are no longer
checking for.
Let InnoDB purge reset DB_TRX_ID,DB_ROLL_PTR when the history is removed.
[TODO: It appears that the resetting is not taking place as often as
it could be. We should test that a simple INSERT should eventually
cause row_purge_reset_trx_id() to be invoked unless DROP TABLE is
invoked soon enough.]
The InnoDB clustered index record system columns DB_TRX_ID,DB_ROLL_PTR
are used by multi-versioning. After the history is no longer needed, these
columns can safely be reset to 0 and 1<<55 (to indicate a fresh insert).
When a reader sees 0 in the DB_TRX_ID column, it can instantly determine
that the record is present the read view. There is no need to acquire
the transaction system mutex to check if the transaction exists, because
writes can never be conducted by a transaction whose ID is 0.
The persistent InnoDB undo log used to be split into two parts:
insert_undo and update_undo. The insert_undo log was discarded at
transaction commit or rollback, and the update_undo log was processed
by the purge subsystem. As part of this change, we will only generate
a single undo log for new transactions, and the purge subsystem will
reset the DB_TRX_ID whenever a clustered index record is touched.
That is, all persistent undo log will be preserved at transaction commit
or rollback, to be removed by purge.
The InnoDB redo log format is changed in two ways:
We remove the redo log record type MLOG_UNDO_HDR_REUSE, and
we introduce the MLOG_ZIP_WRITE_TRX_ID record for updating the
DB_TRX_ID,DB_ROLL_PTR in a ROW_FORMAT=COMPRESSED table.
This is also changing the format of persistent InnoDB data files:
undo log and clustered index leaf page records. It will still be
possible via import and export to exchange data files with earlier
versions of MariaDB. The change to clustered index leaf page records
is simple: we allow DB_TRX_ID to be 0.
When it comes to the undo log, we must be able to upgrade from earlier
MariaDB versions after a clean shutdown (no redo log to apply).
While it would be nice to perform a slow shutdown (innodb_fast_shutdown=0)
before an upgrade, to empty the undo logs, we cannot assume that this
has been done. So, separate insert_undo log may exist for recovered
uncommitted transactions. These transactions may be automatically
rolled back, or they may be in XA PREPARE state, in which case InnoDB
will preserve the transaction until an explicit XA COMMIT or XA ROLLBACK.
Upgrade has been tested by starting up MariaDB 10.2 with
./mysql-test-run --manual-gdb innodb.read_only_recovery
and then starting up this patched server with
and without --innodb-read-only.
trx_undo_ptr_t::undo: Renamed from update_undo.
trx_undo_ptr_t::old_insert: Renamed from insert_undo.
trx_rseg_t::undo_list: Renamed from update_undo_list.
trx_rseg_t::undo_cached: Merged from update_undo_cached
and insert_undo_cached.
trx_rseg_t::old_insert_list: Renamed from insert_undo_list.
row_purge_reset_trx_id(): New function to reset the columns.
This will be called for all undo processing in purge
that does not remove the clustered index record.
trx_undo_update_rec_get_update(): Allow trx_id=0 when copying the
old DB_TRX_ID of the record to the undo log.
ReadView::changes_visible(): Allow id==0. (Return true for it.
This is what speeds up the MVCC.)
row_vers_impl_x_locked_low(), row_vers_build_for_semi_consistent_read():
Implement a fast path for DB_TRX_ID=0.
Always initialize the TRX_UNDO_PAGE_TYPE to 0. Remove undo->type.
MLOG_UNDO_HDR_REUSE: Remove. This changes the redo log format!
innobase_start_or_create_for_mysql(): Set srv_undo_sources before
starting any transactions.
The parsing of the MLOG_ZIP_WRITE_TRX_ID record was successfully
tested by running the following:
./mtr --parallel=auto --mysqld=--debug=d,ib_log innodb_zip.bug56680
grep MLOG_ZIP_WRITE_TRX_ID var/*/log/mysqld.1.err
innochecksum uses global variables. great, let's use them all the
way down, instead of passing them as arguments to innodb internals,
conditionally modifying function prototypes with #ifdefs
Also, remove empty .ic files that were not removed by my MySQL commit.
Problem:
InnoDB used to support a compilation mode that allowed to choose
whether the function definitions in .ic files are to be inlined or not.
This stopped making sense when InnoDB moved to C++ in MySQL 5.6
(and ha_innodb.cc started to #include .ic files), and more so in
MySQL 5.7 when inline methods and functions were introduced
in .h files.
Solution:
Remove all references to UNIV_NONINL and UNIV_MUST_NOT_INLINE from
all files, assuming that the symbols are never defined.
Remove the files fut0fut.cc and ut0byte.cc which only mattered when
UNIV_NONINL was defined.
Define my_thread_id as an unsigned type, to avoid mismatch with
ulonglong. Change some parameters to this type.
Use size_t in a few more places.
Declare many flag constants as unsigned to avoid sign mismatch
when shifting bits or applying the unary ~ operator.
When applying the unary ~ operator to enum constants, explictly
cast the result to an unsigned type, because enum constants can
be treated as signed.
In InnoDB, change the source code line number parameters from
ulint to unsigned type. Also, make some InnoDB functions return
a narrower type (unsigned or uint32_t instead of ulint;
bool instead of ibool).
The InnoDB source code contains quite a few references to a closed-source
hot backup tool which was originally called InnoDB Hot Backup (ibbackup)
and later incorporated in MySQL Enterprise Backup.
The open source backup tool XtraBackup uses the full database for recovery.
So, the references to UNIV_HOTBACKUP are only cluttering the source code.
Contains also:
MDEV-10549 mysqld: sql/handler.cc:2692: int handler::ha_index_first(uchar*): Assertion `table_share->tmp_table != NO_TMP_TABLE || m_lock_type != 2' failed. (branch bb-10.2-jan)
Unlike MySQL, InnoDB still uses THR_LOCK in MariaDB
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
enable tests that were fixed in MDEV-10549
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
fix main.innodb_mysql_sync - re-enable online alter for partitioned innodb tables
Contains also
MDEV-10547: Test multi_update_innodb fails with InnoDB 5.7
The failure happened because 5.7 has changed the signature of
the bool handler::primary_key_is_clustered() const
virtual function ("const" was added). InnoDB was using the old
signature which caused the function not to be used.
MDEV-10550: Parallel replication lock waits/deadlock handling does not work with InnoDB 5.7
Fixed mutexing problem on lock_trx_handle_wait. Note that
rpl_parallel and rpl_optimistic_parallel tests still
fail.
MDEV-10156 : Group commit tests fail on 10.2 InnoDB (branch bb-10.2-jan)
Reason: incorrect merge
MDEV-10550: Parallel replication can't sync with master in InnoDB 5.7 (branch bb-10.2-jan)
Reason: incorrect merge
This patch ports the work that facebook has performed
to make innochecksum handle compressed tables.
the basic idea is to use actual innodb-code to perform
checksum verification rather than duplicating in innochecksum.cc.
to make this work, innodb code has been annotated with
lots of #ifndef UNIV_INNOCHECKSUM so that it can be
compiled outside of storage/innobase.
A new testcase is also added that verifies that innochecksum
works on compressed/non-compressed tables.
Merged from commit fabc79d2ea976c4ff5b79bfe913e6bc03ef69d42
from https://code.google.com/p/google-mysql/
The actual steps to produce this patch are:
take innochecksum from 5.6.14
apply changes in innodb from facebook patches needed to make innochecksum compile
apply changes in innochecksum from facebook patches
add handcrafted testcase
The referenced facebook patches used are:
91e25120e7847fe76ea51135628a5a4dbf7c240c
Update InnoDB to 5.6.14
Apply MySQL-5.6 hack for MySQL Bug#16434374
Move Aria-only HA_RTREE_INDEX from my_base.h to maria_def.h (breaks an assert in InnoDB)
Fix InnoDB memory leak