This is basically port of WL6045:Improve Innochecksum with some
code refactoring on innochecksum.
Added page0size.h include from 10.2 to make 10.1 vrs 10.2 innochecksum
as identical as possible.
Added page 0 checksum checking and if that fails whole test fails.
Always read full page 0 to determine does tablespace contain
encryption metadata. Tablespaces that are page compressed or
page compressed and encrypted do not compare checksum as
it does not exists. For encrypted tables use checksum
verification written for encrypted tables and normal tables
use normal method.
buf_page_is_checksum_valid_crc32
buf_page_is_checksum_valid_innodb
buf_page_is_checksum_valid_none
Add Innochecksum logging to file
buf_page_is_corrupted
Remove ib_logf and page_warn_strict_checksum
calls in innochecksum compilation. Add innochecksum
logging to file.
fil0crypt.cc fil0crypt.h
Modify to be able to use in innochecksum compilation and
move fil_space_verify_crypt_checksum to end of the file.
Add innochecksum logging to file.
univ.i
Add innochecksum strict_verify, log_file and cur_page_num
variables as extern.
page_zip_verify_checksum
Add innochecksum logging to file.
innochecksum.cc
Lot of changes most notable able to read encryption
metadata from page 0 of the tablespace.
Added test case where we corrupt intentionally
FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION (encryption key version)
FIL_PAGE_FILE_FLUSH_LSN_OR_KEY_VERSION+4 (post encryption checksum)
FIL_DATA+10 (data)
Crashes with innodb_page_size=64K. Does not crash at <= 32K.
Problem was that when blob record that was earlier < 16k is
enlarged at update wo that length > 16K it should be stored
externally. However, that was not enforced when page-size = 64K
(note that 16K+1 < 64K/2 i.e. half of the btree leaf page).
btr_cur_optimistic_update: limit max record size to 16K
or in REDUNDANT row format to 16K-1.
In all InnoDB row formats, the pointers or lengths stored in the record
header can be at most 14 bits, that is, count up to 16383.
In ROW_FORMAT=REDUNDANT, this limits the maximum possible record length
to 16383 bytes. In other ROW_FORMAT, it could merely limit the maximum
length of variable-length fields.
When MySQL 5.7 introduced innodb_page_size=32k and 64k, the maximum
record length was limited to 16383 bytes (I hope 16383, not 16384,
to be able to distinguish from a record whose length is 0 bytes).
This change is present in MariaDB Server 10.2.
btr_cur_optimistic_update(): Restrict maximum record size to 16K-1
for REDUNDANT and 64K page size.
dict_index_too_big_for_tree(): The maximum allowed record size
is half a B-tree page or 16K(-1 for REDUNDANT) for 64K page size.
convert_error_code_to_mysql(): Fix error message to print
correct limits.
my_error_innodb(): Fix error message to print correct limits.
page_zip_rec_needs_ext() : record size was already restricted to 16K.
Restrict REDUNDANT to 16K-1.
rem0rec.h: Introduce REDUNDANT_REC_MAX_DATA_SIZE (16K-1)
and COMPRESSED_REC_MAX_DATA_SIZE (16K).
The option innodb_log_compressed_pages was contributed by
Facebook to MySQL 5.6. It was disabled in the 5.6.10 GA release
due to problems that were fixed in 5.6.11, which is when the
option was enabled.
The option was set to innodb_log_compressed_pages=ON by default
(disabling the feature), because safety was considered more
important than speed. The option innodb_log_compressed_pages=OFF
can *CORRUPT* ROW_FORMAT=COMPRESSED tables on crash recovery
if the zlib deflate function is behaving differently (producing
a different amount of compressed data) from how it behaved
when the redo log records were written (prior to the crash recovery).
In MDEV-6935, the default value was changed to
innodb_log_compressed_pages=OFF. This is inherently unsafe, because
there are very many different environments where MariaDB can be
running, using different zlib versions. While zlib can decompress
data just fine, there are no guarantees that different versions will
always compress the same data to the exactly same size. To avoid
problems related to zlib upgrades or version mismatch, we must
use a safe default setting.
This will reduce the write performance for users of
ROW_FORMAT=COMPRESSED tables. If you configure
innodb_log_compressed_pages=ON, please make sure that you will
always cleanly shut down InnoDB before upgrading the server
or zlib.
When using innodb_page_size=16k, InnoDB tables
that were created in MariaDB 10.1.0 to 10.1.20 with
PAGE_COMPRESSED=1 and
PAGE_COMPRESSION_LEVEL=2 or PAGE_COMPRESSION_LEVEL=3
would fail to load.
fsp_flags_is_valid(): When using innodb_page_size=16k, use a
more strict check for .ibd files, with the assumption that
nobody would try to use different-page-size files.
Comment from Codership:-
To fix the problem, we changed the certification logic in galera to treat insert
on child table row as exclusive to prevent any operation on referenced
parent table row. At the same time, update and delete on
child table row were demoted to "shared", which makes it possible to
update/delete referenced parent table row, but only in a later transaction.
This change allows somewhat more concurrency for foreign key constrained
transactions, but is still safe for correct certification end result.
log_calc_max_ages(): Use the requested size in the check, instead of
the detected redo log size. The redo log will be resized at startup
if it differs from what has been requested.
dict_table_t::thd: Remove. This was only used by btr_root_block_get()
for reporting decryption failures, and it was only assigned by
ha_innobase::open(), and never cleared. This could mean that if a
connection is closed, the pointer would become stale, and the server
could crash while trying to report the error. It could also mean
that an error is being reported to the wrong client. It is better
to use current_thd in this case, even though it could mean that if
the code is invoked from an InnoDB background operation, there would
be no connection to which to send the error message.
Remove dict_table_t::crypt_data and dict_table_t::page_0_read.
These fields were never read.
fil_open_single_table_tablespace(): Remove the parameter "table".
Cover innodb.table_flags with the new innodb_page_size.combinations
32k and 64k.
dict_sys_tables_type_validate(): Remove an assertion that made a
check in the function redundant. Remove the excessive output to
the error log, as the invalid SYS_TABLES.TYPE value is already being
output.
When a slow shutdown is performed soon after spawning some work for
background threads that can create or commit transactions, it is possible
that new transactions are started or committed after the purge has finished.
This is violating the specification of innodb_fast_shutdown=0, namely that
the purge must be completed. (None of the history of the recent transactions
would be purged.)
Also, it is possible that the purge threads would exit in slow shutdown
while there exist active transactions, such as recovered incomplete
transactions that are being rolled back. Thus, the slow shutdown could
fail to purge some undo log that becomes purgeable after the transaction
commit or rollback.
srv_undo_sources: A flag that indicates if undo log can be generated
or the persistent, whether by background threads or by user SQL.
Even when this flag is clear, active transactions that already exist
in the system may be committed or rolled back.
innodb_shutdown(): Renamed from innobase_shutdown_for_mysql().
Do not return an error code; the operation never fails.
Clear the srv_undo_sources flag, and also ensure that the background
DROP TABLE queue is empty.
srv_purge_should_exit(): Do not allow the purge to exit if
srv_undo_sources are active or the background DROP TABLE queue is not
empty, or in slow shutdown, if any active transactions exist
(and are being rolled back).
srv_purge_coordinator_thread(): Remove some previous workarounds
for this bug.
innobase_start_or_create_for_mysql(): Set buf_page_cleaner_is_active
and srv_dict_stats_thread_active directly. Set srv_undo_sources before
starting the purge subsystem, to prevent immediate shutdown of the purge.
Create dict_stats_thread and fts_optimize_thread immediately
after setting srv_undo_sources, so that shutdown can use this flag to
determine if these subsystems were started.
dict_stats_shutdown(): Shut down dict_stats_thread. Backported from 10.2.
srv_shutdown_table_bg_threads(): Remove (unused).
Problem appears to be that the function fsp_flags_try_adjust()
is being unconditionally invoked on every .ibd file on startup.
Based on performance investigation also the top function
fsp_header_get_crypt_offset() needs to addressed.
Ported implementation of fsp_header_get_encryption_offset()
function from 10.2 to fsp_header_get_crypt_offset().
Introduced a new function fil_crypt_read_crypt_data()
to read page 0 if it is not yet read.
fil_crypt_find_space_to_rotate(): Now that page 0 for every .ibd
file is not read on startup we need to check has page 0 read
from space that we investigate for key rotation, if it is not read
we read it.
fil_space_crypt_get_status(): Now that page 0 for every .ibd
file is not read on startup here also we need to read page 0
if it is not yet read it. This is needed
as tests use IS query to wait until background encryption
or decryption has finished and this function is used to
produce results.
fil_crypt_thread(): Add is_stopping condition for tablespace
so that we do not rotate pages if usage of tablespace should
be stopped. This was needed for failure seen on regression
testing.
fil_space_create: Remove page_0_crypt_read and extra
unnecessary info output.
fil_open_single_table_tablespace(): We call fsp_flags_try_adjust
only when when no errors has happened and server was not started
on read only mode and tablespace validation was requested or
flags contain other table options except low order bits to
FSP_FLAGS_POS_PAGE_SSIZE position.
fil_space_t::page_0_crypt_read removed.
Added test case innodb-first-page-read to test startup when
encryption is on and when encryption is off to check that not
for all tables page 0 is read on startup.
The doublewrite buffer pages must fit in the first InnoDB system
tablespace data file. The checks that were added in the initial patch
(commit 112b21da37)
were at too high level and did not cover all cases.
innodb.log_data_file_size: Test all innodb_page_size combinations.
fsp_header_init(): Never return an error. Move the change buffer creation
to the only caller that needs to do it.
btr_create(): Clean up the logic. Remove the error log messages.
buf_dblwr_create(): Try to return an error on non-fatal failure.
Check that the first data file is big enough for creating the
doublewrite buffers.
buf_dblwr_process(): Check if the doublewrite buffer is available.
Display the message only if it is available.
recv_recovery_from_checkpoint_start_func(): Remove a redundant message
about FIL_PAGE_FILE_FLUSH_LSN mismatch when crash recovery has already
been initiated.
fil_report_invalid_page_access(): Simplify the message.
fseg_create_general(): Do not emit messages to the error log.
innobase_init(): Revert the changes.
trx_rseg_create(): Refactor (no functional change).
commit 1af8bf39ca added unnecessary
calls to fil_write_flushed_lsn() during redo log resizing at
InnoDB server startup.
Because fil_write_flushed_lsn() is neither redo-logged nor doublewrite
buffered, the call is risky and should be avoided, because if the
server killed during the write call, the whole InnoDB instance can
become inaccessible (corrupted page 0 in the system tablespace).
In the best case, this call might prevent a diagnostic message from
being emitted to the error log on the next startup.
Rewrite the test encryption.innodb-checksum-algorithm not to
require any restarts or re-bootstrapping, and to cover all
innodb_page_size combinations.
Test innodb.101_compatibility with all innodb_page_size combinations.
Problem was that all doublewrite buffer pages must fit to first
system datafile.
Ported commit 27a34df7882b1f8ed283f22bf83e8bfc523cbfde
Author: Shaohua Wang <shaohua.wang@oracle.com>
Date: Wed Aug 12 15:55:19 2015 +0800
BUG#21551464 - SEGFAULT WHILE INITIALIZING DATABASE WHEN
INNODB_DATA_FILE SIZE IS SMALL
To 10.1 (with extended error printout).
btr_create(): If ibuf header page allocation fails report error and
return FIL_NULL. Similarly if root page allocation fails return a error.
dict_build_table_def_step: If fsp_header_init fails return
error code.
fsp_header_init: returns true if header initialization succeeds
and false if not.
fseg_create_general: report error if segment or page allocation fails.
innobase_init: If first datafile is smaller than 3M and could not
contain all doublewrite buffer pages report error and fail to
initialize InnoDB plugin.
row_truncate_table_for_mysql: report error if fsp header init
fails.
srv_init_abort: New function to report database initialization errors.
srv_undo_tablespaces_init, innobase_start_or_create_for_mysql: If
database initialization fails report error and abort.
trx_rseg_create: If segment header creation fails return.
Problem was that checksum check resulted false positives that page is
both not encrypted and encryted when checksum_algorithm was
strict_none.
Encrypton checksum will use only crc32 regardless of setting.
buf_zip_decompress: If compression fails report a error message
containing the space name if available (not available during import).
And note if space could be encrypted.
buf_page_get_gen: Do not assert if decompression fails,
instead unfix the page and return NULL to upper layer.
fil_crypt_calculate_checksum: Use only crc32 method.
fil_space_verify_crypt_checksum: Here we need to check
crc32, innodb and none method for old datafiles.
fil_space_release_for_io: Allow null space.
encryption.innodb-compressed-blob is now run with crc32 and none
combinations.
Note that with none and strict_none method there is not really
a way to detect page corruptions and page corruptions after
decrypting the page with incorrect key.
New test innodb-checksum-algorithm to test different checksum
algorithms with encrypted, row compressed and page compressed
tables.
Problem was that FIL_PAGE_FLUSH_LSN_OR_KEY_VERSION field that for
encrypted pages even in system datafiles should contain key_version
except very first page (0:0) is after encryption overwritten with
flush lsn.
Ported WL#7990 Repurpose FIL_PAGE_FLUSH_LSN to 10.1
The field FIL_PAGE_FLUSH_LSN_OR_KEY_VERSION is consulted during
InnoDB startup.
At startup, InnoDB reads the FIL_PAGE_FLUSH_LSN_OR_KEY_VERSION
from the first page of each file in the InnoDB system tablespace.
If there are multiple files, the minimum and maximum LSN can differ.
These numbers are passed to InnoDB startup.
Having the number in other files than the first file of the InnoDB
system tablespace is not providing much additional value. It is
conflicting with other use of the field, such as on InnoDB R-tree
index pages and encryption key_version.
This worklog will stop writing FIL_PAGE_FLUSH_LSN_OR_KEY_VERSION to
other files than the first file of the InnoDB system tablespace
(page number 0:0) when system tablespace is encrypted. If tablespace
is not encrypted we continue writing FIL_PAGE_FLUSH_LSN_OR_KEY_VERSION
to all first pages of system tablespace to avoid unnecessary
warnings on downgrade.
open_or_create_data_files(): pass only one flushed_lsn parameter
xb_load_tablespaces(): pass only one flushed_lsn parameter.
buf_page_create(): Improve comment about where
FIL_PAGE_FIL_FLUSH_LSN_OR_KEY_VERSION is set.
fil_write_flushed_lsn(): A new function, merged from
fil_write_lsn_and_arch_no_to_file() and
fil_write_flushed_lsn_to_data_files().
Only write to the first page of the system tablespace (page 0:0)
if tablespace is encrypted, or write all first pages of system
tablespace and invoke fil_flush_file_spaces(FIL_TYPE_TABLESPACE)
afterwards.
fil_read_first_page(): read flush_lsn and crypt_data only from
first datafile.
fil_open_single_table_tablespace(): Remove output of LSN, because it
was only valid for the system tablespace and the undo tablespaces, not
user tablespaces.
fil_validate_single_table_tablespace(): Remove output of LSN.
checkpoint_now_set(): Use fil_write_flushed_lsn and output
a error if operation fails.
Remove lsn variable from fsp_open_info.
recv_recovery_from_checkpoint_start(): Remove unnecessary second
flush_lsn parameter.
log_empty_and_mark_files_at_shutdown(): Use fil_writte_flushed_lsn
and output error if it fails.
open_or_create_data_files(): Pass only one flushed_lsn variable.
row_merge_write(): Pass the correct (possibly encrypted) buffer
to os_file_write_int_fd().
This bug was introduced in commit 65e1399e64
which included a commit to merge changes from MySQL 5.6.36 to
MariaDB Server 10.0.
btr_defragment_thread(): Create the thread in the same place as other
threads. Do not invoke btr_defragment_shutdown(), because
row_drop_tables_for_mysql_in_background() in the master thread can still
keep invoking btr_defragment_remove_table().
logs_empty_and_mark_files_at_shutdown(): Wait for btr_defragment_thread()
to exit.
innobase_start_or_create_for_mysql(), innobase_shutdown_for_mysql():
Skip encryption and scrubbing in innodb_read_only_mode.
srv_export_innodb_status(): Do not export encryption or scrubbing
statistics in innodb_read_only mode, because the threads will not
be running.
InnoDB shutdown assumes that once the server has entered
SRV_SHUTDOWN_FLUSH_PHASE, no change to persistent data is allowed.
It was possible for the master thread to wake up while shutdown
is executing in SRV_SHUTDOWN_FLUSH_PHASE or
even in SRV_SHUTDOWN_LAST_PHASE.
We do not yet know if further crashes at shutdown are possible.
Also, we do not know if all the observed crashes could be explained
by the race conditions that we are now fixing.
srv_shutdown_print_master_pending(): Remove a redundant ut_time() call.
srv_shutdown(): Renamed from srv_master_do_shutdown_tasks().
srv_master_thread(): Do not resume after shutdown has been initiated.
Before MDEV-6812, it did not matter that merge_files[].offset was
uninitialized when no files were created.
This problem was introduced in MDEV-6812. There could be a user-visible
impact that the progress reports spit into the error log are bogus.
row_merge_build_indexes(): Initialize merge_files[i].offset.
Snappy compression method require that output buffer
used for compression is bigger than input buffer.
Similarly lzo require additional work memory buffer.
Increase the allocated buffer accordingly.
buf_tmp_buffer_t: removed unnecessary lzo_mem, crypt_buf_free and
comp_buf_free.
buf_pool_reserve_tmp_slot: use alligned_alloc and if snappy
available allocate size based on snappy_max_compressed_length and
if lzo is available increase buffer by LZO1X_1_15_MEM_COMPRESS.
fil_compress_page: Remove unneeded lzo mem (we use same buffer)
and if output buffer is not yet allocated allocate based similarly
as above.
Decompression does not require additional work area.
Modify test to use same test as other compression method tests.
In commit 360a4a0372
some debug assertions were introduced to the page flushing code
in XtraDB. Add these assertions to InnoDB as well, and adjust
the InnoDB shutdown so that these assertions will not fail.
logs_empty_and_mark_files_at_shutdown(): Advance
srv_shutdown_state from the first phase SRV_SHUTDOWN_CLEANUP
only after no page-dirtying activity is possible
(well, except by srv_master_do_shutdown_tasks(), which will be
fixed separately in MDEV-12052).
rotate_thread_t::should_shutdown(): Already exit the key rotation
threads at the first phase of shutdown (SRV_SHUTDOWN_CLEANUP).
page_cleaner_sleep_if_needed(): Do not sleep during shutdown.
This change is originally from XtraDB.
Significantly reduce the amount of InnoDB, XtraDB and Mariabackup
code changes by defining pfs_os_file_t as something that is
transparently compatible with os_file_t.
available
lz4.cmake: Check if shared or static lz4 library has LZ4_compress_default
function and if it has define HAVE_LZ4_COMPRESS_DEFAULT.
fil_compress_page: If HAVE_LZ4_COMPRESS_DEFAULT is defined use
LZ4_compress_default function for compression if not use
LZ4_compress_limitedOutput function.
Introduced a innodb-page-compression.inc file for page compression
tests that will also search .ibd file to verify that pages
are compressed (i.e. used search string is not found). Modified
page compression tests to use this file.
Note that snappy method is not included because of MDEV-12615
InnoDB page compression method snappy mostly does not compress pages
that will be fixed on different commit.
This fixes warnings that were emitted when running InnoDB test
suites on a debug server that was compiled with GCC 7.1.0 using
the flags -O3 -fsanitize=undefined.
thd_requested_durability(): XtraDB can call this with trx->mysql_thd=NULL.
Remove the function in InnoDB, because it is not used there.
calc_row_difference(): Do not call memcmp(o_ptr, NULL, 0).
innobase_index_name_is_reserved(): This can be called with
key_info=NULL, num_of_keys=0.
innobase_dropping_foreign(), innobase_check_foreigns_low(),
innobase_check_foreigns(): This can be called with
drop_fk=NULL, n_drop_fk=0.
rec_convert_dtuple_to_rec_comp(): Do not invoke memcpy(end, NULL, 0).
On 64-bit systems, the constant 1 would be 32-bit (int or unsigned)
by default. Cast the constant to ulint before shifting to avoid a
-fsanitize=undefined warning or any potential overflow.
Fix a -fsanitizer=undefined warning that trx_undo_report_row_operation()
was being passed thr=NULL when the BTR_NO_UNDO_LOG_FLAG flag was set.
trx_undo_report_row_operation(): Remove the first two parameters.
The parameter clust_entry!=NULL distinguishes inserts from updates.
This should be a non-functional change (no observable change in
behaviour; slightly smaller code).
Allocate srv_sys statically so that the desired alignment can be
guaranteed. This silences -fsanitize=undefined warnings.
There probably is no performance impact of this, because the
reason for the alignment to ensure the absence of false sharing
between counters. Even with the misalignment, each counter would
have been been aligned at 64 bits, and the counters would reside
in separate cache lines.
The parameter thr of the function btr_cur_optimistic_insert()
is not declared as nonnull, but GCC 7.1.0 with -O3 is wrongly
optimizing away the first part of the condition
UNIV_UNLIKELY(thr && thr_get_trx(thr)->fake_changes)
when the function is being called by row_merge_insert_index_tuples()
with thr==NULL.
The fake_changes is an XtraDB addition. This GCC bug only appears
to have an impact on XtraDB, not InnoDB.
We work around the problem by not attempting to dereference thr
when both BTR_NO_LOCKING_FLAG and BTR_NO_UNDO_LOG_FLAG are set
in the flags. Probably BTR_NO_LOCKING_FLAG alone should suffice.
btr_cur_optimistic_insert(), btr_cur_pessimistic_insert(),
btr_cur_pessimistic_update(): Correct comments that disagree with
usage and with nonnull attributes. No other parameter than thr can
actually be NULL.
row_ins_duplicate_error_in_clust(): Remove an unused parameter.
innobase_is_fake_change(): Unused function; remove.
ibuf_insert_low(), row_log_table_apply(), row_log_apply(),
row_undo_mod_clust_low():
Because we will be passing BTR_NO_LOCKING_FLAG | BTR_NO_UNDO_LOG_FLAG
in the flags, the trx->fake_changes flag will be treated as false,
which is the right thing to do at these low-level operations
(change buffer merge, ALTER TABLE…LOCK=NONE, or ROLLBACK).
This might be fixing actual XtraDB bugs.
Other callers that pass these two flags are also passing thr=NULL,
implying fake_changes=false. (Some callers in ROLLBACK are passing
BTR_NO_LOCKING_FLAG and a nonnull thr. In these callers, fake_changes
better be false, to avoid corruption.)
This merge reverts commit 6ca4f693c1ce472e2b1bf7392607c2d1124b4293
from current 5.6.36 innodb.
Bug #23481444 OPTIMISER CALL ROW_SEARCH_MVCC() AND READ THE
INDEX APPLIED BY UNCOMMITTED ROW
Problem:
========
row_search_for_mysql() does whole table traversal for range query
even though the end range is passed. Whole table traversal happens
when the record is not with in transaction read view.
Solution:
=========
Convert the innodb last record of page to mysql format and compare
with end range if the traversal of row_search_mvcc() exceeds 100,
no ICP involved. If it is out of range then InnoDB can avoid the
whole table traversal. Need to refactor the code little bit to
make it compile.
Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
Reviewed-by: Knut Hatlen <knut.hatlen@oracle.com>
Reviewed-by: Dmitry Shulga <dmitry.shulga@oracle.com>
RB: 14660
The macro UT_LIST_INIT() zero-initializes the UT_LIST_NODE.
There is no need to call this macro on a buffer that has
already been zero-initialized by mem_zalloc() or mem_heap_zalloc()
or similar.
For some reason, the statement UT_LIST_INIT(srv_sys->tasks) in
srv_init() caused a SIGSEGV on server startup when compiling with
GCC 7.1.0 for AMD64 using -O3. The zero-initialization was attempted
by the instruction movaps %xmm0,0x50(%rax), while the proper offset
of srv_sys->tasks would seem to have been 0x48.
Do not silence uncertain cases, or fix any bugs.
The only functional change should be that ha_federated::extra()
is not calling DBUG_PRINT to report an unhandled case for
HA_EXTRA_PREPARE_FOR_DROP.