Problem: We created more than 5 encryption keys for redo-logs.
Idea was that we do not anymore create more than one encryption
key for redo-logs but if existing checkpoint from earlier
MariaDB contains more keys, we should read all of them.
Fix: Add new encryption key to memory structure only if there
currently has none or if we are reading checkpoint from the log.
Checkpoint from older MariaDB version could contain more than
one key.
Using __ppc_get_timebase will translate to mfspr instruction
The mfspr instruction will block FXU1 until complete but the other
Pipelines are available for execution of instructions from other
SMT threads on the same core.
The latency time to read the timebase SPR is ~10 cycles.
So any impact on other threads is limited other FXU1 only instructions
(basically other mfspr/mtspr ops).
Suggested by Steven J. Munroe, Linux on Power Toolchain Architect,
Linux Technology Center
IBM Corporation
Bug#18842925 : SET THREAD PRIORITY IN INNODB MUTEX SPINLOOP
Like "pause" instruction for hyper-threading at Intel CPUs,
POWER has special instructions only for hinting priority of hardware-threads.
Approved by Sunny in rb#6256
Backport of the 5.7 fix - c92102a6ef
(excluded cache line size patch)
Suggestion by Stewart Smith
UT_RELAX_CPU(): Use a compiler barrier.
ut_delay(): Remove the dummy global variable ut_always_false.
RB: 11399
Reviewed-by: Jimmy Yang <jimmy.yang@oracle.com>
Backported from MySQL-5.7 - patch 5e3efb0396
Suggestion by Stewart Smith
Make sure that we read all possible encryption keys from checkpoint
and if log block checksum does not match, print all found
checkpoint encryption keys.
Problem was that link file (.isl) is also opened using O_DIRECT
mode and if this fails the whole create table fails on internal
error.
Fixed by not using O_DIRECT on link files as they are used only
on create table and startup and do not contain real data.
O_DIRECT failures are successfully ignored for data files
if O_DIRECT is not supported by file system on used
data directory.
Analysis:
-- InnoDB has n (>0) redo-log files.
-- In the first page of redo-log there is 2 checkpoint records on fixed location (checkpoint is not encrypted)
-- On every checkpoint record there is up to 5 crypt_keys containing the keys used for encryption/decryption
-- On crash recovery we read all checkpoints on every file
-- Recovery starts by reading from the latest checkpoint forward
-- Problem is that latest checkpoint might not always contain the key we need to decrypt all the
redo-log blocks (see MDEV-9422 for one example)
-- Furthermore, there is no way to identify is the log block corrupted or encrypted
For example checkpoint can contain following keys :
write chk: 4 [ chk key ]: [ 5 1 ] [ 4 1 ] [ 3 1 ] [ 2 1 ] [ 1 1 ]
so over time we could have a checkpoint
write chk: 13 [ chk key ]: [ 14 1 ] [ 13 1 ] [ 12 1 ] [ 11 1 ] [ 10 1 ]
killall -9 mysqld causes crash recovery and on crash recovery we read as
many checkpoints as there is log files, e.g.
read [ chk key ]: [ 13 1 ] [ 12 1 ] [ 11 1 ] [ 10 1 ] [ 9 1 ]
read [ chk key ]: [ 14 1 ] [ 13 1 ] [ 12 1 ] [ 11 1 ] [ 10 1 ] [ 9 1 ]
This is problematic, as we could still scan log blocks e.g. from checkpoint 4 and we do
not know anymore the correct key.
CRYPT INFO: for checkpoint 14 search 4
CRYPT INFO: for checkpoint 13 search 4
CRYPT INFO: for checkpoint 12 search 4
CRYPT INFO: for checkpoint 11 search 4
CRYPT INFO: for checkpoint 10 search 4
CRYPT INFO: for checkpoint 9 search 4 (NOTE: NOT FOUND)
For every checkpoint, code generated a new encrypted key based on key
from encryption plugin and random numbers. Only random numbers are
stored on checkpoint.
Fix: Generate only one key for every log file. If checkpoint contains only
one key, use that key to encrypt/decrypt all log blocks. If checkpoint
contains more than one key (this is case for databases created
using MariaDB server version 10.1.0 - 10.1.12 if log encryption was
used). If looked checkpoint_no is found from keys on checkpoint we use
that key to decrypt the log block. For encryption we use always the
first key. If the looked checkpoint_no is not found from keys on checkpoint
we use the first key.
Modified code also so that if log is not encrypted, we do not generate
any empty keys. If we have a log block and no keys is found from
checkpoint we assume that log block is unencrypted. Log corruption or
missing keys is found by comparing log block checksums. If we have
a keys but current log block checksum is correct we again assume
log block to be unencrypted. This is because current implementation
stores checksum only before encryption and new checksum after
encryption but before disk write is not stored anywhere.
Make sure that on decrypt we do not try to reference
NULL pointer and if page contains undefined
FIL_PAGE_FILE_FLUSH_LSN field on when page is not
the first page or page is not in system tablespace,
clear it.
In row_search_for_mysql function on XtraDB there was a old logic
where null bytes were inited. This caused server to think that
key value is null and continue on incorrect path.
There was two problems. Firstly, if page in ibuf is encrypted but
decrypt failed we should not allow InnoDB to start because
this means that system tablespace is encrypted and not usable.
Secondly, if page decrypt is detected we should return false
from buf_page_decrypt_after_read.
MDEV-9469: 'Incorrect key file' on ALTER TABLE
InnoDB needs to rebuild table if column name is changed and
added index (or foreign key) is created based on this new
name in same alter table.
Fix the doubly questional fix for MySQL Bug#17250787:
* it detected autoinc index by looking for the first index
that starts from autoinc column. never mind one column
can be part of many indexes.
* it used autoinc_field->field_index to look up into internal
innodb dictionary. But field_index accounts for virtual
columns too, while innodb dictionary ignores them.
Find the index by its name, like elsewhere in ha_innobase.
At alter table when server renames the table to temporal name,
old name uses normal partioned table naming rules. However,
if tables are created on Windows and then transfered to Linux
and lower-case-table-names=1 we should modify the old name
on rename to lower case to be able to find it from the
InnoDB dictionary.
At alter table when server renames the table to temporal name,
old name uses normal partioned table naming rules. However,
if tables are created on Windows and then transfered to Linux
and lower-case-table-names=1 we should modify the old name
on rename to lower case to be able to find it from the
InnoDB dictionary.
Provided IBM System Z have outdated compiler version, which supports gcc sync
builtins but not gcc atomic builtins. It also has weak memory model.
InnoDB attempted to verify if __sync_lock_test_and_set() is available by
checking IB_STRONG_MEMORY_MODEL. This macro has nothing to do with availability
of __sync_lock_test_and_set(), the right one is HAVE_ATOMIC_BUILTINS.
Provided IBM System Z have outdated compiler version, which supports gcc sync
builtins but not gcc atomic builtins. It also has weak memory model.
InnoDB attempted to verify if __sync_lock_test_and_set() is available by
checking IB_STRONG_MEMORY_MODEL. This macro has nothing to do with availability
of __sync_lock_test_and_set(), the right one is HAVE_ATOMIC_BUILTINS.
In wsrep BF we have already took lock_sys and trx
mutex either on wsrep_abort_transaction() or
before wsrep_kill_victim(). In replication we
could own lock_sys mutex taken in
lock_deadlock_check_and_resolve().
Analysis: We have reserved ROW_MERGE_RESERVE_SIZE ( = 4) for
encryption key_version. When calculating is there more
space on sort buffer, this value needs to be substracted
from current available space.
Backport pull request #125 from grooverdan/MDEV-8923_innodb_buffer_pool_dump_pct to 10.0
WL#6504 InnoDB buffer pool dump/load enchantments
This patch consists of two parts:
1. Dump only the hottest N% of the buffer pool(s)
2. Prevent hogging the server duing BP load
From MySQL - commit b409342c43ce2edb68807100a77001367c7e6b8e
Add testcases for innodb_buffer_pool_dump_pct_basic.
Part of the code authored by Daniel Black
- Make accelerated checksum available to InnoDB and XtraDB.
- Fall back to slice-by-eight if not available. The mode used is printed on startup.
- Will only build on POWER systems at the moment until CMakeLists are modified
to only add the crc32_power8/ files when building on POWER.
running MySQL-5.7 unittest/gunit/innodb/ut0crc32-t
Before:
1..2
Using software crc32 implementation, CPU is little-endian
ok 1
Using software crc32 implementation, CPU is little-endian
normal CRC32: real 0.148006 sec
normal CRC32: user 0.148000 sec
normal CRC32: sys 0.000000 sec
big endian CRC32: real 0.144293 sec
big endian CRC32: user 0.144000 sec
big endian CRC32: sys 0.000000 sec
ok 2
After:
1..2
Using POWER8 crc32 implementation, CPU is little-endian
ok 1
Using POWER8 crc32 implementation, CPU is little-endian
normal CRC32: real 0.008097 sec
normal CRC32: user 0.008000 sec
normal CRC32: sys 0.000000 sec
big endian CRC32: real 0.147043 sec
big endian CRC32: user 0.144000 sec
big endian CRC32: sys 0.000000 sec
ok 2
Author CRC32 ASM code: Anton Blanchard <anton@au.ibm.com>
ref: https://github.com/antonblanchard/crc32-vpmsum
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
As galera node (slave) received query log events from an async
replication master, it partially wrote the updates made to replication
state table (mysql.gtid_slave_pos) to galera transaction writeset post
TOI. As a result, the transaction handle, thus created within galera,
was never freed/purged as the corresponding trx did not commit.
Thus, it kept piling up for every query log event and was only reclaimed
upon server shutdown when the transaction map object got destructed.
Fixed by making sure that updates in replication slave state table
are not written to galera transaction writeset and thus, not replicated
to other nodes.
while according to Storage Engine API column names should be compared
case insensitively. This can cause FRM and InnoDB data dictionary to
go out of sync.
Called thd_progress_init() from several threads used for FT-index
creation. For FT-indexes, need better way to report progress,
remove current one for them.
Correct InnoDB calls to progress report API:
- second argument of thd_progress_init() is number of stages, one stage is
enough for the row merge
- third argument of thd_progress_report() is some value indicating threshold,
for the row merge it is file->offset
- second argument of thd_progress_report() is some value indicating current
state, for the row merge it is file->offset - num_runs.
fix innodb auto-increment handling
three bugs:
1. innobase_next_autoinc treated the case of current<offset incorrectly
2. ha_innobase::get_auto_increment didn't recalculate current when increment changed
3. ha_innobase::get_auto_increment didn't pass offset down to innobase_next_autoinc
Analysis: There were two problems. (1) if partition table was
created using lower_case_tables = 1 on windows we did find the
correct table but we did not set share->ib_table correctly.
(2) we did open table on dictionary but did not increase
mysql_open_tables.
Fix: In xtradb allow access to tables with incorrect
lower case names (warning is printed to error log). If
table is opened increase mysql_open_tables count to avoid
crash on flush tables.
Analysis: debug only assertion I_S function (IS is XtraDB feature) is calling
buf_block_get_frame on any page it reads, which debug-asserts that the page is
buffer-fixed, which is not the case in I_S query.
Fixed by holding the buffer page mutex while the fields are read directly.
WL#6504 InnoDB buffer pool dump/load enchantments
This patch consists of two parts:
1. Dump only the hottest N% of the buffer pool(s)
2. Prevent hogging the server duing BP load
From MySQL - commit b409342c43ce2edb68807100a77001367c7e6b8e
- Added some extra command to rpl_start_stop to ensure that the
IO thread has connected to the master before we shut down the server.
- if signal returns signalhandler_t, use this with the alarm code
- Added missing tests to sys_vars
- Fixed some possible overflow bugs in tabxml.cpp
Analysis: Lengths which are not UNIV_SQL_NULL, but bigger than the following
number indicate that a field contains a reference to an externally
stored part of the field in the tablespace. The length field then
contains the sum of the following flag and the locally stored len.
This was incorrectly set to
define UNIV_EXTERN_STORAGE_FIELD (UNIV_SQL_NULL - UNIV_PAGE_SIZE_MAX)
When it should be
define UNIV_EXTERN_STORAGE_FIELD (UNIV_SQL_NULL - UNIV_PAGE_SIZE_DEF)
Additionally, we need to disable support for > 16K page size for
row compressed tables because a compressed page directory entry
reserves 14 bits for the start offset and 2 bits for flags.
This limits the uncompressed page size to 16k. To support
larger pages page directory entry needs to be larger.
Analysis: Current implementation will write and read at least one block
(sort_buffer_size bytes) from disk / index even if that block does not
contain any records.
Fix: Avoid writing / reading empty blocks to temporary files (disk).
Analysis: We are alreading holing lock_sys mutex when we call thd::awake.
This could lead mutex deadlock if trx->current_lock_mutex_owner is not
correctly set.
Fix: Make sure that trx->current_lock_mutex_owner is correctly set.
Analysis: Problem is that punch hole does not know the actual page size
of the page and does the page belong to an data file or to a log file.
Fix: Pass down the file type and page size to os layer to be used
when trim is called. Also fix unsafe null pointer access to
actual write_size.
Analysis: When a page is read from encrypted table and page can't be
decrypted because of bad key (or incorrect encryption algorithm or
method) page was incorrectly left on buffer pool.
Fix: Remove page from buffer pool and from pending IO.
Folloup: Made encryption rules too strict (and incorrect). Allow creating
table with ENCRYPTED=OFF with all values of ENCRYPTION_KEY_ID but create
warning that nondefault values are ignored. Allow creating table with
ENCRYPTED=DEFAULT if used key_id is found from key file (there was
bug on this) and give error if key_id is not found.
Analysis: Problem sees to be the fact that we allow creating or altering
table to use encryption_key_id that does not exists in case where
original table is not encrypted currently. Secondly we should not
do key rotation to tables that are not encrypted or tablespaces
that can't be found from tablespace cache.
Fix: Do not allow creating unencrypted table with nondefault encryption key
and do not rotate tablespaces that are not encrypted (FIL_SPACE_ENCRYPTION_OFF)
or can't be found from tablespace cache.
Added encryption support for online alter table where InnoDB temporary
files are used. Added similar support also for tables containing
full text-indexes.
Made sure that table remains encrypted during discard and import
tablespace.
Analysis: Server tried to continue reading tablespace using a cursor after
we had resolved that pages in the tablespace can't be decrypted.
Fixed by addind check is tablespace still encrypted.
Analysis: Problem was that in fil_read_first_page we do find that
table has encryption information and that encryption service
or used key_id is not available. But, then we just printed
fatal error message that causes above assertion.
Fix: When we open single table tablespace if it has encryption
information (crypt_data) store this crypt data to the table
structure. When we open a table and we find out that tablespace
is not available, check has table a encryption information
and from there is encryption service or used key_id is not available.
If it is, add additional warning for SQL-layer.
Analysis: Problem was that in fil_read_first_page we do find that
table has encryption information and that encryption service
or used key_id is not available. But, then we just printed
fatal error message that causes above assertion.
Fix: When we open single table tablespace if it has encryption
information (crypt_data) store this crypt data to the table
structure. When we open a table and we find out that tablespace
is not available, check has table a encryption information
and from there is encryption service or used key_id is not available.
If it is, add additional warning for SQL-layer.
Instead of encrypt(src, dst, key, iv) that encrypts all
data in one go, now we have encrypt_init(key,iv),
encrypt_update(src,dst), and encrypt_finish(dst).
This also causes collateral changes in the internal my_crypt.cc
encryption functions and in the encryption service.
There are wrappers to provide the old all-at-once encryption
functionality. But binlog events are often written piecewise,
they'll need the new api.
Fix all cmake tests (including plugin) to use
MY_CHECK_AND_SET_COMPILER_FLAG. And fix that function
to be compatible with cmake 3.0. This way flag checks
are correctly cached (even in cmake 3.0) and properly reused.
When wsrep is enabled, for any update on innodb tables, the
corresponding keys are appended to galera's transaction writeset
(wsrep_append_keys()). However, for LOAD DATA, this got skipped
if binary logging was disabled or it was non-ROW based.
As a result, while the updates from LOAD DATA on non-partitioned
tables replicated fine as wsrep implicitly enables binary logging
(if not enabled, explicitly), the same did not work on partitioned
tables as for partitioned tables the binary logging gets disabled
temporarily (ha_partition::write_row()).
Fixed by removing the unwanted conditions from the check.
Also backported some changes from 10.0-galera to make sure
wsrep_load_data_splitting affects LOAD DATA commands only.