InnoDB divides the allocation of undo logs into rollback segments.
The DB_ROLL_PTR system column of clustered indexes can address up to
128 rollback segments (TRX_SYS_N_RSEGS). Originally, InnoDB only
created one rollback segment. In MySQL 5.5 or in the InnoDB Plugin
for MySQL 5.1, all 128 rollback segments were created.
MySQL 5.7 hard-codes the rollback segment IDs 1..32 for temporary undo logs.
On upgrade, unless a slow shutdown (innodb_fast_shutdown=0)
was performed on the old server instance, these rollback segments
could be in use by transactions that are in XA PREPARE state or
transactions that were left behind by a server kill followed by a
normal shutdown immediately after restart.
Persistent tables cannot refer to temporary undo logs or vice versa.
Therefore, we should keep two distinct sets of rollback segments:
one for persistent tables and another for temporary tables. In this way,
all 128 rollback segments will be available for both types of tables,
which could improve performance. Also, MariaDB 10.2 will remain more
compatible than MySQL 5.7 with data files from earlier versions of
MySQL or MariaDB.
trx_sys_t::temp_rsegs[TRX_SYS_N_RSEGS]: A new array of temporary
rollback segments. The trx_sys_t::rseg_array[TRX_SYS_N_RSEGS] will
be solely for persistent undo logs.
srv_tmp_undo_logs. Remove. Use the constant TRX_SYS_N_RSEGS.
srv_available_undo_logs: Change the type to ulong.
trx_rseg_get_on_id(): Remove. Instead, let the callers refer to
trx_sys directly.
trx_rseg_create(), trx_sysf_rseg_find_free(): Remove unneeded parameters.
These functions only deal with persistent undo logs.
trx_temp_rseg_create(): New function, to create all temporary rollback
segments at server startup.
trx_rseg_t::is_persistent(): Determine if the rollback segment is for
persistent tables.
trx_sys_is_noredo_rseg_slot(): Remove. The callers must know based on
context (such as table handle) whether the DB_ROLL_PTR is referring to
a persistent undo log.
trx_sys_create_rsegs(): Remove all parameters, which were always passed
as global variables. Instead, modify the global variables directly.
enum trx_rseg_type_t: Remove.
trx_t::get_temp_rseg(): A method to ensure that a temporary
rollback segment has been assigned for the transaction.
trx_t::assign_temp_rseg(): Replaces trx_assign_rseg().
trx_purge_free_segment(), trx_purge_truncate_rseg_history():
Remove the redundant variable noredo=false.
Temporary undo logs are discarded immediately at transaction commit
or rollback, not lazily by purge.
trx_purge_mark_undo_for_truncate(): Remove references to the
temporary rollback segments.
trx_purge_mark_undo_for_truncate(): Remove a check for temporary
rollback segments. Only the dedicated persistent undo log tablespaces
can be truncated.
trx_undo_get_undo_rec_low(), trx_undo_get_undo_rec(): Add the
parameter is_temp.
trx_rseg_mem_restore(): Split from trx_rseg_mem_create().
Initialize the undo log and the rollback segment from the file
data structures.
trx_sysf_get_n_rseg_slots(): Renamed from
trx_sysf_used_slots_for_redo_rseg(). Count the persistent
rollback segment headers that have been initialized.
trx_sys_close(): Also free trx_sys->temp_rsegs[].
get_next_redo_rseg(): Merged to trx_assign_rseg_low().
trx_assign_rseg_low(): Remove the parameters and access the
global variables directly. Revert to simple round-robin, now that
the whole trx_sys->rseg_array[] is for persistent undo log again.
get_next_noredo_rseg(): Moved to trx_t::assign_temp_rseg().
srv_undo_tablespaces_init(): Remove some parameters and use the
global variables directly. Clarify some error messages.
Adjust the test innodb.log_file. Apparently, before these changes,
InnoDB somehow ignored missing dedicated undo tablespace files that
are pointed by the TRX_SYS header page, possibly losing part of
essential transaction system state.
MDEV-11581: Mariadb starts InnoDB encryption threads
when key has not changed or data scrubbing turned off
Background: Key rotation is based on background threads
(innodb-encryption-threads) periodically going through
all tablespaces on fil_system. For each tablespace
current used key version is compared to max key age
(innodb-encryption-rotate-key-age). This process
naturally takes CPU. Similarly, in same time need for
scrubbing is investigated. Currently, key rotation
is fully supported on Amazon AWS key management plugin
only but InnoDB does not have knowledge what key
management plugin is used.
This patch re-purposes innodb-encryption-rotate-key-age=0
to disable key rotation and background data scrubbing.
All new tables are added to special list for key rotation
and key rotation is based on sending a event to
background encryption threads instead of using periodic
checking (i.e. timeout).
fil0fil.cc: Added functions fil_space_acquire_low()
to acquire a tablespace when it could be dropped concurrently.
This function is used from fil_space_acquire() or
fil_space_acquire_silent() that will not print
any messages if we try to acquire space that does not exist.
fil_space_release() to release a acquired tablespace.
fil_space_next() to iterate tablespaces in fil_system
using fil_space_acquire() and fil_space_release().
Similarly, fil_space_keyrotation_next() to iterate new
list fil_system->rotation_list where new tables.
are added if key rotation is disabled.
Removed unnecessary functions fil_get_first_space_safe()
fil_get_next_space_safe()
fil_node_open_file(): After page 0 is read read also
crypt_info if it is not yet read.
btr_scrub_lock_dict_func()
buf_page_check_corrupt()
buf_page_encrypt_before_write()
buf_merge_or_delete_for_page()
lock_print_info_all_transactions()
row_fts_psort_info_init()
row_truncate_table_for_mysql()
row_drop_table_for_mysql()
Use fil_space_acquire()/release() to access fil_space_t.
buf_page_decrypt_after_read():
Use fil_space_get_crypt_data() because at this point
we might not yet have read page 0.
fil0crypt.cc/fil0fil.h: Lot of changes. Pass fil_space_t* directly
to functions needing it and store fil_space_t* to rotation state.
Use fil_space_acquire()/release() when iterating tablespaces
and removed unnecessary is_closing from fil_crypt_t. Use
fil_space_t::is_stopping() to detect when access to
tablespace should be stopped. Removed unnecessary
fil_space_get_crypt_data().
fil_space_create(): Inform key rotation that there could
be something to do if key rotation is disabled and new
table with encryption enabled is created.
Remove unnecessary functions fil_get_first_space_safe()
and fil_get_next_space_safe(). fil_space_acquire()
and fil_space_release() are used instead. Moved
fil_space_get_crypt_data() and fil_space_set_crypt_data()
to fil0crypt.cc.
fsp_header_init(): Acquire fil_space_t*, write crypt_data
and release space.
check_table_options()
Renamed FIL_SPACE_ENCRYPTION_* TO FIL_ENCRYPTION_*
i_s.cc: Added ROTATING_OR_FLUSHING field to
information_schema.innodb_tablespace_encryption
to show current status of key rotation.
Remove srv_win_file_flush_method
- Rename srv_unix_file_flush_method to srv_file_flush_method, and
rename constants to remove UNIX from them, i.e SRV_UNIX_FSYNC=>SRV_FSYNC
- Add SRV_ALL_O_DIRECT_FSYNC corresponding to current Windows default
(no buffering for either log or data, flush on both log and data)
- change os_file_open on Windows to behave identically to Unix wrt
O_DIRECT and O_DSYNC settings. map O_DIRECT to FILE_FLAG_NO_BUFFERING and
O_DSYNC to FILE_FLAG_WRITE_THROUGH
- remove various #ifdef _WIN32
MDEV-7618 introduced configuration parameter innodb_instrument_semaphores
in MariaDB Server 10.1. The parameter seems to only affect the rw-lock
X-latch acquisition. Extra fields are added to rw_lock_t to remember one
X-latch holder or waiter. These fields are not being consulted or reported
anywhere. This is basically only adding code bloat.
If the intention is to debug hangs or deadlocks, we have better tools for
that in the debug server, and for the non-debug server, core dumps can
reveal a lot. For example, the mini-transaction memo records the
currently held buffer block or index rw-locks, to be released at
mtr_t::commit().
The configuration parameter innodb_instrument_semaphores will be
deprecated in 10.2.5 and removed in 10.3.0.
rw_lock_t: Remove the members lock_name, file_name, line, thread_id
which did not affect any output.
The function trx_purge_stop() was calling os_event_reset(purge_sys->event)
before calling rw_lock_x_lock(&purge_sys->latch). The os_event_set()
call in srv_purge_coordinator_suspend() is protected by that X-latch.
It would seem a good idea to consistently protect both os_event_set()
and os_event_reset() calls with a common mutex or rw-lock in those
cases where os_event_set() and os_event_reset() are used
like condition variables, tied to changes of shared state.
For each os_event_t, we try to document the mutex or rw-lock that is
being used. For some events, frequent calls to os_event_set() seem to
try to avoid hangs. Some events are never waited for infinitely, only
timed waits, and os_event_set() is used for early termination of these
waits.
os_aio_simulated_put_read_threads_to_sleep(): Define as a null macro
on other systems than Windows. TODO: remove this altogether and disable
innodb_use_native_aio on Windows.
os_aio_segment_wait_events[]: Initialize only if innodb_use_native_aio=0.
log_write_flush_to_disk_low(): Invoke log_mutex_enter() at the end, to
avoid race conditions when changing the system state. (No potential
race condition existed before MySQL 5.7.)
The function trx_purge_stop() was calling os_event_reset(purge_sys->event)
before calling rw_lock_x_lock(&purge_sys->latch). The os_event_set()
call in srv_purge_coordinator_suspend() is protected by that X-latch.
It would seem a good idea to consistently protect both os_event_set()
and os_event_reset() calls with a common mutex or rw-lock in those
cases where os_event_set() and os_event_reset() are used
like condition variables, tied to changes of shared state.
For each os_event_t, we try to document the mutex or rw-lock that is
being used. For some events, frequent calls to os_event_set() seem to
try to avoid hangs. Some events are never waited for infinitely, only
timed waits, and os_event_set() is used for early termination of these
waits.
os_aio_simulated_put_read_threads_to_sleep(): Define as a null macro
on other systems than Windows. TODO: remove this altogether and disable
innodb_use_native_aio on Windows.
os_aio_segment_wait_events[]: Initialize only if innodb_use_native_aio=0.
Ever since MDEV-5800 enabled indexed virtual columns for InnoDB,
the InnoDB shutdown relied on close_connections() that would set
thd->killed for the InnoDB purge threads. Alas, the embedded server
shutdown is not invoking close_connections(), and thus InnoDB purge
threads fail to initiate shutdown, causing a hang.
innodb_inited: Remove. Use srv_was_started instead.
innobase_fast_shutdown: Remove. Use srv_fast_shutdown instead.
srv_running: Renamed from thd_destructor_myvar, and made global.
The value NULL means that shutdown was requested or the purge threads
should not be running because of innodb_read_only_mode=1.
innobase_init(): Set srv_was_started after ensuring that srv_running
was initialized. (In innodb_read_only mode, the purge threads are not
started and we do not care if srv_running==NULL.)
innobase_start_or_create_for_mysql(): Do not set srv_was_started.
Let it be set by the only caller innobase_init().
srv_purge_should_exit(): Check also srv_was_started and srv_running
when evaluating thd->killed.
Remove the debug parameter innodb_force_recovery_crash that was
introduced into MySQL 5.6 by me in WL#6494 which allowed InnoDB
to resize the redo log on startup.
Let innodb.log_file_size actually start up the server, but ensure
that the InnoDB storage engine refuses to start up in each of the
scenarios.
srv_release_threads(): Actually wait for the threads to resume
from suspension. On CentOS 5 and possibly other platforms,
os_event_set() may be lost.
srv_resume_thread(): A counterpart of srv_suspend_thread().
Optionally wait for the event to be set, optionally with a timeout,
and then release the thread from suspension.
srv_free_slot(): Unconditionally suspend the thread. It is always
in resumed state when this function is entered.
srv_active_wake_master_thread_low(): Only call os_event_set().
srv_purge_coordinator_suspend(): Use srv_resume_thread() instead
of the complicated logic.
srv_release_threads(): Actually wait for the threads to resume
from suspension. On CentOS 5 and possibly other platforms,
os_event_set() may be lost.
srv_resume_thread(): A counterpart of srv_suspend_thread().
Optionally wait for the event to be set, optionally with a timeout,
and then release the thread from suspension.
srv_free_slot(): Unconditionally suspend the thread. It is always
in resumed state when this function is entered.
srv_active_wake_master_thread_low(): Only call os_event_set().
srv_purge_coordinator_suspend(): Use srv_resume_thread() instead
of the complicated logic.
This fixes memory leaks in tests that cause InnoDB startup to fail.
buf_pool_free_instance(): Also free buf_pool->flush_rbt, which would
normally be freed when crash recovery finishes.
fil_node_close_file(), fil_space_free_low(), fil_close_all_files():
Relax some debug assertions to tolerate !srv_was_started.
innodb_shutdown(): Renamed from innobase_shutdown_for_mysql().
Changed the return type to void. Do not assume that all subsystems
were started.
que_init(), que_close(): Remove (empty functions).
srv_init(), srv_general_init(): Remove as global functions.
srv_free(): Allow srv_sys=NULL.
srv_get_active_thread_type(): Only return SRV_PURGE if purge really
is running.
srv_shutdown_all_bg_threads(): Do not reset srv_start_state. It will
be needed by innodb_shutdown().
innobase_start_or_create_for_mysql(): Always call srv_boot() so that
innodb_shutdown() can assume that it was called. Make more subsystems
dependent on SRV_START_STATE_STAT.
srv_shutdown_bg_undo_sources(): Require SRV_START_STATE_STAT.
trx_sys_close(): Do not assume purge_sys!=NULL. Do not call
buf_dblwr_free(), because the doublewrite buffer can exist while
the transaction system does not.
logs_empty_and_mark_files_at_shutdown(): Do a faster shutdown if
!srv_was_started.
recv_sys_close(): Invoke dblwr.pages.clear() which would normally
be invoked by buf_dblwr_process().
recv_recovery_from_checkpoint_start(): Always release log_sys->mutex.
row_mysql_close(): Allow the subsystem not to exist.
Remove the debug parameter innodb_force_recovery_crash that was
introduced into MySQL 5.6 by me in WL#6494 which allowed InnoDB
to resize the redo log on startup.
Let innodb.log_file_size actually start up the server, but ensure
that the InnoDB storage engine refuses to start up in each of the
scenarios.
Problem was that implementation merged from 10.1 was incompatible
with InnoDB 5.7.
buf0buf.cc: Add functions to return should we punch hole and
how big.
buf0flu.cc: Add written page to IORequest
fil0fil.cc: Remove unneeded status call and add test is
sparse files and punch hole supported by file system when
tablespace is created. Add call to get file system
block size. Used file node is added to IORequest. Added
functions to check is punch hole supported and setting
punch hole.
ha_innodb.cc: Remove unneeded status variables (trim512-32768)
and trim_op_saved. Deprecate innodb_use_trim and
set it ON by default. Add function to set innodb-use-trim
dynamically.
dberr.h: Add error code DB_IO_NO_PUNCH_HOLE
if punch hole operation fails.
fil0fil.h: Add punch_hole variable to fil_space_t and
block size to fil_node_t.
os0api.h: Header to helper functions on buf0buf.cc and
fil0fil.cc for os0file.h
os0file.h: Remove unneeded m_block_size from IORequest
and add bpage to IORequest to know actual size of
the block and m_fil_node to know tablespace file
system block size and does it support punch hole.
os0file.cc: Add function punch_hole() to IORequest
to do punch_hole operation,
get the file system block size and determine
does file system support sparse files (for punch hole).
page0size.h: remove implicit copy disable and
use this implicit copy to implement copy_from()
function.
buf0dblwr.cc, buf0flu.cc, buf0rea.cc, fil0fil.cc, fil0fil.h,
os0file.h, os0file.cc, log0log.cc, log0recv.cc:
Remove unneeded write_size parameter from fil_io
calls.
srv0mon.h, srv0srv.h, srv0mon.cc: Remove unneeded
trim512-trim32678 status variables. Removed
these from monitor tests.
The MariaDB 10.1 page_compression is incompatible with the Oracle
implementation that was introduced in MySQL 5.7 later.
Remove the Oracle implementation. Also remove the remaining traces of
MYSQL_ENCRYPTION.
This will also remove traces of PUNCH_HOLE until it is implemented
better. The only effective call to os_file_punch_hole() was in
fil_node_create_low() to test if the operation is supported for the file.
In other words, it looks like page_compression is not working in
MariaDB 10.2, because no code equivalent to the 10.1 os_file_trim()
is enabled.
Sometimes innodb_data_file_size_debug was reported as INT UNSIGNED
instead of BIGINT UNSIGNED. Make it uint instead of ulong to get
a more deterministic result.
InnoDB shutdown failed to properly take fil_crypt_thread() into account.
The encryption threads were signalled to shut down together with other
non-critical tasks. This could be much too early in case of slow shutdown,
which could need minutes to complete the purge. Furthermore, InnoDB
failed to wait for the fil_crypt_thread() to actually exit before
proceeding to the final steps of shutdown, causing the race conditions.
Furthermore, the log_scrub_thread() was shut down way too early.
Also it should remain until the SRV_SHUTDOWN_FLUSH_PHASE.
fil_crypt_threads_end(): Remove. This would cause the threads to
be terminated way too early.
srv_buf_dump_thread_active, srv_dict_stats_thread_active,
lock_sys->timeout_thread_active, log_scrub_thread_active,
srv_monitor_active, srv_error_monitor_active: Remove a race condition
between startup and shutdown, by setting these in the startup thread
that creates threads, not in each created thread. In this way, once the
flag is cleared, it will remain cleared during shutdown.
srv_n_fil_crypt_threads_started, fil_crypt_threads_event: Declare in
global rather than static scope.
log_scrub_event, srv_log_scrub_thread_active, log_scrub_thread():
Declare in static rather than global scope. Let these be created by
log_init() and freed by log_shutdown().
rotate_thread_t::should_shutdown(): Do not shut down before the
SRV_SHUTDOWN_FLUSH_PHASE.
srv_any_background_threads_are_active(): Remove. These checks now
exist in logs_empty_and_mark_files_at_shutdown().
logs_empty_and_mark_files_at_shutdown(): Shut down the threads in
the proper order. Keep fil_crypt_thread() and log_scrub_thread() alive
until SRV_SHUTDOWN_FLUSH_PHASE, and check that they actually terminate.
Problem was that log_scrub function did not take required log_sys mutex.
Background: Unused space in log blocks are padded with MLOG_DUMMY_RECORD if innodb-scrub-log
is enabled. As log files are written on circular fashion old log blocks can be reused
later for new redo-log entries. Scrubbing pads unused space in log blocks to avoid visibility
of the possible old redo-log contents.
log_scrub(): Take log_sys mutex
log_pad_current_log_block(): Increase srv_stats.n_log_scrubs if padding is done.
srv0srv.cc: Set srv_stats.n_log_scrubs to export vars innodb_scrub_log
ha_innodb.cc: Export innodb_scrub_log to global status.
The InnoDB source code contains quite a few references to a closed-source
hot backup tool which was originally called InnoDB Hot Backup (ibbackup)
and later incorporated in MySQL Enterprise Backup.
The open source backup tool XtraBackup uses the full database for recovery.
So, the references to UNIV_HOTBACKUP are only cluttering the source code.
The configuration parameter innodb_use_fallocate, which is mapped to
the variable srv_use_posix_fallocate, has no effect in MariaDB 10.2.2
or MariaDB 10.2.3.
Thus the configuration parameter and the variable should be removed.
fil_space_t::recv_size: New member: recovered tablespace size in pages;
0 if no size change was read from the redo log,
or if the size change was implemented.
fil_space_set_recv_size(): New function for setting space->recv_size.
innodb_data_file_size_debug: A debug parameter for setting the system
tablespace size in recovery even when the redo log does not contain
any size changes. It is hard to write a small test case that would
cause the system tablespace to be extended at the critical moment.
recv_parse_log_rec(): Note those tablespaces whose size is being changed
by the redo log, by invoking fil_space_set_recv_size().
innobase_init(): Correct an error message, and do not require a larger
innodb_buffer_pool_size when starting up with a smaller innodb_page_size.
innobase_start_or_create_for_mysql(): Allow startup with any initial
size of the ibdata1 file if the autoextend attribute is set. Require
the minimum size of fixed-size system tablespaces to be 640 pages,
not 10 megabytes. Implement innodb_data_file_size_debug.
open_or_create_data_files(): Round the system tablespace size down
to pages, not to full megabytes, (Our test truncates the system
tablespace to more than 800 pages with innodb_page_size=4k.
InnoDB should not imagine that it was truncated to 768 pages
and then overwrite good pages in the tablespace.)
fil_flush_low(): Refactored from fil_flush().
fil_space_extend_must_retry(): Refactored from
fil_extend_space_to_desired_size().
fil_mutex_enter_and_prepare_for_io(): Extend the tablespace if
fil_space_set_recv_size() was called.
The test case has been successfully run with all the
innodb_page_size values 4k, 8k, 16k, 32k, 64k.
Reduce the number of calls to encryption_get_key_get_latest_version
when doing key rotation with two different methods:
(1) We need to fetch key information when tablespace not yet
have a encryption information, invalid keys are handled now
differently (see below). There was extra call to detect
if key_id is not found on key rotation.
(2) If key_id is not found from encryption plugin, do not
try fetching new key_version for it as it will fail anyway.
We store return value from encryption_get_key_get_latest_version
call and if it returns ENCRYPTION_KEY_VERSION_INVALID there
is no need to call it again.
Analysis: By design InnoDB was reading first page of every .ibd file
at startup to find out is tablespace encrypted or not. This is
because tablespace could have been encrypted always, not
encrypted newer or encrypted based on configuration and this
information can be find realible only from first page of .ibd file.
Fix: Do not read first page of every .ibd file at startup. Instead
whenever tablespace is first time accedded we will read the first
page to find necessary information about tablespace encryption
status.
TODO: Add support for SYS_TABLEOPTIONS where all table options
encryption information included will be stored.
Contains also:
MDEV-10549 mysqld: sql/handler.cc:2692: int handler::ha_index_first(uchar*): Assertion `table_share->tmp_table != NO_TMP_TABLE || m_lock_type != 2' failed. (branch bb-10.2-jan)
Unlike MySQL, InnoDB still uses THR_LOCK in MariaDB
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
enable tests that were fixed in MDEV-10549
MDEV-10548 Some of the debug sync waits do not work with InnoDB 5.7 (branch bb-10.2-jan)
fix main.innodb_mysql_sync - re-enable online alter for partitioned innodb tables
Contains also
MDEV-10547: Test multi_update_innodb fails with InnoDB 5.7
The failure happened because 5.7 has changed the signature of
the bool handler::primary_key_is_clustered() const
virtual function ("const" was added). InnoDB was using the old
signature which caused the function not to be used.
MDEV-10550: Parallel replication lock waits/deadlock handling does not work with InnoDB 5.7
Fixed mutexing problem on lock_trx_handle_wait. Note that
rpl_parallel and rpl_optimistic_parallel tests still
fail.
MDEV-10156 : Group commit tests fail on 10.2 InnoDB (branch bb-10.2-jan)
Reason: incorrect merge
MDEV-10550: Parallel replication can't sync with master in InnoDB 5.7 (branch bb-10.2-jan)
Reason: incorrect merge
Backport pull request #125 from grooverdan/MDEV-8923_innodb_buffer_pool_dump_pct to 10.0
WL#6504 InnoDB buffer pool dump/load enchantments
This patch consists of two parts:
1. Dump only the hottest N% of the buffer pool(s)
2. Prevent hogging the server duing BP load
From MySQL - commit b409342c43ce2edb68807100a77001367c7e6b8e
Add testcases for innodb_buffer_pool_dump_pct_basic.
Part of the code authored by Daniel Black
WL#6504 InnoDB buffer pool dump/load enchantments
This patch consists of two parts:
1. Dump only the hottest N% of the buffer pool(s)
2. Prevent hogging the server duing BP load
From MySQL - commit b409342c43ce2edb68807100a77001367c7e6b8e
Analysis: Current implementation will write and read at least one block
(sort_buffer_size bytes) from disk / index even if that block does not
contain any records.
Fix: Avoid writing / reading empty blocks to temporary files (disk).