Commit graph

101 commits

Author SHA1 Message Date
Marko Mäkelä
c5856b0a68 MDEV-21351: Allocate aligned memory
recv_sys_t::ALIGNMENT: The recv_sys_t::alloc() alignment
2020-02-08 11:47:42 +02:00
Marko Mäkelä
6d214415c9 MDEV-21351: Free processed recv_sys_t::blocks
Release memory as soon as redo log records are processed.

Because the memory allocation and deallocation of parsed redo log
records must be protected by recv_sys.mutex, it is better to avoid
using a std::atomic field for bookkeeping.

buf_page_t::access_time: Keep track of the recv_sys.pages record
allocations. The most significant 16 bits will count allocated
blocks (which were previously counted by buf_page_t::buf_fix_count
in the debug version), and the least significant 16 bits indicate
the number of allocated bytes in the block (which was previously
managed in buf_block_t::modify_clock), which must be a positive
number, up to innodb_page_size. The byte offset 65536 is represented
as the value 0.

recv_recover_page(): Let the caller erase the log.

recv_validate_tablespace(): Acquire recv_sys_t::mutex.
2020-02-06 09:00:19 +02:00
Marko Mäkelä
50324ce624 MDEV-21351 Replace recv_sys.heap with list of buf_block_t
InnoDB crash recovery used a special type of mem_heap_t that
allocates backing store from the buffer pool. That incurred
a significant overhead, leading to underutilization of memory,
and limiting the maximum contiguous allocated size of a log record.

recv_sys_t::blocks: A linked list of buf_block_t that are allocated
by buf_block_alloc() for redo log records. Replaces recv_sys_t::heap.
We repurpose buf_block_t::unzip_LRU for linking the elements.

recv_sys_t::max_log_blocks: Renamed from recv_n_pool_free_frames.

recv_sys_t::max_blocks(): Accessor for max_log_blocks.

recv_sys_t::alloc(): Allocate memory from the current recv_sys_t::blocks
element, or allocate another block.  In debug builds, various free()
member functions must be invoked, because we repurpose
buf_page_t::buf_fix_count for tracking allocations.

recv_sys_t::free_corrupted_page(): Renamed from recv_recover_corrupt_page()

recv_sys_t::is_memory_exhausted(): Renamed from recv_sys_heap_check()

recv_sys_t::pages and its elements are allocated directly by the
system memory allocator.

recv_parse_log_recs(): Remove the parameter available_memory.

We rename some variables 'store_to_hash' to 'store', because
recv_sys.pages is not actually a hash table.

This is joint work with Thirunarayanan Balathandayuthapani.
2020-01-29 12:53:39 +02:00
Marko Mäkelä
b36154a109 Cleanup log_rec_t 2019-12-27 21:20:03 +02:00
Marko Mäkelä
8cc15c036d Merge 10.4 into 10.5 2019-12-27 21:17:16 +02:00
Marko Mäkelä
4c25e75ce7 Merge 10.3 into 10.4 2019-12-27 18:20:28 +02:00
Marko Mäkelä
5ab70e7f68 Merge 10.2 into 10.3 2019-12-27 15:14:48 +02:00
Thirunarayanan Balathandayuthapani
bba59abb03 MDEV-19176 Reduce the memory usage during recovery
- Moved the recv_sys->heap memory condition inside recv_parse_log_recs().
So that, InnoDB can mark the status as STORE_NO earlier.

- InnoDB uses one third of buffer pool chunk size for reading the redo
log records. In that case, we can avoid the scenario where buffer ran
out of memory issue during recovery.
2019-12-23 15:51:02 +05:30
Marko Mäkelä
28c89b7151 Merge 10.4 into 10.5 2019-12-16 07:47:17 +02:00
Eugene Kosov
014e125830 optimize crash recovery
recv_dblwr_t::list is used for appending to the beginning and iterating
through its elements. std::deque fits better for that purpose because
it does less allocations than std::forward_list and provides better memory
locality.
2019-12-12 22:19:41 +07:00
Marko Mäkelä
6b5cdd4ff7 MDEV-19514: Update stale comments 2019-12-04 15:35:58 +02:00
Marko Mäkelä
64a02e4fa2 MDEV-19586: Add const qualifiers
Except for fil_name_process(), which invokes os_normalize_path(),
the redo log record parser will not modify the redo log records.
Add const qualifiers accordingly.
2019-11-04 09:25:26 +02:00
Marko Mäkelä
b7fc2c899e Refactor recv_sys_t::recs_t into page_recv_t
page_recv_t: Replaces recv_sys_t::recs_t.
page_recv_t::state is not private, even though some accessors exist.

page_recv_t::log: A singly-linked list of log_rec_t* with STL decoration
and the custom operations trim() and append(). The list members are private.

recv_t::data_t: Replaces recv_data_t.

recv_t::data: Remove the pointer indirection for the first log chunk,
and copy the first chunk directly after the record. Adjust the
definition of RECV_DATA_BLOCK_SIZE accordingly.
2019-11-01 15:06:31 +02:00
Marko Mäkelä
2aa1f77ef1 MDEV-19586: Rename recv_sys.empty() to recv_sys.clear()
In the collections of Standard Template Library,
empty() is a predicate and clear() empties a collection.
Let us rename recv_sys.empty() to recv_sys.clear() to avoid confusion.
2019-11-01 14:12:55 +02:00
Marko Mäkelä
624dd71b94 Merge 10.4 into 10.5 2019-08-13 18:57:00 +03:00
Marko Mäkelä
e9c1701e11 Merge 10.3 into 10.4 2019-07-25 18:42:06 +03:00
Marko Mäkelä
fdef9f9b89 Merge 10.2 into 10.3 2019-07-25 15:31:11 +03:00
Marko Mäkelä
b6ac67389d Merge 10.1 into 10.2 2019-07-25 12:14:27 +03:00
Marko Mäkelä
0c7c61019d Remove the wrappers ut_time(), ut_difftime(), ib_time_t 2019-07-24 21:59:26 +03:00
Marko Mäkelä
984d7100cd Merge 10.4 into 10.5 2019-06-13 18:36:09 +03:00
Marko Mäkelä
2fd82471ab Merge 10.3 into 10.4 2019-06-12 08:37:27 +03:00
Marko Mäkelä
b42dbdbccd Merge 10.2 into 10.3 2019-06-11 13:00:18 +03:00
Marko Mäkelä
177a571e01 MDEV-19586 Replace recv_sys_t::addr_hash with a std::map
InnoDB crash recovery buffers redo log records in a hash table.
The function recv_read_in_area() would pick a random hash bucket
and then try to submit read requests for a few nearby pages.
Let us replace the recv_sys.addr_hash with a std::map, which will
automatically be iterated in sorted order.

recv_sys_t::pages: Replaces recv_sys_t::addr_hash, recv_sys_t::n_addrs.

recv_sys_t::recs: Replaces most of recv_addr_t.

recv_t: Encapsulate a raw singly-linked list of records. This reduces
overhead compared to std::forward_list. Storage and cache overhead,
because the next-element pointer also points to the data payload.
Processing overhead, because recv_sys_t::recs_t::last will point to
the last record, so that recv_sys_t::add() can append straight to the
end of the list.

RECV_PROCESSED, RECV_DISCARDED: Remove. When a page is fully processed,
it will be deleted from recv_sys.pages.

recv_sys_t::trim(): Replaces recv_addr_trim().

recv_sys_t::add(): Use page_id_t for identifying pages.

recv_fold(), recv_hash(), recv_get_fil_addr_struct(): Remove.

recv_read_in_area(): Simplify the iteration.
2019-06-11 11:08:39 +03:00
Thirunarayanan Balathandayuthapani
b4287ec386 MDEV-19541 InnoDB crashes when trying to recover a corrupted page
- Use corrupt page id instead of whole block after releasing it from
LRU list.
2019-06-05 16:36:51 +05:30
Marko Mäkelä
f98bb23168 Merge 10.3 into 10.4 2019-05-29 22:17:00 +03:00
Marko Mäkelä
90a9193685 Merge 10.2 into 10.3 2019-05-29 11:32:46 +03:00
Thirunarayanan Balathandayuthapani
79b46ab2a6 MDEV-19541 InnoDB crashes when trying to recover a corrupted page
- Don't apply redo log for the corrupted page when innodb_force_recovery > 0.
- Allow the table to be dropped when index root page is
corrupted when innodb_force_recovery > 0.
2019-05-28 11:55:02 +03:00
Marko Mäkelä
50e79f604e MDEV-19606: Make recv_dblwr_t::list a forward_list 2019-05-28 08:01:49 +03:00
Marko Mäkelä
5d2619b693 MDEV-19584 Allocate recv_sys statically
There is only one InnoDB crash recovery subsystem.
Allocating recv_sys statically removes one level of pointer indirection
and makes code more readable, and removes the awkward initialization of
recv_sys->dblwr.

recv_sys_t::create(): Replaces recv_sys_init().

recv_sys_t::debug_free(): Replaces recv_sys_debug_free().

recv_sys_t::close(): Replaces recv_sys_close().

recv_sys_t::add(): Replaces recv_add_to_hash_table().

recv_sys_t::empty(): Replaces recv_sys_empty_hash().
2019-05-24 16:19:38 +03:00
Oleksandr Byelkin
c07325f932 Merge branch '10.3' into 10.4 2019-05-19 20:55:37 +02:00
Marko Mäkelä
be85d3e61b Merge 10.2 into 10.3 2019-05-14 17:18:46 +03:00
Marko Mäkelä
26a14ee130 Merge 10.1 into 10.2 2019-05-13 17:54:04 +03:00
Vicențiu Ciorbaru
c0ac0b8860 Update FSF address 2019-05-11 19:25:02 +03:00
Marko Mäkelä
d3dcec5d65 Merge 10.3 into 10.4 2019-05-05 15:06:44 +03:00
Marko Mäkelä
b6f4cccd19 Merge 10.2 into 10.3 2019-05-03 20:14:09 +03:00
Marko Mäkelä
3db94d2403 MDEV-19346: Remove dummy InnoDB log checkpoints
log_checkpoint(), log_make_checkpoint_at(): Remove the parameter
write_always. It seems that the primary purpose of this parameter
was to ensure in the function recv_reset_logs() that both checkpoint
header pages will be overwritten, when the function is called from
the never-enabled function recv_recovery_from_archive_start().

create_log_files(): Merge recv_reset_logs() to its only caller.

Debug instrumentation: Prefer to flush the redo log, instead of
triggering a redo log checkpoint.

page_header_set_field(): Disable a debug assertion that will
always fail due to MDEV-19344, now that we no longer initiate
a redo log checkpoint before an injected crash.

In recv_reset_logs() there used to be two calls to
log_make_checkpoint_at(). The apparent purpose of this was
to ensure that both InnoDB redo log checkpoint header pages
will be initialized or overwritten.
The second call was removed (without any explanation) in MySQL 5.6.3:
mysql/mysql-server@4ca37968da

In MySQL 5.6.8 WL#6494, starting with
mysql/mysql-server@00a0ba8ad9
the function recv_reset_logs() was not only invoked during
InnoDB data file initialization, but also during a regular
startup when the redo log is being resized.

mysql/mysql-server@45e9167983
in MySQL 5.7.2 removed the UNIV_LOG_ARCHIVE code, but still
did not remove the parameter write_always.
2019-05-03 20:02:11 +03:00
Marko Mäkelä
d8303c3ee7 Merge 10.3 into 10.4 2019-04-08 08:22:34 +03:00
Marko Mäkelä
cc492bfd4f Merge 10.2 into 10.3 2019-04-07 11:49:50 +03:00
Marko Mäkelä
1d30b7b1d2 MDEV-12699 preparation: Clean up recv_sys
The recv_sys data structures are accessed not only from the thread
that executes InnoDB plugin initialization, but also from the
InnoDB I/O threads, which can invoke recv_recover_page().

Assert that sufficient concurrency control is in place.
Some code was accessing recv_sys data structures without
holding recv_sys->mutex.

recv_recover_page(bpage): Refactor the call from buf_page_io_complete()
into a separate function that performs necessary steps. The
main thread was unnecessarily releasing and reacquiring recv_sys->mutex.

recv_recover_page(block,mtr,recv_addr): Pass more parameters from
the caller. Avoid redundant lookups and computations. Eliminate some
redundant variables.

recv_get_fil_addr_struct(): Assert that recv_sys->mutex is being held.
That was not always the case!

recv_scan_log_recs(): Acquire recv_sys->mutex for the whole duration
of the function. (While we are scanning and buffering redo log records,
no pages can be read in.)

recv_read_in_area(): Properly protect access with recv_sys->mutex.

recv_apply_hashed_log_recs(): Check recv_addr->state only once,
and continuously hold recv_sys->mutex. The mutex will be released
and reacquired inside recv_recover_page() and recv_read_in_area(),
allowing concurrent processing by buf_page_io_complete() in I/O threads.
2019-04-06 21:25:43 +03:00
Marko Mäkelä
71f9552fd8 recv_recovery_is_on(): Add UNIV_UNLIKELY
Normally, InnoDB is not in the process of executing crash recovery.
Provide a hint to the compiler that the recovery-related code paths
are rarely executed.
2019-04-06 21:25:43 +03:00
Marko Mäkelä
dde2ca4aa1 Merge 10.3 into 10.4 2018-11-19 20:22:33 +02:00
Marko Mäkelä
fd58bb71e2 Merge 10.2 into 10.3 2018-11-19 18:45:53 +02:00
Marko Mäkelä
ff88e4bb8a Remove many redundant #include from InnoDB 2018-11-19 11:42:14 +02:00
Marko Mäkelä
09af00cbde MDEV-13564: Remove old crash-upgrade logic in 10.4
Stop supporting the additional *trunc.log files that were
introduced via MySQL 5.7 to MariaDB Server 10.2 and 10.3.

DB_TABLESPACE_TRUNCATED: Remove.

purge_sys.truncate: A new structure to track undo tablespace
file truncation.

srv_start(): Remove the call to buf_pool_invalidate(). It is
no longer necessary, given that we no longer access things in
ways that violate the ARIES protocol. This call was originally
added for innodb_file_format, and it may later have been necessary
for the proper function of the MySQL 5.7 TRUNCATE recovery, which
we are now removing.

trx_purge_cleanse_purge_queue(): Take the undo tablespace as a
parameter.

trx_purge_truncate_history(): Rewrite everything mostly in a
single function, replacing references to undo::Truncate.

recv_apply_hashed_log_recs(): If any redo log is to be applied,
and if the log_sys.log.subformat indicates that separately
logged truncate may have been used, refuse to proceed except if
innodb_force_recovery is set. We will still refuse crash-upgrade
if TRUNCATE TABLE was logged. Undo tablespace truncation would
only be logged in undo*trunc.log files, which we are no longer
checking for.
2018-09-11 21:32:15 +03:00
Marko Mäkelä
5a1868b58d MDEV-13564 Mariabackup does not work with TRUNCATE
This is a merge from 10.2, but the 10.2 version of this will not
be pushed into 10.2 yet, because the 10.2 version would include
backports of MDEV-14717 and MDEV-14585, which would introduce
a crash recovery regression: Tables could be lost on
table-rebuilding DDL operations, such as ALTER TABLE,
OPTIMIZE TABLE or this new backup-friendly TRUNCATE TABLE.
The test innodb.truncate_crash occasionally loses the table due to
the following bug:

MDEV-17158 log_write_up_to() sometimes fails
2018-09-07 22:15:06 +03:00
Marko Mäkelä
055a3334ad MDEV-13564 Mariabackup does not work with TRUNCATE
Implement undo tablespace truncation via normal redo logging.

Implement TRUNCATE TABLE as a combination of RENAME to #sql-ib name,
CREATE, and DROP.

Note: Orphan #sql-ib*.ibd may be left behind if MariaDB Server 10.2
is killed before the DROP operation is committed. If MariaDB Server 10.2
is killed during TRUNCATE, it is also possible that the old table
was renamed to #sql-ib*.ibd but the data dictionary will refer to the
table using the original name.

In MariaDB Server 10.3, RENAME inside InnoDB is transactional,
and #sql-* tables will be dropped on startup. So, this new TRUNCATE
will be fully crash-safe in 10.3.

ha_mroonga::wrapper_truncate(): Pass table options to the underlying
storage engine, now that ha_innobase::truncate() will need them.

rpl_slave_state::truncate_state_table(): Before truncating
mysql.gtid_slave_pos, evict any cached table handles from
the table definition cache, so that there will be no stale
references to the old table after truncating.

== TRUNCATE TABLE ==

WL#6501 in MySQL 5.7 introduced separate log files for implementing
atomic and crash-safe TRUNCATE TABLE, instead of using the InnoDB
undo and redo log. Some convoluted logic was added to the InnoDB
crash recovery, and some extra synchronization (including a redo log
checkpoint) was introduced to make this work. This synchronization
has caused performance problems and race conditions, and the extra
log files cannot be copied or applied by external backup programs.

In order to support crash-upgrade from MariaDB 10.2, we will keep
the logic for parsing and applying the extra log files, but we will
no longer generate those files in TRUNCATE TABLE.

A prerequisite for crash-safe TRUNCATE is a crash-safe RENAME TABLE
(with full redo and undo logging and proper rollback). This will
be implemented in MDEV-14717.

ha_innobase::truncate(): Invoke RENAME, create(), delete_table().
Because RENAME cannot be fully rolled back before MariaDB 10.3
due to missing undo logging, add some explicit rename-back in
case the operation fails.

ha_innobase::delete(): Introduce a variant that takes sqlcom as
a parameter. In TRUNCATE TABLE, we do not want to touch any
FOREIGN KEY constraints.

ha_innobase::create(): Add the parameters file_per_table, trx.
In TRUNCATE, the new table must be created in the same transaction
that renames the old table.

create_table_info_t::create_table_info_t(): Add the parameters
file_per_table, trx.

row_drop_table_for_mysql(): Replace a bool parameter with sqlcom.

row_drop_table_after_create_fail(): New function, wrapping
row_drop_table_for_mysql().

dict_truncate_index_tree_in_mem(), fil_truncate_tablespace(),
fil_prepare_for_truncate(), fil_reinit_space_header_for_table(),
row_truncate_table_for_mysql(), TruncateLogger,
row_truncate_prepare(), row_truncate_rollback(),
row_truncate_complete(), row_truncate_fts(),
row_truncate_update_system_tables(),
row_truncate_foreign_key_checks(), row_truncate_sanity_checks():
Remove.

row_upd_check_references_constraints(): Remove a check for
TRUNCATE, now that the table is no longer truncated in place.

The new test innodb.truncate_foreign uses DEBUG_SYNC to cover some
race-condition like scenarios. The test innodb-innodb.truncate does
not use any synchronization.

We add a redo log subformat to indicate backup-friendly format.
MariaDB 10.4 will remove support for the old TRUNCATE logging,
so crash-upgrade from old 10.2 or 10.3 to 10.4 will involve
limitations.

== Undo tablespace truncation ==

MySQL 5.7 implements undo tablespace truncation. It is only
possible when innodb_undo_tablespaces is set to at least 2.
The logging is implemented similar to the WL#6501 TRUNCATE,
that is, using separate log files and a redo log checkpoint.

We can simply implement undo tablespace truncation within
a single mini-transaction that reinitializes the undo log
tablespace file. Unfortunately, due to the redo log format
of some operations, currently, the total redo log written by
undo tablespace truncation will be more than the combined size
of the truncated undo tablespace. It should be acceptable
to have a little more than 1 megabyte of log in a single
mini-transaction. This will be fixed in MDEV-17138 in
MariaDB Server 10.4.

recv_sys_t: Add truncated_undo_spaces[] to remember for which undo
tablespaces a MLOG_FILE_CREATE2 record was seen.

namespace undo: Remove some unnecessary declarations.

fil_space_t::is_being_truncated: Document that this flag now
only applies to undo tablespaces. Remove some references.

fil_space_t::is_stopping(): Do not refer to is_being_truncated.
This check is for tablespaces of tables. Potentially used
tablespaces are never truncated any more.

buf_dblwr_process(): Suppress the out-of-bounds warning
for undo tablespaces.

fil_truncate_log(): Write a MLOG_FILE_CREATE2 with a nonzero
page number (new size of the tablespace in pages) to inform
crash recovery that the undo tablespace size has been reduced.

fil_op_write_log(): Relax assertions, so that MLOG_FILE_CREATE2
can be written for undo tablespaces (without .ibd file suffix)
for a nonzero page number.

os_file_truncate(): Add the parameter allow_shrink=false
so that undo tablespaces can actually be shrunk using this function.

fil_name_parse(): For undo tablespace truncation,
buffer MLOG_FILE_CREATE2 in truncated_undo_spaces[].

recv_read_in_area(): Avoid reading pages for which no redo log
records remain buffered, after recv_addr_trim() removed them.

trx_rseg_header_create(): Add a FIXME comment that we could write
much less redo log.

trx_undo_truncate_tablespace(): Reinitialize the undo tablespace
in a single mini-transaction, which will be flushed to the redo log
before the file size is trimmed.

recv_addr_trim(): Discard any redo logs for pages that were
logged after the new end of a file, before the truncation LSN.
If the rec_list becomes empty, reduce n_addrs. After removing
any affected records, actually truncate the file.

recv_apply_hashed_log_recs(): Invoke recv_addr_trim() right before
applying any log records. The undo tablespace files must be open
at this point.

buf_flush_or_remove_pages(), buf_flush_dirty_pages(),
buf_LRU_flush_or_remove_pages(): Add a parameter for specifying
the number of the first page to flush or remove (default 0).

trx_purge_initiate_truncate(): Remove the log checkpoints, the
extra logging, and some unnecessary crash points. Merge the code
from trx_undo_truncate_tablespace(). First, flush all to-be-discarded
pages (beyond the new end of the file), then trim the space->size
to make the page allocation deterministic. At the only remaining
crash injection point, flush the redo log, so that the recovery
can be tested.
2018-09-07 22:10:02 +03:00
Marko Mäkelä
4901f31c13 Merge 10.2 into 10.3 2018-09-07 22:09:28 +03:00
Marko Mäkelä
0927332961 Make some declarations private
recv_addr_state, recv_addr_t: Define in log0recv.cc only.
2018-09-07 22:06:20 +03:00
Marko Mäkelä
7830fb7f45 Merge 10.2 into 10.3 2018-08-28 12:22:56 +03:00
Marko Mäkelä
d6f7fd6016 MDEV-13564: Refuse MLOG_TRUNCATE in mariabackup
The MySQL 5.7 TRUNCATE TABLE is inherently incompatible
with hot backup, because it is creating and deleting a separate
log file, and it is not writing redo log for all changes of the
InnoDB data dictionary tables. Refuse to create a corrupted backup
if the unsafe form of TRUNCATE was executed.

Note: Undo log tablespace truncation cannot be detected easily.
Also it is incompatible with backup, for similar reasons.

xtrabackup_backup_func(): "Subscribe to" the log events before
the first invocation of xtrabackup_copy_logfile().

recv_parse_or_apply_log_rec_body(): If the function pointer
log_truncate is set, invoke it to report MLOG_TRUNCATE.
2018-08-16 16:10:18 +03:00