Commit graph

205 commits

Author SHA1 Message Date
Marko Mäkelä
680463a8d9 Merge 10.2 into 10.3 2020-06-05 16:51:26 +03:00
Marko Mäkelä
efc70da5fd MDEV-22769 Shutdown hang or crash due to XA breaking locks
The background drop table queue in InnoDB is a work-around for
cases where the SQL layer is requesting DDL on tables on which
transactional locks exist.

One such case are XA transactions. Our test case exploits the
fact that the recovery of XA PREPARE transactions will
only resurrect InnoDB table locks, but not MDL that should
block any concurrent DDL.

srv_shutdown_t: Introduce the srv_shutdown_state=SRV_SHUTDOWN_INITIATED
for the initial part of shutdown, to wait for the background drop
table queue to be emptied.

srv_shutdown_bg_undo_sources(): Assign
srv_shutdown_state=SRV_SHUTDOWN_INITIATED
before waiting for the background drop table queue to be emptied.

row_drop_tables_for_mysql_in_background(): On slow shutdown, if
no active transactions exist (excluding ones that are in
XA PREPARE state), skip any tables on which locks exist.

row_drop_table_for_mysql(): Do not unnecessarily attempt to
drop InnoDB persistent statistics for tables that have
already been added to the background drop table queue.

row_mysql_close(): Relax an assertion, and free all memory
even if innodb_force_recovery=2 would prevent the background
drop table queue from being emptied.
2020-06-05 15:22:46 +03:00
Marko Mäkelä
3d0bb2b7f1 Merge 10.2 into 10.3 2020-05-15 19:11:57 +03:00
Marko Mäkelä
ad6171b91c MDEV-22456 Dropping the adaptive hash index may cause DDL to lock up InnoDB
If the InnoDB buffer pool contains many pages for a table or index
that is being dropped or rebuilt, and if many of such pages are
pointed to by the adaptive hash index, dropping the adaptive hash index
may consume a lot of time.

The time-consuming operation of dropping the adaptive hash index entries
is being executed while the InnoDB data dictionary cache dict_sys is
exclusively locked.

It is not actually necessary to drop all adaptive hash index entries
at the time a table or index is being dropped or rebuilt. We can let
the LRU replacement policy of the buffer pool take care of this gradually.
For this to work, we must detach the dict_table_t and dict_index_t
objects from the main dict_sys cache, and once the last
adaptive hash index entry for the detached table is removed
(when the garbage page is evicted from the buffer pool) we can free
the dict_table_t and dict_index_t object.

Related to this, in MDEV-16283, we made ALTER TABLE...DISCARD TABLESPACE
skip both the buffer pool eviction and the drop of the adaptive hash index.
We shifted the burden to ALTER TABLE...IMPORT TABLESPACE or DROP TABLE.
We can remove the eviction from DROP TABLE. We must retain the eviction
in the ALTER TABLE...IMPORT TABLESPACE code path, so that in case the
discarded table is being re-imported with the same tablespace identifier,
the fresh data from the imported tablespace will replace any stale pages
in the buffer pool.

rpl.rpl_failed_drop_tbl_binlog: Remove the test. DROP TABLE can
no longer be interrupted inside InnoDB.

fseg_free_page(), fseg_free_step(), fseg_free_step_not_header(),
fseg_free_page_low(), fseg_free_extent(): Remove the parameter
that specifies whether the adaptive hash index should be dropped.

btr_search_lazy_free(): Lazily free an index when the last
reference to it is dropped from the adaptive hash index.

buf_pool_clear_hash_index(): Declare static, and move to the
same compilation unit with the bulk of the adaptive hash index
code.

dict_index_t::clone(), dict_index_t::clone_if_needed():
Clone an index that is being rebuilt while adaptive hash index
entries exist. The original index will be inserted into
dict_table_t::freed_indexes and dict_index_t::set_freed()
will be called.

dict_index_t::set_freed(), dict_index_t::freed(): Note that
or check whether the index has been freed. We will use the
impossible page number 1 to denote this condition.

dict_index_t::n_ahi_pages(): Replaces btr_search_info_get_ref_count().

dict_index_t::detach_columns(): Move the assignment n_fields=0
to ha_innobase_inplace_ctx::clear_added_indexes().
We must have access to the columns when freeing the
adaptive hash index. Note: dict_table_t::v_cols[] will remain
valid. If virtual columns are dropped or added, the table
definition will be reloaded in ha_innobase::commit_inplace_alter_table().

buf_page_mtr_lock(): Drop a stale adaptive hash index if needed.

We will also reduce the number of btr_get_search_latch() calls
and enclose some more code inside #ifdef BTR_CUR_HASH_ADAPT
in order to benefit cmake -DWITH_INNODB_AHI=OFF.
2020-05-15 17:23:08 +03:00
Marko Mäkelä
15fa70b840 Merge 10.2 into 10.3 2020-05-13 11:45:05 +03:00
Marko Mäkelä
6bc4444d7c Merge 10.1 into 10.2 2020-05-13 11:12:31 +03:00
Marko Mäkelä
26aab96ecf MDEV-22497 [ERROR] InnoDB: Unable to purge a record
The InnoDB insert buffer was upgraded in MySQL 5.5 into a change
buffer that also covers delete-mark and delete (purge) operations.

There is an important constraint for delete operations: a B-tree
leaf page must not become empty unless the entire tree becomes empty,
consisting of an empty root page. Because change buffer merges only
occur on a single leaf page at a time, delete operations must not be
buffered if it is possible that the last record of the page could be
deleted. (In that case, we would refuse to use the change buffer, and
if we really delete the last record, we would shrink the index tree.)

The function ibuf_get_volume_buffered_hash() is part of our insurance
that the page would not become empty. It is supposed to map each
buffered INSERT or DELETE_MARK record payload into a hash value.
We will only count each such record as a distinct key if there is no
hash collision. DELETE operations will always decrement the predicted
number fo records in the page.

Due to a bug in the function, we would actually compute the hash value
not only on the record payload, but also on some following bytes,
in case the record contains NULL values. In MySQL Bug #61104, we had
some examples of this dating back to 2012. But back then, we failed to
reproduce the bug, and in commit d84c95579b
we simply demoted the hard assertion to a message printout and a debug
assertion failure.

ibuf_get_volume_buffered_hash(): Correctly compute the hash value
of the payload bytes only. Note: we will consider
('foo','bar'),(NULL,'foobar'),('foob','ar') to be equal, but this
is not a problem, because in case of a hash collision, we could
also consider ('boo','far') to be equal, and underestimate the number
of records in the page, leading to refusing to buffer a DELETE.
2020-05-07 20:44:33 +03:00
Oleksandr Byelkin
7fb73ed143 Merge branch '10.2' into 10.3 2020-05-04 16:47:11 +02:00
Daniel Black
ba2061da52 MDEV-21595: innodb offset_t rename to rec_offs
thanks to:

perl -i -pe 's/\boffset_t\b/rec_offs/g' $(git grep -lw offset_t storage/innobase)
2020-04-29 12:02:47 +03:00
Marko Mäkelä
1a9b6c4c7f Merge 10.2 into 10.3 2020-03-30 11:12:56 +03:00
Eugene Kosov
884d22f288 remove fishy reinterpret_cast from buf_page_is_zeroes()
In my micro-benchmarks memcmp(4196) 3 times faster than old
implementation. Also, it's generally better to use as less
reinterpret_casts<> as possible.

buf_is_zeroes(): renamed from buf_page_is_zeroes() and
argument changed to span<> for convenience.

st_::span<T>::const_iterator: fixed

page_zip-verify_checksum(): make argument byte* instead of void*
2020-03-20 21:35:42 +03:00
Marko Mäkelä
3466b47b0d Merge 10.2 into 10.3 2019-12-13 10:08:57 +02:00
Eugene Kosov
f0aa073f2b MDEV-20950 Reduce size of record offsets
offset_t: this is a type which represents one record offset.
It's unsigned short int.

a lot of functions: replace ulint with offset_t

btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
  allocate record offsets on the stack instead of waiting for rec_get_offsets()
  to allocate it from mem_heap_t. So, reducing  memory allocations.

RECORD_OFFSET, INDEX_OFFSET:
  now it's less convenient to store pointers in offset_t*
  array. One pointer occupies now several offset_t. And those constant are start
  indexes into array to places where to store pointer values

REC_OFFS_HEADER_SIZE: adjusted for the new reality

REC_OFFS_NORMAL_SIZE:
  increase size from 100 to 300 which means less heap allocations.
  And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
  is smaller than previous 800 bytes.

REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality

rem0rec.h, rem0rec.ic, rem0rec.cc:
  various arguments, return values and local variables types were changed to
  fix numerous integer conversions issues.

enum field_type_t:
  offset types concept was introduces which replaces old offset flags stuff.
  Like in earlier version, 2 upper bits are used to store offset type.
  And this enum represents those types.

REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed

get_type(), set_type(), get_value(), combine():
  these are convenience functions to work with offsets and it's types

rec_offs_base()[0]:
  still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL

rec_offs_base()[i]:
  these have type offset_t now. Two upper bits contains type.
2019-12-13 00:26:50 +07:00
Jan Lindström
9d9a2253c6 Merge remote-tracking branch 10.2 into 10.3
Conflicts:
	mysql-test/suite/galera/t/galera_binlog_event_max_size_max-master.opt
	mysql-test/suite/innodb/r/innodb-mdev-7513.result
	mysql-test/suite/innodb/t/innodb-mdev-7513.test
	mysql-test/suite/wsrep/disabled.def
	storage/innobase/ibuf/ibuf0ibuf.cc
2019-12-02 14:35:10 +02:00
Marko Mäkelä
a8395853cc MDEV-21152 Bogus debug assertion btr_pcur_is_after_last_in_tree() in ibuf code
As noted in commit abd45cdc38
a search with PAGE_CUR_GE may land on the supremum record on
a leaf page that is not the rightmost leaf page. This could occur
when all keys on the current page are smaller than the search key,
and the smallest key on the successor page is larger than the search key.

Hence, after a failed PAGE_CUR_GE search, assertions
btr_pcur_is_after_last_in_tree() are bogus
and should be replaced with btr_pcur_is_after_last_on_page().
2019-11-26 16:32:51 +02:00
Marko Mäkelä
613e13072c Merge 10.2 into 10.3 2019-11-19 00:38:33 +02:00
Marko Mäkelä
b80df9eba2 MDEV-21069 Crash on DROP TABLE if the data file is corrupted
buf_read_ibuf_merge_pages(): Discard any page numbers that are
outside the current bounds of the tablespace, by invoking the
function ibuf_delete_recs() that was introduced in MDEV-20934.
This could avoid an infinite change buffer merge loop on
innodb_fast_shutdown=0, because normally the change buffer merge
would only be attempted if a page was successfully loaded into
the buffer pool.

dict_drop_index_tree(): Add the parameter trx_t*.
To prevent the DROP TABLE crash, do not invoke btr_free_if_exists()
if the entire .ibd file will be dropped. Thus, we will avoid a crash
if the BTR_SEG_LEAF or BTR_SEG_TOP of the index is corrupted,
and we will also avoid unnecessarily accessing the to-be-dropped
tablespace via the buffer pool.

In MariaDB 10.2, we disable the DROP TABLE fix if innodb_safe_truncate=0,
because the backup-unsafe MySQL 5.7 WL#6501 form of TRUNCATE TABLE
requires that the individual pages be freed inside the tablespace.
2019-11-19 00:07:06 +02:00
Marko Mäkelä
3d4a801533 MDEV-12353 preparation: Replace mtr_x_lock() and friends
Apart from page latches (buf_block_t::lock), mini-transactions
are keeping track of at most one dict_index_t::lock and
fil_space_t::latch at a time, and in a rare case, purge_sys.latch.

Let us introduce interfaces for acquiring an index latch
or a tablespace latch.

In a later version, we may want to introduce mtr_t members
for holding a latched dict_index_t* and fil_space_t*,
and replace the remaining use of mtr_t::m_memo
with std::set<buf_block_t*> or with a map<buf_block_t*,byte*>
pointing to log records.
2019-11-14 11:40:33 +02:00
Marko Mäkelä
abd45cdc38 MDEV-20934: Correct a debug assertion
A search with PAGE_CUR_GE may land on the supremum record on
a leaf page that is not the rightmost leaf page.
This could occur when all keys on the current page are
smaller than the search key, and the smallest key on the
successor page is larger than the search key.

ibuf_delete_recs(): Correct the debug assertion accordingly.
2019-11-13 09:26:10 +02:00
Marko Mäkelä
4fcfdb60e7 Merge 10.2 into 10.3 2019-11-11 14:56:51 +02:00
Marko Mäkelä
29d67d051a Cleanup btr_page_get_prev(), btr_page_get_next()
Remove the redundant parameter mtr_t*.

Make use of page_has_prev(), page_has_next() whenever possible.
2019-11-11 13:36:21 +02:00
Marko Mäkelä
908ca4668d Merge 10.2 into 10.3 2019-11-06 13:14:31 +02:00
Marko Mäkelä
d7a2401750 MDEV-20934 Infinite loop on innodb_fast_shutdown=0 with inconsistent change buffer
Due to a data corruption bug that may have occurred a long time earlier
(possibly involving physical backup and MySQL Bug #69122, which was
addressed in commit f166ec71b7)
it seems possible that the InnoDB change buffer might end up containing
entries, while no buffered changes exist according to the change buffer
bitmap pages in the .ibd files.

ibuf_delete_recs(): New function, to be invoked on slow shutdown only.
Remove all buffered changes for a specific page.

ibuf_merge_or_delete_for_page(): If the change buffer bitmap is clean
and a slow shutdown is in progress, invoke ibuf_delete_recs().
We do not want to do that during normal operation, due to the additional
overhead that is involved. The bitmap page should be consistent with
the change buffer in the first place.
2019-11-06 08:48:48 +02:00
Oleksandr Byelkin
55b2281a5d Merge branch '10.2' into 10.3 2019-10-31 10:58:06 +01:00
Marko Mäkelä
2809842031 MDEV-20864 Introduce debug option innodb_change_buffer_dump
To diagnose a hang in slow shutdown (innodb_fast_shutdown=0),
let us introduce a Boolean startup option in debug builds
that will cause the contents of the InnoDB change buffer
to be dumped to the server error log at startup.
2019-10-19 15:16:47 +03:00
Marko Mäkelä
8e3d85e112 Merge 10.2 into 10.3 2019-10-12 06:34:09 +03:00
Marko Mäkelä
966d97b5f9 Merge 10.1 into 10.2 2019-10-11 18:38:18 +03:00
Marko Mäkelä
cbfd6882f4 Merge 5.5 into 10.1 2019-10-11 15:19:55 +03:00
Marko Mäkelä
ea61b79694 MDEV-20805 ibuf_add_free_page() is not initializing FIL_PAGE_TYPE first
In the function recv_parse_or_apply_log_rec_body() there are debug checks
for validating the state of the page when redo log records are being
applied. Most notably, FIL_PAGE_TYPE should be set before anything else
is being written to the page.

ibuf_add_free_page(): Set FIL_PAGE_TYPE before performing any other changes.
2019-10-11 14:12:36 +03:00
Marko Mäkelä
892378fb9d Merge 10.2 into 10.3 2019-10-09 13:25:11 +03:00
Thirunarayanan Balathandayuthapani
c65cb244b3 MDEV-19335 Remove buf_page_t::encrypted
The field buf_page_t::encrypted was added in MDEV-8588.
It was made mostly redundant in MDEV-12699. Remove the field.
2019-10-09 13:13:12 +03:00
Marko Mäkelä
d480d28f4f Add page_has_prev(), page_has_next(), page_has_siblings()
Until now, InnoDB inefficiently compared the aligned fields
FIL_PAGE_PREV, FIL_PAGE_NEXT to the byte-order-agnostic value FIL_NULL.

This is a backport of 32170f8c6d
from MariaDB Server 10.3.
2019-10-09 08:29:26 +03:00
Marko Mäkelä
da9201dd5b Merge 10.2 into 10.3 2019-09-10 09:25:20 +03:00
Marko Mäkelä
43a6e81ccb MDEV-19514 preparation: Remove innodb_change_buffering_debug=2
The setting innodb_change_buffering_debug=2 was supposed to inject
a crash during change buffer merge. There is no public test for
that functionality, and even if there were, it would be better
to use DEBUG_SYNC to halt the thread that does change buffer merge,
force a redo log flush from another thread, and finally kill the
server externally.
2019-09-09 18:18:52 +03:00
Marko Mäkelä
e82fe21e3a Merge 10.2 into 10.3 2019-07-02 17:46:22 +03:00
Thirunarayanan Balathandayuthapani
723a4b1d78 MDEV-17228 Encrypted temporary tables are not encrypted
- Introduce a new variable called innodb_encrypt_temporary_tables which is
a boolean variable. It decides whether to encrypt the temporary tablespace.
- Encrypts the temporary tablespace based on full checksum format.
- Introduced a new counter to track encrypted and decrypted temporary
tablespace pages.
- Warnings issued if temporary table creation has conflict value with
innodb_encrypt_temporary_tables
- Added a new test case which reads and writes the pages from/to temporary
tablespace.
2019-06-28 19:07:59 +05:30
Marko Mäkelä
be85d3e61b Merge 10.2 into 10.3 2019-05-14 17:18:46 +03:00
Marko Mäkelä
26a14ee130 Merge 10.1 into 10.2 2019-05-13 17:54:04 +03:00
Vicențiu Ciorbaru
f177f125d4 Merge branch '5.5' into 10.1 2019-05-11 19:15:57 +03:00
Vicențiu Ciorbaru
15f1e03d46 Follow-up to changing FSF address
Some places didn't match the previous rules, making the Floor
address wrong.

Additional sed rules:

sed -i -e 's/Place.*Suite .*, Boston/Street, Fifth Floor, Boston/g'
sed -i -e 's/Suite .*, Boston/Fifth Floor, Boston/g'
2019-05-11 18:30:45 +03:00
Marko Mäkelä
acf6f92aa9 Merge 10.2 into 10.3 2019-04-25 09:05:52 +03:00
Marko Mäkelä
d315b4ff39 Remove IBUF_COUNT_DEBUG
The compile-time option IBUF_COUNT_DEBUG has not been used for years.
It would only work with up to 3 created .ibd files, with no buffered
changes existing while InnoDB is started up.
2019-04-19 12:44:46 +03:00
Marko Mäkelä
250799f961 Merge 10.2 into 10.3 2019-04-17 15:26:17 +03:00
Marko Mäkelä
169c00994b MDEV-12699 Improve crash recovery of corrupted data pages
InnoDB crash recovery used to read every data page for which
redo log exists. This is unnecessary for those pages that are
initialized by the redo log. If a newly created page is corrupted,
recovery could unnecessarily fail. It would suffice to reinitialize
the page based on the redo log records.

To add insult to injury, InnoDB crash recovery could hang if it
encountered a corrupted page. We will fix also that problem.
InnoDB would normally refuse to start up if it encounters a
corrupted page on recovery, but that can be overridden by
setting innodb_force_recovery=1.

Data pages are completely initialized by the records
MLOG_INIT_FILE_PAGE2 and MLOG_ZIP_PAGE_COMPRESS.
MariaDB 10.4 additionally recognizes MLOG_INIT_FREE_PAGE,
which notifies that a page has been freed and its contents
can be discarded (filled with zeroes).

The record MLOG_INDEX_LOAD notifies that redo logging has
been re-enabled after being disabled. We can avoid loading
the page if all buffered redo log records predate the
MLOG_INDEX_LOAD record.

For the internal tables of FULLTEXT INDEX, no MLOG_INDEX_LOAD
records were written before commit aa3f7a107c.
Hence, we will skip these optimizations for tables whose
name starts with FTS_.

This is joint work with Thirunarayanan Balathandayuthapani.

fil_space_t::enable_lsn, file_name_t::enable_lsn: The LSN of the
latest recovered MLOG_INDEX_LOAD record for a tablespace.

mlog_init: Page initialization operations discovered during
redo log scanning. FIXME: This really belongs in recv_sys->addr_hash,
and should be removed in MDEV-19176.

recv_addr_state: Add the new state RECV_WILL_NOT_READ to
indicate that according to mlog_init, the page will be
initialized based on redo log record contents.

recv_add_to_hash_table(): Set the RECV_WILL_NOT_READ state
if appropriate. For now, we do not treat MLOG_ZIP_PAGE_COMPRESS
as page initialization. This works around bugs in the crash
recovery of ROW_FORMAT=COMPRESSED tables.

recv_mark_log_index_load(): Process a MLOG_INDEX_LOAD record
by resetting the state to RECV_NOT_PROCESSED and by updating
the fil_name_t::enable_lsn.

recv_init_crash_recovery_spaces(): Copy fil_name_t::enable_lsn
to fil_space_t::enable_lsn.

recv_recover_page(): Add the parameter init_lsn, to ignore
any log records that precede the page initialization.
Add DBUG output about skipped operations.

buf_page_create(): Initialize FIL_PAGE_LSN, so that
recv_recover_page() will not wrongly skip applying
the page-initialization record due to the field containing
some newer LSN as a leftover from a different page.
Do not invoke ibuf_merge_or_delete_for_page() during
crash recovery.

recv_apply_hashed_log_recs(): Remove some unnecessary lookups.
Note if a corrupted page was found during recovery.
After invoking buf_page_create(), do invoke
ibuf_merge_or_delete_for_page() via mlog_init.ibuf_merge()
in the last recovery batch.

ibuf_merge_or_delete_for_page(): Relax a debug assertion.

innobase_start_or_create_for_mysql(): Abort startup if
a corrupted page was found during recovery. Corrupted pages
will not be flagged if innodb_force_recovery is set.
However, the recv_sys->found_corrupt_fs flag can be set
regardless of innodb_force_recovery if file names are found
to be incorrect (for example, multiple files with the same
tablespace ID).
2019-04-17 13:58:41 +03:00
Marko Mäkelä
ee7a4f4462 MDEV-12266: Pass fil_space_t* to fseg_free_page()
fseg_free_page_func(): Avoid an unnecessary tablespace ID lookup.
The callers should pass the tablespace that they already know.
2019-04-08 21:38:43 +03:00
Marko Mäkelä
45531949ae Merge 10.2 into 10.3 2018-12-18 09:15:41 +02:00
Marko Mäkelä
7d245083a4 Merge 10.1 into 10.2 2018-12-17 20:15:38 +02:00
Marko Mäkelä
06e5f28f9f MDEV-12266: Remove a level of pointer indirection
Replace table->space->id with table->space_id.
2018-11-22 17:10:26 +02:00
Marko Mäkelä
fd58bb71e2 Merge 10.2 into 10.3 2018-11-19 18:45:53 +02:00
Marko Mäkelä
ff88e4bb8a Remove many redundant #include from InnoDB 2018-11-19 11:42:14 +02:00