Commit graph

339 commits

Author SHA1 Message Date
Marko Mäkelä
745fd4b39f MDEV-21174: Remove some mlog_write_initial_log_record_fast()
Pass buf_block_t* to more functions that write redo log.

page_zip_write_node_ptr(), page_zip_write_blob_ptr(),
page_zip_compress_write_log_no_data():
Take buf_block_t* as parameter, and do not tolerate mtr=NULL.

page_zip_compress(): Do not tolerate mtr=NULL.

page_zip_dir_insert(): Take page_cur_t* as parameter.

mlog_write_initial_log_record(): Remove. This function was unused.

RecIterator::remove(): Remove the redundant page_zip parameter.

PageConverter::m_page_zip_ptr: Remove.
2019-12-13 18:15:51 +02:00
Marko Mäkelä
8fa759a576 Merge 10.3 into 10.4
We disable the MDEV-21189 test galera.galera_partition
because it times out.
2019-12-13 17:30:37 +02:00
Marko Mäkelä
3466b47b0d Merge 10.2 into 10.3 2019-12-13 10:08:57 +02:00
Eugene Kosov
f0aa073f2b MDEV-20950 Reduce size of record offsets
offset_t: this is a type which represents one record offset.
It's unsigned short int.

a lot of functions: replace ulint with offset_t

btr_pcur_restore_position_func(),
page_validate(),
row_ins_scan_sec_index_for_duplicate(),
row_upd_clust_rec_by_insert_inherit_func(),
row_vers_impl_x_locked_low(),
trx_undo_prev_version_build():
  allocate record offsets on the stack instead of waiting for rec_get_offsets()
  to allocate it from mem_heap_t. So, reducing  memory allocations.

RECORD_OFFSET, INDEX_OFFSET:
  now it's less convenient to store pointers in offset_t*
  array. One pointer occupies now several offset_t. And those constant are start
  indexes into array to places where to store pointer values

REC_OFFS_HEADER_SIZE: adjusted for the new reality

REC_OFFS_NORMAL_SIZE:
  increase size from 100 to 300 which means less heap allocations.
  And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
  is smaller than previous 800 bytes.

REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality

rem0rec.h, rem0rec.ic, rem0rec.cc:
  various arguments, return values and local variables types were changed to
  fix numerous integer conversions issues.

enum field_type_t:
  offset types concept was introduces which replaces old offset flags stuff.
  Like in earlier version, 2 upper bits are used to store offset type.
  And this enum represents those types.

REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed

get_type(), set_type(), get_value(), combine():
  these are convenience functions to work with offsets and it's types

rec_offs_base()[0]:
  still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL

rec_offs_base()[i]:
  these have type offset_t now. Two upper bits contains type.
2019-12-13 00:26:50 +07:00
Marko Mäkelä
d3b2625ba0 MDEV-21259 Assertion failed in mtr_t::write()
btr_free_externally_stored_field(): Pass w=mtr_t::OPT to
note that the BTR_EXTERN_LEN is not necessarily changing
when a multi-page ROW_FORMAT=COMPRESSED off-page column
is being freed, and to allow redundant writes to the redo
log to be optimized away.

Ever since commit 56f6dab1d0
the refactored function mtr_t::write() asserts by default
that the page contents is being changed.
2019-12-09 21:11:08 +02:00
Marko Mäkelä
af5947f433 MDEV-21174: Replace mlog_write_string() with mtr_t::memcpy()
mtr_t::memcpy(): Replaces mlog_write_string(), mlog_log_string().
The buf_block_t is passed a parameter, so that
mlog_write_initial_log_record_low() can be used instead of
mlog_write_initial_log_record_fast().

fil_space_crypt_t::write_page0(): Remove the fil_space_t* parameter.
2019-12-03 11:05:19 +02:00
Marko Mäkelä
87839258f8 MDEV-21174: Replace mlog_memset() with mtr_t::memset()
Passing buf_block_t helps us avoid calling
mlog_write_initial_log_record_fast() and page_get_page_no(),
and allows us to implement more debug checks, such as
that on ROW_FORMAT=COMPRESSED index pages, only the page header
may be modified by MLOG_MEMSET records.

fseg_n_reserved_pages(): Add a buf_block_t parameter.
2019-12-03 11:05:19 +02:00
Marko Mäkelä
caea64df18 Cleanup: Remove some page_get_page_no() calls
Refer to buf_page_t::id instead of parsing the tablespace identifier
or page number from the buffer pool page.
2019-12-03 11:05:19 +02:00
Marko Mäkelä
56f6dab1d0 MDEV-21174: Replace mlog_write_ulint() with mtr_t::write()
mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull().
Optimize away writes if the page contents does not change,
except when a dummy write has been explicitly requested.

Because the member function template takes a block descriptor as a
parameter, it is possible to introduce better consistency checks.
Due to this, the code for handling file-based lists, undo logs
and user transactions was refactored to pass around buf_block_t.
2019-12-03 11:05:18 +02:00
Marko Mäkelä
cd92c6c83d MDEV-12353 preparation: Do not write MLOG_REC_MIN_MARK
btr_set_min_rec_mark(): Write MLOG_1BYTE instead of
MLOG_REC_MIN_MARK or MLOG_COMP_REC_MIN_MARK.

On ROW_FORMAT=COMPRESSED pages, the minimum record flag is not stored
at all. The flag is computed for the uncompressed page by
page_zip_decompress(). Hence, nothing needs to be logged for
ROW_FORMAT=COMPRESSED tables for this operation.

To facilitate crash-upgrade and hot backup from older versions,
we will retain the code to parse and apply the old log record types
MLOG_REC_MIN_MARK and MLOG_COMP_REC_MIN_MARK.
2019-12-03 11:05:18 +02:00
Marko Mäkelä
ddbbf97670 Merge 10.4 into 10.5 2019-11-27 06:29:14 +02:00
Marko Mäkelä
3eda03d0fe MDEV-21148: Assertion index->n_core_fields + n_add >= index->n_fields
Revert part of commit 6cedb671e9
because it turns out to be theoretically impossible to parse a
ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC metadata record where
the variable-length fields in the PRIMARY KEY have been written
as nonempty strings.
2019-11-26 20:46:25 +02:00
Marko Mäkelä
5b686af2ec Merge 10.4 into 10.5 2019-11-20 15:47:16 +02:00
Marko Mäkelä
6cedb671e9 MDEV-21088 Table cannot be loaded after instant ADD/DROP COLUMN
btr_cur_instant_init_low(): Accurately parse the metadata record
header for ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPACT. CHAR columns
used to be unnecessarily written as nonempty strings of bytes.
2019-11-20 14:12:53 +08:00
Vladislav Vaintroub
5e62b6a5e0 MDEV-16264 Use threadpool for Innodb background work.
Almost all threads have gone
- the "ticking" threads, that sleep a while then do some work)
(srv_monitor_thread, srv_error_monitor_thread, srv_master_thread)
were replaced with timers. Some timers are periodic,
e.g the "master" timer.

- The btr_defragment_thread is also replaced by a timer , which
reschedules it self when current defragment "item" needs throttling

- the buf_resize_thread and buf_dump_threads are substitutes with tasks
Ditto with page cleaner workers.

- purge workers threads are not tasks as well, and purge cleaner
coordinator is a combination of a task and timer.

- All AIO is outsourced to tpool, Innodb just calls thread_pool::submit_io()
and provides the callback.

- The srv_slot_t was removed, and innodb_debug_sync used in purge
is currently not working, and needs reimplementation.
2019-11-15 18:09:30 +01:00
Marko Mäkelä
786b004972 Cleanup: More use of mtr_memo_type_t 2019-11-15 14:55:38 +02:00
Marko Mäkelä
ae90f8431b Merge 10.4 into 10.5 2019-11-14 14:49:20 +02:00
Marko Mäkelä
89ae01fd00 Merge 10.3 into 10.4 2019-11-14 13:23:36 +02:00
Marko Mäkelä
3d4a801533 MDEV-12353 preparation: Replace mtr_x_lock() and friends
Apart from page latches (buf_block_t::lock), mini-transactions
are keeping track of at most one dict_index_t::lock and
fil_space_t::latch at a time, and in a rare case, purge_sys.latch.

Let us introduce interfaces for acquiring an index latch
or a tablespace latch.

In a later version, we may want to introduce mtr_t members
for holding a latched dict_index_t* and fil_space_t*,
and replace the remaining use of mtr_t::m_memo
with std::set<buf_block_t*> or with a map<buf_block_t*,byte*>
pointing to log records.
2019-11-14 11:40:33 +02:00
Marko Mäkelä
0117d0e65a Merge 10.4 into 10.5 2019-11-11 15:21:58 +02:00
Marko Mäkelä
3da895a736 Merge 10.3 into 10.4 2019-11-11 15:03:46 +02:00
Marko Mäkelä
4fcfdb60e7 Merge 10.2 into 10.3 2019-11-11 14:56:51 +02:00
Marko Mäkelä
98e1d603bf MDEV-21024: Optimize writing BTR_EXTERN_LEN
btr_store_big_rec_extern_fields(): Remove the redundant initialization
of the most significant 32 bits of BTR_EXTERN_LEN. InnoDB never supported
BLOBs that are longer than 4GiB. In fact, dtuple_convert_big_rec()
would write emit an error message if a clustered index record tuple would
exceed 1,000,000,000 bytes in length.

The BTR_EXTERN_LEN in the BLOB pointers in clustered index leaf page
records is zero-initialized at least since
commit 41bb3537ba
2019-11-11 14:14:26 +02:00
Marko Mäkelä
29d67d051a Cleanup btr_page_get_prev(), btr_page_get_next()
Remove the redundant parameter mtr_t*.

Make use of page_has_prev(), page_has_next() whenever possible.
2019-11-11 13:36:21 +02:00
Marko Mäkelä
a6d614fb4a MDEV-12353 preparation: Remove redundant writes
fsp_alloc_seg_inode_page(): Ever since
commit 3926673ce7
all newly allocated pages are zero-initialized.
Assert that this is the case for the FSEG_ID fields.
(Side note: before that fix, other parts of the pages
could contain nonzero garbage.)

btr_store_big_rec_extern_fields(): Remove the redundant initialization
of the most significant 32 bits of BTR_EXTERN_LEN. InnoDB never supported
BLOBs that are longer than 4GiB. In fact, dtuple_convert_big_rec()
would write emit an error message if a clustered index record tuple would
exceed 1,000,000,000 bytes in length.
2019-11-08 11:04:26 +02:00
Marko Mäkelä
52246dff2c Merge 10.4 into 10.5 2019-11-08 09:43:41 +02:00
Marko Mäkelä
8a5eb4141b MDEV-17138 follow-up: Use MLOG_MEMSET for writing FIL_NULL
Always use the MLOG_MEMSET record for writing FIL_NULL,
because it is more compact.
2019-11-08 09:00:10 +02:00
Oleksandr Byelkin
3ad37ed0eb Merge 10.4 into 10.5 2019-11-07 08:52:30 +01:00
Marko Mäkelä
64a02e4fa2 MDEV-19586: Add const qualifiers
Except for fil_name_process(), which invokes os_normalize_path(),
the redo log record parser will not modify the redo log records.
Add const qualifiers accordingly.
2019-11-04 09:25:26 +02:00
Marko Mäkelä
ec40980ddd Merge 10.3 into 10.4 2019-11-01 15:23:18 +02:00
Marko Mäkelä
0b9cee2cbf Merge 10.2 into 10.3 2019-10-18 09:05:27 +03:00
Marko Mäkelä
fa32d28f2f MDEV-20852 BtrBulk is unnecessarily holding dict_index_t::lock
The BtrBulk class, which was introduced in MySQL 5.7, is by design
the exclusive writer to an index. It is therefore unnecessary to
acquire the dict_index_t::lock in that code.

Holding the dict_index_t::lock would unnecessarily block other threads
(SQL connections and the InnoDB purge threads) from buffering concurrent
modifications to being-created secondary indexes.

This fix is motivated by a change in MySQL 5.7.28:
Bug #29008298 MYSQLD CRASHES ITSELF WHEN CREATING INDEX
mysql/mysql-server@f9fb96c20f

PageBulk::init(), PageBulk::latch(): Never acquire m_index->lock.

PageBulk::storeExt(): Remove some pointer indirection, and improve
a debug assertion that seems to prove that some code is redundant.

BtrBulk::pageCommit(): Assert that m_index->lock is not being held.

btr_blob_log_check_t: Do not acquire m_index->lock if
m_op == BTR_STORE_INSERT_BULK. Add UNIV_UNLIKELY hints around
that condition.

btr_store_big_rec_extern_fields(): Allow index->lock not to be held
while op == BTR_STORE_INSERT_BULK. Add UNIV_UNLIKELY hints around
that condition.
2019-10-17 14:04:07 +03:00
Marko Mäkelä
b42294bc64 MDEV-19514 Defer change buffer merge until pages are requested
We will remove the InnoDB background operation of merging buffered
changes to secondary index leaf pages. Changes will only be merged as a
result of an operation that accesses a secondary index leaf page,
such as a SQL statement that performs a lookup via that index,
or is modifying the index. Also ROLLBACK and some background operations,
such as purging the history of committed transactions, or computing
index cardinality statistics, can cause change buffer merge.
Encryption key rotation will not perform change buffer merge.

The motivation of this change is to simplify the I/O logic and to
allow crash recovery to happen in the background (MDEV-14481).
We also hope that this will reduce the number of "mystery" crashes
due to corrupted data. Because change buffer merge will typically
take place as a result of executing SQL statements, there should be
a clearer connection between the crash and the SQL statements that
were executed when the server crashed.

In many cases, a slight performance improvement was observed.

This is joint work with Thirunarayanan Balathandayuthapani
and was tested by Axel Schwenke and Matthias Leich.

The InnoDB monitor counter innodb_ibuf_merge_usec will be removed.

On slow shutdown (innodb_fast_shutdown=0), we will continue to
merge all buffered changes (and purge all undo log history).

Two InnoDB configuration parameters will be changed as follows:

innodb_disable_background_merge: Removed.
This parameter existed only in debug builds.
All change buffer merges will use synchronous reads.

innodb_force_recovery will be changed as follows:
* innodb_force_recovery=4 will be the same as innodb_force_recovery=3
(the change buffer merge cannot be disabled; it can only happen as
a result of an operation that accesses a secondary index leaf page).
The option used to be capable of corrupting secondary index leaf pages.
Now that capability is removed, and innodb_force_recovery=4 becomes 'safe'.
* innodb_force_recovery=5 (which essentially hard-wires
SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED)
becomes safe to use. Bogus data can be returned to SQL, but
persistent InnoDB data files will not be corrupted further.
* innodb_force_recovery=6 (ignore the redo log files)
will be the only option that can potentially cause
persistent corruption of InnoDB data files.

Code changes:

buf_page_t::ibuf_exist: New flag, to indicate whether buffered
changes exist for a buffer pool page. Pages with pending changes
can be returned by buf_page_get_gen(). Previously, the changes
were always merged inside buf_page_get_gen() if needed.

ibuf_page_exists(const buf_page_t&): Check if a buffered changes
exist for an X-latched or read-fixed page.

buf_page_get_gen(): Add the parameter allow_ibuf_merge=false.
All callers that know that they may be accessing a secondary index
leaf page must pass this parameter as allow_ibuf_merge=true,
unless it does not matter for that caller whether all buffered
changes have been applied. Assert that whenever allow_ibuf_merge
holds, the page actually is a leaf page. Attempt change buffer
merge only to secondary B-tree index leaf pages.

btr_block_get(): Add parameter 'bool merge'.
All callers of btr_block_get() should know whether the page could be
a secondary index leaf page. If it is not, we should avoid consulting
the change buffer bitmap to even consider a merge. This is the main
interface to requesting index pages from the buffer pool.

ibuf_merge_or_delete_for_page(), recv_recover_page(): Replace
buf_page_get_known_nowait() with much simpler logic, because
it is now guaranteed that that the block is x-latched or read-fixed.

mlog_init_t::mark_ibuf_exist(): Renamed from mlog_init_t::ibuf_merge().
On crash recovery, we will no longer merge any buffered changes
for the pages that we read into the buffer pool during the last batch
of applying log records.

buf_page_get_gen_known_nowait(), BUF_MAKE_YOUNG, BUF_KEEP_OLD: Remove.

btr_search_guess_on_hash(): Merge buf_page_get_gen_known_nowait()
to its only remaining caller.

buf_page_make_young_if_needed(): Define as an inline function.
Add the parameter buf_pool.

buf_page_peek_if_young(), buf_page_peek_if_too_old(): Add the
parameter buf_pool.

fil_space_validate_for_mtr_commit(): Remove a bogus comment
about background merge of the change buffer.

btr_cur_open_at_rnd_pos_func(), btr_cur_search_to_nth_level_func(),
btr_cur_open_at_index_side_func(): Use narrower data types and scopes.

ibuf_read_merge_pages(): Replaces buf_read_ibuf_merge_pages().
Merge the change buffer by invoking buf_page_get_gen().
2019-10-11 17:28:15 +03:00
Marko Mäkelä
d04f2de80a Merge 10.4 into 10.5 2019-10-11 08:41:36 +03:00
Marko Mäkelä
09afd3da1a Merge 10.3 into 10.4 2019-10-10 21:30:40 +03:00
Marko Mäkelä
7f84e3ad75 Merge 10.2 into 10.3 2019-10-10 20:38:44 +03:00
Marko Mäkelä
6d7a826953 MDEV-20788: Bogus assertion failure for PAGE_FREE list
In MDEV-11369 (instant ADD COLUMN) in MariaDB Server 10.3,
we introduced the hidden metadata record that must be the
first record in the clustered index if and only if
index->is_instant() holds.

To catch MDEV-19783, in
commit ed0793e096 and
commit 99dc40d6ac
we added some assertions to find cases where
the metadata record is missing while it should not be, or a
record exists when it should not. Those assertions were invalid
when traversing the PAGE_FREE list. That list can contain anything;
we must only be able to determine the successor and the size of
each garbage record in it.

page_validate(), page_simple_validate_old(), page_simple_validate_new():
Do not invoke page_rec_get_next_const() for traversing the PAGE_FREE
list, but instead use a lower-level accessor that does not attempt to
validate the REC_INFO_MIN_REC_FLAG.

page_copy_rec_list_end_no_locks(),
page_copy_rec_list_start(), page_delete_rec_list_start():
Add assertions.

btr_page_get_split_rec_to_left(): Remove a redundant return value,
and make the output parameter the return value.

btr_page_get_split_rec_to_right(), btr_page_split_and_insert(): Clean up.
2019-10-10 20:29:30 +03:00
Marko Mäkelä
c11e5cdd12 Merge 10.3 into 10.4 2019-10-10 11:19:25 +03:00
Marko Mäkelä
892378fb9d Merge 10.2 into 10.3 2019-10-09 13:25:11 +03:00
Eugene Kosov
ed0793e096 MDEV-19783: Add more REC_INFO_MIN_REC_FLAG checks
btr_cur_pessimistic_delete(): code changed in a way that allows
to put more REC_INFO_MIN_REC_FLAG assertions inside btr_set_min_rec_mark().
Without that change tests innodb.innodb-table-online,
innodb.temp_table_savepoint and innodb_zip.prefix_index_liftedlimit fail.

Removed basically duplicated page_zip_validate() calls
which fails because of temporary(!) invariant violation.
That fixed innodb_zip.wl5522_debug_zip and
innodb_zip.prefix_index_liftedlimit
2019-10-09 08:29:26 +03:00
Marko Mäkelä
d480d28f4f Add page_has_prev(), page_has_next(), page_has_siblings()
Until now, InnoDB inefficiently compared the aligned fields
FIL_PAGE_PREV, FIL_PAGE_NEXT to the byte-order-agnostic value FIL_NULL.

This is a backport of 32170f8c6d
from MariaDB Server 10.3.
2019-10-09 08:29:26 +03:00
Marko Mäkelä
a340af9223 btr_block_get(): Remove redundant parameters 2019-09-25 16:08:48 +03:00
Marko Mäkelä
5d0bab47fc btr_block_get(), btr_block_get_func(): Change the parameter to
const dict_index_t&

btr_level_list_remove(): Clean up the parameters. Renamed from
btr_level_list_remove_func().
2019-09-25 13:34:49 +03:00
Marko Mäkelä
60c04be659 Merge 10.3 into 10.4 2019-09-12 12:16:40 +03:00
Marko Mäkelä
0fa5ad3acf Merge 10.2 into 10.3 2019-09-11 16:42:01 +03:00
Marko Mäkelä
0f950e53f0 MDEV-20562 btr_cur_open_at_rnd_pos() fails to return error for corrupted page
In mysql-server/commit@f46329044f
the InnoDB function btr_cur_open_at_rnd_pos() was corrected so that
it would return a status that indicates whether the cursor was
successfully positioned. But this change was not correctly merged to
MariaDB in 2e814d4702.

btr_cur_open_at_rnd_pos(): In the code path that was introduced in
MDEV-8588, properly return failure status.

No deterministic test case was found for this failure.
It was caught after removing the function
page_copy_rec_list_end_to_created_page() in a development branch.
As a result, the fill factor of index trees would improve, and
supposedly, so would the probability of btr_cur_open_at_rnd_pos()
reaching the intentionally corrupted page in the test
innodb.leaf_page_corrupted_during_recovery.
The wrong return value would cause
btr_estimate_number_of_different_key_vals() to wrongly invoke
btr_rec_get_externally_stored_len() on a non-leaf page and
trigger an assertion failure at the start of that function.
2019-09-11 15:30:19 +03:00
Eugene Kosov
4c7a743964 Merge 10.3 into 10.4 2019-07-26 15:22:31 +03:00
Eugene Kosov
29df1003d9 MDEV-20184 data race at global counter btr_cur_n_non_sea
Make all accesses to btr_cur_n_non_sea atomic.
2019-07-26 13:52:52 +03:00
Marko Mäkelä
09e9f884f1 MDEV-20048 Assertion 'n < tuple->n_fields on ROLLBACK after DROP COLUMN
btr_push_update_extern_fields(): Add a parameter for the original number
of fields in the record before btr_cur_trim(). Assume that this function
will only be called for the clustered index, which is the only index
that can contain off-page columns.

trx_undo_prev_version_build(), btr_cur_pessimistic_update():
Only invoke btr_push_update_extern_fields() for the clustered index.
2019-07-19 18:13:36 +03:00
Marko Mäkelä
7a3d34d645 Merge 10.3 into 10.4 2019-07-02 21:44:58 +03:00