In MySQL 5.7.8 an extra level of pointer indirection was added to
dict_operation_lock and some other rw_lock_t without solid justification,
in mysql/mysql-server@52720f1772.
Let us revert that change and remove the rather useless rw_lock_t
constructor and destructor and the magic_n field. In this way,
some unnecessary pointer dereferences and heap allocation will be avoided
and debugging might be a little easier.
Some places didn't match the previous rules, making the Floor
address wrong.
Additional sed rules:
sed -i -e 's/Place.*Suite .*, Boston/Street, Fifth Floor, Boston/g'
sed -i -e 's/Suite .*, Boston/Fifth Floor, Boston/g'
row_purge_upd_exist_or_extern_func(): Check for node->vcol_op_failed()
after row_purge_remove_sec_if_poss(), like row_purge_del_mark() did.
This avoids us dereferencing the node->table=NULL pointer.
The test case, submitted by Elena Stepanova, is not deterministic and
does not repeat the bug on 10.2. With the added loop, for me, it reliably
crashes 10.3 without the fix. I was unable to create a deterministic
test case for either 10.2 or 10.3.
Reviewed by Thirunarayanan Balathandayuthapani
buf_dblwr_process(): Remove the useless warning that a copy of a page
in the doublewrite buffer is corrupted. We already report an error if a
corrupted page cannot be recovered from the doublewrite buffer.
Note: In MariaDB 10.1, the original bug reported in MDEV-13893 could
still be easily repeatable. In MariaDB 10.2.24, MDEV-12699 should
have reduced the probability considerably.
fts_get_table_name(): Output to a caller-allocated buffer.
fts_get_table_name_prefix(): Use the lower-overhead allocation
ut_malloc() instead of mem_alloc().
This is based on mysql/mysql-server@d1584b9f38
in MySQL 5.7.4.
fts_table_t::parent: Remove the redundant field. Refer to
table->name.m_name instead.
fts_update_sync_doc_id(), fts_update_next_doc_id(): Remove
the redundant parameter table_name.
fts_get_table_name_prefix(): Access the dict_table_t::name.
FIXME: Ensure that this access is always covered by
dict_sys->mutex.
fts_state_t, fts_slot_t::state: Remove. Replaced by fts_slot_t::running
and fts_slot_t::table_id as follows.
FTS_STATE_SUSPENDED: Removed (unused).
FTS_STATE_EMPTY: Removed. table_id=0 will denote empty slots.
FTS_STATE_RUNNING: Equivalent to running=true.
FTS_STATE_LOADED, FTS_STATE_DONE: Equivalent to running=false.
fts_slot_t::table: Remove. Tables will be identified by table_id.
After opening a table, we will check fil_table_accessible() before
accessing the data.
fts_optimize_new_table(), fts_optimize_del_table(),
fts_optimize_how_many(), fts_is_sync_needed():
Remove the parameter tables, and use the static variable fts_slots
(which was introduced in MariaDB 10.2) instead.
No functional change.
Call my_timer_init() only once and then reuse it from InnoDB and
perfschema storage engines.
This patch speeds up empty test for me like this:
./mtr -mem innodb.kevg,xtradb 1.21s user 0.84s system 34% cpu 5.999 total
./mtr -mem innodb.kevg,xtradb 1.12s user 0.60s system 31% cpu 5.385 total
A sequel to 9180e86 and 149b754.
ALTER TABLE ... ADD FOREIGN KEY may crash if parent table is updated
concurrently.
Block FK parent table updates even earlier, before intermediate child
table is created.
Use proper charset info for my_casedn_str() and don't update original
identifiers so that lower_cast_table_names == 2 is honoured.
ReadView::copy_trx_ids(): Relax a debug check. It failed to account for
TRX_STATE_PREPARED_RECOVERED, which was introduced in MDEV-15772.
It was also reading trx->state twice and failed to tolerate
TRX_STATE_COMMITTED_IN_MEMORY, which could be concurrently assigned
in lock_trx_release_locks(), which is not holding trx_sys->mutex.
This bug is specific to the MariaDB 10.2 series. The ReadView was
introduced in MariaDB 10.2.2 by merging the code that had been
introduced in MySQL 5.7.2. In MariaDB 10.3, ReadView::snapshot()
would use the lock-free trx_sys.rw_trx_hash. MDEV-14638 moved the
corresponding assertion to trx_sys_t::find(), where it was duly
protected by trx->mutex, and later MDEV-14756 moved the check to
rw_trx_hash_t::validate_element(). This check was correctly adjusted
when MDEV-15772 was merged to 10.3.
The problem is in rtr_adjust_upper_level(), which allocates node_ptr
from heap, and then passes the same heap to btr_cur_pessimistic_insert().
The documentation of btr_cur_pessimistic_insert() says that the heap can
be emptied. If the heap is emptied and something else is allocated from
the heap, the node_ptr can become corrupted.
No functional change.
Call my_timer_init() only once and then reuse it from InnoDB and
perfschema storage engines.
This patch speeds up empty test for me like this:
./mtr -mem innodb.kevg,xtradb 1.21s user 0.84s system 34% cpu 5.999 total
./mtr -mem innodb.kevg,xtradb 1.12s user 0.60s system 31% cpu 5.385 total
The accessor dtuple_get_nth_v_field() was defined differently between
debug and release builds in MySQL 5.7.8 in
mysql/mysql-server@c47e1751b7
and a debug assertion to document or enforce the questionable assumption
tuple->v_fields == &tuple->fields[tuple->n_fields] was missing.
This was apparently no problem until MDEV-11369 introduced instant
ADD COLUMN to MariaDB Server 10.3. With that work present, in one
test case, trx_undo_report_insert_virtual() could in release builds
fetch the wrong value for a virtual column.
We replace many of the dtuple_t accessors with const-preserving
inline functions, and fix missing or misleadingly applied const
qualifiers accordingly.
log_checkpoint(), log_make_checkpoint_at(): Remove the parameter
write_always. It seems that the primary purpose of this parameter
was to ensure in the function recv_reset_logs() that both checkpoint
header pages will be overwritten, when the function is called from
the never-enabled function recv_recovery_from_archive_start().
create_log_files(): Merge recv_reset_logs() to its only caller.
Debug instrumentation: Prefer to flush the redo log, instead of
triggering a redo log checkpoint.
page_header_set_field(): Disable a debug assertion that will
always fail due to MDEV-19344, now that we no longer initiate
a redo log checkpoint before an injected crash.
In recv_reset_logs() there used to be two calls to
log_make_checkpoint_at(). The apparent purpose of this was
to ensure that both InnoDB redo log checkpoint header pages
will be initialized or overwritten.
The second call was removed (without any explanation) in MySQL 5.6.3:
mysql/mysql-server@4ca37968da
In MySQL 5.6.8 WL#6494, starting with
mysql/mysql-server@00a0ba8ad9
the function recv_reset_logs() was not only invoked during
InnoDB data file initialization, but also during a regular
startup when the redo log is being resized.
mysql/mysql-server@45e9167983
in MySQL 5.7.2 removed the UNIV_LOG_ARCHIVE code, but still
did not remove the parameter write_always.
The statement
SET GLOBAL innodb_encryption_rotate_key_age=0;
would have the unwanted side effect that ENCRYPTION=DEFAULT tablespaces
would no longer be encrypted or decrypted according to the setting of
innodb_encrypt_tables.
We implement a trigger, so that whenever one of the following is executed:
SET GLOBAL innodb_encrypt_tables=OFF;
SET GLOBAL innodb_encrypt_tables=ON;
SET GLOBAL innodb_encrypt_tables=FORCE;
all wrong-state ENCRYPTION=DEFAULT tablespaces will be added to
fil_system_t::rotation_list, so that the encryption will be added
or removed.
Note: This will *NOT* happen automatically after a server restart.
Before reading the first page of a data file, InnoDB cannot know
the encryption status of the data file. The statement
SET GLOBAL innodb_encrypt_tables will have the side effect that
all not-yet-read InnoDB data files will be accessed in order to
determine the encryption status.
innodb_encrypt_tables_validate(): Stop disallowing
SET GLOBAL innodb_encrypt_tables when innodb_encryption_rotate_key_age=0.
This reverts part of commit 50eb40a2a8
that addressed MDEV-11738 and MDEV-11581.
fil_system_t::read_page0(): Trigger a call to fil_node_t::read_page0().
Refactored from fil_space_get_space().
fil_crypt_rotation_list_fill(): If innodb_encryption_rotate_key_age=0,
initialize fil_system->rotation_list. This is invoked both on
SET GLOBAL innodb_encrypt_tables and
on SET GLOBAL innodb_encryption_rotate_key_age=0.
fil_space_set_crypt_data(): Remove.
fil_parse_write_crypt_data(): Simplify the logic.
This is joint work with Marko Mäkelä.
This reverts commit 61f370a3c9
and implements a simpler fix that is straightforward to merge to 10.3.
lock_print_info: Renamed from PrintNotStarted. Dump the
entire contents of trx_sys->mysql_trx_list.
lock_print_info_rw_recovered: Like lock_print_info, but dump
only recovered transactions in trx_sys->rw_trx_list.
lock_print_info_all_transactions(): Dump both trx_sys->mysql_trx_list
and trx_sys->rw_trx_list.
TrxLockIterator, TrxListIterator, lock_rec_fetch_page(): Remove.
This is a partial backport of the 10.3
commit a447980ff3
which removed the race-condition-prone ability of the InnoDB monitor
to read relevant pages into the buffer pool for some record locks.
ut_list_validate(), ut_list_map(): Add variants with const Functor&
so that these functions can be called with an rvalue.
Remove wrapper macros, and add #ifdef UNIV_DEBUG around debug-only code.
lock_print_info_all_transactions(): print transactions which are started
but not in RW or RO lists.
print_not_started(): Replaces PrintNotStarted. Collect the skipped
transactions.
fil_node_t::read_page0(): Do not replace up-to-date metadata with a
possibly old version of page 0 that is being reread from the file.
A more up-to-date page 0 could still exists in the buffer pool,
waiting to be written back to the file.
The field roll_node_t::partial holds if and only if
savept has been set. Make savept a pointer.
trx_rollback_start(): Use the semantic type undo_no_t for roll_limit.
PROBLEM
-------
Function innodb_base_col_setup_for_stored() was skipping to store
the base column information for a generated column if the base column
was a "STORED" generated column. This later causes a crash in function
innoabse_col_check_fk() where it says that a generated columns depends
upon two base columns ,but there is information on only one of them.
There was a explicit check barring the stored columns being stored,
which is wrong because the documentation says that a generated stored
column can be a part of a generated column.
FIX
----
Store the information of base column if it is a stored generated column.
#RB21247
Reviewed by: Debarun Banerjee <debarun.banerjee@oracle.com>
Problem:
io_getevents() - read asynchronous I/O events from the completion
queue. For each IO event, the res field in io_event tells whether IO
event is succeeded or not. To see if the IO actually succeeded we
always need to check event.res (negative=error,
positive=bytesread/written).
LinuxAIOHandler::collect() doesn't check event.res value for each event.
which leads to incorrect value in n_bytes for IO context (or IO Slot).
Fix:
Added a check for event.res negative value.
RB: 20871
Reviewed by : annamalai.gurusami@oracle.com
Normally, the InnoDB master thread executes InnoDB log checkpoints
so frequently that bugs in crash recovery or redo logging can be
hard to reproduce. This is because crash recovery would start replaying
the log only from the latest checkpoint. Because the InnoDB redo log
format only allows saving information for at most 2 latest checkpoints,
and because the log files are written in a circular fashion, it would
be challenging to implement a debug option that would start the redo
log apply from the very start of the redo log file.
It's a micro optimization. On most platforms CPUs has instructions to
compare with 0 fast. DB_SUCCESS is the most popular outcome of functions
and this patch optimized code like (err == DB_SUCCESS)
BtrBulk::finish(): bogus assertion fixed
fil_node_t::read_page0(): corrected usage of os_file_read()
que_eval_sql(): bugus assertion removed. Apparently it checked that
the field was assigned after having been zero-initialized at
object creation.
It turns out that the return type of os_file_read_func() was changed
in mysql/mysql-server@98909cefbc (MySQL 5.7)
from ibool to dberr_t. The reviewer (if there was any) failed to
point out that because of future merges, it could be a bad idea to
change the return type of a function without changing the function name.
This change was applied to MariaDB 10.2.2 in
commit 2e814d4702 but the
MariaDB-specific code was not fully adjusted accordingly,
e.g. in fil_node_open_file(). Essentially, code like
!os_file_read(...) became dead code in MariaDB and later
in Mariabackup 10.2, and we could be dealing with an uninitialized
buffer after a failed page read.
For partitioned table, ensure that the AUTO_INCREMENT values will
be assigned from the same sequence. This is based on the following
change in MySQL 5.6.44:
commit aaba359c13d9200747a609730dafafc3b63cd4d6
Author: Rahul Malik <rahul.m.malik@oracle.com>
Date: Mon Feb 4 13:31:41 2019 +0530
Bug#28573894 ALTER PARTITIONED TABLE ADD AUTO_INCREMENT DIFF RESULT DEPENDING ON ALGORITHM
Problem:
When a partition table is in-place altered to add an auto-increment column,
then its values are starting over for each partition.
Analysis:
In the case of in-place alter, InnoDB is creating a new sequence object
for each partition. It is default initialized. So auto-increment columns
start over for each partition.
Fix:
Assign old sequence of the partition to the sequence of next partition
so it won't start over.
RB#21148
Reviewed by Bin Su <bin.x.su@oracle.com>
Correctly document the usage of m_max_value. Remove the const
qualifier, so that the implicit assignment operator can be used.
Make all members of ib_sequence private, and add an accessor
member function max_value().
PROBLEM
=======
An add index doesn't update index length stats in information schema
TABLES table.
FIX
===
Update the dict_table_t variable with index length stats that is
actually calculated post alter . As this variable is used to populated
the information schema index length statistics.
Reviewed by: Bin su<bin.x.su@oracle.com>
RB: 21277
InnoDB could return the same list again and again if the buffer
passed to trx_recover_for_mysql() is smaller than the number of
transactions that InnoDB recovered in XA PREPARE state.
We introduce the transaction state TRX_PREPARED_RECOVERED, which
is like TRX_PREPARED, but will be set during trx_recover_for_mysql()
so that each transaction will only be returned once.
Because init_server_components() is invoking ha_recover() twice,
we must reset the state of the transactions back to TRX_PREPARED
after returning the complete list, so that repeated traversals
will see the complete list again, instead of seeing an empty list.
Without this tweak, the test main.tc_heuristic_recover would hang
in MariaDB 10.1.
dict_create_foreign_constraints_low(): Tolerate the keywords
IGNORE and ONLINE between the keywords ALTER and TABLE.
We should really remove the hacky FOREIGN KEY constraint parser
from InnoDB.
The compile-time option IBUF_COUNT_DEBUG has not been used for years.
It would only work with up to 3 created .ibd files, with no buffered
changes existing while InnoDB is started up.
InnoDB crash recovery used to read every data page for which
redo log exists. This is unnecessary for those pages that are
initialized by the redo log. If a newly created page is corrupted,
recovery could unnecessarily fail. It would suffice to reinitialize
the page based on the redo log records.
To add insult to injury, InnoDB crash recovery could hang if it
encountered a corrupted page. We will fix also that problem.
InnoDB would normally refuse to start up if it encounters a
corrupted page on recovery, but that can be overridden by
setting innodb_force_recovery=1.
Data pages are completely initialized by the records
MLOG_INIT_FILE_PAGE2 and MLOG_ZIP_PAGE_COMPRESS.
MariaDB 10.4 additionally recognizes MLOG_INIT_FREE_PAGE,
which notifies that a page has been freed and its contents
can be discarded (filled with zeroes).
The record MLOG_INDEX_LOAD notifies that redo logging has
been re-enabled after being disabled. We can avoid loading
the page if all buffered redo log records predate the
MLOG_INDEX_LOAD record.
For the internal tables of FULLTEXT INDEX, no MLOG_INDEX_LOAD
records were written before commit aa3f7a107c.
Hence, we will skip these optimizations for tables whose
name starts with FTS_.
This is joint work with Thirunarayanan Balathandayuthapani.
fil_space_t::enable_lsn, file_name_t::enable_lsn: The LSN of the
latest recovered MLOG_INDEX_LOAD record for a tablespace.
mlog_init: Page initialization operations discovered during
redo log scanning. FIXME: This really belongs in recv_sys->addr_hash,
and should be removed in MDEV-19176.
recv_addr_state: Add the new state RECV_WILL_NOT_READ to
indicate that according to mlog_init, the page will be
initialized based on redo log record contents.
recv_add_to_hash_table(): Set the RECV_WILL_NOT_READ state
if appropriate. For now, we do not treat MLOG_ZIP_PAGE_COMPRESS
as page initialization. This works around bugs in the crash
recovery of ROW_FORMAT=COMPRESSED tables.
recv_mark_log_index_load(): Process a MLOG_INDEX_LOAD record
by resetting the state to RECV_NOT_PROCESSED and by updating
the fil_name_t::enable_lsn.
recv_init_crash_recovery_spaces(): Copy fil_name_t::enable_lsn
to fil_space_t::enable_lsn.
recv_recover_page(): Add the parameter init_lsn, to ignore
any log records that precede the page initialization.
Add DBUG output about skipped operations.
buf_page_create(): Initialize FIL_PAGE_LSN, so that
recv_recover_page() will not wrongly skip applying
the page-initialization record due to the field containing
some newer LSN as a leftover from a different page.
Do not invoke ibuf_merge_or_delete_for_page() during
crash recovery.
recv_apply_hashed_log_recs(): Remove some unnecessary lookups.
Note if a corrupted page was found during recovery.
After invoking buf_page_create(), do invoke
ibuf_merge_or_delete_for_page() via mlog_init.ibuf_merge()
in the last recovery batch.
ibuf_merge_or_delete_for_page(): Relax a debug assertion.
innobase_start_or_create_for_mysql(): Abort startup if
a corrupted page was found during recovery. Corrupted pages
will not be flagged if innodb_force_recovery is set.
However, the recv_sys->found_corrupt_fs flag can be set
regardless of innodb_force_recovery if file names are found
to be incorrect (for example, multiple files with the same
tablespace ID).