MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY
Move a test from innodb.rename_table_debug to innodb.alter_copy.
ha_innobase::extra(HA_EXTRA_BEGIN_ALTER_COPY): Register id-versioned
tables so that mysql.transaction_registry will be updated, even for
empty tables that are subjected to ALTER TABLE…ALGORITHM=COPY.
If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend
a lot of time rolling back writes to the intermediate copy of the table.
To reduce the amount of busy work done, a work-around was introduced in
commit fd069e2bb3 in MySQL 4.1.8 and 5.0.2,
to commit the transaction after every 10,000 inserted rows.
A proper fix would have been to disable the undo logging altogether and
to simply drop the intermediate copy of the table on subsequent server
startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585.
In MariaDB 10.2, the intermediate copy of the table would be left behind
with a name starting with the string #sql.
This is a backport of a bug fix from MySQL 8.0.0 to MariaDB,
contributed by jixianliang <271365745@qq.com>.
Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation
InnoDB must for now keep the undo logging enabled, so that the latest
row can be rolled back in case of an error.
In Galera cluster, the LOAD DATA statement will retain the existing
behaviour and commit the transaction after every 10,000 rows if
the parameter wsrep_load_data_splitting=ON is set. The logic to do
so (the wsrep_load_data_split() function and the call
handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work
by Ji Xianliang and Marko Mäkelä.
The original fix:
Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com>
Date: Wed Dec 2 16:09:15 2015 +0530
Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY
Problem:
During ALTER TABLE, we commit and restart the transaction for every
10,000 rows, so that the rollback after recovery would not take so long.
Fix:
Suppress the undo logging during copy alter operation. If fts_index is
present then insert directly into fts auxiliary table rather
than doing at commit time.
ha_innobase::num_write_row: Remove the variable.
ha_innobase::write_row(): Remove the hack for committing every 10000 rows.
row_lock_table_for_mysql(): Remove the extra 2 parameters.
lock_get_src_table(), lock_is_table_exclusive(): Remove.
Reviewed-by: Marko Mäkelä <marko.makela@oracle.com>
Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com>
Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>
Remove unnecessary repeated lookups for undo pages.
trx_undo_assign(), trx_undo_assign_low(), trx_undo_seg_create(),
trx_undo_create(): Return the undo log block to the caller.
Inside InnoDB, each mini-transaction that generates any redo log records
will acquire log_sys->mutex during mtr_t::commit() in order to copy the
records into the global log_sys->buf for writing into the redo log file.
For single-row transactions, this incurs quite a bit of overhead.
We would use two mini-transactions for writing a record into a
freshly updated undo log page. (Only if the undo record will
not fit in that page, then we will have to commit and restart
the mini-transaction.)
trx_undo_assign(): Assign undo log for a persistent transaction,
or return the already assigned one.
trx_undo_assign_low(): Assign undo log for an operation on a
persistent or temporary table.
trx_undo_create(), trx_undo_reuse_cached(): Remove redundant parameters.
Merge the logic from trx_undo_mark_as_dict_operation().
Only invoke set_versioned() on trx_id versioned tables.
dict_table_t::versioned_by_id(): New accessor, to determine if
a table is system versioned by transaction ID.
When cloning oldest view, don't copy ReadView::m_creator_trx_id.
It means that the owner thread is now allowed to access this member
without trx_sys.mutex protection.
To achieve this we have to keep ReadView::m_creator_trx_id in
ReadView::m_ids. This is required to not let purge thread process
records owned by transaction associated with oldest view.
It is only required if trsanction entered read-write mode before it's
view was created.
If transaction entered read-write mode after it's view was created
(trx_set_rw_mode()), purge thread won't be allowed to touch it because
m_low_limit_id >= m_creator_trx_id holds. Thus we don't have to add
this transaction id to ReadView::m_ids.
Cleanups:
ReadView::ids_t: don't seem to make any sense, just complicate matters.
ReadView::copy_trx_ids(): doesn't make sense anymore, integrated into
caller.
ReadView::copy_complete(): not needed anymore.
ReadView copy constructores: don't seem to make any sense.
trx_purge_truncate_history(): removed view argument, access
purge_sys->view directly instead.
Moved mutex locking inside lock_rec_lock().
Moved monitor increment out of mutex.
Moved assertions that don't require protection out of mutex.
Removed duplicate assertions.
Moved duplicate debug injections into lock_rec_lock().
Let monitor updates use relaxed memory order.
Return directly without maintaining variables in lock_rec_lock_slow().
Moved lock_rec_lock_fast() body into lock_rec_lock(): saves at least one
trx_mutex_enter(), one switch() plus some code was moved out of mutex.
InnoDB RNG maintains global state, causing otherwise unnecessary bus
traffic. Even worse this is cross-mutex traffic. That is different
mutexes suffer from contention.
Fixed delay of 4 was verified to give best throughput by OLTP update
index and read-write benchmarks on Intel Broadwell (2/20/40) and
ARM (1/46/46).
Traditionally, DROP TABLE and TRUNCATE TABLE discarded any locks that
may have been held on the table. This feels like an ACID violation.
Probably most occurrences of it were prevented by meta-data locks (MDL)
which were introduced in MySQL 5.5.
dict_table_t::n_foreign_key_checks_running: Reduce the number of
non-debug checks.
lock_remove_all_on_table(), lock_remove_all_on_table_for_trx(): Remove.
ha_innobase::truncate(): Acquire an exclusive InnoDB table lock
before proceeding. DROP TABLE and DISCARD/IMPORT were already doing
this.
row_truncate_table_for_mysql(): Convert the already started transaction
into a dictionary operation, and do not invoke lock_remove_all_on_table().
row_mysql_table_id_reassign(): Do not call lock_remove_all_on_table().
This function is only used in ALTER TABLE...DISCARD/IMPORT TABLESPACE,
which is already holding an exclusive InnoDB table lock.
TODO: Make n_foreign_key_checks running a debug-only variable.
This would require two fixes:
(1) DROP TABLE: Exclusively lock the table beforehand, to prevent
the possibility of concurrently running foreign key checks (which
would acquire a table IS lock and then record S locks).
(2) RENAME TABLE: Find out if n_foreign_key_checks_running>0 actually
constitutes a potential problem.
It was supposed to be called out of mutex, but nevertheless was called
under trx_sys.mutex for normal threads adding one extra condtion in
critical section.
This will allow us to reduce critical section protected by
trx_sys.mutex:
- no need to maintain global m_free list
- eliminate if (trx->read_view == NULL) condition.
On x86_64 sizeof(Readview) is 144 mostly due to padding, sizeof(trx_t)
with ReadView is 1200.
Also don't close ReadView for read-write transactions, just mark it
closed similarly to read-only.
Clean-up: removed n_prepared_recovered_trx and n_prepared_trx, which
accidentally re-appeared after some rebase.
MDEV-14511 tried to avoid some consistency problems related to InnoDB
persistent statistics. The persistent statistics are being written by
an InnoDB internal SQL interpreter that requires the InnoDB data dictionary
cache to be locked.
Before MDEV-14511, the statistics were written during DDL in separate
transactions, which could unnecessarily reduce performance (each commit
would require a redo log flush) and break atomicity, because the statistics
would be updated separately from the dictionary transaction.
However, because it is unacceptable to hold the InnoDB data dictionary
cache locked while suspending the execution for waiting for a
transactional lock (in the mysql.innodb_index_stats or
mysql.innodb_table_stats tables) to be released, any lock conflict
was immediately be reported as "lock wait timeout".
To fix MDEV-14941, an attempt to reduce these lock conflicts by acquiring
transactional locks on the user tables in both the statistics and DDL
operations was made, but it would still not entirely prevent lock conflicts
on the mysql.innodb_index_stats and mysql.innodb_table_stats tables.
Fixing the remaining problems would require a change that is too intrusive
for a GA release series, such as MariaDB 10.2.
Thefefore, we revert the change MDEV-14511. To silence the
MDEV-13201 assertion, we use the pre-existing flag trx_t::internal.
trx->read_view|= 1 was done in a silly attempt to fix race condition
where trx->read_view was closed without trx_sys.mutex lock by read-only
trasnactions.
This just made the problem less likely to happen. In fact there was race
condition in const version of trx_get_read_view(): pointer may change to
garbage any moment after MVCC::is_view_active(trx->read_view) check and
before this function returns.
This patch doesn't fix this race condition, but rather makes it's
consequences less destructive.
trx_erase_lists(): trx->read_view is owned by current thread and thus
doesn't need trx_sys.mutex protection for reading it's value. Move
trx->read_view check out of mutex
trx_start_low(): moved assertion out of mutex.
Call ReadView::creator_trx_id() directly: allows to inline this one-line
method.
There is only one transaction system object in InnoDB.
Allocate the storage for it at link time, not at runtime.
lock_rec_fetch_page(): Use the correct fetch mode BUF_GET.
Pages may never be deallocated from a tablespace while
record locks are pointing to them.
Use atomic operations when accessing trx_sys_t::max_trx_id. We can't yet
move trx_sys_t::get_new_trx_id() out of mutex because it must be updated
atomically along with trx_sys_t::rw_trx_ids.
Let lock_print_info_all_transactions() iterate rw_trx_hash instead of
rw_trx_list.
When printing info of locks for transactions, InnoDB monitor doesn't
attempt to read relevant page from disk anymore. The code was prone
to race conditions.
Note that TrxListIterator didn't work as advertised: it iterated
rw_trx_list only.
Removed trx_sys_validate_trx_list(): with rw_trx_hash elements are not
required to be ordered by transaction id. Transaction state is now guarded
by asserts in rw_trx_hash_t.
Removed trx_sys_t::n_prepared_recovered_trx: never used.
Removed trx_sys_t::n_prepared_trx: used only at shutdown, we can perfectly
get this value from rw_trx_hash.
Determine minimum transaction id by iterating rw_trx_hash, not rw_trx_list.
It is more expensive than previous implementation since it does linear
search, especially if there're many concurrent transactions running. But in
such case mutex is much bigger evil. And since it doesn't require
trx_sys->mutex protection it scales better.
For low concurrency performance difference is neglible.
Replaced UT_LIST_GET_LEN(trx_sys->rw_trx_list) with
trx_sys->rw_trx_hash.size().
Moved freeing of trx objects at shutdown to rw_trx_hash destructor.
Small clean-up in trx_rollback_recovered().
Reduce divergence between trx_sys_t::rw_trx_hash and trx_sys_t::rw_trx_list
by not adding recovered COMMITTED transactions to trx_sys_t::rw_trx_list.
Such transactions are discarded immediately without creating trx object.
This also required to split rollback and cleanup phases of recovery. To
reflect these updates the following renames happened:
trx_rollback_or_clean_all_recovered() -> trx_rollback_all_recovered()
trx_rollback_or_clean_is_active -> trx_rollback_is_active
trx_rollback_or_clean_recovered() -> trx_rollback_recovered()
trx_cleanup_at_db_startup() -> trx_cleanup_recovered()
Also removed a hack from lock_trx_release_locks(). Instead let recovery
rollback thread to skip committed XA transactions.
Replace some !rw_lock_own() assertions with the stronger
!btr_search_own_any(). Remove some redundant btr_get_search_latch()
calls.
btr_search_update_hash_ref(): Remove a duplicated assertion.
btr_search_build_page_hash_index(): Remove a duplicated assertion.
rw_lock_s_lock() asserts that the latch is not being held.
btr_search_disable_ref_count(): Remove an assertion. The only caller
is acquiring all adaptive hash index latches.
btr_cur_search_to_nth_level(), btr_search_guess_on_hash(),
btr_pcur_open_with_no_init_func(), row_sel_open_pcur():
Replace the parameter has_search_latch with the ahi_latch
(passed as NULL if the caller does not hold the latch).
btr_search_update_hash_node_on_insert(),
btr_search_update_hash_on_insert(),
btr_search_build_page_hash_index(): Add the parameter ahi_latch.
btr_search_x_lock(), btr_search_x_unlock(),
btr_search_s_lock(), btr_search_s_unlock(): Remove.