Before we create an InnoDB data file, we must have persistently
started a DDL transaction and written a record in SYS_INDEXES
as well as a FILE_CREATE record for creating the file.
In that way, if InnoDB is killed before the DDL transaction is
committed, the rollback will be able to delete the file in
dict_drop_index_tree().
dict_build_table_def_step(): Do not create the tablespace.
At this point, we have not written any log, not even for
inserting the SYS_TABLES record.
dict_create_sys_indexes_tuple(): Relax an assertion to tolerate
a missing tablespace before the first index has been created in
dict_create_index_step().
dict_build_index_def_step(): Relax the dict_table_open_on_name()
parameter, because no tablespace may be available yet.
tab_create_graph_create(), row_create_table_for_mysql(), tab_node_t:
Remove key_id, mode.
ind_create_graph_create(), row_create_index_for_mysql(), ind_node_t:
Add key_id, mode.
dict_create_index_space(): New function, to create the tablespace
during clustered index creation.
dict_create_index_step(): After the SYS_INDEXES record has been
written, invoke dict_create_index_space() to create the tablespace
if needed.
fil_ibd_create(): Before creating the file, persistently write a
FILE_CREATE record. This will also ensure that an incomplete DDL
transaction will be recovered. After creating the file, invoke
fsp_header_init().
InnoDB used to support at most one CREATE TABLE or DROP TABLE
per transaction. This caused complications for DDL operations on
partitioned tables (where each partition is treated as a separate
table by InnoDB) and FULLTEXT INDEX (where each index is maintained
in a number of internal InnoDB tables).
dict_drop_index_tree(): Extend the MDEV-24589 logic and treat
the purge or rollback of SYS_INDEXES records of clustered indexes
specially: by dropping the tablespace if it exists. This is the only
form of recovery that we will need.
trx_undo_ddl_type: Document the DDL undo log record types better.
trx_t::dict_operation: Change the type to bool.
trx_t::ddl: Remove.
trx_t::table_id, trx_undo_t::table_id: Remove.
dict_build_table_def_step(): Remove trx_t::table_id logging.
dict_table_close_and_drop(), row_merge_drop_table(): Remove.
row_merge_lock_table(): Merged to the only callers, which can
call lock_table_for_trx() directly.
fts_aux_table_t, fts_aux_id, fts_space_set_t: Remove.
fts_drop_orphaned_tables(): Remove.
row_merge_rename_index_to_drop(): Remove. Thanks to MDEV-24589,
we can simply delete the to-be-dropped indexes from SYS_INDEXES,
while still being able to roll back the operation.
ha_innobase_inplace_ctx: Make a few data members const.
Preallocate trx.
prepare_inplace_alter_table_dict(): Simplify the logic. Let the
normal rollback take care of some cleanup.
row_undo_ins_remove_clust_rec(): Simplify the parsing of SYS_COLUMNS.
trx_rollback_active(): Remove the special DROP TABLE logic.
trx_undo_mem_create_at_db_start(), trx_undo_reuse_cached():
Always write TRX_UNDO_TABLE_ID as 0.
In commit 91599701d0 (MDEV-25312)
some recovery code for TRUNCATE TABLE was broken
causing a regression in a case where undo log for a RENAME TABLE
operation had been durably written but the tablespace had not been
renamed yet.
row_rename_table_for_mysql(): Add a DEBUG_SYNC point for the
test case, and simplify the logic and trim the error messages.
fil_space_t::rename(): Simplify the operation. Merge the necessary
part of fil_rename_tablespace_check(). If there is no change to
the file name, do nothing.
dict_table_t::rename_tablespace(): Refactored from
dict_table_rename_in_cache().
row_undo_ins_parse_undo_rec(): On rolling back TRX_UNDO_RENAME_TABLE,
invoke dict_table_t::rename_tablespace() even if the table name matches.
os_file_rename_func(): Temporarily relax an assertion that would
fail during the recovery in the test innodb.truncate_crash.
lock_discard_for_index(): New function, to discard locks for an
index whose index tree has been purged. By definition, such indexes
must be ones for which the MDL upgrade failed in inplace ALTER TABLE
and the ADD INDEX operation was never committed.
Note: Because we do not support online ADD SPATIAL INDEX, we only
have to traverse the lock_sys.rec_hash for B-trees and not the
hash tables for R-trees.
row_purge_remove_clust_if_poss_low(): Invoke lock_discard_for_index()
if necessary before dropping a B-tree for a SYS_INDEXES record.
btr_free_if_exists(): Always use the BUF_GET_POSSIBLY_FREED mode
when accessing pages, because due to MDEV-24589 the function
fil_space_t::set_stopping(true) can be called at any time during
the execution of this function.
mtr_t::m_freeing_tree: New data member for debugging purposes.
buf_page_get_low(): Assert that the BUF_GET mode is not being used
anywhere during the execution of btr_free_if_exists().
In all code related to freeing or allocating pages, we will add some
robustness, by making more use of BUF_GET_POSSIBLY_FREED and by
reporting an error instead of crashing in some cases of corruption.
fil_check_pending_ops(), fil_check_pending_io(): Remove.
These functions were actually duplicating each other ever since
commit 118e258aaa (MDEV-23855).
fil_space_t::check_pending_operations(): Replaces
fil_check_pending_operations() and incorporates the logic of
fil_check_pending_ops(). Avoid unnecessary lookups for the tablespace.
Just wait for the reference count to drop to zero.
fil_space_t::io(): Remove an unnecessary condition. We can (and
probably better should) refuse asynchronous reads of undo tablespaces
that are being truncated.
fil_truncate_prepare(): Remove.
trx_purge_truncate_history(): Implement the necessary steps that used
to be in fil_truncate_prepare().
InnoDB startup hangs if a DDL transaction needs to be
rolled back and a recovered transaction on statistics
tables exists. In that case, InnoDB should rollback
the transaction which holds locks on innodb_table_stats
or innodb_index_stats during trx_rollback_or_clean_recovered().
innodb_debug_sync was introduced in commit
b393e2cb0c and reverted in
commit fc58c17216 due to memory leak reported
by valgrind, see MDEV-21336.
The leak is now fixed by adding `rw_lock_free(&slot->debug_sync_lock)`
after background thread working loop is finished, and the patch is
reapplied, with respect to c++98 fixes by Marko.
The missing DEBUG_SYNC for MDEV-18546 in row0vers.cc is also reapplied.
In the SUX_LOCK_GENERIC implementation, we can remember at most
one pending exclusive lock request. If multiple exclusive lock
requests are pending, the WRITER_WAITING flag will be cleared when
the first waiting writer acquires the exclusive lock.
ssux_lock_low::update_lock(): If WRITER_WAITING is set, wake up
the writer even if the UPDATER flag is set, because the waiting
writer may be in the process of upgrading its U lock to X.
rw_lock::read_unlock(): Also indicate that an X lock waiter must
be woken up if an U lock exists.
This fix may cause unnecessary wake-ups and system calls, but this
is the best that we can do. Ideally we would use the MDEV-25404
idea of a separate 'writer' mutex, but there is no portable way to
request that a non-recursive mutex be created, and InnoDB requires
the ability to transfer buf_block_t::lock ownership to an I/O thread.
To allow problems like this to be caught more reliably in the future,
we add a unit test for srw_mutex, srw_lock, ssux_lock, sux_lock.
The U-to-X upgrade turned out to be incorrect. A debug assertion
failed in wr_wait(), called from mtr_defer_drop_ahi() in a stress
test with innodb_adaptive_hash_index=ON.
A correct upgrade procedure ought to be readers.fetch_add(WRITER-1)
to register ourselves as a WRITER (or waiting writer) and to release
the reference that was being held for the U lock.
Thanks to Matthias Leich for catching the problem.
This reverts commit e731a28394.
A crash occurred during the test stress.ddl_innodb when
fil_delete_tablespace() for DROP TABLE was waiting in
fil_check_pending_operations() and a purge thread for handling
an earlier DROP INDEX was attempting to load the index root page
in btr_free_if_exists() and btr_free_root_check(). The function
buf_page_get_gen() would write out several times
"trying to read...being-dropped tablespace"
before giving up and committing suicide.
It turns out that during any page access in btr_free_if_exists(),
fil_space_t::set_stopping() could have been invoked by
fil_check_pending_operations(), as part of dropping the tablespace.
Preventing this race condition would require extensive changes
to the allocation code or some locking mechanism that would ensure
that we only set the flag if btr_free_if_exists() is not in progress.
Either way, that could be a too risky change in a GA release.
Because MDEV-24589 is not strictly necessary in the 10.5 release
series and it only is a requirement for MDEV-25180 in a later
major release, we will revert the change from 10.5.
The easiest way to compile and test the server with UBSAN is to run:
./BUILD/compile-pentium64-ubsan
and then run mysql-test-run.
After this commit, one should be able to run this without any UBSAN
warnings. There is still a few compiler warnings that should be fixed
at some point, but these do not expose any real bugs.
The 'special' cases where we disable, suppress or circumvent UBSAN are:
- ref10 source (as here we intentionally do some shifts that UBSAN
complains about.
- x86 version of optimized int#korr() methods. UBSAN do not like unaligned
memory access of integers. Fixed by using byte_order_generic.h when
compiling with UBSAN
- We use smaller thread stack with ASAN and UBSAN, which forced me to
disable a few tests that prints the thread stack size.
- Verifying class types does not work for shared libraries. I added
suppression in mysql-test-run.pl for this case.
- Added '#ifdef WITH_UBSAN' when using integer arithmetic where it is
safe to have overflows (two cases, in item_func.cc).
Things fixed:
- Don't left shift signed values
(byte_order_generic.h, mysqltest.c, item_sum.cc and many more)
- Don't assign not non existing values to enum variables.
- Ensure that bool and enum values are properly initialized in
constructors. This was needed as UBSAN checks that these types has
correct values when one copies an object.
(gcalc_tools.h, ha_partition.cc, item_sum.cc, partition_element.h ...)
- Ensure we do not called handler functions on unallocated objects or
deleted objects.
(events.cc, sql_acl.cc).
- Fixed bugs in Item_sp::Item_sp() where we did not call constructor
on Query_arena object.
- Fixed several cast of objects to an incompatible class!
(Item.cc, Item_buff.cc, item_timefunc.cc, opt_subselect.cc, sql_acl.cc,
sql_select.cc ...)
- Ensure we do not do integer arithmetic that causes over or underflows.
This includes also ++ and -- of integers.
(Item_func.cc, Item_strfunc.cc, item_timefunc.cc, sql_base.cc ...)
- Added JSON_VALUE_UNITIALIZED to json_value_types and ensure that
value_type is initialized to this instead of to -1, which is not a valid
enum value for json_value_types.
- Ensure we do not call memcpy() when second argument could be null.
- Fixed that Item_func_str::make_empty_result() creates an empty string
instead of a null string (safer as it ensures we do not do arithmetic
on null strings).
Other things:
- Changed struct st_position to an OBJECT and added an initialization
function to it to ensure that we do not copy or use uninitialized
members. The change to a class was also motived that we used "struct
st_position" and POSITION randomly trough the code which was
confusing.
- Notably big rewrite in sql_acl.cc to avoid using deleted objects.
- Changed in sql_partition to use '^' instead of '-'. This is safe as
the operator is either 0 or 0x8000000000000000ULL.
- Added check for select_nr < INT_MAX in JOIN::build_explain() to
avoid bug when get_select() could return NULL.
- Reordered elements in POSITION for better alignment.
- Changed sql_test.cc::print_plan() to use pointers instead of objects.
- Fixed bug in find_set() where could could execute '1 << -1'.
- Added variable have_sanitizer, used by mtr. (This variable was before
only in 10.5 and up). It can now have one of two values:
ASAN or UBSAN.
- Moved ~Archive_share() from ha_archive.cc to ha_archive.h and marked
it virtual. This was an effort to get UBSAN to work with loaded storage
engines. I kept the change as the new place is better.
- Added in CONNECT engine COLBLK::SetName(), to get around a wrong cast
in tabutil.cpp.
- Added HAVE_REPLICATION around usage of rgi_slave, to get embedded
server to compile with UBSAN. (Patch from Marko).
- Added #ifdef for powerpc64 to avoid a bug in old gcc versions related
to integer arithmetic.
Changes that should not be needed but had to be done to suppress warnings
from UBSAN:
- Added static_cast<<uint16_t>> around shift to get rid of a LOT of
compiler warnings when using UBSAN.
- Had to change some '/' of 2 base integers to shift to get rid of
some compile time warnings.
Reviewed by:
- Json changes: Alexey Botchkov
- Charset changes in ctype-uca.c: Alexander Barkov
- InnoDB changes & Embedded server: Marko Mäkelä
- sql_acl.cc changes: Vicențiu Ciorbaru
- build_explain() changes: Sergey Petrunia
Having both readers and writers use a single lock word in
futex system calls caused performance regression compared to
SRW_LOCK_DUMMY (mutex and 2 condition variables).
A contributing factor is that we did not accurately keep
track of the number of waiting threads and thus had to invoke
system calls to wake up any waiting threads.
SUX_LOCK_GENERIC: Renamed from SRW_LOCK_DUMMY. This is the
original implementation, with rw_lock (std::atomic<uint32_t>),
a mutex and two condition variables. Using a separate writer
mutex (as described below) is not possible, because the mutex ownership
in a buf_block_t::lock must be able to transfer from a write submitter
thread to an I/O completion thread, and pthread_mutex_lock() may assume
that the submitter thread is recursively acquiring the mutex that it
already holds, while in reality the I/O completion thread is the real
owner. POSIX does not define an interface for requesting a mutex to
be non-recursive.
On Microsoft Windows, srw_lock_low will remain a simple wrapper of
SRWLOCK. On 32-bit Microsoft Windows, sizeof(SRWLOCK)=4 while
sizeof(srw_lock_low)=8.
On other platforms, srw_lock_low is an alias of ssux_lock_low,
the Simple (non-recursive) Shared/Update/eXclusive lock.
In the futex-based implementation of ssux_lock_low (Linux, OpenBSD,
Microsoft Windows), we shall use a dedicated mutex for exclusive
requests (writer), and have a WRITER flag in the 'readers' lock word
to inform that a writer is holding the lock or waiting for the lock to
be granted. When the WRITER flag is set, all lock requests must acquire
the writer mutex. Normally, shared (S) lock requests simply perform a
compare-and-swap on the 'readers' word.
Update locks are implemented as a combination of writer mutex
and a normal counter in the 'readers' lock word. The conflict between
U and X locks is guaranteed by the writer mutex.
Unlike SUX_LOCK_GENERIC, wr_u_downgrade() will not wake up any pending
rd_lock() waits. They will wait until u_unlock() releases the writer mutex.
The ssux_lock_low is always wrapped by sux_lock (with a recursion count
of U and X locks), used for dict_index_t::lock and buf_block_t::lock.
Their memory footprint for the futex-based implementation will increase
by sizeof(srw_mutex), or 4 bytes.
This change addresses a performance regression in read-only benchmarks,
such as sysbench oltp_read_only. Also write performance was improved.
On 32-bit Linux and OpenBSD, lock_sys_t::hash_table will allocate
two hash table elements for each srw_lock (14 instead of 15 hash
table cells per 64-byte cache line on IA-32). On Microsoft Windows,
sizeof(SRWLOCK)==sizeof(void*) and there is no change.
Reviewed by: Vladislav Vaintroub
Tested by: Axel Schwenke and Vladislav Vaintroub
On Linux, OpenBSD and Microsoft Windows, srw_mutex was an alias for a
rw-lock while we only need mutex functionality. Let us implement a
futex-based mutex with one bit for HOLDER and 31 bits for counting
waiting requests.
srw_lock::wr_unlock() can avoid waking up a waiter when no waiting
requests exist. (Previously, we only had 1-bit rw_lock::WRITER_WAITING
flag that could be wrongly cleared if multiple waiting wr_lock() exist.
Now we have no problem with up to 2,147,483,648 conflicting threads.)
On 64-bit Microsoft Windows, the advantage is that
sizeof(srw_mutex) is 4, while sizeof(SRWLOCK) would be 8.
Reviewed by: Vladislav Vaintroub
After the merging of MDEV-24915, 10.6 branch has regressions with handling of
concurrent write load against two or more cluster nodes. These regressions may
surface as cluster hanging, node crashes or data inconsistency. With some test
scenarios, the only visible symptom could be that the BF victim aborting happens
only by innodb lock wait timeout expiration. This would result only to poor
performance (by default 50 sec hang for each BF conflict), and could be somewhat
difficult to diagnose.
This pull request has following fixes to handle concurrent write load from
multiple nodes:
In lock_wait_wsrep_kill(), the victim trx was expected to be only in
TRX_STATE_ACTIVE state. With the delayed BF conflict handling, it can happen
that victim has advanced into pre commit state. This was fixed by choosing
victim both in TRX_STATE_ACTIVE and TRX_STATE_PREPARED states.
Victim transaction may be in several different states at the time of detected
lock conflict, and due to delayed BF aborting practice in MDEV-24915, the victim
may advance further before the actual BF aborting takes place. The BF aborting
in MDEV-24915 did not wake the victim, if it was in the state of waiting for
some other lock (than the one that was blocking the high priority thread).
This anomaly caused the innodb lock wait timeout expiration delays and poor
performance symptom. To fix this, lock_wait_wsrep_kill() now looks if
victim is in lock waiting state, and uses lock_cancel_waiting_and_release()
to cancel this lock wait.
wsrep_bf_abort() checks if the victim has active transaction (in wsrep-lib),
and starts a new transaction if there was no active transaction before.
Due to late BF aborting, the victim may have e.g. failed in certification
and is already aborting or has aborted at this stage. This has caused
problems in testing where BF aborter tries to BF abort himself.
The fix in wsrep_bf_abort() now skips the BF abort, if victim is aborting
or has aborted. Victim may not have started transaction yet in wsrep context,
but it may have acquired MDL locks (due to DDL execution), and this has
caused BF conflict. Such case does not require aborting in wsrep or
replication provider state.
BF aborting could cause BF-BF conflict scenario, if victim was already aborted
and changed to replayer having high priority as well. This BF-BF conflict
scenario is now avoided in lock_wait_wsrep() where we now check if blocking
lock holder is also high priority and is ordered before, caller should wait
for the lock in this situation.
The natural innodb deadlock resolving algorithm could pick BF thread as
deadlock victim. This is fixed by giving max weigh to BF threads in
Deadlock::report().
MDEV-24341 has changed excution paths in do_command() and this affects BF
aborted victim execution. This PR fixes one assert in do_command():
DBUG_ASSERT(!thd->async_state.pending_ops())
Which fired if the thd was BF aborted earlier. This assert is now changed
to allow pending_ops() if thd was BF aborted before.
With these fixes, long term highly conflicting write load could be run against
to node cluster. If binlogging is configured, log_slave_updates should be
also set.
Between btr_pcur_store_position() and btr_pcur_restore_position()
it is possible that purge empties a table and enlarges
index->n_core_fields and index->n_core_null_bytes.
Therefore, we must cache index->n_core_fields in
btr_pcur_t::old_n_core_fields so that btr_pcur_t::old_rec can be
parsed correctly.
Unfortunately, this is a huge change, because we will replace
"bool leaf" parameters with "ulint n_core"
(passing index->n_core_fields, or 0 for non-leaf pages).
For special cases where we know that index->is_instant() cannot hold,
we may also pass index->n_fields.
Problem:
========
InnoDB fails to clean the index stub if it fails to add the
virtual index which contains new virtual column. But it clears
the newly virtual column from index in clear_added_indexes()
during inplace_alter_table. On commit, InnoDB evicts and
reload the table. In case of rollback, it doesn't happen.
InnoDB clears the ABORTED index while opening the table
or doing the DDL. In the mean time, InnoDB can access
the dropped virtual index columns while creating prebuilt
or rollback of concurrent DML.
Solution:
==========
(1) InnoDB should maintain newly added virtual column while
rollbacking the newly added virtual index.
(2) InnoDB must not defer the index removal
if the alter table is executed with LOCK=EXCLUSIVE.
(3) For LOCK=SHARED, InnoDB should check whether the table
has any other transaction lock other than alter transaction
before deferring the index stub.
Replaced has_new_v_col with dict_add_vcol_info in dict_index_t to
indicate whether the index has any new virtual column.
dict_index_t::has_new_v_col(): Returns whether the index has
newly added virtual column, it doesn't say which columns are
newly added virtual column
ha_innobase_inplace_ctx::is_new_vcol(): Return whether the
given column is added as a part of the current alter.
ha_innobase_inplace_ctx::clean_new_vcol_index(): Copy the newly
added virtual column to new_vcol_info in dict_index_t. Replace
the column in the index fields with virtual column stored
in new_vcol_info.
dict_index_t::assign_new_v_col(): Store the number of virtual
column added in index as a part of alter table.
dict_index_t::get_n_new_vcol(): Get the number of newly added
virtual column
dict_index_t::assign_drop_v_col(): Allocate the memory for
adding new virtual column in new_vcol_info.
dict_index_t::add_drop_v_col(): Add the newly added virtual
column in new_vcol_info.
dict_table_t::has_lock_for_other_trx(): Whether the table has
any other transaction lock than given transaction.
row_merge_drop_indexes(): Add parameter alter_trx and check
whether the table has any other lock than alter transaction.
In commit 8ea923f55b (MDEV-24818)
when we optimized multi-statement INSERT transactions into empty tables,
we would roll back the entire transaction on any error. But, we would
fail to invalidate any SAVEPOINT that had been requested in the past.
trx_t::savepoints_discard(): Renamed from trx_roll_savepoints_free().
row_mysql_handle_errors(): If we were in bulk insert, invoke
trx_t::savepoints_discard(). In this way, a future attempt of
ROLLBACK TO SAVEPOINT will return an error.
In commit 8ea923f55b (MDEV-24818)
when we optimized multi-statement INSERT into an empty table,
we would sometimes wrongly enable bulk insert into a table that
is actually already using row-level locking and undo logging.
trx_has_lock_x(): New predicate, to check if the transaction of
the current thread is holding an exclusive lock on a table.
trx_undo_report_row_operation(): Only invoke
trx_mod_table_time_t::start_bulk_insert() if
trx_has_lock_x() holds.
In commit e71e613353 (MDEV-24671),
lock_sys.wait_mutex was moved above lock_sys.mutex
(which was later replaced with lock_sys.latch) in the latching order.
In commit 7cf4419fc4 (MDEV-24789),
a potential hang was introduced to Galera. The function lock_wait()
would hold lock_sys.wait_mutex while invoking wsrep_is_BF_lock_timeout(),
which in turn could acquire LockMutexGuard for some diagnostic printout.
wsrep_is_BF_lock_timeout(): Do not invoke trx_print_latched() or
LockMutexGuard.
lock_sys_t::wr_lock(), lock_sys_t::rd_lock(): Assert that the current
thread is not holding lock_sys.wait_mutex.
Unfortunately, RW-locks are not covered by SAFE_MUTEX.
Reviewed by: Jan Lindström
Adds an implementation for SELECT ... FOR UPDATE SKIP LOCKED /
SELECT ... LOCK IN SHARED MODE SKIP LOCKED
This is implemented only InnoDB at the moment, not in RockDB yet.
This adds a new hander flag HA_CAN_SKIP_LOCKED than
will be used when the storage engine advertises the flag.
When a storage engine indicates this flag it will get
TL_WRITE_SKIP_LOCKED and TL_READ_SKIP_LOCKED transaction types.
The Lex structure has been updated to store both the FOR UPDATE/LOCK IN
SHARE as well as the SKIP LOCKED so the SHOW CREATE VIEW
implementation is simplier.
"SELECT FOR UPDATE ... SKIP LOCKED" combined with CREATE TABLE AS or
INSERT.. SELECT on the result set is not safe for STATEMENT based
replication. MIXED replication will replicate this as row based events."
Thanks to guidance from Facebook commit
193896c466
This helped verify basic test case, and components that need implementing
(even though every part was implemented differently).
Thanks Marko for guidance on simplier InnoDB implementation.
Reviewers: Marko, Monty
A consistency check for fil_space_t::name is causing recovery failures
in MDEV-25180 (Atomic ALTER TABLE). So, we'd better remove that field
altogether.
fil_space_t::name was more or less a copy of dict_table_t::name
(except for some special cases), and it was not being used for
anything useful.
There used to be a name_hash, but it had been removed already in
commit a75dbfd718 (MDEV-12266).
We will also remove os_normalize_path(), OS_PATH_SEPARATOR,
OS_PATH_SEPATOR_ALT. On Microsoft Windows, we will treat \ and /
roughly in the same way. The intention is that for per-table
tablespaces, the filenames will always follow the pattern
prefix/databasename/tablename.ibd. (Any \ in the prefix must not
be converted.)
ut_basename_noext(): Remove (unused function).
read_link_file(): Replaces RemoteDatafile::read_link_file().
We will ensure that the last two path component separators are
forward slashes (converting up to 2 trailing backslashes on
Microsoft Windows), so that everywhere else we can
assume that data file names end in "/databasename/tablename.ibd".
Note: On Microsoft Windows, path names that start with \\?\ must
not contain / as path component separators. Previously, such paths
did work in the DATA DIRECTORY argument of InnoDB tables.
Reviewed by: Vladislav Vaintroub
Commit 76d2846a71 was for 10.5 only.
It caused some performance regression on 10.6 in some cases,
likely related to the removal of ib_mutex_t in MDEV-21452.
As pointed out by Andrei Elkin, the previous fix did not fix one
race condition that may have caused the observed hang.
innodb_log_flush_request(): If we are enqueueing the very first
request at the same time the log write is being completed,
we must ensure that a near-concurrent call to log_flush_notify()
will not result in a missed notification. We guarantee this by
release-acquire operations on log_requests.start and
log_sys.flushed_to_disk_lsn.
log_flush_notify_and_unlock(): Cleanup: Always release the mutex.
log_sys_t::get_flushed_lsn(): Use acquire memory order.
log_sys_t::set_flushed_lsn(): Use release memory order.
log_sys_t::set_lsn(): Use release memory order.
log_sys_t::get_lsn(): Use relaxed memory order by default, and
allow the caller to specify acquire memory order explicitly.
Whenever the log_sys.mutex is being held or when log writes are
prohibited during startup, we can use a relaxed load. Likewise,
in some assertions where reading a stale value of log_sys.lsn
should not matter, we can use a relaxed load.
This will cause some additional instructions to be emitted on
architectures that do not implement Total Store Ordering (TSO),
such as POWER, ARM, and RISC-V Weak Memory Ordering (RVWMO).
track page-access counter
As part of MDEV-21212, n_page_gets that is meant to track page access,
is ported to use distributed counter that default uses atomic sub-counters.
n_page_gets originally was a non-atomic counter that represented an approximate
value of pages tracked. Using the said analogy it doesn't need to be
an atomic distributed counter.
This patch introduces an interface that allows distributed counter to be
used with atomic and non-atomic sub-counter (through template) and also
port n_page_gets to use non-atomic distributed counter using the said
updated interface.
Starting with MariaDB 10.5, roughly after MDEV-23855 was fixed,
we are observing sporadic hangs during the execution of the
RESET MASTER statement. We are hoping to fix the hangs with these
changes, but due to the rather infrequent occurrence of the hangs
and our inability to reliably reproduce the hangs, we cannot be
sure of this.
What we do know is that innodb_force_recovery=2 (or a larger setting)
will prevent srv_master_callback (the former srv_master_thread) from
running. In that mode, periodic log flushes would never occur and
RESET MASTER could hang indefinitely. That is demonstrated by the new
test case that was developed by Andrei Elkin. We fix this case by
implementing a special case for it.
This also includes some code cleanup and renames of misleadingly
named code. The interface has nothing to do with log checkpoints in
the storage engine; it is only about requesting log writes to be
persistent.
handlerton::commit_checkpoint_request,
commit_checkpoint_notify_ha(): Remove the unused parameter hton.
log_requests.start: Replaces pending_checkpoint_list.
log_requests.end: Replaces pending_checkpoint_list_end.
log_requests.mutex: Replaces pending_checkpoint_mutex.
log_flush_notify_and_unlock(), log_flush_notify(): Replaces
innobase_mysql_log_notify(). The new implementation should be
functionally equivalent to the old one.
innodb_log_flush_request(): Replaces innobase_checkpoint_request().
Implement a fast path for common cases, and reduce the mutex hold time.
POSSIBLE FIX OF THE HANG: We will invoke commit_checkpoint_notify_ha()
for the current request if it is already satisfied, as well as invoke
log_flush_notify_and_unlock() for any satisfied requests.
log_write(): Invoke log_flush_notify() when the write is already durable.
This was missing WITH_PMEM when the log is in persistent memory.
Reviewed by: Vladislav Vaintroub
HAVE_valgrind_or_MSAN to HAVE_valgrind was incorrect in
af784385b4.
In my_valgrind.h when clang exists (hence no __has_feature(memory_sanitizer),
and -DWITH_VALGRIND=1, but without memcheck.h, we end up with a MEM_CHECK_DEFINED
being empty.
If we are also doing a CMAKE_BUILD_TYPE=Debug this results a number of
[-Werror,-Wunused-variable] errors because MEM_CHECK_DEFINED is empty.
With MEM_CHECK_DEFINED empty, there becomes no uses of this of the
fixed field and innodb variables in this patch.
So we stop using HAVE_valgrind as catchall and use the name
HAVE_CHECK_MEM to indicate that a CHECK_MEM_DEFINED function exists.
Reviewer: Monty
Corrects: af784385b4
row_upd_clust_step(): Remove the "trigger" on DELETE SYS_INDEXES
that would invoke dict_drop_index_tree(). Let us do it on purge.
row_purge_remove_clust_if_poss_low(): Invoke
dict_drop_index_tree() when purging a delete-marked SYS_INDEXES record.
The debug parameter innodb_simulate_comp_failures injected compression
failures for ROW_FORMAT=COMPRESSED tables, breaking the pre-existing
logic that I had implemented in the InnoDB Plugin for MySQL 5.1 to prevent
compressed page overflows. A much better check is already achieved by
defining UNIV_ZIP_COPY at the compilation time.
(Only UNIV_ZIP_DEBUG is part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON.)