rocksdb_checkpoint_request() should call FlushWAL(sync=true) (which does
write-out and sync), not just SyncWAL() (which just syncs without writing
out)
Followup: the test requires debug sync facility
(This is a backport to 10.5)
The code compares two query plans with identical costs, the plan
with lateral is the same as one without. Introduce a small difference
to cost numbers to prefer non-lateral plan in this case.
Add condition on trx->state == TRX_STATE_COMMITTED_IN_MEMORY in order to
avoid unnecessary work. If a transaction has already been committed or
rolled back, it will release its locks in lock_release() and let
the waiting thread(s) continue execution.
Let BF wait on lock_rec_has_to_wait and if necessary other BF
is replayed.
wsrep_trx_order_before
If BF is not even replicated yet then they are ordered
correctly.
bg_wsrep_kill_trx
Make sure victim_trx is found and check also its state. If
state is TRX_STATE_COMMITTED_IN_MEMORY transaction is
already committed or rolled back and will release it locks
soon.
wsrep_assert_no_bf_bf_wait
Transaction requesting new record lock should be TRX_STATE_ACTIVE
Conflicting transaction can be in states TRX_STATE_ACTIVE,
TRX_STATE_COMMITTED_IN_MEMORY or in TRX_STATE_PREPARED.
If conflicting transaction is already committed in memory or
prepared we should wait. When transaction is committed in memory we
held trx mutex, but not lock_sys->mutex. Therefore, we
could end here before transaction has time to do lock_release()
that is protected with lock_sys->mutex.
lock_rec_has_to_wait
We very well can let bf to wait normally as other BF will be
replayed in case of conflict. For debug builds we will do
additional sanity checks to catch unsupported bf wait if any.
wsrep_kill_victim
Check is victim already in TRX_STATE_COMMITTED_IN_MEMORY state and
if it is we can return.
lock_rec_dequeue_from_page
lock_rec_unlock
Remove unnecessary wsrep_assert_no_bf_bf_wait function calls.
We can very well let BF wait here.
Starting with MariaDB 10.5, roughly after MDEV-23855 was fixed,
we are observing sporadic hangs during the execution of the
RESET MASTER statement. We are hoping to fix the hangs with these
changes, but due to the rather infrequent occurrence of the hangs
and our inability to reliably reproduce the hangs, we cannot be
sure of this.
What we do know is that innodb_force_recovery=2 (or a larger setting)
will prevent srv_master_callback (the former srv_master_thread) from
running. In that mode, periodic log flushes would never occur and
RESET MASTER could hang indefinitely. That is demonstrated by the new
test case that was developed by Andrei Elkin. We fix this case by
implementing a special case for it.
This also includes some code cleanup and renames of misleadingly
named code. The interface has nothing to do with log checkpoints in
the storage engine; it is only about requesting log writes to be
persistent.
handlerton::commit_checkpoint_request,
commit_checkpoint_notify_ha(): Remove the unused parameter hton.
log_requests.start: Replaces pending_checkpoint_list.
log_requests.end: Replaces pending_checkpoint_list_end.
log_requests.mutex: Replaces pending_checkpoint_mutex.
log_flush_notify_and_unlock(), log_flush_notify(): Replaces
innobase_mysql_log_notify(). The new implementation should be
functionally equivalent to the old one.
innodb_log_flush_request(): Replaces innobase_checkpoint_request().
Implement a fast path for common cases, and reduce the mutex hold time.
POSSIBLE FIX OF THE HANG: We will invoke commit_checkpoint_notify_ha()
for the current request if it is already satisfied, as well as invoke
log_flush_notify_and_unlock() for any satisfied requests.
log_write(): Invoke log_flush_notify() when the write is already durable.
This was missing WITH_PMEM when the log is in persistent memory.
Reviewed by: Vladislav Vaintroub
`mallinfo` is deprecated since glibc 2.33 and has been replaced by mallinfo2.
The deprecation causes building the server to fail if glibc version is > 2.33.
Check if mallinfo2 exist on the system and use it instead.
HAVE_valgrind_or_MSAN to HAVE_valgrind was incorrect in
af784385b4.
In my_valgrind.h when clang exists (hence no __has_feature(memory_sanitizer),
and -DWITH_VALGRIND=1, but without memcheck.h, we end up with a MEM_CHECK_DEFINED
being empty.
If we are also doing a CMAKE_BUILD_TYPE=Debug this results a number of
[-Werror,-Wunused-variable] errors because MEM_CHECK_DEFINED is empty.
With MEM_CHECK_DEFINED empty, there becomes no uses of this of the
fixed field and innodb variables in this patch.
So we stop using HAVE_valgrind as catchall and use the name
HAVE_CHECK_MEM to indicate that a CHECK_MEM_DEFINED function exists.
Reviewer: Monty
Corrects: af784385b4
Binlog group commit could lead to a situation where group commit leader
accesses participant thd's wsrep client state concurrently with the
thread executing the participant thd.
This is because of race condition in
MYSQL_BIN_LOG::write_transaction_to_binlog_events(),
and was fixed by moving wsrep_ordered_commit() to happen in
MYSQL_BIN_LOG::queue_for_group_commit() under protection of
LOCK_prepare_ordered mutex.
splittable derived
If one of joined tables of the processed query is a materialized derived
table (or view or CTE) with GROUP BY clause then under some conditions it
can be subject to split optimization. With this optimization new equalities
are injected into the WHERE condition of the SELECT that specifies this
derived table. The injected equalities are generated for all join orders
with which the split optimization can employed. After the best join order
has been chosen only certain of this equalities are really needed. The
others can be safely removed. If it's not done and some of injected
equalities involve expressions over semi-joins with look-up access then
the query may return a wrong result set.
This patch effectively removes equalities injected for split optimization
that are needed only at the optimization stage and not needed for execution.
Approved by serg@mariadb.com
by compound index
This typo bug may lead to wrong result sets for equi-join queries where
the join operation is supported by a compound index such that the order of
its components differs from the order of the corresponding columns in
the table the index belongs to. The bug manifests itself only when usage
of the BNLH algorithm is forced.
The fix for the bug was provided by Chu Huaxing.
in ON condition
The fix of the bug MDEV-25002 for 10.4 turned out to be incomplete. It
caused crashes when executing CREATE VIEW, CREATE TABLE .. SELECT,
INSERT .. SELECT statements if their SELECTs contained references to
non-existing fields.
This patch complements the fix for MDEV-25002 in order to avoid such
crashes.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
If a query with implicit grouping contains in MIN/MAX set function in the
select list over a column that is a part of an index then the query
might be subject to MIN/MAX optimization. With this optimization the
server performs a look-up into an index, fetches a value of the column C
used in the MIN/MAX function and substitute the MIN/MAX expression for this
value. This allows to eliminate the table containing C from further join
processing. In order the optimization to be applied the WHERE condition
must be a conjunction of simple equality/inequality predicates or/and
BETWEEN predicates.
The bug fixed in the patch resulted in fetching a wrong value from the
index used for MIN/MAX optimization. It may happened when a BETWEEN
predicate containing the MIN/MAX value followed a strict inequality.
Approved by dmitry.shulga@mariadb.com
Store old value of binlog format before wsrep code so that
if we bail out because wsrep is not ready for connections
we can restore binlog format correctly.
Note they key_or() may call tree_delete(), which will cause the weight
asserts to be checked. In order to avoid them from firing, update key1
tree's weight after we've changed key1->some_local_child->next_key_part.
Having done that, do we still need this at the function end:
/* Re-compute the result tree's weight. */
key1->update_weight_locally();
?
Store old value of binlog format before wsrep code so that
if we bail out because wsrep is not ready for connections
we can restore binlog format correctly.
The query causing the issue here has implicit grouping for we
have to produce one row with special values for the aggregates
(depending on each aggregate function), and NULL values for all
non-aggregate fields.
The subselect item where implicit grouping was being done,
null_value for the subselect item was not being set for
the case when the implicit grouping produces NULL values
for the items in the select list of the subquery.
This which was leading to the crash.
The fix would be to set the null_value when all the values
for the row column have NULL values.
Further changes are
1) etting null_value for Item_singlerow_subselect only
after val_* functions have been called.
2) Introduced a parameter null_value_inside to Item_cache that
would store be set to TRUE if any of the arguments of the
Item_cache are null.
Reviewed And co-authored by Monty
This bug manifested itself when executing queries with multiple reference
to a CTE specified by a query expression with union and having its
column names explicitly declared. In this case the server returned a bogus
error message about unknown column name. It happened because while for the
first reference to the CTE the names of the columns returned by the CTE
specification were properly changed to match the CTE definition for the
other references it was not done. This was a consequence of not quite
complete code of the function With_element::clone_parsed_spec() that forgot
to set the reference to the CTE definition for unit structures representing
non-first CTE references.
Approved by dmitry.shulga@mariadb.com
This bug could affect multi-way join queries with embedded outer joins that
contained a conjunctive IS NULL predicate over a non-nullable column from
inner table of an outer join. The predicate could occur in WHERE condition
or in ON condition. Due to this bug a wrong result set could be returned by
the query. The bug manifested itself only when join buffers were employed
for join operations.
The problem appeared because
- a bug in the function JOIN_CACHE::get_match_flag_by_pos that not always
returned proper match flags for embedding outer joins stored together
with table rows put a join buffer.
- bug in the function JOIN_CACHE::join_matching_records that not always
correctly determined that a row from the buffer could be skipped due
to applied 'not_exists' optimization.
Example:
SELECT * FROM t1 LEFT JOIN ((t2 LEFT JOIN t3 ON c = d) JOIN t4) ON b = e
WHERE e IS NULL;
The patch introduces a new function that finds the match flag for a record
from join buffer specifying the buffer where this flag has to be found.
The function is called JOIN_CACHE::get_match_flag_by_pos_from_join_buffer().
Now this function rather than JOIN_CACHE::get_match_flag_by_pos() is used
in JOIN_CACHE::skip_if_matched() to check whether a record from the join
buffer must be ignored when extending the record by null complements.
Also the code of the function JOIN_CACHE::skip_if_not_needed_match() has
been changed. The function checks whether a record from the join buffer
still may produce some useful extensions.
Also some clarifying comments has been added.
Approved by monty@mariadb.com.
It was possibile for a user to create an interlocked state which may go on
for a significant period of time. There is a tight loop in the FTWRL code
path that tries to repeatedly acquire a read lock. As the weight of FTWRL
lock is the smallest among others, it's always selected by the deadlock
detector, but can never be killed.
Imaging the following sequence:
connection_0 connection_1
GET_LOCK("l1", 0);
LOCK TABLES t WRITE;
FLUSH TABLES WITH READ LOCK;
GET_LOCK("l1", 1000);
The GET_LOCK statement in connection_1 triggers the deadlock detector,
which tries to select the lock in FTWRL, since its weight is 0. However,
since a loop in Global_read_lock::lock_global_read_lock() tries to always
win, it tries to acquire lock again. Which invokes the deadlock detector,
and that cycle continues until GET_LOCK in connection_1 times out.
This patch resolves the live-locking by introducing a dynamic bonus to the
deadlock weight associated with every lock. Each lock gets a bonus weight
each time it's selected by the deadlock detector. In case of a live-lock
situation, those locks that cannot be killed, get additional weight each
iteration. Eventually their weight becomes so high that the deadlock
detector shifts its attention to other lock, until it find the one that
can be killed.
The problem was that the CONNECT engine is trying to open the .frm file
during drop_table(), which the code did not take into account.
Fixed by adding the HA_REUSES_FILE_NAMES table flag to CONNECT.
Other things:
- Fixed a wrong test of HA_REUSE_FILE_NAMES of in mysql_alter_table()
(Comment was correct, no the code)
- Added a test in the connect engine that if the .frm it tries to use in
delete is not made for connect, it will generate an error instead of
crash.