When one session SELECT ... FOR UPDATE and holds the lock, subsequent
sessions that SELECT ... FOR UPDATE will wait to get the lock.
Currently, that event is labeled as `wait/io/table/sql/handler`, which
is incorrect. Instead, it should have been
`wait/lock/table/sql/handler`.
Two factors contribute to this bug:
1. Instrumentation interface and the heavy usage of `TABLE_IO_WAIT` in
`sql/handler.cc` file. See interface [^1] for better understanding;
2. The balancing act [^2] of doing instrumentation aggregration _AND_
having good performance. For example, EVENTS_WAITS_SUMMARY... is
aggregated using EVENTS_WAITS_CURRENT. Aggregration needs to be based
on the same wait class, and the code was overly aggressive in label a
LOCK operation as an IO operation in this case.
The proposed fix is pretty simple, but understanding the bug took a
while. Hence the footnotes below. For future improvement and
refactoring, we may want to consider renaming `TABLE_IO_WAIT` and making
it less coarse and more targeted.
Note that newly added test case, events_waits_current_MDEV-29091,
initially didn't pass Buildbot CI for embedded build tests. Further
research showed that other impacted tests all included not_embedded.inc.
This oversight was fixed later.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
[^1]: To understand `performance_schema` instrumentation interface, I
found this URL is the most helpful:
https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_PFS_PSI.html
[^2]: The best place to understand instrumentation projection,
composition, and aggregration is through the source file. Although I
prefer reading Doxygen produced html file, but for whatever reason, the
rendering is not ideal. Here is link to 10.6's pfs.cc:
https://github.com/MariaDB/server/blob/10.6/storage/perfschema/pfs.cc
MDEV-28227 added the error messages in simplified characters.
Lets use these for those running a zh_CN profile.
From Haidong Ji in the MDEV, Taiwan/Hong Kong (zh_TW/zh_HK)
would expect traditional characters so this is left for when
we have these.
- InnoDB tries to build the previous version of the record for
the virtual index, but the undo log record doesn't contain
virtual column information. This leads to assert failure while
building the tuple.
ha_partition doesn't forward Rowid Filter API calls to the storage
engines handling partitions.
An attempt to use rowid filtering with caused either
- Rowid Filter being shown in EXPLAIN but not actually used
- Assertion failure when subquery code tried to disable/enable rowid
filter, which was not present.
Fixed by returning correct flags from ha_partition::index_flags()
One of the constraints added in the MDEV-29639 patch, is that only
the first event after idling should update last_master_timestamp;
and as long as the replica has more events to execute, the variable
should not be updated. The corresponding test,
rpl_delayed_parallel_slave_sbm.test, aims to verify this; however,
if the IO thread takes too long to queue events, the SQL thread can
appear to catch up too fast.
This fix ensures that the relay log has been fully written before
executing the events.
Note that the underlying cause of this test failure needs to be
addressed as a bug-fix, this is a temporary fix to stop test
failures. To track work on the bug-fix for the underlying issue,
please see MDEV-30619.
The parser code for single-table DELETE missed the call of the function
LEX::check_main_unit_semantics(). As a result the the field nested level
of SELECT_LEX structures remained set 0 for all non-top level selects.
This could lead to different kind of problems. In particular this did not
allow to determine properly the selects where set functions had to be
aggregated when they were used in inner subqueries.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
This patch allowed transformation of EXISTS subqueries into equivalent
IN predicands at the top level of WHERE conditions for multi-table UPDATE
and DELETE statements. There was no reason to prohibit the transformation
for such statements. The transformation provides more opportunities of
using semi-join optimizations.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
Following tests do not test anymore what they intended to test
deleted: suite/galera/t/MDEV-24143.test
deleted: suite/galera/t/galera_bf_abort_get_lock.test
Enable use of Rowid Filter optimization with eq_ref access.
Use the following assumptions:
- Assume index-only access cost is 50% of non-index-only access cost.
- Take into account that "Eq_ref access cache" reduces the number of
lookups eq_ref access will make.
= This means the number of Rowid Filter checks is reduced also
= Eq_ref access cost is computed using that assumption (see
prev_record_reads() call), so we should use it in all cost '
computations.
ANALYZE was observed to race over a preceding in binlog order DML
in updating the binlog and slave gtid states.
Tagging ANALYZE and other admin class commands in binlog by the fixes
of MDEV-17515 left a flaw allowing such race leading to
the gtid mode out-of-order error.
This is fixed now to observe by ADMIN commands the ordered access to
the slave gtid status variables and binlog.
This commit merely adds is a Read-Committed version MDEV-30225 test
solely to prove the RC isolation yields ROW binlog format as it is
supposed to per docs.
This bug manifested itself in very rare situations when splitting
optimization was applied to a materialized derived table with group clause
by key over a constant meargeable derived table that was in inner part of
an outer join. In this case the used tables for the key to access the
split table incorrectly was evaluated to a not empty table map.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
Problem
========
On a parallel, delayed replica, Seconds_Behind_Master will not be
calculated until after MASTER_DELAY seconds have passed and the
event has finished executing, resulting in potentially very large
values of Seconds_Behind_Master (which could be much larger than the
MASTER_DELAY parameter) for the entire duration the event is
delayed. This contradicts the documented MASTER_DELAY behavior,
which specifies how many seconds to withhold replicated events from
execution.
Solution
========
After a parallel replica idles, the first event after idling should
immediately update last_master_timestamp with the time that it began
execution on the primary.
Reviewed By
===========
Andrei Elkin <andrei.elkin@mariadb.com>
This patch fixes the patch for bug MDEV-30248 that unsatisfactorily
resolved the problem of resolution of references to CTE. In some cases
when such a reference has the same table name as the name of one of
CTEs containing this reference the reference could be resolved incorrectly
that led to an invalid select tree where units could be mutually dependent.
This in its turn could lead to an infinite sequence of recursive calls or
to falls into infinite loops.
The patch also removes LEX::resolve_references_to_cte_in_hanging_cte() as
with the new code for resolution of CTE references the call of this
function is not needed anymore.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
(Initial patch by Varun Gupta. Amended and added comments).
When the query has both
1. Aggregate functions that require sorting data by group, and
2. Window functions
we need to use two temporary tables. The first temp.table will hold the
join output. Then it is passed to filesort(). Reading it in sorted
order allows to compute the aggregate functions.
Then, we need to write their values into the second temp. table. Then,
Window Function computation step can pass that to filesort() and read
them in the order it needs.
Failure to create the second temp. table would cause an assertion
failure: window function could would not find where to get the values
of the aggregate functions.
disable bulk insert optimization if long uniques are used, because they
need to read the table (index_read) after every inserted now. And bulk
insert optimization might disable indexes.
bulk insert is already disabled in other cases when there are chances
that the table will be read duing the bulk insert.
plugin_vars_free_values() was walking plugin sysvars and thus
did not free memory of plugin PLUGIN_VAR_NOSYSVAR vars.
* change it to walk all plugin vars
* add the pluginname_ prefix to NOSYSVARS var names too,
so that plugin_vars_free_values() would be able to find their
bookmarks
Use SELECT_LEX to save lists for ORDER BY and GROUP BY before parsing
WINDOW clauses / specifications. This is needed for proper parsing
of a nested WINDOW clause when a WINDOW clause is used in a subquery
contained in another WINDOW clause.
Fix assignment of empty SQL_I_List to another one (in case of empty list
next shoud point on first).
Item_singlerow_subselect may be converted to Item_cond during
optimization. So there is a possibility of constructing nested
Item_cond_and or Item_cond_or which is not allowed (such
conditions must be flattened).
This commit checks if such kind of optimization has been applied
and flattens the condition if needed
This commit contains only a mtr test for reproducing the issue in MDEV-29512
The actual fix will be pushed in wsrep-lib repository
The hanging in MDEV-29512 happens when binlog purging is attempted, and there is
one local BF aborted transaction waiting for commit monitor.
The test will launch two node cluster and enable binlogging with expire log days,
to force binlog purging to happen.
A local transaction is executed so that will become BF abort victim, and has advanced
to replication stage waiting for commit monitor for final cleanup (to mark position in innodb)
after that, applier is released to complete the BF abort and due to binlog configuration,
starting the binlog purging. This is where the hanging would occur, if code is buggy
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Created mtr test for reproducing the crash
Developed actual fix for the issue.
Setting THD::system_thread_info.rpl_sql_info for replayer thread,
same way as it is handled for appliers.
Recorded test result, with the fix
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
node->is_delete was incorrectly set to NO_DELETE for a set of operations.
In general we shouldn't rely on sql_command and look for more abstract ways
to control the behavior.
trg_event_map seems to be a suitable way. To mind replica nodes, it is ORed
with slave_fk_event_map, which stores trg_event_map when replica has
triggers disabled.
Problem:
=======
Mysqlbinlog cannot show the type of a compressed
column when two levels of verbosity is provided.
Solution:
========
Extend the log event printing logic to handle and
tag compressed types.
Behavioral Changes:
==================
Old: When mysqlbinlog is called in verbose mode and
the database uses compressed columns, an error is
returned to the user.
New: The output will append “ COMPRESSED” on the
type of compressed columns
Reviewed By
===========
Andrei Elkin <andrei.elkin@mariadb.com>
(Variant 3, initial variant was by Rex Jonston)
A LEFT JOIN with a constant as a column of the inner table produced wrong
query result if the optimizer had to write the inner table column into a
temp table. Query pattern:
SELECT ...
FROM (SELECT /*non-mergeable select*/
FROM t1 LEFT JOIN (SELECT 'Y' as Val) t2 ON ...) as tbl
Fixed this by adding Item_direct_view_ref::save_in_field() which follows
the pattern of Item_direct_view_ref's save_org_in_field(),
save_in_result_field() and val_XXX() functions:
* call check_null_ref() and handle NULL value
* if we didn't get a NULL-complemented row, call Item_direct_ref's function.
it's incorrect to use change_item_tree() to replace arguments
of top-level AND/OR, because they (arguments) are stored in a List,
so a pointer to an argument is in the list_node, and individual
list_node's of top-level AND/OR can be deleted in Item_cond::build_equal_items().
In that case rollback_item_tree_changes() will modify the deleted object.
Luckily, it's not needed to use change_item_tree() for top-level
AND/OR, because the whole top-level item is copied and preserved
in prep_where and prep_on, and restored from there.
So, just don't.
Additionally to the test case in the commit it fixes
* ASAN failure of main.opt_tvc --ps
* ASAN failure of main.having_cond_pushdown --ps
when an internal temporary table field is created from a real field,
a new temp field should only copy a default from the source field
when the latter has it
when creating a temp table field from an actual table field,
these two fields are supposed to be mostly identical
(except for BIT field storage), in particular, temp field should
have the same default as the orig field, even if the sql_mode has
been changed meanwhile (e.g. to include NO_ZERO_DATE)
regression from MDEV-29540 / 8c38939369.
INSERT SELECT errors needed to be unconditionally ignored.
As this touches the CREATE .. SELECT functionality, show
the equalivent test there.
This bug affected queries with nested left joins having the same last inner
table such that not_exists optimization could be applied to the most inner
outer join when optimizer chose to use join buffers. The bug could lead to
producing wrong a result set.
If the WHERE condition a query contains a conjunctive IS NULL predicate
over a non-nullable column of an inner table of a not nested outer join
then not_exists optimization can be applied to tho the outer join. With
this optimization when looking for matches for a certain record from the
outer table of the join the records of the inner table can be ignored
right after the first match satisfying the ON condition is found.
In the case of nested outer joins having the same last inner table this
optimization still can be applied but only if all ON conditions of the
embedding outer joins are satisfied. Such check was missing in the code
that tried to apply not_exists optimization when join buffers were used
for outer join operations.
This problem has been already fixed in the patch for bug MDEV-7992. Yet
there it was resolved only for the cases when join buffers were not used
for outer joins.
Approved by Oleksandr Byelkin <sanja@mariadb.com>