MDEV-21810 MBR: Unexpected "Unsafe statement" warning for unsafe IODKU
MDEV-17614 fixes to replication unsafety for INSERT ON DUP KEY UPDATE
on two or more unique key table left a flaw. The fixes checked the
safety condition per each inserted record with the idea to catch a user-created
value to an autoincrement column and when that succeeds the autoincrement column
would become the source of unsafety too.
It was not expected that after a duplicate error the next record's
write_set may become different and the unsafe decision for that
specific record will be computed to screw the Query's binlogging
state and when @@binlog_format is MIXED nothing gets bin-logged.
This case has been already fixed in 10.5.2 by 91ab42a823 that
relocated/optimized THD::decide_logging_format_low() out of the record insert
loop. The safety decision is computed once and at the right time.
Pertinent parts of the commit are cherry-picked.
Also a spurious warning about unsafety is removed when MIXED
@@binlog_format; original MDEV-17614 test result corrected.
The original test of MDEV-17614 is extended and made more readable.
or slow query log when the log_output=TABLE.
When this happens, we temporary disable by changing log_output until
we've created the general_log and slow_log tables again.
Move </database> in xml mode until after the transaction_registry.
General_log and slow_log tables where moved to be first to be dumped so
that the disabling of the general/slow queries is minimal.
Previously the correct SQL mode for a stored routine or
package was only set before doing the CREATE part, this
worked out for PROCEDUREs and FUNCTIONs, but with ORACLE
mode specific PACKAGEs the DROP also only works in ORACLE
mode.
Moving the setting of the sql_mode a few lines up to happen
right before the DROP statement is writen fixes this.
If UPDATE/DELETE does not change data it is skipped from
replication. We now force replication of such events when they trigger
partition auto-creation.
For ROLLBACK it is as simple as set OPTION_KEEP_LOG
flag. trans_cannot_safely_rollback() does the rest.
For UPDATE/DELETE .. LIMIT 0 we make additional binlog_query() calls
at the early points of return.
As a safety measure we also convert row format into statement if it is
needed. The condition is decided by
binlog_need_stmt_format(). Basically if there are some row events in
cache we don't need that: table open of row event will trigger
auto-creation anyway.
Multi-update/delete works via mysql_select(). There is no early points
of return, so binlogging is always checked by
send_eof()/abort_resultset(). But we must comply with the above
measure of converting into statement.
:: Syntax change ::
Keyword AUTO enables history partition auto-creation.
Examples:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 MONTH
STARTS '2021-01-01 00:00:00' AUTO PARTITIONS 12;
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME LIMIT 1000 AUTO;
Or with explicit partitions:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO
(PARTITION p0 HISTORY, PARTITION pn CURRENT);
To disable or enable auto-creation one can use ALTER TABLE by adding
or removing AUTO from partitioning specification:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
# Disables auto-creation:
ALTER TABLE t1 PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR;
# Enables auto-creation:
ALTER TABLE t1 PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
If the rest of partitioning specification is identical to CREATE TABLE
no repartitioning will be done (for details see MDEV-27328).
:: Description ::
Before executing history-generating DML command (see the list of commands below)
add N history partitions, so that N would be sufficient for potentially
generated history. N > 1 may be required when history partitions are switched
by INTERVAL and current_timestamp is N times further than the interval
boundary of the last history partition.
If the last history partition equals or exceeds LIMIT records then new history
partition is created and selected as the working partition. According to
MDEV-28411 partitions cannot be switched (or created) while the command is
running. Thus LIMIT does not carry strict limitation and the history partition
size must be planned as LIMIT value plus average number of history one DML
command can generate.
Auto-creation is implemented by synchronous fast_alter_partition_table() call
from the thread of the executed DML command before the command itself is run
(by the fallback and retry mechanism similar to Discovery feature,
see Open_table_context).
The name for newly added partitions are generated like default partition names
with extension of MDEV-22155 (which avoids name clashes by extending assignment
counter to next free-enough gap).
These DML commands can trigger auto-creation:
DELETE (including multitable DELETE, excluding DELETE HISTORY)
UPDATE (including multitable UPDATE)
REPLACE (including REPLACE .. SELECT)
INSERT .. ON DUPLICATE KEY UPDATE (including INSERT .. SELECT .. ODKU)
LOAD DATA .. REPLACE
:: Bug fixes ::
MDEV-23642 Locking timeout caused by auto-creation affects original DML
The reasons for this are:
- Do not disrupt main business process (the history is auxiliary service);
- Consequences are non-fatal (history is not lost, but comes into wrong
partition; fixed by partitioning rebuild);
- There is more freedom for application to fail in this case or not: it may
read warning info and find corresponding error number.
- While non-failing command is easy to handle by an application and fail it,
the opposite is hard to handle: there is no automatic actions to fix
failed command and retry, DBA intervention is required and until then
application is non-functioning.
MDEV-23639 Auto-create does not work under LOCK TABLES or inside triggers
Don't do tdc_remove_table() for OT_ADD_HISTORY_PARTITION because it is
not possible in locked tables mode.
LTM_LOCK_TABLES mode (and LTM_PRELOCKED_UNDER_LOCK_TABLES) works out
of the box as fast_alter_partition_table() can reopen tables via
locked_tables_list.
In LTM_PRELOCKED we reopen and relock table manually.
:: More fixes ::
* some_table_marked_for_reopen flag fix
some_table_marked_for_reopen affets only reopen of
m_locked_tables. I.e. Locked_tables_list::reopen_tables() reopens only
tables from m_locked_tables.
* Unused can_recover_from_failed_open() condition
Is recover_from_failed_open() can be really used after
open_and_process_routine()?
:: Reviewed by ::
Sergei Golubchik <serg@mariadb.org>
When we need to add/remove or change LIMIT, INTERVAL, AUTO we have to
recreate partitioning from scratch (via data copy). Such operations
should be done fast. To remove options like LIMIT or INTERVAL one
should write:
alter table t1 partition by system_time;
The command checks whether it is new or existing SYSTEM_TIME
partitioning. And in the case of new it behaves as CREATE would do:
adds default number of partitions (2). If SYSTEM_TIME partitioning
already existed it just changes its options: removes unspecified ones
and adds/changes those specified explicitly. In case when partitions
list was supplied it behaves as usual: does full repartitioning.
Examples:
create or replace table t1 (x int) with system versioning
partition by system_time limit 100 partitions 4;
# Change LIMIT
alter table t1 partition by system_time limit 33;
# Remove LIMIT
alter table t1 partition by system_time;
# This does full repartitioning
alter table t1 partition by system_time limit 33 partitions 4;
# This does data copy as pruning will require records in correct partitions
alter table t1 partition by system_time interval 1 hour
starts '2000-01-01 00:00:00';
# But this works fast, LIMIT will apply to DML commands
alter table t1 partition by system_time limit 33;
To sum up, ALTER for SYSTEM_TIME partitioning does full repartitioning
when:
- INTERVAL was added or changed;
- partition list or partition number was specified;
Otherwise it does fast alter table.
Cleaned up dead condition in set_up_default_partitions().
Reviewed by:
Oleksandr Byelkin <sanja@mariadb.com>
Nikita Malyavin <nikitamalyavin@gmail.com>
This is based on commit 20ae4816bb
with some adjustments for MDEV-12353.
row_ins_sec_index_entry_low(): If a separate mini-transaction is
needed to adjust the minimum bounding rectangle (MBR) in the parent
page, we must disable redo logging if the table is a temporary table.
For temporary tables, no log is supposed to be written, because
the temporary tablespace will be reinitialized on server restart.
rtr_update_mbr_field(), rtr_merge_and_update_mbr(): Changed the return
type to void and removed unreachable code. In older versions, these
used to return a different value for temporary tables.
page_id_t: Add constexpr to most member functions.
mtr_t::log_write(): Catch log writes to invalid tablespaces
so that the test case would crash without the fix to
row_ins_sec_index_entry_low().
row_ins_sec_index_entry_low(): If a separate mini-transaction is
needed to adjust the minimum bounding rectangle (MBR) in the parent
page, we must disable redo logging if the table is a temporary table.
For temporary tables, no log is supposed to be written, because
the temporary tablespace will be reinitialized on server restart.
rtr_update_mbr_field(): Plug a memory leak.
(This is the assert that was added in fix for MDEV-26047)
Table elimination may remove an ON expression from an outer join.
However SELECT_LEX::update_used_tables() will still call
item->walk(&Item::eval_not_null_tables)
for eliminated expressions. If the subquery is constant and cheap
Item_cond_and will attempt to evaluate it, which will trigger an
assert.
The fix is not to call update_used_tables() or eval_not_null_tables()
for ON expressions that were eliminated.
* FreeBSD returns errno 31 (EMLINK, Too many links),
not 40 (ELOOP, Too many levels of symbolic links)
* (`mysqlbinlog|mysql`) was just crazy, why did it ever work?
* socket_ipv6.inc check (that checked whether ipv6 is supported)
only worked correctly when ipv6 was supported
* perfschema.socket_summary_by_instance was changing global variables
and then skip-ing the test (because on missing ipv6)
- When arguments to the procedure contain quote in the name, procedure fails with parsing error.
The reason was because additional quoting is done when testing TEMPORARY table with the same name.
- Reviewed by: <wlad@mariadb.com>
Window Functions code tries to minimize the number of times it
needs to sort the select's resultset by finding "compatible"
OVER (PARTITION BY ... ORDER BY ...) clauses.
This employs compare_order_elements(). That function assumed that
the order expressions are Item_field-derived objects (that refer
to a temp.table). But this is not always the case: one can
construct queries order expressions are arbitrary item expressions.
Add handling for such expressions: sort them according to the window
specification they appeared in.
This means we cannot detect that two compatible PARTITION BY clauses
that use expressions can share the sorting step.
But at least we won't crash.
Grep is not safe as it will confuse if there is more than one
matching line or if progress report is slightly different as
progress reporting is afected by how fast IST is send/received.
Fix is to use sed and we are interested only that there is at
least one progress report from IST.
Problem:
========
InnoDB DDL fails to remove the newly added table or index from
dictionary and index stub from the table cache if the alter
transaction encounters DEADLOCK error in commit phase.
Solution:
========
Restart the alter table transaction if it encounters DEADLOCK
in commit phase. So that index stubs and index, table removal from
dictionary can be done in rollback_inplace_alter_table()
- Added one assert in rollback_inplace_alter_table() to indicate that
the online log for the old table shouldn't exist.
When "mariabackup --target-dir=$basedir --incremental-dir=$incremental_dir"
is running and is moving a new table file (e.g. `db1/t1.new`) from the
incremental directory to the base directory, it needs to verify that the base
backup database directory (e.g. `$basedir/db1`) really exists
(or create it otherwise).
The table `db1/t1` can come from a new database `db1` which
was created during the base mariabackup execution time.
In such case the directory `db1` exists only in the incremental directory,
but does not exist in the base directory.
This patch corrects the fix for MDEV-26412.
Note that when parsing an ON expression the pointer to the current select
is always in select_stack[select_stack_top - 1]. So the pointer to the
outer select (if any) is in select_stack[select_stack_top - 2].
The query manifesting this bug is added to the test case of MDEV-26412.
MDEV-28082 Crash when using HAVING with IS NULL predicate in an equality
These bugs have been fixed by the patch for MDEV-26402.
Only test cases are added.
Moved LIMIT warning from vers_set_hist_part() to new call
vers_check_limit() at table unlock phase. At that point
read_partitions bitmap is already pruned by DML code (see
prune_partitions(), find_used_partitions()) so we have to set
corresponding bits for working history partition.
Also we don't do my_error(ME_WARNING|ME_ERROR_LOG), because at that
point it doesn't update warnings number, so command reports 0 warnings
(but warning list is still updated). Instead we do
push_warning_printf() and sql_print_warning() separately.
Under LOCK TABLES external_lock(F_UNLCK) is not executed. There is
start_stmt(), but no corresponding "stop_stmt()". So for that mode we
call vers_check_limit() directly from close_thread_tables().
Test result has been changed according to new LIMIT and warning
printing algorithm. For convenience all LIMIT warnings are marked with
"You see warning above ^".
TODO MDEV-20345 fixed. Now vers_history_generating() contains
fine-grained list of DML-commands that can generate history (and TODO
mechanism worked well).
Like in MDEV-27217 vers_set_hist_part() for LIMIT depends on all
partitions selected in read_partitions. That bugfix just disabled
partition selection for DELETE with this check:
if (table->pos_in_table_list &&
table->pos_in_table_list->partition_names)
{
return HA_ERR_PARTITION_LIST;
}
ALTER TABLE TRUNCATE PARTITION is a different story. First, it doesn't
update pos_in_table_list->partition_names, but
thd->lex->alter_info.partition_names. But we cannot depend on that
since alter_info will be stale for DML. Second, we should not disable
TRUNCATE PARTITION for that to be consistent with TRUNCATE TABLE
behavior.
Now we don't do vers_set_hist_part() for ALTER TABLE as this command
is not DML, so it does not produce history.
In commit 1bd681c8b3 (MDEV-25506 part 3)
the way how DDL transactions delete files was rewritten.
Only files that are actually attached to InnoDB tablespaces would be
deleted, and only after the DDL transaction was durably committed.
After a failed ALTER TABLE...IMPORT TABLESPACE, any data files that
the user might have moved to the data directory will not be attached
to the InnoDB data dictionary. Therefore, DROP TABLE would not
attempt to delete those files, and a subsequent CREATE TABLE would
fail. The logic was that the user who created the files outside the
DBMS is still the owner of those files, and InnoDB should not delete
those files because an "ownership transfer" (IMPORT TABLESPACE) was
not successfully completed.
However, not deleting those detached files could surprise users.
ha_innobase::delete_table(): Even if no tablespace exists, try to
delete any files that might match the table name.
Reviewed by: Thirunarayanan Balathandayuthapani
(cherry-pick into preview-10.9-MDEV-27021-explain tree)
Expression_cache_tmptable object uses an Expression_cache_tracker object
to report the statistics.
In the common scenario, Expression_cache_tmptable destructor sets
tracker->cache=NULL. The tracker object survives after the expression
cache is deleted and one may call cache_tracker->fetch_current_stats()
for it with no harm.
However a degenerate cache with no parameters does not set
tracker->cache=NULL in Expression_cache_tmptable destructor which
results in an attempt to use freed data in the
cache_tracker->fetch_current_stats() call.
Fixed by setting tracker->cache to NULL and wrapping the assignment into
a function.
- Describe the lifetime of EXPLAIN data structures in
sql_explain.h:ExplainDataStructureLifetime.
- Make Item_field::set_field() call set_refers_to_temp_table()
when it refers to a temp. table.
- Introduce QT_DONT_ACCESS_TMP_TABLES flag for Item::print.
It directs Item_field::print to not try access its the
temp table.
- Introduce Explain_query::notify_tables_are_closed()
and call it right before the query closes its tables.
- Make Explain data stuctures' print_explain_json() methods
accept "no_tmp_tbl" parameter which means pass
QT_DONT_ACCESS_TMP_TABLES when printing items.
- Make Show_explain_request::call_in_target_thread() not call
set_current_thd(). This wasn't needed as the code inside
lex->print_explain() uses output->thd anyway. output->thd
refers to the SHOW command's THD object.
SHOW EXPLAIN/ANALYZE FORMAT=JSON tries to access items that have already been
freed by a call to free_items() during THD::cleanup_after_query().
The solution is to disallow APC calls including SHOW EXPLAIN/ANALYZE
just before the call to free_items().
1. Add explicit indication that the output is produced by
SHOW EXPLAIN/ANALYZE FORMAT=JSON command.
2. Remove useless "r_total_time_ms" field from SHOW ANALYZE FORMAT=JSON
output when there is no timed statistics gathered.
3. Add "r_query_time_in_progress_ms" to the output of SHOW ANALYZE FORMAT=JSON.
EXPLAIN FOR CONNECTION is a MySQL-compatible syntax for SHOW EXPLAIN.
This commit also adds support for FORMAT=JSON to SHOW EXPLAIN,
so the possible options to get JSON-formatted output are:
- SHOW EXPLAIN FORMAT=JSON FOR $con
- EXPLAIN FORMAT=JSON FOR CONNECTION $con
Problem:
========
The test logic checked for the wrong condition to validate that the
slave had caught up with the master. Specifically, it used the
thread stage of the IO and SQL thread to be in the “Waiting for
master to send event” and “Slave has read all relay log; waiting for
more updates” states, respectively. The problem exposed by this MDEV
is that, this state is also the initial slave state before reading
data from the primary (whereas the intended state was having already
read all available events from the primary and now waiting for new
events). This made the MTR test validate data that it had not yet
received, and thereby fail.
Solution:
========
Instead of using the IO/SQL thread states, use the existing helper
functions save_master_gtid.inc and sync_with_master_gtid.inc. Note
that the test result file also needed to be updated to reflect
this fix.
Special thanks to Angelique Sklavounos for pointing out that
--stop-position was not specified in any buildbot failures, as this
led to an IF block in the MTR test that was the source of the test
failure.
Reviewed By
============
Andrei Elkin <andrei.elkin@mariadb.com>
Expression_cache_tmptable object uses an Expression_cache_tracker object
to report the statistics.
In the common scenario, Expression_cache_tmptable destructor sets
tracker->cache=NULL. The tracker object survives after the expression
cache is deleted and one may call cache_tracker->fetch_current_stats()
for it with no harm.
However a degenerate cache with no parameters does not set
tracker->cache=NULL in Expression_cache_tmptable destructor which
results in an attempt to use freed data in the
cache_tracker->fetch_current_stats() call.
Fixed by setting tracker->cache to NULL and wrapping the assignment into
a function.
IF an INSERT/REPLACE SELECT statement contained an ON expression in the top
level select and this expression used a subquery with a column reference
that could not be resolved then an attempt to resolve this reference as
an outer reference caused a crash of the server. This happened because the
outer context field in the Name_resolution_context structure was not set
to NULL for such references. Rather it pointed to the first element in
the select_stack.
Note that starting from 10.4 we cannot use the SELECT_LEX::outer_select()
method when parsing a SELECT construct.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
btr_insert_into_right_sibling(): Inherit any gap lock from the
left sibling to the right sibling before inserting the record
to the right sibling and updating the node pointer(s).
lock_update_node_pointer(): Update locks in case a node pointer
will move.
Based on mysql/mysql-server@c7d93c274f