If UPDATE/DELETE does not change data it is skipped from
replication. We now force replication of such events when they trigger
partition auto-creation.
For ROLLBACK it is as simple as set OPTION_KEEP_LOG
flag. trans_cannot_safely_rollback() does the rest.
For UPDATE/DELETE .. LIMIT 0 we make additional binlog_query() calls
at the early points of return.
As a safety measure we also convert row format into statement if it is
needed. The condition is decided by
binlog_need_stmt_format(). Basically if there are some row events in
cache we don't need that: table open of row event will trigger
auto-creation anyway.
Multi-update/delete works via mysql_select(). There is no early points
of return, so binlogging is always checked by
send_eof()/abort_resultset(). But we must comply with the above
measure of converting into statement.
:: Syntax change ::
Keyword AUTO enables history partition auto-creation.
Examples:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 MONTH
STARTS '2021-01-01 00:00:00' AUTO PARTITIONS 12;
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME LIMIT 1000 AUTO;
Or with explicit partitions:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO
(PARTITION p0 HISTORY, PARTITION pn CURRENT);
To disable or enable auto-creation one can use ALTER TABLE by adding
or removing AUTO from partitioning specification:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
# Disables auto-creation:
ALTER TABLE t1 PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR;
# Enables auto-creation:
ALTER TABLE t1 PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
If the rest of partitioning specification is identical to CREATE TABLE
no repartitioning will be done (for details see MDEV-27328).
:: Description ::
Before executing history-generating DML command (see the list of commands below)
add N history partitions, so that N would be sufficient for potentially
generated history. N > 1 may be required when history partitions are switched
by INTERVAL and current_timestamp is N times further than the interval
boundary of the last history partition.
If the last history partition equals or exceeds LIMIT records then new history
partition is created and selected as the working partition. According to
MDEV-28411 partitions cannot be switched (or created) while the command is
running. Thus LIMIT does not carry strict limitation and the history partition
size must be planned as LIMIT value plus average number of history one DML
command can generate.
Auto-creation is implemented by synchronous fast_alter_partition_table() call
from the thread of the executed DML command before the command itself is run
(by the fallback and retry mechanism similar to Discovery feature,
see Open_table_context).
The name for newly added partitions are generated like default partition names
with extension of MDEV-22155 (which avoids name clashes by extending assignment
counter to next free-enough gap).
These DML commands can trigger auto-creation:
DELETE (including multitable DELETE, excluding DELETE HISTORY)
UPDATE (including multitable UPDATE)
REPLACE (including REPLACE .. SELECT)
INSERT .. ON DUPLICATE KEY UPDATE (including INSERT .. SELECT .. ODKU)
LOAD DATA .. REPLACE
:: Bug fixes ::
MDEV-23642 Locking timeout caused by auto-creation affects original DML
The reasons for this are:
- Do not disrupt main business process (the history is auxiliary service);
- Consequences are non-fatal (history is not lost, but comes into wrong
partition; fixed by partitioning rebuild);
- There is more freedom for application to fail in this case or not: it may
read warning info and find corresponding error number.
- While non-failing command is easy to handle by an application and fail it,
the opposite is hard to handle: there is no automatic actions to fix
failed command and retry, DBA intervention is required and until then
application is non-functioning.
MDEV-23639 Auto-create does not work under LOCK TABLES or inside triggers
Don't do tdc_remove_table() for OT_ADD_HISTORY_PARTITION because it is
not possible in locked tables mode.
LTM_LOCK_TABLES mode (and LTM_PRELOCKED_UNDER_LOCK_TABLES) works out
of the box as fast_alter_partition_table() can reopen tables via
locked_tables_list.
In LTM_PRELOCKED we reopen and relock table manually.
:: More fixes ::
* some_table_marked_for_reopen flag fix
some_table_marked_for_reopen affets only reopen of
m_locked_tables. I.e. Locked_tables_list::reopen_tables() reopens only
tables from m_locked_tables.
* Unused can_recover_from_failed_open() condition
Is recover_from_failed_open() can be really used after
open_and_process_routine()?
:: Reviewed by ::
Sergei Golubchik <serg@mariadb.org>
When we need to add/remove or change LIMIT, INTERVAL, AUTO we have to
recreate partitioning from scratch (via data copy). Such operations
should be done fast. To remove options like LIMIT or INTERVAL one
should write:
alter table t1 partition by system_time;
The command checks whether it is new or existing SYSTEM_TIME
partitioning. And in the case of new it behaves as CREATE would do:
adds default number of partitions (2). If SYSTEM_TIME partitioning
already existed it just changes its options: removes unspecified ones
and adds/changes those specified explicitly. In case when partitions
list was supplied it behaves as usual: does full repartitioning.
Examples:
create or replace table t1 (x int) with system versioning
partition by system_time limit 100 partitions 4;
# Change LIMIT
alter table t1 partition by system_time limit 33;
# Remove LIMIT
alter table t1 partition by system_time;
# This does full repartitioning
alter table t1 partition by system_time limit 33 partitions 4;
# This does data copy as pruning will require records in correct partitions
alter table t1 partition by system_time interval 1 hour
starts '2000-01-01 00:00:00';
# But this works fast, LIMIT will apply to DML commands
alter table t1 partition by system_time limit 33;
To sum up, ALTER for SYSTEM_TIME partitioning does full repartitioning
when:
- INTERVAL was added or changed;
- partition list or partition number was specified;
Otherwise it does fast alter table.
Cleaned up dead condition in set_up_default_partitions().
Reviewed by:
Oleksandr Byelkin <sanja@mariadb.com>
Nikita Malyavin <nikitamalyavin@gmail.com>
Moved LIMIT warning from vers_set_hist_part() to new call
vers_check_limit() at table unlock phase. At that point
read_partitions bitmap is already pruned by DML code (see
prune_partitions(), find_used_partitions()) so we have to set
corresponding bits for working history partition.
Also we don't do my_error(ME_WARNING|ME_ERROR_LOG), because at that
point it doesn't update warnings number, so command reports 0 warnings
(but warning list is still updated). Instead we do
push_warning_printf() and sql_print_warning() separately.
Under LOCK TABLES external_lock(F_UNLCK) is not executed. There is
start_stmt(), but no corresponding "stop_stmt()". So for that mode we
call vers_check_limit() directly from close_thread_tables().
Test result has been changed according to new LIMIT and warning
printing algorithm. For convenience all LIMIT warnings are marked with
"You see warning above ^".
TODO MDEV-20345 fixed. Now vers_history_generating() contains
fine-grained list of DML-commands that can generate history (and TODO
mechanism worked well).
Like in MDEV-27217 vers_set_hist_part() for LIMIT depends on all
partitions selected in read_partitions. That bugfix just disabled
partition selection for DELETE with this check:
if (table->pos_in_table_list &&
table->pos_in_table_list->partition_names)
{
return HA_ERR_PARTITION_LIST;
}
ALTER TABLE TRUNCATE PARTITION is a different story. First, it doesn't
update pos_in_table_list->partition_names, but
thd->lex->alter_info.partition_names. But we cannot depend on that
since alter_info will be stale for DML. Second, we should not disable
TRUNCATE PARTITION for that to be consistent with TRUNCATE TABLE
behavior.
Now we don't do vers_set_hist_part() for ALTER TABLE as this command
is not DML, so it does not produce history.
Implicit system-versioned table does not contain system fields in SHOW
CREATE. Therefore after mysqldump recovery such table has system
fields in the last place in frm image. The original table meanwhile
does not guarantee these system fields on last place because adding
new fields via ALTER TABLE places them last. Thus the order of fields
may be different between master and slave, so row-based replication
may fail.
To fix this on ALTER TABLE we now place system-invisible fields always
last in frm image. If the table was created via old revision and has
an incorrect order of fields it can be fixed via any copy operation of
ALTER TABLE, f.ex.:
ALTER TABLE t1 FORCE;
To check the order of fields in frm file one can use hexdump:
hexdump -C t1.frm
Note, the replication fails only when all 3 conditions are met:
1. row-based or mixed mode replication;
2. table has new fields added via ALTER TABLE;
3. table was rebuilt on some, but not all nodes via mysqldump image.
Otherwise it will operate properly even with incorrect order of
fields.
vers_info->hist_part retained stale value after ROLLBACK. The
algorithm in vers_set_hist_part() continued iteration from that value.
The simplest solution is to process partitions each time from start
for LIMIT in vers_set_hist_part().
records_are_comparable() requires this condition:
bitmap_is_subset(table->write_set, table->read_set)
On first iteration vers_update_fields() changes write_set and
read_set. On second iteration the above condition fails.
Added missing read bit for ROW_START. Also reorganized
bitmap_set_bit() so it is called only when needed.
Throw ER_NOT_FORM_FILE if this is wrong FRM data (warning with
ER_VERS_FIELD_WRONG_TYPE is still printed for deeper knowledge of what
was happened).
Keep ER_VERS_FIELD_WRONG_TYPE for creating partitioned table with
trx-versioning. Tested by MDEV-15951 in trx_id.test
A few regression tests invoke heavy flushing of the buffer pool
and may trigger warnings that tablespaces could not be deleted
because of pending writes. Those warnings are to be expected
during the execution of such tests.
The warnings are also frequently seen with Valgrind or MemorySanitizer.
For those, the global suppression in have_innodb.inc does the trick.
MDEV-27025 allows to insert records before the record on which DELETE is
locked, as a result the DELETE misses those records, what causes serious ACID
violation.
Revert MDEV-27025, MDEV-27550. The test which shows the scenario of ACID
violation is added.
The code was backported from 10.6 bd03c0e516
commit. See that commit message for details.
Apart from the above commit trx_lock_t::wait_trx was also backported from
MDEV-24738. trx_lock_t::wait_trx is protected with lock_sys.wait_mutex
in 10.6, but that mutex was implemented only in MDEV-24789. As there is no
need to backport MDEV-24789 for MDEV-27025,
trx_lock_t::wait_trx is protected with the same mutexes as
trx_lock_t::wait_lock.
This fix should not break innodb-lock-schedule-algorithm=VATS. This
algorithm uses an Eldest-Transaction-First (ETF) heuristic, which prefers
older transactions over new ones. In this fix we just insert granted lock
just before the last granted lock of the same transaction, what does not
change transactions execution order.
The changes in lock_rec_create_low() should not break Galera Cluster,
there is a big "if" branch for WSREP. This branch is necessary to provide
the correct transactions execution order, and should not be changed for
the current bug fix.
When lock is checked for conflict, ignore other locks on the record if
they wait for the requesting transaction.
lock_rec_has_to_wait_in_queue() iterates not all locks for
the page, but only the locks located before the waiting lock in the
queue. So there is some invariant - any lock in the queue can wait only
lock which is located before the waiting lock in the queue.
In the case when conflicting lock waits for the transaction of
requesting lock, we need to place the requesting lock before the waiting
lock in the queue to preserve the invariant. That is why we are looking
for the first waiting for requesting transation lock and place the new
lock just after the last granted requesting transaction lock before the
first waiting for requesting transaction lock.
Example:
trx1 waiting lock, trx1 granted lock, ..., trx2 lock - waiting for trx1
place new lock here -----------------^
There are also implicit locks which are lazily converted to explicit
ones, and we need to place the newly created explicit lock to the correct
place in a queue. All explicit locks converted from implicit ones are
placed just after the last non-waiting lock of the same transaction before
the first waiting for the transaction lock.
Code review and cleanup was made by Marko Mäkelä.
First, we do not add VERS_UPDATE_UNVERSIONED_FLAG for system field and
that fixes SHOW CREATE result.
Second, we have to call check_sys_fields() for any CREATE TABLE and
there correct type is checked for system fields.
Third, we update system_time like as_row structures for ALTER TABLE
and that makes check_sys_fields() happy for ALTER TABLE when we make
system fields hidden.
Update was skipped (need_update was false) because compare_record()
used HA_PARTIAL_COLUMN_READ branch and it skipped row_start check
has_explicit_value() was false. When we set bit for row_start in
has_value_set the row is updated with new row_start value.
The bug was caused by combination of MDEV-23446 and 3789692d17. The
latter one says:
... But generated columns that are written to the table are always
deterministic and cannot change unless normal non-generated columns
were changed. ...
Since MDEV-23446 generated row_start can change while non-generated
columns are not changed.
Explicit value flag came from HAS_EXPLICIT_DEFAULT which was used to
distinguish default-generated value from user-supplied one.
LIMIT history switching requires the number of history partitions to
be marked for read: from first to last non-empty plus one empty. The
least we can do is to fail with error message if the needed partition
was not marked for read. As this is handler interface we require new
handler error code to display user-friendly error message.
Switching by INTERVAL works out-of-the-box with
ER_ROW_DOES_NOT_MATCH_GIVEN_PARTITION_SET error.