Delete-marked record is on the secondary index and the clustered index
already purged the corresponding record. We cannot detect if such
record is historical and we should not: the algorithm of
row_ins_check_foreign_constraint() skips such record anyway.
MDEV-16026: Forbid global system_versioning_asof in non-default time zone
* store `system_versioning_asof` in unix time;
* both session and global vars are processed in session timezone;
* setting `default` does not copy global variable anymore. Instead, it sets
system_time to SYSTEM_TIME_UNSPECIFIED, which means that no 'AS OF' time
is applied and `now()` can be assumed
As a regression, we cannot assign values below 1970 (UTC) anymore
MDEV-16481: set global system_versioning_asof=sf() crashes in specific case
* sys_vars.h: add `MYSQL_TIME` field to `set_var::save_result`
* sys_vars.ic: get rid of calling `var->value->get_date()` from
`Sys_var_vers_asof::update()`
* versioning.sysvars: add test; remove double warning
refactor Sys_var_vers_asof
* inherit from sys_var rather than Sys_var_enum
* remove junk "DEFAULT" keyword. There is DEFAULT in SQL grammar for it.
* make all conversions in check() to avoid possible errors
* avoid double var->value evaluation, which could
consequence in undefined behavior
Historical query with AS OF point after the last history partition
must include last history partition because it can be overflown
(contain history rows out of right endpoint).
Move left point back to hist_part->id in that case.
The practical maximum value of the parameter innodb_lock_wait_timeout
is 100,000,000. Any value larger than that specifies an infinite timeout.
Therefore, we should make 100,000,000 the maximum value of the parameter.
Item_func_history (is_history()) is a bool function that checks if the
row is the history row by checking row_end->is_max(). The argument to
this function must be row_end system field.
Added the above function to conjunction with SYSTEM_TIME_BEFORE
versioning condition.
Cause: no table->update_handler cloned at the moment of
vers_insert_history_row(). update_handler is needed because there
can't be several inited indexes at once in the same handler. First
index is inited by QUICK_RANGE_SELECT::reset(). Then when history row
is inserted check_duplicate_long_entry_key() is done and it requires
another index.
We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.
For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().
Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.
Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.
This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).
INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.
In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.
Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.
dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.
ins_node_t::bulk_insert: Whether bulk insert was initiated.
trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.
trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().
trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.
row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.
lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.
btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.
ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.
row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
btr_root_page_init(): Refactored from btr_create().
dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.
ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.
This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.
Problem: Assertion `transactional_table || !changed ||
thd->transaction.stmt.modified_non_trans_table' failed due REPLACE into a
versioned table.
It is not specific to system versioning/pertitioning/heap, but this
combination makes it much easier to reproduce.
The thing is to make first ha_update_row call succeed to make
info->deleted != 0. And then make REPLACE fail by any reason.
In this scenario we overflow versioned partition, so next ha_update_row
succeeds, but corresponding ha_write_row fails to insert history record.
Fix: modified_non_trans_table is set in one missed place
Add history row outside of compare_record() check. For TRX_ID
versioning we have to fail can_compare_record to force InnoDB update
which adds history row; and there in ha_innobase::update_row() is
additional "row changed" check where we force history row anyway.
First part of the fix (row0mysql.cc) addresses external columns when adding history
row on referential action. The full data must be retrieved before the
row is inserted.
Second part of the fix (the rest) avoids duplicate primary key error between
the history row generated on referential action and the history row
generated by SQL command. Both command and referential action can
happen on same table since foreign key can be self-reference (parent
and child tables are same). Moreover, the self-reference can refer
multiple rows when the key is non-unique. In such case history is
generated by referential action occured on first row but processed all
rows by a matched key. The second round is when the next row is
processed by a command but history already exists. In such case we
check TRX_ID of existing history row and if it is the same we assume
the above situation and skip adding one more history row or failing
the command.
First part (row0mysql.cc) fixes ins_node_set_new_row() usage workflow
as it is designed to operate on empty row (see row_get_prebuilt_insert_row()
for example).
Second part (row0ins.cc) fixes duplicate key error in FTS_DOC_ID_INDEX
since history rows must not generate entries in that index. We detect
FTS_DOC_ID_INDEX by a number of attributes and skip it if the row is
historical.
Misc fixes:
row_build_index_entry_low() does not accept non-NULL tuple
for FTS index (subject assertion fails), assertion (index->type !=
DICT_FTS) adds code understanding.
Now as historical_row is copied in row_update_vers_insert() there is
no need to copy the row twice: ROW_COPY_POINTERS is used to build
historical_row initially.
dbug_print_rec() debug functions.
- Remove row_start/row_end from keys in fix_create_like();
- Disable manual adding of implicit row_start/row_end to indexes on
CREATE TABLE. INVISIBLE_SYSTEM fields are unoperable by user;
- Fix memory leak on allocation of Key_part_spec.
PARTITION clause in SELECT means query is non-versioned (see
WITH_PARTITION_STORAGE_ENGINE in vers_setup_conds()).
vers_setup_conds() expands such query to SYSTEM_TIME_ALL which is then
added to VIEW specification. When VIEW is queried both clauses
PARTITION and FOR SYSTEM_TIME ALL lead to ER_VERS_QUERY_IN_PARTITION
(same place WITH_PARTITION_STORAGE_ENGINE).
Fix removes FOR SYSTEM_TIME ALL from VIEW by accessing original
SYSTEM_TIME clause: the one specified in parser. As a side-effect
EXPLAIN SELECT displays SYSTEM_TIME specified in SELECT which is
user-friendly.
For join to work correctly versioning condition must be added to table
on_expr. Without that JOIN_CACHE gets expression (1)
trigcond(xtitle.row_end = TIMESTAMP'2038-01-19 06:14:07.999999') and
trigcond(xtitle.elementId = x.`id` and xtitle.pkey = 'title')
instead of (2)
trigcond(xtitle.elementId = x.`id` and xtitle.pkey = 'title')
for join_null_complements(). It is NULL-row of xtitle for
complementing the join and the above comparisons of course FALSE, but
trigcond (Item_func_trig_cond) makes them TRUE via its trig_var
property which is bound to some boolean properties of JOIN_TAB.
Expression (2) evaluated to TRUE because its trig_var is bound to
first_inner_tab->not_null_compl. The expression (1) does not evaluate
correctly because row_end comparison's trig_var is bound to
first_inner->found earlier. As a result JOIN_CACHE::check_match()
skipped the row for join_null_complements().
When we add versioning condition to table's on_expr the optimizer in
make_join_select() distributes conditions differently. tmp_cond
inherits on_expr value and in Good case it is full expression
xgender.elementId = x.`id` and xgender.pkey = 'gender' and
xgender.row_end = TIMESTAMP'2038-01-19 06:14:07.999999'
while in Bad case it is only
xgender.elementId = x.`id` and xgender.pkey = 'gender'.
Later in Good row_end condition is optimized out and we get one
trigcond in form of (2).