MDEV-20704 changed the rules of how (HA_BINARY_PACK_KEY |
HA_VAR_LENGTH_KEY) flags are added. Older FRMs before that fix had
these flags for DOUBLE index. After that fix when ALTER sees such old
FRM it thinks it cannot do instant alter because of failed
compare_keys_but_name(): it compares flags against tmp table created
by ALTER.
MDEV-20704 fix was actually not about DOUBLE type but about
FIELDFLAG_BLOB which affected DOUBLE. So there is no direct knowledge
that any other types were not affected.
The proposed fix under CHECK TABLE checks if FRM has
(HA_BINARY_PACK_KEY | HA_VAR_LENGTH_KEY) flags and was created prior
MDEV-20704 and if so issues "needs upgrade". When mysqlcheck and
mysql_upgrade see such status they issue ALTER TABLE FORCE and upgrade
the table to the version of server.
Handler for existing partition was already index-inited at the
beginning of copy_partitions().
In the case of REORGANIZE PARTITION we fill new partition by calling
its ha_write_row() (handler is storage engine of new partition). From
that we go through the below conditions:
if (this->inited == RND)
table->clone_handler_for_update();
handler *h= table->update_handler ? table->update_handler : table->file;
First, the above misses the meaning of this->inited check. Now it is
new partition and this handler is not inited. So, we assign
table->file which is ha_partition and is really not known to be inited
or not. It is supposed (this == table->file), otherwise we are
out of the logic for using update_handler. This patch adds DBUG_ASSERT
for that.
Second, we call check_duplicate_long_entries() for table->file and
that calls ha_partition::index_init() which calls index_init() for
each partition's handler. But the existing parititions' handlers was
already inited in copy_partitions() and we fail on assertion.
The fix implies that we don't need check_duplicate_long_entries()
per-partition as we've already done check_duplicate_long_entries() for
ha_partition. For REORGANIZE PARTITION that means existing row was
already checked at previous INSERT/UPDATE commands, so no need to
check it again (see NOTE in handler::ha_write_row()).
The fix also optimizes ha_update_row() so
check_duplicate_long_entries_update() is not called per-partition
considering it was already called for ha_partition. Besides,
per-partition duplicate check is not really usable.
Problem:
=======
This patch addresses two issues:
1. An incident event can be incorrectly reported for transactions
which are rolled back successfully. That is, an incident event
should only be generated for failed “non-transactional transactions”
(i.e., those which modify non-transactional tables) because they
cannot be rolled back.
2. When the mariadb slave (error) stops at receiving the incident
event there's no description of what led to it. Neither in the event
nor in the master's error log.
Solution:
========
Before reporting an incident event for a transaction, first validate
that it is “non-transactional” (i.e. cannot be safely rolled back).
To determine if a transaction is non-transactional,
lex->stmt_accessed_table(LEX::STMT_WRITES_NON_TRANS_TABLE)
is used because it is set previously in
THD::decide_logging_format().
Additionally, when an incident event is written, write an error
message to the server’s error log to indicate the underlying issue.
Reviewed by:
===========
Andrei Elkin <andrei.elkin@mariadb.com>
Also fixes
MDEV-27782 Wrong columns when using table level `CHARACTER SET utf8mb4 COLLATE DEFAULT`
MDEV-28644 Unexpected error on ALTER TABLE t1 CONVERT TO CHARACTER SET utf8mb3, DEFAULT CHARACTER SET utf8mb4
:: Syntax change ::
Keyword AUTO enables history partition auto-creation.
Examples:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 MONTH
STARTS '2021-01-01 00:00:00' AUTO PARTITIONS 12;
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME LIMIT 1000 AUTO;
Or with explicit partitions:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO
(PARTITION p0 HISTORY, PARTITION pn CURRENT);
To disable or enable auto-creation one can use ALTER TABLE by adding
or removing AUTO from partitioning specification:
CREATE TABLE t1 (x int) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
# Disables auto-creation:
ALTER TABLE t1 PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR;
# Enables auto-creation:
ALTER TABLE t1 PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR AUTO;
If the rest of partitioning specification is identical to CREATE TABLE
no repartitioning will be done (for details see MDEV-27328).
:: Description ::
Before executing history-generating DML command (see the list of commands below)
add N history partitions, so that N would be sufficient for potentially
generated history. N > 1 may be required when history partitions are switched
by INTERVAL and current_timestamp is N times further than the interval
boundary of the last history partition.
If the last history partition equals or exceeds LIMIT records then new history
partition is created and selected as the working partition. According to
MDEV-28411 partitions cannot be switched (or created) while the command is
running. Thus LIMIT does not carry strict limitation and the history partition
size must be planned as LIMIT value plus average number of history one DML
command can generate.
Auto-creation is implemented by synchronous fast_alter_partition_table() call
from the thread of the executed DML command before the command itself is run
(by the fallback and retry mechanism similar to Discovery feature,
see Open_table_context).
The name for newly added partitions are generated like default partition names
with extension of MDEV-22155 (which avoids name clashes by extending assignment
counter to next free-enough gap).
These DML commands can trigger auto-creation:
DELETE (including multitable DELETE, excluding DELETE HISTORY)
UPDATE (including multitable UPDATE)
REPLACE (including REPLACE .. SELECT)
INSERT .. ON DUPLICATE KEY UPDATE (including INSERT .. SELECT .. ODKU)
LOAD DATA .. REPLACE
:: Bug fixes ::
MDEV-23642 Locking timeout caused by auto-creation affects original DML
The reasons for this are:
- Do not disrupt main business process (the history is auxiliary service);
- Consequences are non-fatal (history is not lost, but comes into wrong
partition; fixed by partitioning rebuild);
- There is more freedom for application to fail in this case or not: it may
read warning info and find corresponding error number.
- While non-failing command is easy to handle by an application and fail it,
the opposite is hard to handle: there is no automatic actions to fix
failed command and retry, DBA intervention is required and until then
application is non-functioning.
MDEV-23639 Auto-create does not work under LOCK TABLES or inside triggers
Don't do tdc_remove_table() for OT_ADD_HISTORY_PARTITION because it is
not possible in locked tables mode.
LTM_LOCK_TABLES mode (and LTM_PRELOCKED_UNDER_LOCK_TABLES) works out
of the box as fast_alter_partition_table() can reopen tables via
locked_tables_list.
In LTM_PRELOCKED we reopen and relock table manually.
:: More fixes ::
* some_table_marked_for_reopen flag fix
some_table_marked_for_reopen affets only reopen of
m_locked_tables. I.e. Locked_tables_list::reopen_tables() reopens only
tables from m_locked_tables.
* Unused can_recover_from_failed_open() condition
Is recover_from_failed_open() can be really used after
open_and_process_routine()?
:: Reviewed by ::
Sergei Golubchik <serg@mariadb.org>
Analysis: There are 2 server variables- "old_mode" and "old". "old" is no
longer needed as "old_mode" has replaced it (however still used in some places
in the code). "old_mode" and "old" has same purpose- emulate behavior from
previous MariaDB versions. So they can be merged to avoid confusion.
Fix: Deprecate "old" variable and create another mode for @@old_mode to mimic
behavior of previous "old" variable. Create specific modes for specifix task
that --old sql variable was doing earlier and use the new modes instead.
We will remove the parameter innodb_disallow_writes because it is badly
designed and implemented. The parameter was never allowed at startup.
It was only internally used by Galera snapshot transfer.
If a user executed
SET GLOBAL innodb_disallow_writes=ON;
the server could hang even on subsequent read operations.
During Galera snapshot transfer, we will block writes
to implement an rsync friendly snapshot, as follows:
sst_flush_tables() will acquire a global lock by executing
FLUSH TABLES WITH READ LOCK, which will block any writes
at the high level.
sst_disable_innodb_writes(), invoked via ha_disable_internal_writes(true),
will suspend or disable InnoDB background tasks or threads that could
initiate writes. As part of this, log_make_checkpoint() will be invoked
to ensure that anything in the InnoDB buf_pool.flush_list will be written
to the data files. This has the nice side effect that the Galera joiner
will avoid crash recovery.
The changes to sql/wsrep.cc and to the tests are based on a prototype
that was developed by Jan Lindström.
Reviewed by: Jan Lindström
Task 1:
If table is added to list using option TL_OPTION_SEQUENCE (done when we
have sequence functions) then then we are dealing with sequence instead
of table. So global table list will have sequence set to true. This is
used to check and give correct error message about unknown sequence
instead of table doesn't exist.
Commit 6c39eaeb1 made the crash recovery dependent on server_id.
The crash recovery could fail when restoring a new instance from
original crashed data directory USING A NEW SERVER ID.
The issue doesn't exist in previous major versions before 10.6.
Root cause is when generating the input XID to be searched in the hash,
server id is populated with the current server id.
So if the server id changed when recovering, the XID couldn't be found
in the hash due to server id doesn't match.
This fix is to use original server id when creating the input XID
object in function `xarecover_do_commit_or_rollback`.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
First, we do not add VERS_UPDATE_UNVERSIONED_FLAG for system field and
that fixes SHOW CREATE result.
Second, we have to call check_sys_fields() for any CREATE TABLE and
there correct type is checked for system fields.
Third, we update system_time like as_row structures for ALTER TABLE
and that makes check_sys_fields() happy for ALTER TABLE when we make
system fields hidden.
LIMIT history switching requires the number of history partitions to
be marked for read: from first to last non-empty plus one empty. The
least we can do is to fail with error message if the needed partition
was not marked for read. As this is handler interface we require new
handler error code to display user-friendly error message.
Switching by INTERVAL works out-of-the-box with
ER_ROW_DOES_NOT_MATCH_GIVEN_PARTITION_SET error.
MDEV-22617 Galera node crashes when trying to log to slow_log table in
streaming replication mode
Other things:
- Changed name of wsrep_after_row(two arguments) to
wsrep_after_row_internal(one argument) to not depended on the
function signature with unused arguments.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Added test case
The old code erroneously used default_charset_info to compare field names.
default_charset_info can point to any arbitrary collation,
including ucs2*, utf16*, utf32*, including those that do not
support strcasecmp().
my_charset_utf8mb4_unicode_ci, which is used in this scenario:
CREATE TABLE t1 ENGINE=InnoDB WITH SYSTEM VERSIONING AS SELECT 0;
does not support strcasecmp().
Fixing the code to use Lex_ident::streq(), which uses
system_charset_info instead of default_charset_info.
versioning_fields flag indicates that any columns were specified WITH
SYSTEM VERSIONING. In that case we imply WITH SYSTEM VERSIONING for
the whole table and WITHOUT SYSTEM VERSIONING for the other columns.
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is
wsrep_thd_LOCK()
wsrep_kill_victim()
lock_rec_other_has_conflicting()
lock_clust_rec_read_check_and_lock()
row_search_mvcc()
ha_innobase::index_read()
ha_innobase::rnd_pos()
handler::ha_rnd_pos()
handler::rnd_pos_by_record()
handler::ha_rnd_pos_by_record()
Rows_log_event::find_row()
Update_rows_log_event::do_exec_row()
Rows_log_event::do_apply_event()
Log_event::apply_event()
wsrep_apply_events()
and mutexes are taken in the order
lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data
When a normal KILL statement is executed, the stack is
innobase_kill_query()
kill_handlerton()
plugin_foreach_with_mask()
ha_kill_query()
THD::awake()
kill_one_thread()
and mutexes are
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Mutex order violation when wsrep bf thread kills a conflicting trx,
the stack is
wsrep_thd_LOCK()
wsrep_kill_victim()
lock_rec_other_has_conflicting()
lock_clust_rec_read_check_and_lock()
row_search_mvcc()
ha_innobase::index_read()
ha_innobase::rnd_pos()
handler::ha_rnd_pos()
handler::rnd_pos_by_record()
handler::ha_rnd_pos_by_record()
Rows_log_event::find_row()
Update_rows_log_event::do_exec_row()
Rows_log_event::do_apply_event()
Log_event::apply_event()
wsrep_apply_events()
and mutexes are taken in the order
lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data
When a normal KILL statement is executed, the stack is
innobase_kill_query()
kill_handlerton()
plugin_foreach_with_mask()
ha_kill_query()
THD::awake()
kill_one_thread()
and mutexes are
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This also fixed unprotected calls to wsrep_thd_abort
that will use wsrep_abort_transaction. This is fixed
by holding THD::LOCK_thd_data while we abort transaction.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
if plugin->deinit() returns a failure, it is no longer ignored, it
means that the plugin isn't ready to be unloaded from memory yet.
So it's marked "dying", deinitialized as much as possible,
but stays in memory until shutdown.
also:
* increment MARIA_PLUGIN_INTERFACE_VERSION
* rewrite ha_rocksdb to use the new approach, update the test
Dead code cleanup:
part_info->num_parts usage was wrong and working incorrectly in
mysql_drop_partitions() because num_parts is already updated in
prep_alter_part_table(). We don't have to update part_info->partitions
because part_info is destroyed at alter_partition_lock_handling().
Cleanups:
- DBUG_EVALUATE_IF() macro replaced by shorter form DBUG_IF();
- Typo in ER_KEY_COLUMN_DOES_NOT_EXITS.
Refactorings:
- Splitted write_log_replace_delete_frm() into write_log_delete_frm()
and write_log_replace_frm();
- partition_info via DDL_LOG_STATE;
- set_part_info_exec_log_entry() removed.
DBUG_EVALUATE removed
DBUG_EVALUTATE was only added for consistency together with
DBUG_EVALUATE_IF. It is not used anywhere in the code.
DBUG_SUICIDE() fix on release build
On release DBUG_SUICIDE() was statement. It was wrong as
DBUG_SUICIDE() is used in expression context.
When inserting a number of rows into an empty table,
InnoDB will buffer and pre-sort the records for each index, and
build the indexes one page at a time.
For each index, a buffer of innodb_sort_buffer_size will be created.
If the buffer ran out of memory then we will create temporary files
for storing the data.
At the end of the statement, we will sort and apply the buffered
records. Ideally, we would do this at the end of the transaction
or only when starting to execute a non-INSERT statement on the table.
However, it could be awkward if duplicate keys or similar errors
would be reported during the execution of a later statement.
This will be addressed in MDEV-25036.
Any columns longer than 2000 bytes will buffered in temporary files.
innodb_prepare_commit_versioned(): Apply all bulk buffered insert
operation, at the end of each statement.
ha_commit_trans(): Handle errors from innodb_prepare_commit_versioned().
row_merge_buf_write(): This function should accept blob
file handle too and it should write the field data which are
greater than 2000 bytes
row_merge_bulk_t: Data structure to maintain the data during
bulk insert operation.
trx_mod_table_time_t::start_bulk_insert(): Notify the start of
bulk insert operation and create new buffer for the given table
trx_mod_table_time_t::add_tuple(): Buffer a record.
trx_mod_table_time_t::write_bulk(): Do bulk insert operation
present in the transaction
trx_mod_table_time_t::bulk_buffer_exist(): Whether the buffer
storage exist for the bulk transaction
trx_mod_table_time_t::write_bulk(): Write all buffered insert
operation for the transaction and the table.
row_ins_clust_index_entry_low(): Insert the data into the
bulk buffer if it is already exist.
row_ins_sec_index_entry(): Insert the secondary tuple
if the bulk buffer already exist.
row_merge_bulk_buf_add(): Insert the tuple into bulk buffer
insert operation.
row_merge_buf_blob(): Write the field data whose length is
more than 2000 bytes into blob temporary file. Write the
file offset and length into the tuple field.
row_merge_copy_blob_from_file(): Copy the blob from blob file
handler based on reference of the given tuple.
row_merge_insert_index_tuples(): Handle blob for bulk insert
operation.
row_merge_bulk_t::row_merge_bulk_t(): Constructor. Initialize
the buffer and file for all the indexes expect fts index.
row_merge_bulk_t::create_tmp_file(): Create new temporary file
for the given index.
row_merge_bulk_t::write_to_tmp_file(): Write the content from
buffer to disk file for the given index.
row_merge_bulk_t::add_tuple(): Insert the tuple into the merge
buffer for the given index. If the memory ran out then InnoDB
should sort the buffer and write into file.
row_merge_bulk_t::write_to_index(): Do bulk insert operation
from merge file/merge buffer for the given index
row_merge_bulk_t::write_to_table(): Do bulk insert operation
for all the indexes.
dict_stats_update(): If a bulk insert transaction is in progress,
treat the table as empty. The index creation could hold latches
for extended amounts of time.
In a rebase of the merge, two preceding commits were accidentally reverted:
commit 112b23969a (MDEV-26308)
commit ac2857a5fb (MDEV-25717)
Thanks to Daniele Sciascia for noticing this.
Contains following fixes:
* allow TOI commands to timeout while trying to acquire TOI with
override lock_wait_timeout with a LONG_TIMEOUT only after
succesfully entering TOI
* only ignore lock_wait_timeout on TOI
* fix galera_split_brain test as TOI operation now returns ER_LOCK_WAIT_TIMEOUT after lock_wait_timeout
* explicitly test for TOI
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Updates to transaction registry table shouldn't be replicated in
cluster so there is no need to append wsrep keys.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Due to wsrep uses its own xid format for its recovery,
the xid hashing has to be refined.
When a xid object is not in the server "mysql" format,
the hash record made to contain the xid also in the full format.
1. work around MDEV-25912 to not apply assert
at wsrep running time;
2. handle wsrep mode of the server recovery
3. convert hton calls to static binlog_commit ones.
4. satisfy MSAN complain on uninitialized std::pair
Problem:
=======
When the semisync master is crashed and restarted as slave it could
recover transactions that former slaves may never have seen.
A known method existed to clear out all prepared transactions
with --tc-heuristic-recover=rollback does not care to adjust
binlog accordingly.
Fix:
===
The binlog-based recovery is made to concern of the slave semisync role of
post-crash restarted server.
No changes in behavior is done to the "normal" binloggging server
and the semisync master.
When the restarted server is configured with
--rpl-semi-sync-slave-enabled=1
the refined recovery attempts to roll back prepared transactions
and truncate binlog accordingly.
In case of a partially committed (that is committed at least
in one of the engine participants) such transaction gets committed.
It's guaranteed no (partially as well) committed transactions
exist beyond the truncate position.
In case there exists a non-transactional replication event
(being in a way a committed transaction) past the
computed truncate position the recovery ends with an error.
As after master crash and failover to slave, the demoted-to-slave
ex-master must be ready to face and accept its own (generated by)
events, without generally necessary --replicate-same-server-id.
So the acceptance conditions are relaxed for the semisync slave
to accept own events without that option.
While gtid_strict_mode ON ensures no duplicate transaction can be
(re-)executed the master_use_gtid=none slave has to be
configured with --replicate-same-server-id.
*NOTE* for reviewers.
This patch does not handle the user XA which is done
in next git commit.
This is a complete rewrite of DROP TABLE, also as part of other DDL,
such as ALTER TABLE, CREATE TABLE...SELECT, TRUNCATE TABLE.
The background DROP TABLE queue hack is removed.
If a transaction needs to drop and create a table by the same name
(like TRUNCATE TABLE does), it must first rename the table to an
internal #sql-ib name. No committed version of the data dictionary
will include any #sql-ib tables, because whenever a transaction
renames a table to a #sql-ib name, it will also drop that table.
Either the rename will be rolled back, or the drop will be committed.
Data files will be unlinked after the transaction has been committed
and a FILE_RENAME record has been durably written. The file will
actually be deleted when the detached file handle returned by
fil_delete_tablespace() will be closed, after the latches have been
released. It is possible that a purge of the delete of the SYS_INDEXES
record for the clustered index will execute fil_delete_tablespace()
concurrently with the DDL transaction. In that case, the thread that
arrives later will wait for the other thread to finish.
HTON_TRUNCATE_REQUIRES_EXCLUSIVE_USE: A new handler flag.
ha_innobase::truncate() now requires that all other references to
the table be released in advance. This was implemented by Monty.
ha_innobase::delete_table(): If CREATE TABLE..SELECT is detected,
we will "hijack" the current transaction, drop the table in
the current transaction and commit the current transaction.
This essentially fixes MDEV-21602. There is a FIXME comment about
making the check less failure-prone.
ha_innobase::truncate(), ha_innobase::delete_table():
Implement a fast path for temporary tables. We will no longer allow
temporary tables to use the adaptive hash index.
dict_table_t::mdl_name: The original table name for the purpose of
acquiring MDL in purge, to prevent a race condition between a
DDL transaction that is dropping a table, and purge processing
undo log records of DML that had executed before the DDL operation.
For #sql-backup- tables during ALTER TABLE...ALGORITHM=COPY, the
dict_table_t::mdl_name will differ from dict_table_t::name.
dict_table_t::parse_name(): Use mdl_name instead of name.
dict_table_rename_in_cache(): Update mdl_name.
For the internal FTS_ tables of FULLTEXT INDEX, purge would
acquire MDL on the FTS_ table name, but not on the main table,
and therefore it would be able to run concurrently with a
DDL transaction that is dropping the table. Previously, the
DROP TABLE queue hack prevented a race between purge and DDL.
For now, we introduce purge_sys.stop_FTS() to prevent purge from
opening any table, while a DDL transaction that may drop FTS_
tables is in progress. The function fts_lock_table(), which will
be invoked before the dictionary is locked, will wait for
purge to release any table handles.
trx_t::drop_table_statistics(): Drop statistics for the table.
This replaces dict_stats_drop_index(). We will drop or rename
persistent statistics atomically as part of DDL transactions.
On lock conflict for dropping statistics, we will fail instantly
with DB_LOCK_WAIT_TIMEOUT, because we will be holding the
exclusive data dictionary latch.
trx_t::commit_cleanup(): Separated from trx_t::commit_in_memory().
Relax an assertion around fts_commit() and allow DB_LOCK_WAIT_TIMEOUT
in addition to DB_DUPLICATE_KEY. The call to fts_commit() is
entirely misplaced here and may obviously break the consistency
of transactions that affect FULLTEXT INDEX. It needs to be fixed
separately.
dict_table_t::n_foreign_key_checks_running: Remove (MDEV-21175).
The counter was a work-around for missing meta-data locking (MDL)
on the SQL layer, and not really needed in MariaDB.
ER_TABLE_IN_FK_CHECK: Replaced with ER_UNUSED_28.
HA_ERR_TABLE_IN_FK_CHECK: Remove.
row_ins_check_foreign_constraints(): Do not acquire
dict_sys.latch either. The SQL-layer MDL will protect us.
This was reviewed by Thirunarayanan Balathandayuthapani
and tested by Matthias Leich.
The underlying problem with MDEV-25551 turned out to be that
transactions having changes for tables with no primary key,
were not safe to apply in parallel. This is due to excessive locking
in innodb side, and even non related row modifications could end up
in lock conflict during applying.
The fix for MDEV-25551 has disabled parallel applying for tables with no PK.
This fix depends on change for wsrep-lib, where a separate PR allows
application to modify transaction flags in wsrep-lib.
This commit has also separate mtr test for verifying that transactions
modifying a table with no primary key, will not apply in parallel.
This test is a modified version of initial test created by Gabor Orosz,
the reporterr of MDEV-25551.
Another mtr test was added in galera_sr suite, for testing if modifying
tables with no primary key would causes issues for streaming replication
use cases.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
MDEV-25604 Atomic DDL: Binlog event written upon recovery does not
have default database
The purpose of this task is to ensure that ALTER TABLE is atomic even if
the MariaDB server would be killed at any point of the alter table.
This means that either the ALTER TABLE succeeds (including that triggers,
the status tables and the binary log are updated) or things should be
reverted to their original state.
If the server crashes before the new version is fully up to date and
commited, it will revert to the original table and remove all
temporary files and tables.
If the new version is commited, crash recovery will use the new version,
and update triggers, the status tables and the binary log.
The one execption is ALTER TABLE .. RENAME .. where no changes are done
to table definition. This one will work as RENAME and roll back unless
the whole statement completed, including updating the binary log (if
enabled).
Other changes:
- Added handlerton->check_version() function to allow the ddl recovery
code to check, in case of inplace alter table, if the table in the
storage engine is of the new or old version.
- Added handler->table_version() so that an engine can report the current
version of the table. This should be changed each time the table
definition changes.
- Added ha_signal_ddl_recovery_done() and
handlerton::signal_ddl_recovery_done() to inform all handlers when
ddl recovery has been done. (Needed by InnoDB).
- Added handlerton call inplace_alter_table_committed, to signal engine
that ddl_log has been closed for the alter table query.
- Added new handerton flag
HTON_REQUIRES_NOTIFY_TABLEDEF_CHANGED_AFTER_COMMIT to signal when we
should call hton->notify_tabledef_changed() during
mysql_inplace_alter_table. This was required as MyRocks and InnoDB
needed the call at different times.
- Added function server_uuid_value() to be able to generate a temporary
xid when ddl recovery writes the query to the binary log. This is
needed to be able to handle crashes during ddl log recovery.
- Moved freeing of the frm definition to end of mysql_alter_table() to
remove duplicate code and have a common exit strategy.
-------
InnoDB part of atomic ALTER TABLE
(Implemented by Marko Mäkelä)
innodb_check_version(): Compare the saved dict_table_t::def_trx_id
to determine whether an ALTER TABLE operation was committed.
We must correctly recover dict_table_t::def_trx_id for this to work.
Before purge removes any trace of DB_TRX_ID from system tables, it
will make an effort to load the user table into the cache, so that
the dict_table_t::def_trx_id can be recovered.
ha_innobase::table_version(): return garbage, or the trx_id that would
be used for committing an ALTER TABLE operation.
In InnoDB, table names starting with #sql-ib will remain special:
they will be dropped on startup. This may be revisited later in
MDEV-18518 when we implement proper undo logging and rollback
for creating or dropping multiple tables in a transaction.
Table names starting with #sql will retain some special meaning:
dict_table_t::parse_name() will not consider such names for
MDL acquisition, and dict_table_rename_in_cache() will treat such
names specially when handling FOREIGN KEY constraints.
Simplify InnoDB DROP INDEX.
Prevent purge wakeup
To ensure that dict_table_t::def_trx_id will be recovered correctly
in case the server is killed before ddl_log_complete(), we will block
the purge of any history in SYS_TABLES, SYS_INDEXES, SYS_COLUMNS
between ha_innobase::commit_inplace_alter_table(commit=true)
(purge_sys.stop_SYS()) and purge_sys.resume_SYS().
The completion callback purge_sys.resume_SYS() must be between
ddl_log_complete() and MDL release.
--------
MyRocks support for atomic ALTER TABLE
(Implemented by Sergui Petrunia)
Implement these SE API functions:
- ha_rocksdb::table_version()
- hton->check_version = rocksdb_check_versionMyRocks data dictionary
now stores table version for each table.
(Absence of table version record is interpreted as table_version=0,
that is, which means no upgrade changes are needed)
- For inplace alter table of a partitioned table, call the underlying
handlerton when checking if the table is ok. This assumes that the
partition engine commits all changes at once.
There are a few different cases to consider
Logging of CREATE TABLE and CREATE TABLE ... LIKE
- If REPLACE is used and there was an existing table, DDL log the drop of
the table.
- If discovery of table is to be done
- DDL LOG create table
else
- DDL log create table (with engine type)
- create the table
- If table was created
- Log entry to binary log with xid
- Mark DDL log completed
Crash recovery:
- If query was in binary log do nothing and exit
- If discoverted table
- Delete the .frm file
-else
- Drop created table and frm file
- If table was dropped, write a DROP TABLE statement in binary log
CREATE TABLE ... SELECT required a little more work as when one is using
statement logging the query is written to the binary log before commit is
done.
This was fixed by adding a DROP TABLE to the binary log during crash
recovery if the ddl log entry was not closed. In this case the binary log
will contain:
CREATE TABLE xxx ... SELECT ....
DROP TABLE xxx;
Other things:
- Added debug_crash_here() functionality to Aria to be able to test
crash in create table between the creation of the .MAI and the .MAD files.
Description of how DROP DATABASE works after this patch
- Collect list of tables
- DDL log tables as they are dropped
- DDL log drop database
- Delete db.opt
- Delete data directory
- Log either DROP TABLE or DROP DATABASE to binary log
- De active ddl log entry
This is in line of how things where before (minus ddl logging) except that
we delete db.opt file last to not loose it if DROP DATABASE fails.
On recovery we have to ensure that all dropped tables are logged in
binary log and that they are properly dropped (as with atomic drop
table).
No new tables be dropped as part of recovery.
Recovery of active drop database ddl log entry:
- If drop database was logged to ddl log but was not found in the binary
log:
- drop the db.opt file and database directory.
- Log DROP DATABASE to binary log
- If drop database was not logged to ddl log
- Update binary log with DROP TABLE of the dropped tables. If table list
is longer than max_allowed_packet, then the query will be split into
multiple DROP TABLE/VIEW queries.
Other things:
- Added DDL_LOG_STATE and 'current database' as arguments to
mysql_rm_table_no_locks(). This was needed to be able to combine
ddl logging of DROP DATABASE and DROP TABLE and make the generated
DROP TABLE statements shorter.
- To make the DROP TABLE statement created by ddl log shorter, I changed
the binlogged query to use current directory and omit the directory
part for all tables in the current directory.
- Merged some DROP TABLE and DROP VIEW code in ddl logger. This was done
to be able get separate DROP VIEW and DROP TABLE statements in the binary
log.
- Added a 'recovery_state' variable to remember the state of dropped
tables and views.
- Moved out code that drops database objects (stored procedures) from
mysql_rm_db_internal() to drop_database_objects() for better code reuse.
- Made mysql_rm_db_internal() global so that could be used by the ddl
recovery code.
Logging logic:
- Log tables, just before they are dropped, to the ddl log
- After the last table for the statement is dropped, log an xid for the
whole ddl log event
In case of crash:
- Remove first any active DROP TABLE events from the ddl log that matches
xids found in binary log (this mean the drop was successful and was
propery logged).
- Loop over all active DROP TABLE events
- Ensure that the table is completely dropped
- Write a DROP TABLE entry to the binary log with the dropped tables.
Other things:
- Added code to ha_drop_table() to be able to tell the difference if
a get_new_handler() failed because of out-of-memory or because the
handler refused/was not able to create a a handler. This was needed
to get sequences to work as sequences needs a share object to be passed
to get_new_handler()
- TC_LOG_BINLOG::recover() was changed to always collect Xid's from the
binary log and always call ddl_log_close_binlogged_events(). This was
needed to be able to collect DROP TABLE events with embedded Xid's
(used by ddl log).
- Added a new variable "$grep_script" to binlog filter to be able to find
only rows that matches a regexp.
- Had to adjust some test that changed because drop statements are a bit
larger in the binary log than before (as we have to store the xid)
Other things:
- MDEV-25588 Atomic DDL: Binlog query event written upon recovery is corrupt
fixed (in the original commit).
- Major rewrite of ddl_log.cc and ddl_log.h
- ddl_log.cc described in the beginning how the recovery works.
- ddl_log.log has unique signature and is dynamic. It's easy to
add more information to the header and other ddl blocks while still
being able to execute old ddl entries.
- IO_SIZE for ddl blocks is now dynamic. Can be changed without affecting
recovery of old logs.
- Code is more modular and is now usable outside of partition handling.
- Renamed log file to dll_recovery.log and added option --log-ddl-recovery
to allow one to specify the path & filename.
- Added ddl_log_entry_phase[], number of phases for each DDL action,
which allowed me to greatly simply set_global_from_ddl_log_entry()
- Changed how strings are stored in log entries, which allows us to
store much more information in a log entry.
- ddl log is now always created at start and deleted on normal shutdown.
This simplices things notable.
- Added probes debug_crash_here() and debug_simulate_error() to simply
crash testing and allow crash after a given number of times a probe
is executed. See comments in debug_sync.cc and rename_table.test for
how this can be used.
- Reverting failed table and view renames is done trough the ddl log.
This ensures that the ddl log is tested also outside of recovery.
- Added helper function 'handler::needs_lower_case_filenames()'
- Extend binary log with Q_XID events. ddl log handling is using this
to check if a ddl log entry was logged to the binary log (if yes,
it will be deleted from the log during ddl_log_close_binlogged_events()
- If a DDL entry fails 3 time, disable it. This is to ensure that if
we have a crash in ddl recovery code the server will not get stuck
in a forever crash-restart-crash loop.
mysqltest.cc changes:
- --die will now replace $variables with their values
- $error will contain the error of the last failed statement
storage engine changes:
- maria_rename() was changed to be more robust against crashes during
rename.
This change removed 68 explict strlen() calls from the code.
The following renames was done to ensure we don't use the old names
when merging code from earlier releases, as using the new variables
for print function could result in crashes:
- charset->csname renamed to charset->cs_name
- charset->name renamed to charset->coll_name
Almost everything where mechanical changes except:
- Changed to use the new Protocol::store(LEX_CSTRING..) when possible
- Changed to use field->store(LEX_CSTRING*, CHARSET_INFO*) when possible
- Changed to use String->append(LEX_CSTRING&) when possible
Other things:
- There where compiler issues with ensuring that all character set names
points to the same string: gcc doesn't allow one to use integer constants
when defining global structures (constant char * pointers works fine).
To get around this, I declared defines for each character set name
length.
Changes:
- To detect automatic strlen() I removed the methods in String that
uses 'const char *' without a length:
- String::append(const char*)
- Binary_string(const char *str)
- String(const char *str, CHARSET_INFO *cs)
- append_for_single_quote(const char *)
All usage of append(const char*) is changed to either use
String::append(char), String::append(const char*, size_t length) or
String::append(LEX_CSTRING)
- Added STRING_WITH_LEN() around constant string arguments to
String::append()
- Added overflow argument to escape_string_for_mysql() and
escape_quotes_for_mysql() instead of returning (size_t) -1 on overflow.
This was needed as most usage of the above functions never tested the
result for -1 and would have given wrong results or crashes in case
of overflows.
- Added Item_func_or_sum::func_name_cstring(), which returns LEX_CSTRING.
Changed all Item_func::func_name()'s to func_name_cstring()'s.
The old Item_func_or_sum::func_name() is now an inline function that
returns func_name_cstring().str.
- Changed Item::mode_name() and Item::func_name_ext() to return
LEX_CSTRING.
- Changed for some functions the name argument from const char * to
to const LEX_CSTRING &:
- Item::Item_func_fix_attributes()
- Item::check_type_...()
- Type_std_attributes::agg_item_collations()
- Type_std_attributes::agg_item_set_converter()
- Type_std_attributes::agg_arg_charsets...()
- Type_handler_hybrid_field_type::aggregate_for_result()
- Type_handler_geometry::check_type_geom_or_binary()
- Type_handler::Item_func_or_sum_illegal_param()
- Predicant_to_list_comparator::add_value_skip_null()
- Predicant_to_list_comparator::add_value()
- cmp_item_row::prepare_comparators()
- cmp_item_row::aggregate_row_elements_for_comparison()
- Cursor_ref::print_func()
- Removes String_space() as it was only used in one cases and that
could be simplified to not use String_space(), thanks to the fixed
my_vsnprintf().
- Added some const LEX_CSTRING's for common strings:
- NULL_clex_str, DATA_clex_str, INDEX_clex_str.
- Changed primary_key_name to a LEX_CSTRING
- Renamed String::set_quick() to String::set_buffer_if_not_allocated() to
clarify what the function really does.
- Rename of protocol function:
bool store(const char *from, CHARSET_INFO *cs) to
bool store_string_or_null(const char *from, CHARSET_INFO *cs).
This was done to both clarify the difference between this 'store' function
and also to make it easier to find unoptimal usage of store() calls.
- Added Protocol::store(const LEX_CSTRING*, CHARSET_INFO*)
- Changed some 'const char*' arrays to instead be of type LEX_CSTRING.
- class Item_func_units now used LEX_CSTRING for name.
Other things:
- Fixed a bug in mysql.cc:construct_prompt() where a wrong escape character
in the prompt would cause some part of the prompt to be duplicated.
- Fixed a lot of instances where the length of the argument to
append is known or easily obtain but was not used.
- Removed some not needed 'virtual' definition for functions that was
inherited from the parent. I added override to these.
- Fixed Ordered_key::print() to preallocate needed buffer. Old code could
case memory overruns.
- Simplified some loops when adding char * to a String with delimiters.
Starting with MariaDB 10.5, roughly after MDEV-23855 was fixed,
we are observing sporadic hangs during the execution of the
RESET MASTER statement. We are hoping to fix the hangs with these
changes, but due to the rather infrequent occurrence of the hangs
and our inability to reliably reproduce the hangs, we cannot be
sure of this.
What we do know is that innodb_force_recovery=2 (or a larger setting)
will prevent srv_master_callback (the former srv_master_thread) from
running. In that mode, periodic log flushes would never occur and
RESET MASTER could hang indefinitely. That is demonstrated by the new
test case that was developed by Andrei Elkin. We fix this case by
implementing a special case for it.
This also includes some code cleanup and renames of misleadingly
named code. The interface has nothing to do with log checkpoints in
the storage engine; it is only about requesting log writes to be
persistent.
handlerton::commit_checkpoint_request,
commit_checkpoint_notify_ha(): Remove the unused parameter hton.
log_requests.start: Replaces pending_checkpoint_list.
log_requests.end: Replaces pending_checkpoint_list_end.
log_requests.mutex: Replaces pending_checkpoint_mutex.
log_flush_notify_and_unlock(), log_flush_notify(): Replaces
innobase_mysql_log_notify(). The new implementation should be
functionally equivalent to the old one.
innodb_log_flush_request(): Replaces innobase_checkpoint_request().
Implement a fast path for common cases, and reduce the mutex hold time.
POSSIBLE FIX OF THE HANG: We will invoke commit_checkpoint_notify_ha()
for the current request if it is already satisfied, as well as invoke
log_flush_notify_and_unlock() for any satisfied requests.
log_write(): Invoke log_flush_notify() when the write is already durable.
This was missing WITH_PMEM when the log is in persistent memory.
Reviewed by: Vladislav Vaintroub
This feature adds the functionality of ignorability for indexes.
Indexes are not ignored be default.
To control index ignorability explicitly for a new index,
use IGNORE or NOT IGNORE as part of the index definition for
CREATE TABLE, CREATE INDEX, or ALTER TABLE.
Primary keys (explicit or implicit) cannot be made ignorable.
The table INFORMATION_SCHEMA.STATISTICS get a new column named IGNORED that
would store whether an index needs to be ignored or not.
* be strict in CREATE TABLE, just like in ALTER TABLE, because
CREATE TABLE, just like ALTER TABLE, can be rolled back for any engine
* but don't auto-convert warnings into errors for engine warnings
(handler::create) - this matches ALTER TABLE behavior
* and not when creating a default record, these errors are handled
specially (and replaced with ER_INVALID_DEFAULT)
* always issue a Note when a non-unique key is truncated, because it's
not a Warning that can be converted to an Error. Before this commit
it was a Note for blobs and a Warning for all other data types.
Server part:
kill_handlerton() was accessing thd->ha_data[] for some other thd,
while it could be concurrently modified by its owner thd.
protect thd->ha_data[] modifications with a mutex.
require this mutex when accessing thd->ha_data[] from kill_handlerton.
InnoDB part:
on close_connection, detach trx from thd before freeing the trx
failed in Diagnostics_area::set_ok_status on INSERT
Analysis: Error is not returned when strict mode is enabled and value is
truncated because double is outside range.
Fix: Return HA_ERR_AUTOINC_ERANGE if the error was reported when double is
outside range.
..causes error on slave.
Cause: if the master doesn't have the frm file for the table,
DROP TABLE code will call ha_delete_table_force() to drop the table
in all available storage engines.
The issue was that this code path didn't check for
HTON_TABLE_MAY_NOT_EXIST_ON_SLAVE flag for the storage engine,
and so did not add "... IF EXISTS" to the statement that's written
to the binary log. This can cause error on the slave when it tries to
drop a table that's already gone.
After Sergei's cleanup this assertion is not actual anymore -- we can't
predict if the handler was used for lookup, especially in multi-update
scenario.
`position(old_data)` is made earlier in `ha_check_overlaps`, therefore it
is guaranteed that we compare right refs.
The problem here was that ha_check_overlaps internally uses ha_index_read,
which in case of fail overwrites table->status. Even though the handlers
are different, they share a common table, so the value is anyway spoiled.
This is bad, and table->status is badly designed and overweighted by
functionality, but nothing can be done with it, since the code related to
this logic is ancient and it's impossible to extract it with normal effort.
So let's just save and restore the value in ha_update_row before and after
the checks.
Other operations like INSERT and simple UPDATE are not in risk, since they
don't use this table->status approach.
DELETE does not do any unique checks, so it's also safe.
Change xarecover_handlerton so that transaction with WSREP prefixed
xids are rolled back when Galera is disabled.
Reviewd-by: Jan Lindström <jan.lindstrom@mariadb.com>
This commit fixed the problems with S3 after the "DROP TABLE FORCE" changes.
It also fixes all failing replication S3 tests.
A slave is delayed if it is trying to execute replicated queries on a
table that is already converted to S3 by the master later in the binlog.
Fixes for replication events on S3 tables for delayed slaves:
- INSERT and INSERT ... SELECT and CREATE TABLE are ignored but written
to the binary log. UPDATE & DELETE will be fixed in a future commit.
Other things:
- On slaves with --s3-slave-ignore-updates set, allow S3 tables to be
opened in read-write mode. This was done to be able to
ignore-but-replicate queries like insert. Without this change any
open of an S3 table failed with 'Table is read only' which is too
early to be able to replicate the original query.
- Errors are now printed if handler::extra() call fails in
wait_while_tables_are_used().
- Error message for row changes are changed from HA_ERR_WRONG_COMMAND
to HA_ERR_TABLE_READONLY.
- Disable some maria_extra() calls for S3 tables. This could cause
S3 tables to fail in some cases.
- Added missing thr_lock_delete() to ma_open() in case of failure.
- Removed from mysql_prepare_insert() the not needed argument 'table'.
- Remove row_start/row_end from keys in fix_create_like();
- Disable manual adding of implicit row_start/row_end to indexes on
CREATE TABLE. INVISIBLE_SYSTEM fields are unoperable by user;
- Fix memory leak on allocation of Key_part_spec.
- row_search_mvcc() should return DB_INTERRUPTED when it got killed.
- Add a syncpoint for the ICP check.
- Add test coverage for killed-during-ICP-check scenario
Backport of MDEV-22761 fixes for ICP from 10.4 commits:
* a6f956488c
* c03885cd9c
XtraDB was fixed in deb3b9a174
Reviewer: Daniel Black
Part #2:
- row_search_mvcc() should return DB_INTERRUPTED when it got
- Move the sync point from innodb internals to
handler_rowid_filter_check() where other storage engines can use
it too
- Add a similar syncpoint for the ICP check.
- Add a bigger test and test coverage for Rowid Filter with MyISAM
- Add test coverage for killed-during-ICP-check scenario
MDEV-21953 deadlock between BACKUP STAGE BLOCK_COMMIT and parallel
replication
Fixed by partly reverting MDEV-21953 to put back MDL_BACKUP_COMMIT locking
before log_and_order.
The original problem for MDEV-21953 was that while a thread was waiting in
for another threads to commit in 'log_and_order', it had the
MDL_BACKUP_COMMIT lock. The backup thread was waiting to get the
MDL_BACKUP_WAIT_COMMIT lock, which blocks all new MDL_BACKUP_COMMIT locks.
This causes a deadlock as the waited-for thread can never get past the
MDL_BACKUP_COMMIT lock in ha_commit_trans.
The main part of the bug fix is to release the MDL_BACKUP_COMMIT lock while
a thread is waiting for other 'previous' threads to commit. This ensures
that no transactional thread keeps MDL_BACKUP_COMMIT while waiting, which
ensures that there are no deadlocks anymore.
failed or late ER_PERIOD_FIELD_WRONG_ATTRIBUTES upon attempt to create
existing table
Analysis: Error state is not stored when field is checked in
Table_period_info::check_field()
Fix: Store error state by setting res to true.
This happend when using XA transactions. I also added some extra asserts
to ensure that m_transactions are properly cleared.
Other things:
- Removed set_time() from THD::init_for_queries() as dispatch_command()
is already doing that.
- Removed duplicate init_for_queries() from prepare_new_connection_state().
The init_for_queries() functions should only be called once per
connection.
The issue was:
T1, a parallel slave worker thread, is waiting for another worker thread to
commit. While waiting, it has the MDL_BACKUP_COMMIT lock.
T2, working for mariabackup, is doing BACKUP STAGE BLOCK_COMMIT and blocks
all commits.
This causes a deadlock as the thread T1 is waiting for can't commit.
Fixed by moving locking of MDL_BACKUP_COMMIT from ha_commit_trans() to
commit_one_phase_2()
Other things:
- Added a new argument to ha_comit_one_phase() to signal if the
transaction was a write transaction.
- Ensured that ha_maria::implicit_commit() is always called under
MDL_BACKUP_COMMIT. This code is not needed in 10.5
- Ensure that MDL_Request values 'type' and 'ticket' are always
initialized. This makes it easier to check the state of the MDL_Request.
- Moved thd->store_globals() earlier in handle_rpl_parallel_thread() as
thd->init_for_queries() could use a MDL that could crash if store_globals
where not called.
- Don't call ha_enable_transactions() in THD::init_for_queries() as this
is both slow (uses MDL locks) and not needed.
first try discovering engines, then the rest.
otherwise every DROP TABLE non_existent; will do
lots of i/o trying to remove .MYI/.MYD/.MAI/.MAD/.CSV/etc files
this matches the old behavior where DROP TABLE always tried to discover
the table before dropping.
don't do table discovery on DROP. DROP falls back to "force"
approach when a table isn't found and will try to drop in all
engines anyway. That is, trying to discover in all engines before
the drop is redundant and may be expensive.
first step in moving drop table out of the handler.
todo: other methods that don't need an open table
for now hton->drop_table is optional, for backward compatibility
reasons
- Rewrote bool Query_compressed_log_event::write() to make it more readable
(no logic changes).
- Changed DBUG_PRINT of 'is_error:' to 'is_error():' to make it easier to
find error: in traces.
- Ensure that 'db' is never null in Query_log_event (Simplified code).
The used code is largely based on code from Tencent
The problem is that in some rare cases there may be a conflict between .frm
files and the files in the storage engine. In this case the DROP TABLE
was not able to properly drop the table.
Some MariaDB/MySQL forks has solved this by adding a FORCE option to
DROP TABLE. After some discussion among MariaDB developers, we concluded
that users expects that DROP TABLE should always work, even if the
table would not be consistent. There should not be a need to use a
separate keyword to ensure that the table is really deleted.
The used solution is:
- If a .frm table doesn't exists, try dropping the table from all storage
engines.
- If the .frm table exists but the table does not exist in the engine
try dropping the table from all storage engines.
- Update storage engines using many table files (.CVS, MyISAM, Aria) to
succeed with the drop even if some of the files are missing.
- Add HTON_AUTOMATIC_DELETE_TABLE to handlerton's where delete_table()
is not needed and always succeed. This is used by ha_delete_table_force()
to know which handlers to ignore when trying to drop a table without
a .frm file.
The disadvantage of this solution is that a DROP TABLE on a non existing
table will be a bit slower as we have to ask all active storage engines
if they know anything about the table.
Other things:
- Added a new flag MY_IGNORE_ENOENT to my_delete() to not give an error
if the file doesn't exist. This simplifies some of the code.
- Don't clear thd->error in ha_delete_table() if there was an active
error. This is a bug fix.
- handler::delete_table() will not abort if first file doesn't exists.
This is bug fix to handle the case when a drop table was aborted in
the middle.
- Cleaned up mysql_rm_table_no_locks() to ensure that if_exists uses
same code path as when it's not used.
- Use non_existing_Table_error() to detect if table didn't exists.
Old code used different errors tests in different position.
- Table_triggers_list::drop_all_triggers() now drops trigger file if
it can't be parsed instead of leaving it hanging around (bug fix)
- InnoDB doesn't anymore print error about .frm file out of sync with
InnoDB directory if .frm file does not exists. This change was required
to be able to try to drop an InnoDB file when .frm doesn't exists.
- Fixed bug in mi_delete_table() where the .MYD file would not be dropped
if the .MYI file didn't exists.
- Fixed memory leak in Mroonga when deleting non existing table
- Fixed memory leak in Connect when deleting non existing table
Bugs fixed introduced by the original version of this commit:
MDEV-22826 Presence of Spider prevents tables from being force-deleted from
other engines
Apply this patch from Percona Server (amended for 10.5):
commit cd7201514fee78aaf7d3eb2b28d2573c76f53b84
Author: Laurynas Biveinis <laurynas.biveinis@gmail.com>
Date: Tue Nov 14 06:34:19 2017 +0200
Fix bug 1704195 / 87065 / TDB-83 (Stop ANALYZE TABLE from flushing table definition cache)
Make ANALYZE TABLE stop flushing affected tables from the table
definition cache, which has the effect of not blocking any subsequent
new queries involving the table if there's a parallel long-running
query:
- new table flag HA_ONLINE_ANALYZE, return it for InnoDB and TokuDB
tables;
- in mysql_admin_table, if we are performing ANALYZE TABLE, and the
table flag is set, do not remove the table from the table
definition cache, do not invalidate query cache;
- in partitioning handler, refresh the query optimizer statistics
after ANALYZE if the underlying handler supports HA_ONLINE_ANALYZE;
- new testcases main.percona_nonflushing_analyze_debug,
parts.percona_nonflushing_abalyze_debug and a supporting debug sync
point.
For TokuDB, this change exposes bug TDB-83 (Index cardinality stats
updated for handler::info(HA_STATUS_CONST), not often enough for
tokudb_cardinality_scale_percent). TokuDB may return different
rec_per_key values depending on dynamic variable
tokudb_cardinality_scale_percent value. The server does not have a way
of knowing that changing this variable invalidates the previous
rec_per_key values in any opened table shares, and so does not call
info(HA_STATUS_CONST) again. Fix by updating rec_per_key for both
HA_STATUS_CONST and HA_STATUS_VARIABLE. This also forces a re-record
of tokudb.bugs.db756_card_part_hash_1_pick, with the new output
seeming to be more correct.
reduce the amount of engine-specific code in the server,
particularly as it does not serve any purpose now.
may be needed for VP engine,
to be reconsidered in MDEV-7795
MDEV-20578 Got error 126 when executing undo undo_key_delete
upon Aria crash recovery
The crash happens in this scenario:
- Table with unique keys and non unique keys
- Batch insert (LOAD DATA or INSERT ... SELECT) with REPLACE
- Some insert succeeds followed by duplicate key error
In the above scenario the table gets corrupted.
The bug was that we don't generate any undo entry for the
failed insert as the whole insert can be ignored by undo.
The code did however not take into account that when bulk
insert is used, we would write cached keys to the file on
failure and undo would wrongly ignore these.
Fixed by moving the writing of the cache keys after we write
the aborted-insert event to the log.
MDEV-22617 Galera node crashes when trying to log to slow_log table in
streaming replication mode
Other things:
- Changed name of wsrep_after_row(two arguments) to
wsrep_after_row_internal(one argument) to not depended on the
function signature with unused arguments.
MDEV-22531 Remove maria::implicit_commit()
MDEV-22607 Assertion `ha_info->ht() != binlog_hton' failed in
MYSQL_BIN_LOG::unlog_xa_prepare
From the handler point of view, Aria now looks like a transactional
engine. One effect of this is that we don't need to call
maria::implicit_commit() anymore.
This change also forces the server to call trans_commit_stmt() after doing
any read or writes to system tables. This work will also make it easier
to later allow users to have system tables in other engines than Aria.
To handle the case that Aria doesn't support rollback, a new
handlerton flag, HTON_NO_ROLLBACK, was added to engines that has
transactions without rollback (for the moment only binlog and Aria).
Other things
- Moved freeing of MARIA_SHARE to a separate function as the MARIA_SHARE
can be still part of a transaction even if the table has closed.
- Changed Aria checkpoint to use the new MARIA_SHARE free function. This
fixes a possible memory leak when using S3 tables
- Changed testing of binlog_hton to instead test for HTON_NO_ROLLBACK
- Removed checking of has_transaction_manager() in handler.cc as we can
assume that as the transaction was started by the engine, it does
support transactions.
- Added new class 'start_new_trans' that can be used to start indepdendent
sub transactions, for example while reading mysql.proc, using help or
status tables etc.
- open_system_tables...() and open_proc_table_for_Read() doesn't anymore
take a Open_tables_backup list. This is now handled by 'start_new_trans'.
- Split thd::has_transactions() to thd::has_transactions() and
thd::has_transactions_and_rollback()
- Added handlerton code to free cached transactions objects.
Needed by InnoDB.
squash! 2ed35999f2a2d84f1c786a21ade5db716b6f1bbc
All changes (except one) is of type
thd->transaction. -> thd->transaction->
thd->transaction points by default to 'thd->default_transaction'
This allows us to 'easily' have multiple active transactions for a
THD object, like when reading data from the mysql.proc table
MDEV-22468 BACKUP STAGE BLOCK_COMMIT should block commit in the Aria engine
This is needed to ensure that mariabackup works properly with Aria tables
This code ads new calls to ha_maria::implicit_commit(). These will be
deleted by MDEV-22531 Remove maria::implicit_commit().
Rowid Filter check is just like Index Condition Pushdown check: before
we check the filter, we must check if we have walked out of the range
we are scanning. (If we did, we should return, and not continue the scan).
Consequences of this:
- Rowid filtering doesn't work for keys that have partially-covered
blob columns (just like Index Condition Pushdown)
- The rowid filter function has three return values: CHECK_POS (passed)
CHECK_NEG (filtered out), CHECK_OUT_OF_RANGE.
All of the above is implemented in this patch
MDEV-22088 S3 partitioning support
All ALTER PARTITION commands should now work on S3 tables except
REBUILD PARTITION
TRUNCATE PARTITION
REORGANIZE PARTITION
In addition, PARTIONED S3 TABLES can also be replicated.
This is achived by storing the partition tables .frm and .par file on S3
for partitioned shared (S3) tables.
The discovery methods are enchanced by allowing engines that supports
discovery to also support of the partitioned tables .frm and .par file
Things in more detail
- The .frm and .par files of partitioned tables are stored in S3 and kept
in sync.
- Added hton callback create_partitioning_metadata to inform handler
that metadata for a partitoned file has changed
- Added back handler::discover_check_version() to be able to check if
a table's or a part table's definition has changed.
- Added handler::check_if_updates_are_ignored(). Needed for partitioning.
- Renamed rebind() -> rebind_psi(), as it was before.
- Changed CHF_xxx hadnler flags to an enum
- Changed some checks from using table->file->ht to use
table->file->partition_ht() to get discovery to work with partitioning.
- If TABLE_SHARE::init_from_binary_frm_image() fails, ensure that we
don't leave any .frm or .par files around.
- Fixed that writefrm() doesn't leave unusable .frm files around
- Appended extension to path for writefrm() to be able to reuse to function
for creating .par files.
- Added DBUG_PUSH("") to a a few functions that caused a lot of not
critical tracing.
If a transaction had no effect due to INSERT IGNORE and a new
transaction was started with START TRANSACTION without committing
the previous one, the server crashed on assertion when starting
a new wsrep transaction.
As a fix, refined the condition to do wsrep_commit_empty() at the end
of the ha_commit_trans().
In main.index_merge_myisam we remove the test that was added in
commit a2d24def8c because
it duplicates the test case that was added in
commit 5af12e4635.
`inited == NONE` at the initialization time does not always mean
that it'll be `NONE` later, at the execution time. Use a more complex
caller-specific logic to decide whether to create a cloned lookup handler.
Besides LOAD (as in the original bug report) make sure that all
prepare_for_insert() invocations are covered by tests. Add tests for
CREATE ... SELECT, multi-UPDATE, and multi-DELETE.
Don't enable write cache with long uniques.
was restored.
Optionally rollback prepared XA's on "mariabackup --prepare".
The fix MUST NOT be ported on 10.5+, as MDEV-742 fix solves the issue for
slaves.
that is don't call alloc_lookup_buffer() and create_lookup_handler()
for every row
also, don't call ha_check_overlaps() for every partition,
after it was already done on the ha_partition level
* The overlaps check is implemented on a handler level per row command.
It creates a separate cursor (actually, another handler instance) and
caches it inside the original handler, when ha_update_row or
ha_insert_row is issued. Cursor closes on unlocking the handler.
* Containing the same key in index means unique constraint violation
even in usual terms. So we fetch left and right neighbours and check
that they have same key prefix, excluding from the key only the period part.
If it doesnt match, then there's no such neighbour, and the check passes.
Otherwise, we check if this neighbour intersects with the considered key.
* The check does not introduce new error and fails with ER_DUPP_KEY error.
This might break REPLACE workflow and should be fixed separately
* rename to a generic name
* move remaning initializations from query exec to prepare time
* simplify/unify key handling in open_table_from_share and delayed
* remove dead code
* move tests where they belong
- multi_range_read_info_const now uses the new records_in_range interface
- Added handler::avg_io_cost()
- Don't calculate avg_io_cost() in get_sweep_read_cost if avg_io_cost is
not 1.0. In this case we trust the avg_io_cost() from the handler.
- Changed test_quick_select to use TIME_FOR_COMPARE instead of
TIME_FOR_COMPARE_IDX to align this with the rest of the code.
- Fixed bug when using test_if_cheaper_ordering where we didn't use
keyread if index was changed
- Fixed a bug where we didn't use index only read when using order-by-index
- Added keyread_time() to HEAP.
The default keyread_time() was optimized for blocks and not suitable for
HEAP. The effect was the HEAP prefered table scans over ranges for btree
indexes.
- Fixed get_sweep_read_cost() for HEAP tables
- Ensure that range and ref have same cost for simple ranges
Added a small cost (MULTI_RANGE_READ_SETUP_COST) to ranges to ensure
we favior ref for range for simple queries.
- Fixed that matching_candidates_in_table() uses same number of records
as the rest of the optimizer
- Added avg_io_cost() to JT_EQ_REF cost. This helps calculate the cost for
HEAP and temporary tables better. A few tests changed because of this.
- heap::read_time() and heap::keyread_time() adjusted to not add +1.
This was to ensure that handler::keyread_time() doesn't give
higher cost for heap tables than for normal tables. One effect of
this is that heap and derived tables stored in heap will prefer
key access as this is now regarded as cheap.
- Changed cost for index read in sql_select.cc to match
multi_range_read_info_const(). All index cost calculation is now
done trough one function.
- 'ref' will now use quick_cost for keys if it exists. This is done
so that for '=' ranges, 'ref' is prefered over 'range'.
- scan_time() now takes avg_io_costs() into account
- get_delayed_table_estimates() uses block_size and avg_io_cost()
- Removed default argument to test_if_order_by_key(); simplifies code
This was done to both simplify the code and also to be easier to handle
storage engines that are clustered on some other index than the primary
key.
As pk_is_clustering_key() and is_clustering_key now are using only
index_flags, these where removed from all storage engines.
MDEV-21605 Clean up and speed up interfaces for binary row logging
MDEV-21617 Bug fix for previous version of this code
The intention is to have as few 'if' as possible in ha_write() and
related functions. This is done by pre-calculating once per statement the
row_logging state for all tables.
Benefits are simpler and faster code both when binary logging is disabled
and when it's enabled.
Changes:
- Added handler->row_logging to make it easy to check it table should be
row logged. This also made it easier to disabling row logging for system,
internal and temporary tables.
- The tables row_logging capabilities are checked once per "statements
that updates tables" in THD::binlog_prepare_for_row_logging() which
is called when needed from THD::decide_logging_format().
- Removed most usage of tmp_disable_binlog(), reenable_binlog() and
temporary saving and setting of thd->variables.option_bits.
- Moved checks that can't change during a statement from
check_table_binlog_row_based() to check_table_binlog_row_based_internal()
- Removed flag row_already_logged (used by sequence engine)
- Moved binlog_log_row() to a handler::
- Moved write_locked_table_maps() to THD::binlog_write_table_maps() as
most other related binlog functions are in THD.
- Removed binlog_write_table_map() and binlog_log_row_internal() as
they are now obsolete as 'has_transactions()' is pre-calculated in
prepare_for_row_logging().
- Remove 'is_transactional' argument from binlog_write_table_map() as this
can now be read from handler.
- Changed order of 'if's in handler::external_lock() and wsrep_mysqld.h
to first evaluate fast and likely cases before more complex ones.
- Added error checking in ha_write_row() and related functions if
binlog_log_row() failed.
- Don't clear check_table_binlog_row_based_result in
clear_cached_table_binlog_row_based_flag() as it's not needed.
- THD::clear_binlog_table_maps() has been replaced with
THD::reset_binlog_for_next_statement()
- Added 'MYSQL_OPEN_IGNORE_LOGGING_FORMAT' flag to open_and_lock_tables()
to avoid calculating of binary log format for internal opens. This flag
is also used to avoid reading statistics tables for internal tables.
- Added OPTION_BINLOG_LOG_OFF as a simple way to turn of binlog temporary
for create (instead of using THD::sql_log_bin_off.
- Removed flag THD::sql_log_bin_off (not needed anymore)
- Speed up THD::decide_logging_format() by remembering if blackhole engine
is used and avoid a loop over all tables if it's not used
(the common case).
- THD::decide_logging_format() is not called anymore if no tables are used
for the statement. This will speed up pure stored procedure code with
about 5%+ according to some simple tests.
- We now get annotated events on slave if a CREATE ... SELECT statement
is transformed on the slave from statement to row logging.
- In the original code, the master could come into a state where row
logging is enforced for all future events if statement could be used.
This is now partly fixed.
Other changes:
- Ensure that all tables used by a statement has query_id set.
- Had to restore the row_logging flag for not used tables in
THD::binlog_write_table_maps (not normal scenario)
- Removed injector::transaction::use_table(server_id_type sid, table tbl)
as it's not used.
- Cleaned up set_slave_thread_options()
- Some more DBUG_ENTER/DBUG_RETURN, code comments and minor indentation
changes.
- Ensure we only call THD::decide_logging_format_low() once in
mysql_insert() (inefficiency).
- Don't annotate INSERT DELAYED
- Removed zeroing pos_in_table_list in THD::open_temporary_table() as it's
already 0
MDEV-21606 Improve update handler (long unique keys on blobs)
MDEV-21470 MyISAM and Aria start_bulk_insert doesn't work with long unique
MDEV-21606 Bug fix for previous version of this code
MDEV-21819 2 Assertion `inited == NONE || update_handler != this'
- Move update_handler from TABLE to handler
- Move out initialization of update handler from ha_write_row() to
prepare_for_insert()
- Fixed that INSERT DELAYED works with update handler
- Give an error if using long unique with an autoincrement column
- Added handler function to check if table has long unique hash indexes
- Disable write cache in MyISAM and Aria when using update_handler as
if cache is used, the row will not be inserted until end of statement
and update_handler would not find conflicting rows.
- Removed not used handler argument from
check_duplicate_long_entries_update()
- Syntax cleanups
- Indentation fixes
- Don't use single character indentifiers for arguments
MDEV-19964 S3 replication support
Added new configure options:
s3_slave_ignore_updates
"If the slave has shares same S3 storage as the master"
s3_replicate_alter_as_create_select
"When converting S3 table to local table, log all rows in binary log"
This allows on to configure slaves to have the S3 storage shared or
independent from the master.
Other thing:
Added new session variable '@@sql_if_exists' to force IF_EXIST to DDL's.
Lifted long standing limitation to the XA of rolling it back at the
transaction's
connection close even if the XA is prepared.
Prepared XA-transaction is made to sustain connection close or server
restart.
The patch consists of
- binary logging extension to write prepared XA part of
transaction signified with
its XID in a new XA_prepare_log_event. The concusion part -
with Commit or Rollback decision - is logged separately as
Query_log_event.
That is in the binlog the XA consists of two separate group of
events.
That makes the whole XA possibly interweaving in binlog with
other XA:s or regular transaction but with no harm to
replication and data consistency.
Gtid_log_event receives two more flags to identify which of the
two XA phases of the transaction it represents. With either flag
set also XID info is added to the event.
When binlog is ON on the server XID::formatID is
constrained to 4 bytes.
- engines are made aware of the server policy to keep up user
prepared XA:s so they (Innodb, rocksdb) don't roll them back
anymore at their disconnect methods.
- slave applier is refined to cope with two phase logged XA:s
including parallel modes of execution.
This patch does not address crash-safe logging of the new events which
is being addressed by MDEV-21469.
CORNER CASES: read-only, pure myisam, binlog-*, @@skip_log_bin, etc
Are addressed along the following policies.
1. The read-only at reconnect marks XID to fail for future
completion with ER_XA_RBROLLBACK.
2. binlog-* filtered XA when it changes engine data is regarded as
loggable even when nothing got cached for binlog. An empty
XA-prepare group is recorded. Consequent Commit-or-Rollback
succeeds in the Engine(s) as well as recorded into binlog.
3. The same applies to the non-transactional engine XA.
4. @@skip_log_bin=OFF does not record anything at XA-prepare
(obviously), but the completion event is recorded into binlog to
admit inconsistency with slave.
The following actions are taken by the patch.
At XA-prepare:
when empty binlog cache - don't do anything to binlog if RO,
otherwise write empty XA_prepare (assert(binlog-filter case)).
At Disconnect:
when Prepared && RO (=> no binlogging was done)
set Xid_cache_element::error := ER_XA_RBROLLBACK
*keep* XID in the cache, and rollback the transaction.
At XA-"complete":
Discover the error, if any don't binlog the "complete",
return the error to the user.
Kudos
-----
Alexey Botchkov took to drive this work initially.
Sergei Golubchik, Sergei Petrunja, Marko Mäkelä provided a number of
good recommendations.
Sergei Voitovich made a magnificent review and improvements to the code.
They all deserve a bunch of thanks for making this work done!