FTS add index fails
Problem:
========
InnoDB double frees the table if auxiliary fts table
creation fails and fails to set the dict operation
for the transaction. It leads to failure while
dropping newly added index.
Solution:
=========
InnoDB should avoid double freeing and set the
dictionary operation of transaction in
fts_create_common_tables()
Many InnoDB data dictionary cache operations require that the
table name be copied so that it will be NUL terminated.
(For example, SYS_TABLES.NAME is not guaranteed to be NUL-terminated.)
dict_table_t::is_garbage_name(): Check if a name belongs to
the background drop table queue.
dict_check_if_system_table_exists(): Remove.
dict_sys_t::load_sys_tables(): Load the non-hard-coded system tables
SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL on startup.
dict_sys_t::create_or_check_sys_tables(): Replaces
dict_create_or_check_foreign_constraint_tables() and
dict_create_or_check_sys_virtual().
dict_sys_t::load_table(): Replaces dict_table_get_low()
and dict_load_table().
dict_sys_t::find_table(): Renamed from get_table().
dict_sys_t::sys_tables_exist(): Check whether all the non-hard-coded
tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL exist.
trx_t::has_stats_table_lock(): Moved to dict0stats.cc.
Some error messages will now report table names in the internal
databasename/tablename format, instead of `databasename`.`tablename`.
MDEV-25604 Atomic DDL: Binlog event written upon recovery does not
have default database
The purpose of this task is to ensure that ALTER TABLE is atomic even if
the MariaDB server would be killed at any point of the alter table.
This means that either the ALTER TABLE succeeds (including that triggers,
the status tables and the binary log are updated) or things should be
reverted to their original state.
If the server crashes before the new version is fully up to date and
commited, it will revert to the original table and remove all
temporary files and tables.
If the new version is commited, crash recovery will use the new version,
and update triggers, the status tables and the binary log.
The one execption is ALTER TABLE .. RENAME .. where no changes are done
to table definition. This one will work as RENAME and roll back unless
the whole statement completed, including updating the binary log (if
enabled).
Other changes:
- Added handlerton->check_version() function to allow the ddl recovery
code to check, in case of inplace alter table, if the table in the
storage engine is of the new or old version.
- Added handler->table_version() so that an engine can report the current
version of the table. This should be changed each time the table
definition changes.
- Added ha_signal_ddl_recovery_done() and
handlerton::signal_ddl_recovery_done() to inform all handlers when
ddl recovery has been done. (Needed by InnoDB).
- Added handlerton call inplace_alter_table_committed, to signal engine
that ddl_log has been closed for the alter table query.
- Added new handerton flag
HTON_REQUIRES_NOTIFY_TABLEDEF_CHANGED_AFTER_COMMIT to signal when we
should call hton->notify_tabledef_changed() during
mysql_inplace_alter_table. This was required as MyRocks and InnoDB
needed the call at different times.
- Added function server_uuid_value() to be able to generate a temporary
xid when ddl recovery writes the query to the binary log. This is
needed to be able to handle crashes during ddl log recovery.
- Moved freeing of the frm definition to end of mysql_alter_table() to
remove duplicate code and have a common exit strategy.
-------
InnoDB part of atomic ALTER TABLE
(Implemented by Marko Mäkelä)
innodb_check_version(): Compare the saved dict_table_t::def_trx_id
to determine whether an ALTER TABLE operation was committed.
We must correctly recover dict_table_t::def_trx_id for this to work.
Before purge removes any trace of DB_TRX_ID from system tables, it
will make an effort to load the user table into the cache, so that
the dict_table_t::def_trx_id can be recovered.
ha_innobase::table_version(): return garbage, or the trx_id that would
be used for committing an ALTER TABLE operation.
In InnoDB, table names starting with #sql-ib will remain special:
they will be dropped on startup. This may be revisited later in
MDEV-18518 when we implement proper undo logging and rollback
for creating or dropping multiple tables in a transaction.
Table names starting with #sql will retain some special meaning:
dict_table_t::parse_name() will not consider such names for
MDL acquisition, and dict_table_rename_in_cache() will treat such
names specially when handling FOREIGN KEY constraints.
Simplify InnoDB DROP INDEX.
Prevent purge wakeup
To ensure that dict_table_t::def_trx_id will be recovered correctly
in case the server is killed before ddl_log_complete(), we will block
the purge of any history in SYS_TABLES, SYS_INDEXES, SYS_COLUMNS
between ha_innobase::commit_inplace_alter_table(commit=true)
(purge_sys.stop_SYS()) and purge_sys.resume_SYS().
The completion callback purge_sys.resume_SYS() must be between
ddl_log_complete() and MDL release.
--------
MyRocks support for atomic ALTER TABLE
(Implemented by Sergui Petrunia)
Implement these SE API functions:
- ha_rocksdb::table_version()
- hton->check_version = rocksdb_check_versionMyRocks data dictionary
now stores table version for each table.
(Absence of table version record is interpreted as table_version=0,
that is, which means no upgrade changes are needed)
- For inplace alter table of a partitioned table, call the underlying
handlerton when checking if the table is ok. This assumes that the
partition engine commits all changes at once.
This patch changes the main name of 3 byte character set from utf8 to
utf8mb3. New old_mode UTF8_IS_UTF8MB3 is added and set TRUE by default,
so that utf8 would mean utf8mb3. If not set, utf8 would mean utf8mb4.
During data file creation, InnoDB holds dict_sys mutex, tries to
write page 0 of the file and flushes the file. This not only causing
unnecessary contention but also a deviation from the write-ahead
logging protocol.
The clean sequence of operations is that we first start a dictionary
transaction and write SYS_TABLES and SYS_INDEXES records that identify
the tablespace. Then, we durably write a FILE_CREATE record to the
write-ahead log and create the file.
Recovery should not unnecessarily insist that the first page of each
data file that is referred to by the redo log is valid. It must be
enough that page 0 of the tablespace can be initialized based on the
redo log contents.
We introduce a new data structure deferred_spaces that keeps track
of corrupted-looking files during recovery. The data structure holds
the last LSN of a FILE_ record referring to the data file, the
tablespace identifier, and the last known file name.
There are two scenarios can happen during recovery:
i) Sufficient memory: InnoDB can reconstruct the
tablespace after parsing all redo log records.
ii) Insufficient memory(multiple apply phase): InnoDB should
store the deferred tablespace redo logs even though
tablespace is not present. InnoDB should start constructing
the tablespace when it first encounters deferred tablespace
id.
Mariabackup copies the zero filled ibd file in backup_fix_ddl() as
the extension of .new file. Mariabackup test case does page flushing
when it deals with DDL operation during backup operation.
fil_ibd_create(): Remove the write of page0 and flushing of file
fil_ibd_load(): Return FIL_LOAD_DEFER if the tablespace has
zero filled page0
Datafile: Clean up the error handling, and do not report errors
if we are in the middle of recovery. The caller will check
Datafile::m_defer.
fil_node_t::deferred: Indicates whether the tablespace loading was
deferred during recovery
FIL_LOAD_DEFER: Returned by fil_ibd_load() to indicate that tablespace
file was cannot be loaded.
recv_sys_t::recover_deferred(): Invoke deferred_spaces.create() to
initialize fil_space_t based on buffered metadata and records to
initialize page 0. Ignore the flags in fil_name_t, because they are
intentionally invalid.
fil_name_process(): Update deferred_spaces.
recv_sys_t::parse(): Store the redo log if the tablespace id
is present in deferred spaces
recv_sys_t::recover_low(): Should recover the first page of
the tablespace even though the tablespace instance is not
present
recv_sys_t::apply(): Initialize the deferred tablespace
before applying the deferred tablespace records
recv_validate_tablespace(): Skip the validation for deferred_spaces.
recv_rename_files(): Moved and revised from recv_sys_t::apply().
For deferred-recovery tablespaces, do not attempt to rename the
file if a deferred-recovery tablespace is associated with the name.
recv_recovery_from_checkpoint_start(): Invoke recv_rename_files()
and initialize all deferred tablespaces before applying redo log.
fil_node_t::read_page0(): Skip page0 validation if the tablespace
is deferred
buf_page_create_deferred(): A variant of buf_page_create() when
the fil_space_t is not available yet
This is joint work with Thirunarayanan Balathandayuthapani,
who implemented an initial prototype.
Make DDL operations that involve FULLTEXT INDEX atomic.
In particular, we must drop the internal FTS_ tables in the same
DDL transaction with ALTER TABLE.
Remove all references to fts_drop_orphaned_tables().
row_merge_drop_temp_indexes(): Drop also the internal FTS_ tables
that are associated with index stubs that were created in
prepare_inplace_alter_table_dict() for
CREATE FULLTEXT INDEX before the server was killed.
fts_clear_all(): Remove the fts_drop_tables() call. It has to be
executed before the transaction is committed!
dict_load_indexes(): Do not load any metadata for index stubs
that had been created by prepare_inplace_alter_table_dict()
fts_create_one_common_table(), fts_create_common_tables(),
fts_create_one_index_table(), fts_create_index_tables():
Remove redundant error handling. The tables will be dropped
just fine by dict_drop_index_tree().
commit_try_norebuild(): Also drop the FTS_ tables when dropping
FULLTEXT INDEX.
The changes to the test case innodb_fts.crash_recovery has been
extensively tested. The non-debug server will be killed while
the 3 ALTER TABLE are in any phase of execution. With the debug
server, DEBUG_SYNC should make the test deterministic.
InnoDB tries to fetch the deleted doc ids for discarded
tablespace. In i_s_fts_deleted_generic_fill(), InnoDB needs
to check whether the table is discarded or not before fetching
deleted doc ids.
InnoDB should skip the dropped aborted FTS_DOC_ID_INDEX while
checking the existing FTS_DOC_ID_INDEX in the table. InnoDB
should able to create new FTS_DOC_ID_INDEX if the fulltext
index is being added for the first time.
- Aborting of fulltext index creation fails to remove the
index from sys indexes table. When we try to reload the
table definition, InnoDB fails with index count mismatch
error. InnoDB should remove the index from sys indexes while
rollbacking the secondary index creation.
In MDEV-515, we enabled an optimization where an insert into an
empty table will use table-level locking and undo logging.
This may break applications that expect row-level locking.
The SQL statements created by the mysqldump utility will include the
following:
SET unique_checks=0, foreign_key_checks=0;
We will use these flags to enable the table-level locked and logged
insert. Unless the parameters are set, INSERT will be executed in
the old way, with row-level undo logging and implicit record locks.
InnoDB set the space in dict_table_t as NULL when table
is discarded. So InnoDB shouldn't use the space present
in table to detect whether the given tablespace is
temporary tablespace.
This feature adds the functionality of ignorability for indexes.
Indexes are not ignored be default.
To control index ignorability explicitly for a new index,
use IGNORE or NOT IGNORE as part of the index definition for
CREATE TABLE, CREATE INDEX, or ALTER TABLE.
Primary keys (explicit or implicit) cannot be made ignorable.
The table INFORMATION_SCHEMA.STATISTICS get a new column named IGNORED that
would store whether an index needs to be ignored or not.
We implement an idea that was suggested by Michael 'Monty' Widenius
in October 2017: When InnoDB is inserting into an empty table or partition,
we can write a single undo log record TRX_UNDO_EMPTY, which will cause
ROLLBACK to clear the table.
For this to work, the insert into an empty table or partition must be
covered by an exclusive table lock that will be held until the transaction
has been committed or rolled back, or the INSERT operation has been
rolled back (and the table is empty again), in lock_table_x_unlock().
Clustered index records that are covered by the TRX_UNDO_EMPTY record
will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
be distinguished from what MDEV-12288 leaves behind after purging the
history of row-logged operations.
Concurrent non-locking reads must be adjusted: If the read view was
created before the INSERT into an empty table, then we must continue
to imagine that the table is empty, and not try to read any records.
If the read view was created after the INSERT was committed, then
all records must be visible normally. To implement this, we introduce
the field dict_table_t::bulk_trx_id.
This special handling only applies to the very first INSERT statement
of a transaction for the empty table or partition. If a subsequent
statement in the transaction is modifying the initially empty table again,
we must enable row-level undo logging, so that we will be able to
roll back to the start of the statement in case of an error (such as
duplicate key).
INSERT IGNORE will continue to use row-level logging and locking, because
implementing it would require the ability to roll back the latest row.
Since the undo log that we write only allows us to roll back the entire
statement, we cannot support INSERT IGNORE. We will introduce a
handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
engines that INSERT IGNORE is being executed.
In many test cases, we add an extra record to the table, so that during
the 'interesting' part of the test, row-level locking and logging will
be used.
Replicas will continue to use row-level logging and locking until
MDEV-24622 has been addressed. Likewise, this optimization will be
disabled in Galera cluster until MDEV-24623 enables it.
dict_table_t::bulk_trx_id: The latest active or committed transaction
that initiated an insert into an empty table or partition.
Protected by exclusive table lock and a clustered index leaf page latch.
ins_node_t::bulk_insert: Whether bulk insert was initiated.
trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
Unlike earlier, this collection will cover also temporary tables.
trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
is_bulk_insert(), was_bulk_insert().
trx_undo_report_row_operation(): Before accessing any undo log pages,
invoke trx->mod_tables.emplace() in order to determine whether undo
logging was disabled, or whether this is the first INSERT and we are
supposed to write a TRX_UNDO_EMPTY record.
row_ins_clust_index_entry_low(): If we are inserting into an empty
clustered index leaf page, set the ins_node_t::bulk_insert flag for
the subsequent trx_undo_report_row_operation() call.
lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
Remove the redundant parameter 'flags' that can be checked in the caller.
btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
the next statement will not be covered by table-level undo logging.
ReadView::changes_visible(trx_id_t) const: New accessor for the case
where the trx_id_t is not read from a potentially corrupted index page
but directly from the memory. In this case, we can skip a sanity check.
row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
row_sel_try_search_shortcut_for_mysql(),
row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
btr_root_page_init(): Refactored from btr_create().
dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
for the ROLLBACK of an INSERT operation.
ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
into an empty table.
This is joint work with Thirunarayanan Balathandayuthapani,
who created a working prototype.
Thanks to Matthias Leich for extensive testing.
with wrong data type is added
Inplace alter fails to report error when fts_doc_id column with
wrong data type is added.
prepare_inplace_alter_table_dict(): Should check whether the column
is fts_doc_id. It should be of bigint type, should accept non null
data type and it should be in capital letters.
Let us introduce the parameter innodb_read_only_compressed
that is ON by default, making any ROW_FORMAT=COMPRESSED tables
read-only.
I developed the ROW_FORMAT=COMPRESSED format based on
Heikki Tuuri's rough design between 2005 and 2008. It might
have been a good idea back then, but no proper benchmarks were
ever run to validate the design or the implementation.
The format has been more or less obsolete for years.
It limits innodb_page_size to 16384 bytes (the default),
and instant ALTER TABLE is not supported.
This is the first step towards deprecating and removing
write support for ROW_FORMAT=COMPRESSED tables.
fts_query_t::nested_sub_exp: Keep track of nested
fts_ast_visit_sub_exp() calls.
fts_ast_visit_sub_exp(): Return DB_OUT_OF_MEMORY if the
maximum recursion depth is exceeded.
This is motivated by a change in MySQL 5.6.50:
mysql/mysql-server@e2a46b4834
Bug #29929684 USING MANY NESTED ARGUMENTS WITH BOOLEAN FTS CAN LEAD
TO TERMINATE SERVER
Marking of deletion of row in fts index happens twice in
self-referential foreign key relation. So while performing
referential checks of foreign key, InnoDB can avoid updating
of fts index if the foreign key has self-referential relationship.
Reviewed-by: Marko Mäkelä