innodb_doublewrite_update(): Disallow any change if srv_read_only_mode
holds, that is, the server was started with innodb_read_only=ON or
innodb_force_recovery=6.
This fixes up commit 1122ac978e (MDEV-33545).
By design, InnoDB has always hung when permanently running out of
buffer pool, for example when several threads are waiting to allocate
a block, and all of the buffer pool is buffer-fixed by the active threads.
The hang that we are fixing here occurs when the buffer pool is only
temporarily running out and the situation could be rescued by writing out
some dirty pages or evicting some clean pages.
buf_LRU_get_free_block(): Simplify the way how we wait for
the buf_flush_page_cleaner thread. This fixes occasional hangs
of the test encryption.innochecksum that were introduced by
commit a55b951e60 (MDEV-26827).
To play it safe, we use a timed wait when waiting for the
buf_flush_page_cleaner() thread to perform its job. Should that
thread get stuck, we will invoke buf_pool.LRU_warn() in order to
display a message that pages could not be freed, and keep trying
to wake up the buf_flush_page_cleaner() thread.
The INFORMATION_SCHEMA.INNODB_METRICS counters
buffer_LRU_single_flush_failure_count and
buffer_LRU_get_free_waits will be removed.
The latter is represented by buffer_pool_wait_free.
Also removed will be the message
"InnoDB: Difficult to find free blocks in the buffer pool"
because in d34479dc66 we
introduced a more precise message
"InnoDB: Could not free any blocks in the buffer pool"
in the buf_flush_page_cleaner thread.
buf_pool_t::LRU_warn(): Issue the warning message that we could
not free any blocks in the buffer pool. This may also be invoked
by buf_LRU_get_free_block() if buf_flush_page_cleaner() appears
to be stuck.
buf_pool_t::n_flush_dec(): Remove.
buf_pool_t::n_flush_dec_holding_mutex(): Rename to n_flush_dec().
buf_flush_LRU_list_batch(): Increment the eviction counter for blocks
of temporary, discarded or dropped tablespaces.
buf_flush_LRU(): Make static, and remove the constant parameter
evict=false. The only caller will be the buf_flush_page_cleaner()
thread.
IORequest::is_LRU(): Remove. The only case of evicting pages on
write completion will be when we are writing out pages of the
temporary tablespace. Those pages are not in buf_pool.flush_list,
only in buf_pool.LRU.
buf_page_t::flush(): Remove the parameter evict.
buf_page_t::write_complete(): Change the parameter "bool temporary"
to "bool persistent" and add a parameter for an already read state().
Reviewed by: Debarun Banerjee
- During non-last batch of multi-batch recovery, InnoDB holds
log_sys.mutex and preallocates the block which may intiate
page flush, which may initiate log flush, which requires
log_sys.mutex to acquire again. This leads to assert failure.
So InnoDB recovery should release log_sys.mutex before
preallocating the block.
Starting with 10.5, InnoDB crash recovery tests seem to time out
more easily under Valgrind, which emulates multiple threads by
interleaving them in a single operating system thread.
These tests will still be covered by
AddressSanitizer and MemorySanitizer.
MDEV-25604 Atomic DDL: Binlog event written upon recovery does not
have default database
The purpose of this task is to ensure that ALTER TABLE is atomic even if
the MariaDB server would be killed at any point of the alter table.
This means that either the ALTER TABLE succeeds (including that triggers,
the status tables and the binary log are updated) or things should be
reverted to their original state.
If the server crashes before the new version is fully up to date and
commited, it will revert to the original table and remove all
temporary files and tables.
If the new version is commited, crash recovery will use the new version,
and update triggers, the status tables and the binary log.
The one execption is ALTER TABLE .. RENAME .. where no changes are done
to table definition. This one will work as RENAME and roll back unless
the whole statement completed, including updating the binary log (if
enabled).
Other changes:
- Added handlerton->check_version() function to allow the ddl recovery
code to check, in case of inplace alter table, if the table in the
storage engine is of the new or old version.
- Added handler->table_version() so that an engine can report the current
version of the table. This should be changed each time the table
definition changes.
- Added ha_signal_ddl_recovery_done() and
handlerton::signal_ddl_recovery_done() to inform all handlers when
ddl recovery has been done. (Needed by InnoDB).
- Added handlerton call inplace_alter_table_committed, to signal engine
that ddl_log has been closed for the alter table query.
- Added new handerton flag
HTON_REQUIRES_NOTIFY_TABLEDEF_CHANGED_AFTER_COMMIT to signal when we
should call hton->notify_tabledef_changed() during
mysql_inplace_alter_table. This was required as MyRocks and InnoDB
needed the call at different times.
- Added function server_uuid_value() to be able to generate a temporary
xid when ddl recovery writes the query to the binary log. This is
needed to be able to handle crashes during ddl log recovery.
- Moved freeing of the frm definition to end of mysql_alter_table() to
remove duplicate code and have a common exit strategy.
-------
InnoDB part of atomic ALTER TABLE
(Implemented by Marko Mäkelä)
innodb_check_version(): Compare the saved dict_table_t::def_trx_id
to determine whether an ALTER TABLE operation was committed.
We must correctly recover dict_table_t::def_trx_id for this to work.
Before purge removes any trace of DB_TRX_ID from system tables, it
will make an effort to load the user table into the cache, so that
the dict_table_t::def_trx_id can be recovered.
ha_innobase::table_version(): return garbage, or the trx_id that would
be used for committing an ALTER TABLE operation.
In InnoDB, table names starting with #sql-ib will remain special:
they will be dropped on startup. This may be revisited later in
MDEV-18518 when we implement proper undo logging and rollback
for creating or dropping multiple tables in a transaction.
Table names starting with #sql will retain some special meaning:
dict_table_t::parse_name() will not consider such names for
MDL acquisition, and dict_table_rename_in_cache() will treat such
names specially when handling FOREIGN KEY constraints.
Simplify InnoDB DROP INDEX.
Prevent purge wakeup
To ensure that dict_table_t::def_trx_id will be recovered correctly
in case the server is killed before ddl_log_complete(), we will block
the purge of any history in SYS_TABLES, SYS_INDEXES, SYS_COLUMNS
between ha_innobase::commit_inplace_alter_table(commit=true)
(purge_sys.stop_SYS()) and purge_sys.resume_SYS().
The completion callback purge_sys.resume_SYS() must be between
ddl_log_complete() and MDL release.
--------
MyRocks support for atomic ALTER TABLE
(Implemented by Sergui Petrunia)
Implement these SE API functions:
- ha_rocksdb::table_version()
- hton->check_version = rocksdb_check_versionMyRocks data dictionary
now stores table version for each table.
(Absence of table version record is interpreted as table_version=0,
that is, which means no upgrade changes are needed)
- For inplace alter table of a partitioned table, call the underlying
handlerton when checking if the table is ok. This assumes that the
partition engine commits all changes at once.
The reason for this is to make all temporary file names similar and
also to be able to figure out from where a #sql-xxx name orginates.
New format is for most cases:
'#sql-name-current_pid-thread_id[-increment]'
Where name is one of subselect, alter, exchange, temptable or backup
The exceptions are:
ALTER PARTITION shadow files:
'#sql-shadow-thread_id-'original_table_name'
Names used with temp pool:
'#sql-name-current_pid-pool_number'
- Note that some issues was also fixed in 10.2 and 10.4. I also fixed them
here to be able to continue with making 10.5 valgrind safe again
- Disable connection threads warnings when doing shutdown
Use DEBUG_SYNC to hang the execution at the interesting point,
and then kill and restart the server externally. This will work
also with Valgrind. DBUG_SUICIDE() causes Valgrind to hang,
and it could also cause uninteresting reports about memory leaks.
While we are at it, let us clean up innodb.innodb_bulk_create_index_debug
so that it will actually test the desired functionality also in future
versions (with instant ADD COLUMN and DROP COLUMN) and avoid
some unnecessary restarts.
We are adding two DEBUG_SYNC points for ALTER TABLE, because there were
none that would be executed right before ha_commit_trans().
This is a backport of the following commits:
commit b4165985c9
commit 69e88de0fe
commit 40f4525f43
commit 656f66def2
Now that MDEV-14717 made RENAME TABLE crash-safe within InnoDB,
it should be safe to drop the #sql- tables within InnoDB during
crash recovery. These tables can be one of two things:
(1) #sql-ib related to deferred DROP TABLE (follow-up to MDEV-13407)
or to table-rebuilding ALTER TABLE...ALGORITHM=INPLACE
(since MDEV-14378, only related to the intermediate copy of a table),
(2) #sql- related to the intermediate copy of a table during
ALTER TABLE...ALGORITHM=COPY
We will not drop tables whose name starts with #sql2, because
the server can be killed during an ALGORITHM=COPY operation at
a point where the original table was renamed to #sql2 but the
finished intermediate copy was not yet renamed from #sql-
to the original table name.
If an old version of MariaDB Server before 10.2.13 (MDEV-11415)
was killed while ALTER TABLE...ALGORITHM=COPY was in progress,
after recovery there could be undo log records for some records that were
inserted into an intermediate copy of the table. Due to these undo log
records, InnoDB would resurrect locks at recovery, and the intermediate
table would be locked while we are trying to drop it. This would cause
a call to row_rename_table_for_mysql(), either from
row_mysql_drop_garbage_tables() or from the rollback of a RENAME
operation that was part of the ALTER TABLE.
row_rename_table_for_mysql(): Do not attempt to parse FOREIGN KEY
constraints when renaming from #sql-something to #sql-something-else,
because it does not make any sense.
row_drop_table_for_mysql(): When deferring DROP TABLE due to locks,
do not rename the table if its name already starts with the #sql-
prefix, which is what row_mysql_drop_garbage_tables() uses.
Previously, the too strict prefix #sql-ib was used, and some
tables were renamed unnecessarily.
If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend
a lot of time rolling back writes to the intermediate copy of the table.
To reduce the amount of busy work done, a work-around was introduced in
commit fd069e2bb3 in MySQL 4.1.8 and 5.0.2,
to commit the transaction after every 10,000 inserted rows.
A proper fix would have been to disable the undo logging altogether and
to simply drop the intermediate copy of the table on subsequent server
startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585.
In MariaDB 10.2, the intermediate copy of the table would be left behind
with a name starting with the string #sql.
This is a backport of a bug fix from MySQL 8.0.0 to MariaDB,
contributed by jixianliang <271365745@qq.com>.
Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation
InnoDB must for now keep the undo logging enabled, so that the latest
row can be rolled back in case of an error.
In Galera cluster, the LOAD DATA statement will retain the existing
behaviour and commit the transaction after every 10,000 rows if
the parameter wsrep_load_data_splitting=ON is set. The logic to do
so (the wsrep_load_data_split() function and the call
handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work
by Ji Xianliang and Marko Mäkelä.
The original fix:
Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com>
Date: Wed Dec 2 16:09:15 2015 +0530
Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY
Problem:
During ALTER TABLE, we commit and restart the transaction for every
10,000 rows, so that the rollback after recovery would not take so long.
Fix:
Suppress the undo logging during copy alter operation. If fts_index is
present then insert directly into fts auxiliary table rather
than doing at commit time.
ha_innobase::num_write_row: Remove the variable.
ha_innobase::write_row(): Remove the hack for committing every 10000 rows.
row_lock_table_for_mysql(): Remove the extra 2 parameters.
lock_get_src_table(), lock_is_table_exclusive(): Remove.
Reviewed-by: Marko Mäkelä <marko.makela@oracle.com>
Reviewed-by: Shaohua Wang <shaohua.wang@oracle.com>
Reviewed-by: Jon Olav Hauglid <jon.hauglid@oracle.com>