server failure in different, confusing ways
InnoDB fails to free the buffer pool instance mutex and zip mutex
If the allocation of buffer pool instance chunk fails. So it leads
to freeing of buffer pool before freeing the mutexes and
leads to double freeing of memory while freeing the mutex
during shutdown.
Problem:
The problem happened because of a conceptual flaw in the server code:
a. The table level CHARSET/COLLATE clause affected all data types,
including numeric and temporal ones:
CREATE TABLE t1 (a INT) CHARACTER SET utf8 [COLLATE utf8_general_ci];
In the above example, the Column_definition_attributes
(and then the FRM record) for the column "a" erroneously inherited
"utf8" as its character set.
b. The "ALTER TABLE t1 CONVERT TO CHARACTER SET csname" statement
also erroneously affected Column_definition_attributes::charset
for numeric and temporal data types and wrote "csname" as their
character set into FRM files.
So now we have arbitrary non-relevant charset ID values for numeric
and temporal data types in all FRM files in the world :)
The code in the server and the other engines did not seem to be affected
by this flaw. Only InnoDB inplace ALTER was affected.
Solution:
Fixing the code in the way that only character string data types
(CHAR,VARCHAR,TEXT,ENUM,SET):
- inherit the table level CHARSET/COLLATE clause
- get the charset value according to "CONVERT TO CHARACTER SET csname".
Numeric and temporal data types now always get &my_charset_numeric
in Column_definition_attributes::charset and always write its ID into FRM files:
- no matter what the table level CHARSET/COLLATE clause is, and
- no matter what "CONVERT TO CHARACTER SET" says.
Details:
1. Adding helper classes to pass small parts of HA_CREATE_INFO
into Type_handler methods:
- Column_derived_attributes - to pass table level CHARSET/COLLATE,
so columns that do not have explicit CHARSET/COLLATE clauses
can derive them from the table level, e.g.
CREATE TABLE t1 (a VARCHAR(1), b CHAR(1)) CHARACTER SET utf8;
- Column_bulk_alter_attributes - to pass bulk attribute changes
generated by the ALTER related code. These bulk changes affect
multiple columns at the same time:
ALTER TABLE ... CONVERT TO CHARACTER SET csname;
Note, passing the whole HA_CREATE_INFO directly to Type_handler
would not be good: HA_CREATE_INFO is huge and would need not desired
dependencies in sql_type.h and sql_type.cc. The Type_handler API should
use smallest possible data types!
2. Type_handler::Column_definition_prepare_stage1() is now responsible
to set Column_definition::charset properly, according to the data type,
for example:
- For string data types, Column_definition_attributes::charset is set from
the table level CHARSET/COLLATE clause (if not specified explicitly in
the column definition).
- For numeric and temporal fields, Column_definition_attributes::charset is
set to &my_charset_numeric, no matter what the table level
CHARSET/COLLATE says.
- For GEOMETRY, Column_definition_attributes::charset is set to
&my_charset_bin, no matter what the table level CHARSET/COLLATE says.
Previously this code (setting `charset`) was outside of of
Column_definition_prepare_stage1(), namely in
mysql_prepare_create_table(), and was erroneously called for
all data types.
3. Adding Type_handler::Column_definition_bulk_alter(), to handle
"ALTER TABLE .. CONVERT TO". Previously this code was inside
get_sql_field_charset() and was erroneously called for all data types.
4. Removing the Schema_specification_st parameter from
Type_handler::Column_definition_redefine_stage1().
Column_definition_attributes::charset is now fully properly initialized by
Column_definition_prepare_stage1(). So we don't need access to the
table level CHARSET/COLLATE clause in Column_definition_redefine_stage1()
any more.
5. Other changes:
- Removing global function get_sql_field_charset()
- Moving the part of the former get_sql_field_charset(), which was
responsible to inherit the table level CHARSET/COLLATE clause to
new methods:
-- Column_definition_attributes::explicit_or_derived_charset() and
-- Column_definition::prepare_charset_for_string().
This code is only needed for string data types.
Previously it was erroneously called for all data types.
- Moving another part, which was responsible to apply the
"CONVERT TO" clause, to
Type_handler_general_purpose_string::Column_definition_bulk_alter().
- Replacing the call for get_sql_field_charset() in sql_partition.cc
to sql_field->explicit_or_derived_charset() - it is perfectly enough.
The old code was redundant: get_sql_field_charset() was called from
sql_partition.cc only when there were no a "CONVERT TO CHARACTER SET"
clause involved, so its purpose was only to inherit the table
level CHARSET/COLLATE clause.
- Moving the code handling the BINCMP_FLAG flag from
mysql_prepare_create_table() to
Column_definition::prepare_charset_for_string():
This code is responsible to resolve the BINARY comparison style
into the corresponding _bin collation, to do the following transparent
rewrite:
CREATE TABLE t1 (a VARCHAR(10) BINARY) CHARSET utf8; ->
CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8 COLLATE utf8_bin);
This code is only needed for string data types.
Previously it was erroneously called for all data types.
6. Renaming Table_scope_and_contents_source_pod_st::table_charset
to alter_table_convert_to_charset, because the only purpose it's used for
is handlering "ALTER .. CONVERT". The new name is much more self-descriptive.
InnoDB should skip the dropped aborted FTS_DOC_ID_INDEX while
checking the existing FTS_DOC_ID_INDEX in the table. InnoDB
should able to create new FTS_DOC_ID_INDEX if the fulltext
index is being added for the first time.
modified: storage/connect/CMakeLists.txt
modified: storage/connect/javaconn.cpp
- Check privileges while creating tables with Discovery
modified: storage/connect/ha_connect.cc
- Calculate LRECL for JSON tables created with Discovery
modified: storage/connect/tabjson.cpp
- Use CreateProcess (Windows) or fork/exec (linux)
to retrieve the result from REST queries
modified: storage/connect/tabrest.cpp
- Typo
modified: storage/connect/jmgoconn.cpp
This was because of a wrong test in encryption code that wrote random
numbers over the LSN for pages for transactional Aria tables during repair.
The effect was that after an ALTER TABLE ENABLE KEYS of a encrypted
recovery of the tables would not work.
The test cases will be pushed into 10.5 as it requires of several changes
to check table that safer not to backport.
During the prepare phase of restoring backups, "mariabackup" does
not seem to allow (or recognize) the option "innodb_force_recovery"
for the embedded InnoDB server instance that it starts.
If page corruption observed during page recovery, the prepare step
fails. While this is indeed the correct behavior ideally, allowing
this option to be set in case of emergencies might be useful when
the current backup is the only copy available. Some error messages
during "--prepare" suggest to set "innodb_force_recovery" to 1:
[ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
For backwards compatibility, "mariabackup --innobackupex --apply-log"
should also have this option.
Signed-off-by: Srinidhi Kaushik <shrinidhi.kaushik@gmail.com>
Virtual column fields are not found in prebuilt data type, so we should
match InnoDB fields with `get_innobase_type_from_mysql_type` method.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
- Aborting of fulltext index creation fails to remove the
index from sys indexes table. When we try to reload the
table definition, InnoDB fails with index count mismatch
error. InnoDB should remove the index from sys indexes while
rollbacking the secondary index creation.
- Importing table operation fails to punch the hole in
the filesystem when page compressed table is involved.
To achieve that, InnoDB firstly punches the hole for
the IOBuffer size(1MB). After that, InnoDB should write
page by page when page compression is involved.
Add condition on trx->state == TRX_STATE_COMMITTED_IN_MEMORY in order to
avoid unnecessary work. If a transaction has already been committed or
rolled back, it will release its locks in lock_release() and let
the waiting thread(s) continue execution.
Let BF wait on lock_rec_has_to_wait and if necessary other BF
is replayed.
wsrep_trx_order_before
If BF is not even replicated yet then they are ordered
correctly.
bg_wsrep_kill_trx
Make sure victim_trx is found and check also its state. If
state is TRX_STATE_COMMITTED_IN_MEMORY transaction is
already committed or rolled back and will release it locks
soon.
wsrep_assert_no_bf_bf_wait
Transaction requesting new record lock should be TRX_STATE_ACTIVE
Conflicting transaction can be in states TRX_STATE_ACTIVE,
TRX_STATE_COMMITTED_IN_MEMORY or in TRX_STATE_PREPARED.
If conflicting transaction is already committed in memory or
prepared we should wait. When transaction is committed in memory we
held trx mutex, but not lock_sys->mutex. Therefore, we
could end here before transaction has time to do lock_release()
that is protected with lock_sys->mutex.
lock_rec_has_to_wait
We very well can let bf to wait normally as other BF will be
replayed in case of conflict. For debug builds we will do
additional sanity checks to catch unsupported bf wait if any.
wsrep_kill_victim
Check is victim already in TRX_STATE_COMMITTED_IN_MEMORY state and
if it is we can return.
lock_rec_dequeue_from_page
lock_rec_unlock
Remove unnecessary wsrep_assert_no_bf_bf_wait function calls.
We can very well let BF wait here.
The function row_upd_clust_step() is invoking several static functions,
some of which used to commit the mini-transaction in some cases.
If innobase_get_computed_value() would fail due to some reason,
we would fail to invoke mtr_t::commit() and release buffer pool
page latches. This would likely lead to a hanging server later.
This regression was introduced in
commit 97db6c15ea (MDEV-20618).
row_upd_index_is_referenced(), row_upd_sec_index_entry(),
row_upd_sec_index_entry(): Cleanup: Replace some ibool with bool.
row_upd_clust_rec_by_insert(), row_upd_clust_rec(): Guarantee that
the mini-transaction will always remain in active state.
row_upd_del_mark_clust_rec(): Guarantee that
the mini-transaction will always remain in active state.
This fixes one "leak" of mini-transaction on DB_COMPUTE_VALUE_FAILED.
row_upd_clust_step(): Use only one return path, which will always
invoke mtr.commit(). After a failed row_upd_store_row() call, we
will no longer "leak" the mini-transaction.
This fix was verified by RQG on 10.6 (depending on MDEV-371 that
was introduced in 10.4). Unfortunately, it is challenging to
create a regression test for this, and a test case could soon become
invalid as more bugs in virtual column evaluation are fixed.
As suggested by Vladislav Vaintroub, let us remove misleading
and malformatted startup messages.
Even if the global variable srv_use_atomic_writes were set, we would
still invoke my_test_if_atomic_write() to check if writes are atomic
with a particular page size.
When using the default innodb_page_size=16k, page writes should be
atomic on NTFS when using ROW_FORMAT=COMPRESSED and KEY_BLOCK_SIZE<=4.
Disabling srv_use_atomic_writes when innodb_file_per_table=OFF does
not make sense, because that is a dynamic parameter.
We also correct the documentation string of innodb_use_atomic_writes
and remove the duplicate variable innobase_use_atomic_writes.
The debug parameter innodb_simulate_comp_failures injected compression
failures for ROW_FORMAT=COMPRESSED tables, breaking the pre-existing
logic that I had implemented in the InnoDB Plugin for MySQL 5.1 to prevent
compressed page overflows. A much better check is already achieved by
defining UNIV_ZIP_COPY at the compilation time.
(Only UNIV_ZIP_DEBUG is part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON.)
In commit eaeb8ec4b8 (MDEV-24653)
an incorrect debug assertion was introduced.
btr_pcur_store_position(): If the only record in the page is the
instant ALTER TABLE metadata record, we cannot expect there to be
a successor page. The situation could be improved by MDEV-24673 later.
Tests with 4096-byte sector size confirm that it is
safe to use O_DIRECT with page_compressed tables.
That had been disabled on Linux, in an attempt to fix MDEV-21584
which had been filed for the O_DIRECT problems earlier.
The fil_node_t::block_size was being set mostly correctly until
commit 10dd290b4b (MDEV-17380)
introduced a regression in MariaDB Server 10.4.4.
fil_node_open_file(): Only avoid setting O_DIRECT on
ROW_FORMAT=COMPRESSED tables that use KEY_BLOCK_SIZE=1 or 2
(1024 or 2048 bytes).
fil_ibd_create(): Avoid setting O_DIRECT on ROW_FORMAT=COMPRESSED tables
that use KEY_BLOCK_SIZE=1 or 2 (1024 or 2048 bytes).
fil_node_t::find_metadata(): Require fstat() to be always invoked
outside Microsoft Windows, so that fil_node_t::block_size can be set.
fil_node_t::read_page0(): Rely on find_metadata() to assign block_size.
Thanks to Vladislav Vaintroub for testing this on Microsoft Windows
using an old-fashioned rotational hard disk with 4KiB sector size.
Reviewed by: Vladislav Vaintroub
This is a port of commit 00f620b27e
and commit 6505662c23 from 10.2.
Before MDEV-14638, there was no race condition between the
execution of fetch_data_into_cache() and transaction commit.
fetch_data_into_cache(): Acquire trx_t::mutex before checking
trx_t::state, to prevent a concurrent transition from
TRX_STATE_COMMITTED_IN_MEMORY to TRX_STATE_NOT_STARTED
in trx_commit_in_memory().
ha_innobase::info_low(): While collecting statistics for
ANALYZE TABLE, ensure that dict_stats_process_entry_from_recalc_pool()
is not executing on the same table.
We observed result differences for the test innodb.innodb_stats because
dict_stats_empty_index() was being invoked by the background statistics
calculation while ha_innobase::analyze() was executing
dict_stats_analyze_index_level().
Tests with 4096-byte sector size confirm that it is
safe to use O_DIRECT with page_compressed tables.
That had been disabled on Linux, in an attempt to fix MDEV-21584
which had been filed for the O_DIRECT problems earlier.
The fil_node_t::block_size was being set mostly correctly until
commit 10dd290b4b (MDEV-17380)
introduced a regression in MariaDB Server 10.4.4.
fil_node_t::read_page0(): Initialize fil_node_t::block_size.
This will probably make similar code in fil_space_extend_must_retry()
redundant, but we play it safe and will not remove that code.
Thanks to Vladislav Vaintroub for testing this on Microsoft Windows
using an old-fashioned rotational hard disk with 4KiB sector size.
Reviewed by: Vladislav Vaintroub
This had been originally added in
mysql/mysql-server@192bb153b6
with the motivation to disable O_DIRECT for the dedicated tablespace
for temporary tables. In MariaDB Server,
commit 5eb539555b (MDEV-12227)
should be a better solution.
The code became orphaned later in
mysql/mysql-server@c61244c0e6
and it had been applied to MariaDB Server 10.2.2 in
commit 2e814d4702 and
commit fec844aca8.
Thanks to Vladislav Vaintroub for spotting this.
This when the HTTP contains & characters
modified: storage/connect/tabbson.cpp
modified: storage/connect/tabjson.cpp
- Make stringfy option work on only one Json item
modified: storage/connect/tabbson.cpp
modified: storage/connect/tabbson.h
modified: storage/connect/tabjson.cpp
modified: storage/connect/tabjson.h
- Make Json/Bson DATE columns accept JSON date syntax
modified: storage/connect/tabbson.cpp
modified: storage/connect/tabjson.cpp
- Fix bug making REST table default file not being
erased when dropping the table
modified: storage/connect/tabbson.cpp
modified: storage/connect/tabjson.cpp
modified: storage/connect/tabrest.cpp
modified: storage/connect/tabxml.cpp
- Suppress CHAR(36) --> VARCHAR(36) when DEVELOPMENT
This was fixed in MyClient
modified: storage/connect/ha_connect.cc
Keyvalue can be longer than REC_VERSION_56_MAX_INDEX_COL_LEN
and this leads out-of-array reference. Use dynamic memory
allocation using actual max length of key value.
Online log for insert operation of redundant table fails with
index->is_instant() assert. Purge can reset the n_core_fields when
alter is waiting to upgrade MDL for commit phase of DDL. In the
meantime, any insert DML tries to log the operation fails with
index is not being instant.
row_log_get_n_core_fields(): Get the n_core_fields of online log
for the given index.
rec_get_converted_size_comp_prefix_low(): Use n_core_fields of online
log when InnoDB calculates the size of data tuple during redundant
row format table rebuild.
rec_convert_dtuple_to_rec_comp(): Use n_core_fields of online log
when InnoDB does the conversion of data tuple to record during
redudant row format table rebuild.
- Adding the test case which has more than 129 instant columns.
- This is caused by merge commit a26e7a3726.
InnoDB fails to fetch the next index field when there is a externally
stored column length check involved.
MDEV-25105 (commit 7a4fbb55b0)
in MariaDB 10.6 will refuse the innodb_checksum_algorithm
values none, innodb, strict_none, strict_innodb.
We will issue a deprecation warning if innodb_checksum_algorithm
is set to any of these non-default unsafe values.
innodb_checksum_algorithm=crc32 was made the default in
MySQL 5.7 and MariaDB Server 10.2, and given that older versions
of the server have reached their end of life, there is no valid
reason to use anything else than innodb_checksum_algorithm=crc32
or innodb_checksum_algorithm=strict_crc32 in MariaDB 10.3.
Reviewed by: Sergei Golubchik
InnoDB set the space in dict_table_t as NULL when table
is discarded. So InnoDB shouldn't use the space present
in table to detect whether the given tablespace is
temporary tablespace.
btr_node_ptr_max_size(): Let us remove the debug assertion that was
added in MDEV-14637. The assertion assumed that no additional
indexes exist in mysql.innodb_index_stats or mysql.innodb_table_stats.
The code path is working around an incorrect definition of a table,
interpreting VARCHAR(64) as the more correct VARCHAR(199).
No test case will be added, because MDEV-24579 proves that executing
DDL on the statistics tables involves a race condition. The test
case included the following:
ALTER TABLE mysql.innodb_index_stats ADD KEY (stat_name);
CREATE TABLE t (a INT) ENGINE=InnoDB STATS_PERSISTENT=1;
failed in dtuple_convert_big_rec
In dtuple_convert_big_rec(), InnoDB fails to consider the
instant metadata blob while choosing the variable length
field.
Incorrect processing of an auto-incrementing field in the
WSREP-related code during applying transactions results in
a duplicate key being created. This is due to the fact that
at the beginning of the write_row() and update_row() functions,
the values of the auto-increment parameters are used, which
are read from the parameters of the current thread, but further
along the code other values are used, which are read from global
variables (when applying a transaction). This can happen when
the cluster configuration has changed while applying a transaction
(for example in the high_priority_service mode for Galera 4).
Further during IST processing duplicating key is detected, and
processing of the DB_DUPLICATE_KEY return code (inside innodb,
in the write_row() handler) results in a call to the
wsrep_thd_self_abort() function.
row_prebuilt_t::m_no_prefetch: Remove (it was always false).
row_prebuilt_t::m_read_virtual_key: Remove (it was always false).
Only ha_innopart ever set these fields.
innobase_rename_table(): Invoke dict_stats_wait_bg_to_stop_using_table()
to ensure that dict_stats_update() cannot be accessing the table name
that we will be modifying. If we are executing RENAME rather than TRUNCATE,
reset the flag at the end so that persistent statistics can be calculated
again.
The race condition was encountered with ASAN and rr.
Sorry, there is no test case, like there is for nothing related to
dict_stats_wait_bg_to_stop_using_table(). The entire code is an ugly
work-around for the failure of dict_stats_process_entry_from_recalc_pool()
to acquire MDL.
Note: It appears that an ALTER TABLE that is not rebuilding the table
will fail to reset the flag that blocks the processing of statistics.
In btr_index_rec_validate(), externally stored column
check is missing while matching the length of the field
with the length of the field data stored in record.
Fetch the length of the externally stored part and compare it
with the fixed field length.
This is a backport of commit 18535a4028
from 10.6.
lock_release(): Implement innodb_evict_tables_on_commit_debug.
Before releasing any locks, collect the identifiers of tables to
be evicted. After releasing all locks, look up for the tables and
evict them if it is safe to do so.
trx_commit_in_memory(): Invoke trx_update_mod_tables_timestamp()
before lock_release(), so that our locks will protect the tables
from being evicted.
When doing a truncate on an Innodb under lock tables, InnoDB would rename
the old table to #sql-... and recreate a new 't1' table. The table lock
would still be on the #sql-table.
When doing ALTER TABLE, Innodb would do the changes on the #sql table
(which would disappear on close).
When the SQL layer, as part of inline alter table, would close the
original t1 table (#sql in InnoDB) and then reopen the t1 table, Innodb
would notice that this does not match it's own (old) t1 table and
generate an error.
Fixed by adding code in truncate table that if we are under lock tables
and truncating an InnoDB table, we would close, reopen and lock the
table after truncate. This will remove the #sql table and ensure that
lock tables is using the new empty table.
Reviewer: Marko Mäkelä
eprintf() was missing a va_start(), which caused wrong filename to be
printed when printing recovery trace.
Added also missing new line when printing "Table is crashed" to trace file
- The commit 5fd3c7471e3e0673b50d309567c9747d36f09412(MDEV-24709)
resets the recv_no_ibuf_operations in
recv_recovery_from_checkpoint_start(), but InnoDB fails to reset
the variable recv_no_log_write() during that time and that leads
to the assert failure.
- This is caused by commit ad6171b91c
(MDEV-22456). InnoDB reloads the evicted table again from dictionary.
In that case, AHI entries and current index object mismatches
happens. When index object mismatches then InnoDB should drop
the page hash AHI entries for the block. In
btr_search_drop_page_hash_index(), InnoDB should take exclusive
lock on the AHI latch if index is already freed to avoid the
freed memory access during buf_pool_resize()
Problem was that thd::awake assumes now that you hold THD::LOCK_thd_data
so we need to keep it when we call wsrep_thd_awake from
wsrep_abort_transaction.
fsp_free_page() writes MLOG_INIT_FREE_PAGE, but does not update page
type. But fil_crypt_rotate_page() checks the type to understand if the
page is freshly initialized, and writes dummy record(updates space id)
to force rotation during recovery. This dummy record causes assertion
crash when the page is flushed after recovery, as it's supposed
that pages LSN is 0 for freshly initialized pages.
The bug is similiar to MDEV-24695, the difference is that in 10.5 the
assertion crashes during log record applying, but in 10.4 it crashes
during page flushing.
The fix could be in marking page as freed and not writing dummy record
during keys rotation procedure for such marked pages. But
bpage->file_page_was_freed is not consistent enough for release builds in
10.4, and the issue is fixed in 10.5 and does not exist in 10.[23] as
MLOG_INIT_FREE_PAGE was introduced since 10.4.
So the better solution is just to relax the assertion and implement some
additional property for freshly allocated pages, and check this property
during pages flushing.
The test is copied from MDEV-24695, the only change is in forcing pages
flushing after each server start to cause crash in non-fixed code.
There is no need to merge it to 10.5+, as the bug is already fixed by
MDEV-24695.
innobase_rename_column_try(): When renaming SYS_FIELDS records
for secondary indexes, try to use both formats of SYS_FIELDS.POS
as keys, in case the PRIMARY KEY includes a column prefix.
Without this fix, an ALTER TABLE that renames a column followed
by a server restart (or LRU eviction of the table definition
from dict_sys) would make the table inaccessible.
Add negative array indexes starting from the last
modified: storage/connect/bson.cpp
modified: storage/connect/bsonudf.cpp
modified: storage/connect/json.cpp
innodb_shutdown(): Check that fil_system.temp_space is not null
before invoking a member function.
This regression was caused by the merge
commit fa1aef39eb
of MDEV-24340 (commit 1eb59c307d).
bson.cpp:1775:3: error: case label value is less than minimum value for type [-Werror]
case TYPE_NULL:
bson.cpp:1776:7: error: statement will never be executed [-Werror=switch-unreachable]
b = true;
Analysis:
=========
Reason for test failure was a mutex deadlock between DeadlockChecker with stack
Thread 6 (Thread 0xffff70066070 (LWP 24667)):
0 0x0000ffff784e850c in __lll_lock_wait (futex=futex@entry=0xffff04002258, private=0) at lowlevellock.c:46
1 0x0000ffff784e19f0 in __GI___pthread_mutex_lock (mutex=mutex@entry=0xffff04002258) at pthread_mutex_lock.c:135
2 0x0000aaaaac8cd014 in inline_mysql_mutex_lock (src_file=0xaaaaacea0f28 "/home/buildbot/buildbot/build/mariadb-10.2.37/sql/wsrep_thd.cc", src_line=762, that=0xffff04002258) at /home/buildbot/buildbot/build/mariadb-10.2.37/include/mysql/psi/mysql_thread.h:675
3 wsrep_thd_is_BF (thd=0xffff040009a8, sync=sync@entry=1 '\001') at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/wsrep_thd.cc:762
4 0x0000aaaaacadce68 in lock_rec_has_to_wait (for_locking=false, lock_is_on_supremum=<optimized out>, lock2=0xffff628952d0, type_mode=291, trx=0xffff62894070) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:826
5 lock_has_to_wait (lock1=<optimized out>, lock2=0xffff628952d0) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:873
6 0x0000aaaaacadd0b0 in DeadlockChecker::search (this=this@entry=0xffff70061fe8) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:7142
7 0x0000aaaaacae2dd8 in DeadlockChecker::check_and_resolve (lock=lock@entry=0xffff62894120, trx=trx@entry=0xffff62894070) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:7286
8 0x0000aaaaacae3070 in lock_rec_enqueue_waiting (c_lock=0xffff628952d0, type_mode=type_mode@entry=3, block=block@entry=0xffff62076c40, heap_no=heap_no@entry=2, index=index@entry=0xffff4c076f28, thr=thr@entry=0xffff4c078810, prdt=prdt@entry=0x0) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:1796
9 0x0000aaaaacae3900 in lock_rec_lock_slow (thr=0xffff4c078810, index=0xffff4c076f28, heap_no=2, block=0xffff62076c40, mode=3, impl=0) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:2106
10 lock_rec_lock (impl=false, mode=3, block=0xffff62076c40, heap_no=2, index=0xffff4c076f28, thr=0xffff4c078810) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:2168
11 0x0000aaaaacae3ee8 in lock_sec_rec_read_check_and_lock (flags=flags@entry=0, block=block@entry=0xffff62076c40, rec=rec@entry=0xffff6240407f "\303\221\342\200\232\303\220\302\265\303\220\302\272\303\221\302\201\303\221\342\200\232", index=index@entry=0xffff4c076f28, offsets=0xffff4c080690, offsets@entry=0xffff70062a30, mode=LOCK_X, mode@entry=1653162096, gap_mode=0, gap_mode@entry=281470749427104, thr=thr@entry=0xffff4c078810) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/lock/lock0lock.cc:6082
12 0x0000aaaaacb684c4 in sel_set_rec_lock (pcur=0xaaaac841c270, pcur@entry=0xffff4c077d58, rec=0xffff6240407f "\303\221\342\200\232\303\220\302\265\303\220\302\272\303\221\302\201\303\221\342\200\232", rec@entry=0x28 <error: Cannot access memory at address 0x28>, index=index@entry=0xffff4c076f28, offsets=0xffff70062a30, mode=281472334905456, type=281470749427104, thr=0xffff4c078810, thr@entry=0x9f, mtr=0x0, mtr@entry=0xffff70063928) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/row/row0sel.cc:1270
13 0x0000aaaaacb6bb64 in row_search_mvcc (buf=buf@entry=0xffff4c080690 "\376\026", mode=mode@entry=PAGE_CUR_GE, prebuilt=0xffff4c077b98, match_mode=match_mode@entry=1, direction=direction@entry=0) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/row/row0sel.cc:5181
14 0x0000aaaaacaae568 in ha_innobase::index_read (this=0xffff4c038a80, buf=0xffff4c080690 "\376\026", key_ptr=<optimized out>, key_len=768, find_flag=<optimized out>) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/handler/ha_innodb.cc:9393
15 0x0000aaaaac9201cc in handler::ha_index_read_map (this=0xffff4c038a80, buf=0xffff4c080690 "\376\026", key=0xffff4c07ccf8 "", keypart_map=keypart_map@entry=18446744073709551615, find_flag=find_flag@entry=HA_READ_KEY_EXACT) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/handler.cc:2718
16 0x0000aaaaac9f36b0 in Rows_log_event::find_row (this=this@entry=0xffff4c030098, rgi=rgi@entry=0xffff4c01b510) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/log_event.cc:13461
17 0x0000aaaaac9f3e44 in Update_rows_log_event::do_exec_row (this=0xffff4c030098, rgi=0xffff4c01b510) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/log_event.cc:13936
18 0x0000aaaaac9e7ee8 in Rows_log_event::do_apply_event (this=0xffff4c030098, rgi=0xffff4c01b510) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/log_event.cc:11101
19 0x0000aaaaac8ca4e8 in Log_event::apply_event (rgi=0xffff4c01b510, this=0xffff4c030098) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/log_event.h:1454
20 wsrep_apply_events (buf_len=0, events_buf=0x1, thd=0xffff4c0009a8) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/wsrep_applier.cc:164
21 wsrep_apply_cb (ctx=0xffff4c0009a8, buf=0x1, buf_len=18446743528248705000, flags=<optimized out>, meta=<optimized out>) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/wsrep_applier.cc:267
22 0x0000ffff7322d29c in galera::TrxHandle::apply (this=this@entry=0xffff4c027960, recv_ctx=recv_ctx@entry=0xffff4c0009a8, apply_cb=apply_cb@entry=0xaaaaac8c9fe8 <wsrep_apply_cb(void*, void const*, size_t, uint32_t, wsrep_trx_meta_t const*)>, meta=...) at /home/buildbot/buildbot/build/galera/src/trx_handle.cpp:317
23 0x0000ffff73239664 in apply_trx_ws (recv_ctx=recv_ctx@entry=0xffff4c0009a8, apply_cb=0xaaaaac8c9fe8 <wsrep_apply_cb(void*, void const*, size_t, uint32_t, wsrep_trx_meta_t const*)>, commit_cb=0xaaaaac8ca8d0 <wsrep_commit_cb(void*, uint32_t, wsrep_trx_meta_t const*, wsrep_bool_t*, bool)>, trx=..., meta=...) at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:34
24 0x0000ffff7323c0c4 in galera::ReplicatorSMM::apply_trx (this=this@entry=0xaaaac7c7ebc0, recv_ctx=recv_ctx@entry=0xffff4c0009a8, trx=trx@entry=0xffff4c027960) at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:454
25 0x0000ffff7323e8b8 in galera::ReplicatorSMM::process_trx (this=0xaaaac7c7ebc0, recv_ctx=0xffff4c0009a8, trx=0xffff4c027960) at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:1258
26 0x0000ffff73268f68 in galera::GcsActionSource::dispatch (this=this@entry=0xaaaac7c7f348, recv_ctx=recv_ctx@entry=0xffff4c0009a8, act=..., exit_loop=@0xffff7006535f: false) at /home/buildbot/buildbot/build/galera/src/gcs_action_source.cpp:115
27 0x0000ffff73269dd0 in galera::GcsActionSource::process (this=0xaaaac7c7f348, recv_ctx=0xffff4c0009a8, exit_loop=@0xffff7006535f: false) at /home/buildbot/buildbot/build/galera/src/gcs_action_source.cpp:180
28 0x0000ffff7323ef5c in galera::ReplicatorSMM::async_recv (this=0xaaaac7c7ebc0, recv_ctx=0xffff4c0009a8) at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:362
29 0x0000ffff73217760 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at /home/buildbot/buildbot/build/galera/src/wsrep_provider.cpp:244
30 0x0000aaaaac8cb344 in wsrep_replication_process (thd=0xffff4c0009a8) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/wsrep_thd.cc:486
31 0x0000aaaaac8bc3a0 in start_wsrep_THD (arg=arg@entry=0xaaaac7cb3e38) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/wsrep_mysqld.cc:2173
32 0x0000aaaaaca89198 in pfs_spawn_thread (arg=<optimized out>) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/perfschema/pfs.cc:1869
33 0x0000ffff784defc4 in start_thread (arg=0xaaaaaca890d8 <pfs_spawn_thread(void*)>) at pthread_create.c:335
34 0x0000ffff7821c3f0 in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:89
and background victim transaction kill with stack
Thread 28 (Thread 0xffff485fa070 (LWP 24870)):
0 0x0000ffff784e530c in __pthread_cond_wait (cond=cond@entry=0xaaaac83e98e0, mutex=mutex@entry=0xaaaac83e98b0) at pthread_cond_wait.c:186
1 0x0000aaaaacb10788 in os_event::wait (this=0xaaaac83e98a0) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/os/os0event.cc:158
2 os_event::wait_low (reset_sig_count=2, this=0xaaaac83e98a0) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/os/os0event.cc:325
3 os_event_wait_low (event=0xaaaac83e98a0, reset_sig_count=<optimized out>) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/os/os0event.cc:507
4 0x0000aaaaacb98480 in sync_array_wait_event (arr=arr@entry=0xaaaac7dbb450, cell=@0xffff485f96e8: 0xaaaac7dbb560) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/sync/sync0arr.cc:471
5 0x0000aaaaacab53c8 in TTASEventMutex<GenericPolicy>::enter (line=19524, filename=0xaaaaacf2ce40 "/home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/handler/ha_innodb.cc", max_delay=<optimized out>, max_spins=0, this=0xaaaac83cc8c0) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/include/ib0mutex.h:516
6 PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0xaaaac83cc8c0, n_spins=<optimized out>, n_delay=<optimized out>, name=0xaaaaacf2ce40 "/home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/handler/ha_innodb.cc", line=19524) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/include/ib0mutex.h:637
7 0x0000aaaaacaaa52c in bg_wsrep_kill_trx (void_arg=0xffff4c057430) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/innobase/handler/ha_innodb.cc:19524
8 0x0000aaaaac79e7f0 in handle_manager (arg=arg@entry=0x0) at /home/buildbot/buildbot/build/mariadb-10.2.37/sql/sql_manager.cc:112
9 0x0000aaaaaca89198 in pfs_spawn_thread (arg=<optimized out>) at /home/buildbot/buildbot/build/mariadb-10.2.37/storage/perfschema/pfs.cc:1869
10 0x0000ffff784defc4 in start_thread (arg=0xaaaaaca890d8 <pfs_spawn_thread(void*)>) at pthread_create.c:335
11 0x0000ffff7821c3f0 in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:89
Fix:
====
Do not use THD::LOCK_thd_data mutex if we already hold lock_sys->mutex because it
will cause mutexing order violation. Victim transaction holding conflicting
locks can't be committed or rolled back while we hold lock_sys->mutex. Thus,
it is safe to do wsrep_thd_is_BF call with no additional mutexes.
recv_recovery_from_checkpoint_start(): Clear the recv_no_ibuf_operations
flag at the same time when we enabled writes to the log.
The failure to clear the flag might have caused some missed
change buffer merges, at least to the secondary index of SYS_TABLES
that were accessed by trx_resurrect_table_locks() while the last
recovery batch was in progress.
Thanks to Thirunarayanan Balathandayuthapani for suggesting this fix.
fil_rename_tablespace(): Do not write a redundant MLOG_FILE_RENAME2
record.
The recovery bug will be fixed later. The problem is that we are
invoking fil_op_replay_rename() too often, while we should skip
any 'intermediate' names of a tablespace and only apply the
very last rename for each tablespace identifier, and only if
the tablespace name is not already correct.
The assertion failed in handler::ha_reset upon SELECT under
READ UNCOMMITTED from table with index on virtual column.
This was the debug-only failure, though the problem is mush wider:
* MY_BITMAP is a structure containing my_bitmap_map, the latter is a raw
bitmap.
* read_set, write_set and vcol_set of TABLE are the pointers to MY_BITMAP
* The rest of MY_BITMAPs are stored in TABLE and TABLE_SHARE
* The pointers to the stored MY_BITMAPs, like orig_read_set etc, and
sometimes all_set and tmp_set, are assigned to the pointers.
* Sometimes tmp_use_all_columns is used to substitute the raw bitmap
directly with all_set.bitmap
* Sometimes even bitmaps are directly modified, like in
TABLE::update_virtual_field(): bitmap_clear_all(&tmp_set) is called.
The last three bullets in the list, when used together (which is mostly
always) make the program flow cumbersome and impossible to follow,
notwithstanding the errors they cause, like this MDEV-17556, where tmp_set
pointer was assigned to read_set, write_set and vcol_set, then its bitmap
was substituted with all_set.bitmap by dbug_tmp_use_all_columns() call,
and then bitmap_clear_all(&tmp_set) was applied to all this.
To untangle this knot, the rule should be applied:
* Never substitute bitmaps! This patch is about this.
orig_*, all_set bitmaps are never substituted already.
This patch changes the following function prototypes:
* tmp_use_all_columns, dbug_tmp_use_all_columns
to accept MY_BITMAP** and to return MY_BITMAP * instead of my_bitmap_map*
* tmp_restore_column_map, dbug_tmp_restore_column_maps to accept
MY_BITMAP* instead of my_bitmap_map*
These functions now will substitute read_set/write_set/vcol_set directly,
and won't touch underlying bitmaps.
We may end up with an empty leaf page (containing only an ADD COLUMN
metadata record) that is not the root page.
innobase_add_instant_try(): Disable an optimization for a non-canonical
empty table that contains a metadata record somewhere else than in
the root page.
btr_pcur_store_position(): Tolerate a non-canonical empty table.
mutex order violation here.
when wsrep bf thread kills a conflicting trx, the stack is
wsrep_thd_LOCK()
wsrep_kill_victim()
lock_rec_other_has_conflicting()
lock_clust_rec_read_check_and_lock()
row_search_mvcc()
ha_innobase::index_read()
ha_innobase::rnd_pos()
handler::ha_rnd_pos()
handler::rnd_pos_by_record()
handler::ha_rnd_pos_by_record()
Rows_log_event::find_row()
Update_rows_log_event::do_exec_row()
Rows_log_event::do_apply_event()
Log_event::apply_event()
wsrep_apply_events()
and mutexes are taken in the order
lock_sys->mutex -> victim_trx->mutex -> victim_thread->LOCK_thd_data
When a normal KILL statement is executed, the stack is
innobase_kill_query()
kill_handlerton()
plugin_foreach_with_mask()
ha_kill_query()
THD::awake()
kill_one_thread()
and mutexes are
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
To fix the mutex order violation we kill the victim thd asynchronously,
from the manager thread
Ever since commit 947efe17ed
InnoDB no longer writes binlog position in one place.
It will not at all be written to the TRX_SYS page, and
instead it will be written to the undo log header page that
changes the transaction state.
trx_rseg_mem_restore(): Recover the information from the latest
written page.
I_S tables were materialized too late, an attempt to use table
statistics before the table was created caused a crash.
Let's move table creation up. it only needs read_set to
be calculated properly, this happens in JOIN::optimize_inner(),
after semijoin transformation.
Note that tables are not populated at that point, so most of the
statistics would make no sense anyway. But at least field sizes
will be correct. And it won't crash.
innodb_io_capacity_update(): When the requested innodb_io_capacity
exceeds innodb_io_capacity_max and is more than half the maximum,
do not double it for computing innodb_io_capacity_max.
This integer arithmetics overflow was introduced in
commit 0f32299437 (MDEV-7035).
No test case is added, because sizeof(ulong) varies between platforms.
Database name mismatch happens while opening the table for
virtual column computation. Because table_name_parse() returns
the length of database and table name before converting the
filename to table name. This issue is caused by
commit 8b0d4cff07 (MDEV-15855).
Fix should be that table_name_parse() should return the length of
database and table name after converting the filename to table name.
Reviewed-by: Marko Mäkelä
When online alter rollbacks due to MDL time out, it doesn't mark the
index online status as ONLINE_INDEX_ABORTED. Concurrent update fails
to update the secondary index while building the entry.
InnoDB should check the online status of the secondary index before
building the secondary index entry.
Reviewed-by: Marko Mäkelä
Some DML operations on tables having unique secondary keys cause scanning
in the secondary index, for instance to find potential unique key violations
in the seconday index. This scanning may involve GAP locking in the index.
As this locking happens also when applying replication events in high priority
applier threads, there is a probabality for lock conflicts between two wsrep
high priority threads.
This PR avoids lock conflicts of high priority wsrep threads, which do
secondary index scanning e.g. for duplicate key detection.
The actual fix is the patch in sql_class.cc:thd_need_ordering_with(), where
we allow relaxed GAP locking protocol between wsrep high priority threads.
wsrep high priority threads (replication appliers, replayers and TOI processors)
are ordered by the replication provider, and they will not need serializability
support gained by secondary index GAP locks.
PR contains also a mtr test, which exercises a scenario where two replication
applier threads have a false positive conflict in GAP of unique secondary index.
The conflicting local committing transaction has to replay, and the test verifies
also that the replaying phase will not conflict with the latter repllication applier.
Commit also contains new test scenario for galera.galera_UK_conflict.test,
where replayer starts applying after a slave applier thread, with later seqno,
has advanced to commit phase. The applier and replayer have false positive GAP
lock conflict on secondary unique index, and replayer should ignore this.
This test scenario caused crash with earlier version in this PR, and to fix this,
the secondary index uniquenes checking has been relaxed even further.
Now innodb trx_t structure has new member: bool wsrep_UK_scan, which is set to
true, when high priority thread is performing unique secondary index scanning.
The member trx_t::wsrep_UK_scan is defined inside WITH_WSREP directive, to make
it possible to prepare a MariaDB build where this additional trx_t member is
not present and is not used in the code base. trx->wsrep_UK_scan is set to true
only for the duration of function call for: lock_rec_lock() trx->wsrep_UK_scan
is used only in lock_rec_has_to_wait() function to relax the need to wait if
wsrep_UK_scan is set and conflicting transaction is also high priority.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
The parameter innodb_idle_flush_pct that was introduced in
MariaDB Server 10.1.2 by MDEV-6932 has no effect ever since
the InnoDB changes from MySQL 5.7.9 were applied in
commit 2e814d4702.
Let us declare the parameter as deprecated and having no effect.
When executing a query like "select id, 0 as const, val from ...", there are 3 columns(items) in Query->select at handlerton->create_group_by(). After that, MariaDB makes a temporary table with 2 columns. The skipped items are const item, so fixing Spider to skip const items for items at Query->select.
Server part:
kill_handlerton() was accessing thd->ha_data[] for some other thd,
while it could be concurrently modified by its owner thd.
protect thd->ha_data[] modifications with a mutex.
require this mutex when accessing thd->ha_data[] from kill_handlerton.
InnoDB part:
on close_connection, detach trx from thd before freeing the trx
with wrong data type is added
Inplace alter fails to report error when fts_doc_id column with
wrong data type is added.
prepare_inplace_alter_table_dict(): Should check whether the column
is fts_doc_id. It should be of bigint type, should accept non null
data type and it should be in capital letters.