Problem:
=======
- In 10.11, During Copy algorithm, InnoDB does use bulk insert
for row by row insert operation. When temporary directory
ran out of memory, row_mysql_handle_errors() fails to handle
DB_TEMP_FILE_WRITE_FAIL.
- During inplace algorithm, concurrent DML fails to write
the log operation into the temporary file. InnoDB fail to
mark the error for the online log.
- ddl_log_write() releases the global ddl lock prematurely before
release the log memory entry
Fix:
===
row_mysql_handle_errors(): Rollback the transaction when
InnoDB encounters DB_TEMP_FILE_WRITE_FAIL
convert_error_code_to_mysql(): Report an aborted transaction
when InnoDB encounters DB_TEMP_FILE_WRITE_FAIL during
alter table algorithm=copy or innodb bulk insert operation
row_log_online_op(): Mark the error in online log when
InnoDB ran out of temporary space
fil_space_extend_must_retry(): Mark the os_has_said_disk_full
as true if os_file_set_size() fails
btr_cur_pessimistic_update(): Return error code when
btr_cur_pessimistic_insert() fails
ddl_log_write(): Release the global ddl lock after releasing
the log memory entry when error was encountered
btr_cur_optimistic_update(): Relax the assertion that
blob pointer can be null during rollback because InnoDB can
ran out of space while allocating the external page
ha_innobase::extra(): Rollback the transaction during DDL before
calling convert_error_code_to_mysql().
row_undo_mod_upd_exist_sec(): Remove the assertion which says
that InnoDB should fail to build index entry when rollbacking
an incomplete transaction after crash recovery. This scenario
can happen when InnoDB ran out of space.
row_upd_changes_ord_field_binary_func(): Relax the assertion to
make that externally stored field can be null when InnoDB ran out
of space.
The problem was that aria_backup_client code did not intialize
maria_tmpdir, which is used during recovery if repair table is needed to
reconstruct indexes.
Problem:
=======
- During inplace algorithm, concurrent DML fails to write
the log operation into the temporary file. InnoDB fail to
mark the error for the online log.
- ddl_log_write() releases the global ddl lock prematurely before
release the log memory entry
Fix:
===
row_log_online_op(): Mark the error in online log when
InnoDB ran out of temporary space
fil_space_extend_must_retry(): Mark the os_has_said_disk_full
as true if os_file_set_size() fails
btr_cur_pessimistic_update(): Return error code when
btr_cur_pessimistic_insert() fails
ddl_log_write(): Release the global ddl lock after releasing the
log memory entry when error was encountered
btr_cur_optimistic_update(): Relax the assertion that
blob pointer can be null during rollback because InnoDB can
ran out of space while allocating the external page
row_undo_mod_upd_exist_sec(): Remove the assertion which says
that InnoDB should fail to build index entry when rollbacking
an incomplete transaction after crash recovery. This scenario
can happen when InnoDB ran out of space.
row_upd_changes_ord_field_binary_func(): Relax the assertion to
make that externally stored field can be null when InnoDB ran out
of space.
rename_table_in_stat_tables(): Allocate TABLE_LIST[STATISTICS_TABLES]
from the heap instead of the stack, to pass -Wframe-larger-than=16384
in an optimized CMAKE_BUILD_TYPE=RelWithDebInfo build.
When table2myisam() prepares recinfo structures BIT field was skipped
because pack_length_in_rec() returns 0. Instead of BIT field
DB_ROW_HASH_1 field was taken into recinfo structure and its length
was added to reclength. This is wrong because not stored fields must
not be prepared as record columns (MI_COLUMNDEF) in storage
layer. 0-length fields are prepared in "reserve space for null bits"
branch.
The problem only occurs with tables where there is no data for the
main record outside of the null bits.
The fix updates minpos condition so we avoid fields after
stored_rec_length and these are not stored by
definition. share->reclength already includes not stored lengths from
CREATE TABLE so we cannot use it as minpos starting point.
In Aria there is no "reserve space for null bits" and it does not
create column definition for BIT. Also there is no
setup_vcols_for_repair() to reproduce the issue. But nonetheless it
creates column definition for not stored fields redundantly, so we
patch table2maria() as well. The test case for Aria tries to
demonstrate BIT field works, it does not reproduce any issues (as
redundant column definition for not stored field seem to not cause any
problems).
ER_DUP_ENTRY on partitioned table
Now as c1492f3d07 (MDEV-36115) restores m_last_part table->file
points to partition p0 while the error happens in p1, so error index
does not match ib_table in innobase_get_mysql_key_number_for_index().
This case is handled by separate code block in
innobase_get_mysql_key_number_for_index() which was wrong on using
secondary index for dict_index_is_auto_gen_clust() and it was not
covered by the tests.
--result_format 2 command fails with assertion:
DbugExit (why=0x7fffffff49e0 "missing DBUG_RETURN or DBUG_VOID_RETURN
macro in function \"do_result_format_version\"\n") at
../src/dbug/dbug.c:2045
After the membership change on a newly synced node, then this is just a
warning and safe to suppress.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Test changes only. Both warnings are expected and
should be suppressed because we intentionally inject
different inconsistencies on two nodes and then join
them back with membership change.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Test changes only. Add wait_condition so that all nodes
are in the expected state and add debug output if issue
does reproduce.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Issue:
Mariadb acquires additional MDL locks on UPDATE/INSERT/DELETE statements
on table with foreign keys. For example, table t1 references t2, an
UPDATE to t1 will MDL lock t2 in addition to t1.
A replica may deliver an ALTER t1 and UPDATE t2 concurrently for
applying. Then the UPDATE may acquire MDL lock for t1, followed by a
conflict when the ALTER attempts to MDL lock on t1. Causing a BF-BF
conflict.
Solution:
Additional keys for the referenced/foreign table needs to be added
to avoid potential MDL conflicts with concurrent update and DDLs.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Add wait_condition to wait until all inserted rows are replicated
so that show create table is deterministic.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
page_is_corrupted(): Do not allocate the buffers from stack,
but from the heap, in xb_fil_cur_open().
row_quiesce_write_cfg(): Issue one type of message when we
fail to create the .cfg file.
update_statistics_for_table(), read_statistics_for_table(),
delete_statistics_for_table(), rename_table_in_stat_tables():
Use a common stack buffer for Index_stat, Column_stat, Table_stat.
ha_connect::FileExists(): Invoke push_warning_printf() so that
we can avoid allocating a buffer for snprintf().
translog_init_with_table(): Do not duplicate TRANSLOG_PAGE_SIZE_BUFF.
Let us also globally enable the GCC 4.4 and clang 3.0 option
-Wframe-larger-than=16384 to reduce the possibility of introducing
such stack overflow in the future. For RocksDB and Mroonga we relax
these limits.
Reviewed by: Vladislav Lesin
In commit b6923420f3 (MDEV-29445)
some hash tables were accidentally created with the minimum size
(101 entries) instead of correctly deriving the size from the
initial innodb_buffer_pool_size. This led to very long hash bucket
chains, which are very slow to traverse.
ut_find_prime(): Assert that the size is nonzero in order to catch
this type of regression in the future.
innodb_init_params(): Do not bother reading buf_pool.curr_size()
when it is known to be 0,
srv_start(): Correctly initialize srv_lock_table_size to 5 times
buf_pool.curr_size(), that is, the buffer pool size in pages,
between invoking buf_pool.create() and lock_sys.create().
btr_search_enable(), dict_sys_t::create(), dict_sys_t::resize():
Correctly refer to buf_pool.curr_pool_size(), that is,
innodb_buffer_pool_size in bytes, when calculating the hash table size.
In MDEV-29445 the expressions buf_pool_get_curr_size() were
accidentally replaced with buf_pool.curr_size().
buf_buddy_shrink(): Properly cover the case when KEY_BLOCK_SIZE
corresponds to the innodb_page_size, that is, the ROW_FORMAT=COMPRESSED
page frame is directly allocated from the buffer pool, not via the
binary buddy allocator.
buf_LRU_check_size_of_non_data_objects(): Avoid a crash when the
buffer pool is being shrunk.
buf_pool_t::shrink(): Abort if over 95% of the shrunk buffer pool
would be occupied by the adaptive hash index or record locks.
In commit b6923420f3 (MDEV-29445)
we started to specify the MAP_POPULATE flag for allocating the
InnoDB buffer pool. This would cause a lot of time to be spent
on __mm_populate() inside the Linux kernel, such as 16 seconds
to pre-fault or commit innodb_buffer_pool_size=64G.
Let us revert to the previous way of allocating the buffer pool
at startup. Note: An attempt to increase the buffer pool size by
SET GLOBAL innodb_buffer_pool_size (up to innodb_buffer_pool_size_max)
will invoke my_virtual_mem_commit(), which will use MAP_POPULATE
to zero-fill and prefault the requested additional memory area, blocking
buf_pool.mutex.
Before MDEV-29445 we allocated the InnoDB buffer pool by invoking
mmap(2) once (via my_large_malloc()). After the change, we would
invoke mmap(2) twice, first via my_virtual_mem_reserve() and then
via my_virtual_mem_commit(). Outside Microsoft Windows, we are
reverting back to my_large_malloc() like allocation.
my_virtual_mem_reserve(): Define only for Microsoft Windows.
Other platforms should invoke my_large_virtual_alloc() and
update_malloc_size() instead of my_virtual_mem_reserve() and
my_virtual_mem_commit().
my_large_virtual_alloc(): Define only outside Microsoft Windows.
Do not specify MAP_NORESERVE nor MAP_POPULATE, to preserve compatibility
with my_large_malloc(). Were MAP_POPULATE specified, the mmap()
system call would be significantly slower, for example 18 seconds
to reserve 64 GiB upfront.
log_t::append_prepare_wait(): Do not attempt to read log_sys.write_lsn
because it is not protected by log_sys.latch but by write_lock, which
we cannot hold here. The assertion could fail if log_t::write_buf()
is executing concurrently, and it has not yet executed log_write_buf()
or updated log_sys.write_lsn.
Fixes up commit acd071f599 (MDEV-21923)
There were two issues with the test:
1. A race between a race_condition.inc and status variable, where the
status variable Rpl_semi_sync_master_status could be ON before the
semi-sync connection finished establishing, resulting in
Rpl_semi_sync_master_clients showing 0 (instead of 1). To fix this,
we simply instead wait for Rpl_semi_sync_master_clients to be 1
before proceeding.
2. Another race between a race_condition.inc and status variable,
where the wait_condition waited on a process_list command of
'BINLOG DUMP' to disappear to infer the binlog dump thread was
killed, to where we then verified semi-sync state was correct
using status variables. However, the 'BINLOG DUMP' command is
overridden with a killed status before the semi-sync tear-down
happens, and thereby we could see invalid values. The fix for
this is to change the wait_condition to instead wait for the
connection with the replication user is gone, because that stays
through the binlog dump thread tear-down life-cycle
log_t::append_prepare_wait(): Do not attempt to read log_sys.write_lsn
because it is not protected by log_sys.latch but by write_lock, which
we cannot hold here. The assertion could fail if log_t::write_buf()
is executing concurrently, and it has not yet executed log_write_buf()
or updated log_sys.write_lsn.
Fixes up commit acd071f599 (MDEV-21923)
The SQL service leaves the affected rows uninitialized.
The initialization of the spider plugin that uses
this service will fail under MSAN because there isn't
an initialized value to return at the end of the query.
Valgrind is single threaded and only changes threads as part of
system calls or waits.
Some busy loops were identified and fixed where the server assumes
that some other thread will change the state, which will not happen
with valgrind.
Based on patch by Monty. Original patch introduced VALGRIND_YIELD,
which emits pthread_yield() only in valgrind builds. However it was
agreed that it is a good idea to emit yield() unconditionally, such
that other affected schedulers (like SCHED_FIFO) benefit from this
change. Also avoid pthread_yield() in favour of standard
std::this_thread::yield().
The minimum statistics level now is rocksdb::StatsLevel::kDisableAll.
The default remains rocksdb::StatsLevel::kExceptHistogramOrTimers
which is now 1 (it used to be 0).
Ensure that backup_reset_alter_copy_lock() is called in case of rollback
or error in mysql_inplace_alter_table() or copy_data_between_tables().
Other things:
- Improved error from mariabackup when unexpected DDL operation is
encountered.
- Added assert if backup_ddl_log() is called in the wrong context.
This is needed to make it easy for users to automatically ignore long
char and varchars when using ANALYZE TABLE PERSISTENT.
These fields can cause problems as they will consume
'CHARACTERS * MAX_CHARACTER_LENGTH * 2 * number_of_rows' space on disk
during analyze, which can easily be much bigger than the analyzed table.
This commit adds a new user variable, analyze_max_length, default value 4G.
Any field that is bigger than this in bytes, will be ignored by
ANALYZE TABLE PERSISTENT unless it is specified in FOR COLUMNS().
While doing this patch, I noticed that we do not skip GEOMETRY columns from
ANALYZE TABLE, like we do with BLOB. This should be fixed when merging
to the 'main' branch. At the same time we should add a resonable default
value for analyze_max_length, probably 1024, like we have for
max_sort_length.
In non-EXPLAIN queries with subqueries, the trace was flooded
with empty "join_execution":{} nodes. Now, they are gone.
The "Range checked for each record" optimization still prints
content into trace on join execution. Now, we wrap it into
"range-checked-for-each-record" to delimit the invocations.
This new object has fields "select_id" which corresponds to
the outer query block, and the "loop" which corresponds to
the inner query block iteration number. Additionally,
the field "row_estimation" which itself is an object has
"table", and "range_analysis" fields that were moved
from the old "join_execution"'s steps array.
There were two issues with the test:
1. A race between a race_condition.inc and status variable, where the
status variable Rpl_semi_sync_master_status could be ON before the
semi-sync connection finished establishing, resulting in
Rpl_semi_sync_master_clients showing 0 (instead of 1). To fix this,
we simply instead wait for Rpl_semi_sync_master_clients to be 1
before proceeding.
2. Another race between a race_condition.inc and status variable,
where the wait_condition waited on a process_list command of
'BINLOG DUMP' to disappear to infer the binlog dump thread was
killed, to where we then verified semi-sync state was correct
using status variables. However, the 'BINLOG DUMP' command is
overridden with a killed status before the semi-sync tear-down
happens, and thereby we could see invalid values. The fix for
this is to change the wait_condition to instead wait for the
connection with the replication user is gone, because that stays
through the binlog dump thread tear-down life-cycle
To check the rows, the table needs to be opened. To that end, and like
MDEV-36038, we force COPY algorithm on ALTER TABLE ... SEQUENCE=1.
This also results in checking the sequence state / metadata.
The table structure was already validated before this patch.
With view protocol collation_connection is reset in mysql_make_view in
the "SELECT * FROM mysqltest_tmp_v" query. In the case of
spider/bugfix.mdev_33434, it is reset to latin1_swedish_ci, with the
latin1 charset.
This results in no conversion needed since it is the same as
character_set_client and the corresponding argument in the udf remains
unchanged, with "dummy" srv value. Thus the reported error is
1477: 'The foreign server name you are trying to reference does not exist. Data source error: dummy'
Without view protocol, the character_set_connection ucs2 setting in
the test survives, and the conversion results in empty connection
parameters, and the reported error is 1429
ER_CONNECT_TO_FOREIGN_DATA_SOURCE
This failure is irrelevant to the test, or to spider at all. Therefore
we disable view protocol for the statement.
In spider/bugfix.mdev_29352, with flush tables with read lock,
statements blocked in THD::has_read_only_protection() by checking
THD::global_read_lock could result in view protocol to "hang" waiting
for acquiring mdl in another THD.
In spider/bugfix.mdev_34555, within an XA transaction, statements
blocked by trans_check() by checking thd->transaction->xid_state could
result in view protocol to "hang" for the same reason.
Therefore we disable view protocol for relevant statements in these
tests.
Spider tables do not support SELECT SQL_CALC_FOUND_ROWS and the
correct test output is a coincidence. Debugging shows that the
limit_found_rows field was last updated in an unrelated statement:
SELECT STRAIGHT_JOIN a.a, a.b, date_format(b.c, '%Y-%m-%d %H:%i:%s')\nFROM ta_l a, tb_l b WHERE a.a = b.a ORDER BY a.a
As a byproduct, this fixes the "wrong found_rows() results" when
running these tests with view protocol.
The failure is caused by exec $stmt where $stmt has two queries.
mtr with view protocol transforms the first query into a view, leaving
the second query executed in the usual way. mtr being oblivious about
the second query then does not handle its results, resulting in
CR_COMMANDS_OUT_OF_SYNC. We disable view protocol for such edge cases.
After fixing these "Failed to drop view: 0: " further failures emerge
from two of the tests, which are the same problem as MDEV-36454, so we
fix them to by disabling view protocol for the relevant SELECTs.