When spider encounters Item_aggregate_ref that is valid at gbh
creation, it could become invalid at gbh execution due to item list
substitution*. Therefore we ban Item_aggregate_ref for spider gbh
creation. To that end, we also make sure that str is NULL if and only
if in the creation stage, not the execution stage, including removing
a redundant check when str is not NULL.
*: Note that it is likely the same scenario as in MDEV-25116.
We spare spider_table_remove_share_from_crd/sts_thread and
spider_table_add_share_to_crd/sts_thread because the function body has
too many fields with crd/sts in the name and macro affects
debuggability.
At TRANSACTION ISOLATION LEVEL SERIALIZABLE, InnoDB would fail to flag
a write/read conflict, which would be a violation already at the more
relaxed REPEATABLE READ level when innodb_snapshot_isolation=ON.
Fix: Create a read view and start the transaction at the same time.
Thus, lock checks will be able to consult the correct read view
to flag ER_CHECKREAD if we are about to lock a record that was committed
after the start of our transaction.
innobase_start_trx_and_assign_read_view(): At any other isolation level
than READ UNCOMMITTED, do create a read view. This is needed for the
correct operation of START TRANSACTION WITH CONSISTENT SNAPSHOT.
ha_innobase::store_lock(): At SERIALIZABLE isolation level, if the
transaction was not started yet, start it and open a read view.
An alternative way to achieve this would be to make trans_begin()
treat START TRANSACTION (or BEGIN) in the same way as
START TRANSACTION WITH CONSISTENT SNAPSHOT when the isolation level
is SERIALIZABLE.
innodb_isolation_level(const THD*): A simpler version of
innobase_map_isolation_level(). Compared to earlier, we will return
READ UNCOMMITTED also if the :newraw option is set for the
InnoDB system tablespace.
Reviewed by: Vladislav Lesin
page_delete_rec_list_end(): Do not attempt to scrub the data of
an empty record.
The test case would reproduce a debug assertion failure in branches
where commit 358921ce32 (MDEV-26938)
is present. MariaDB Server 10.6 only supports ascending indexes,
and in those, the empty string would always be sorted first, never
last in a page.
Nevertheless, we fix the bug also in 10.6, in case it would be
reproducible in a slightly different scenario.
Reviewed by: Thirunarayanan Balathandayuthapani
Problem:
=======
- During slow shutdown, expectation is that change buffer index to
merge all the change buffer pages and free all the change
buffer pages except root. But InnoDB fails to free the
freed change buffer pages from segment and returns it to
system tablespace.
Solution:
=========
- During slow shutdown, remove all pages from free list of change
buffer and frees it to system tablespace
While updating the persistent defragmentation statistics
for the table, InnoDB opens the table only if it is in cache.
If dict_table_open_on_id() fails to find the table in cache
then it fails to unfreeze dict_sys.latch. This lead to crash
Problem:
=======
- There are two failures occurs for this test case:
(1) set global innodb_buf_flush_list_now=1 doesn't make sure that pages
are being flushed.
(2) InnoDB page cleaner thread aborts while writing the checkpoint information.
Problem is that When InnoDB startup aborts, InnoDB changes the shutdown
state to SRV_SHUTDOWN_EXIT_THREADS. By changing the shutdown state, InnoDB
doesn't advance the log_sys.lsn (avoids fil_names_clear()).
After InnoDB shutdown(innodb_shutdown()) is being initiated, shutdown state
again changed to SRV_SHUTDOWN_INITIATED. This leads the page cleaner thread
to fail with assertion ut_ad(srv_shutdown_state > SRV_SHUTDOWN_INITIATED)
in log_write_checkpoint_info()
Solution:
=========
(1) In order to avoid (1) failure, InnoDB can make the
variable innodb_max_dirty_pages_pct_lwm, innodb_max_dirty_pages_pct to 0.
Also make sure that InnoDB doesn't have any dirty pages in buffer pool
by adding wait_condition.
(2) Avoid changing the srv_shutdown_state to SRV_SHUTDOWN_EXIT_THREADS
when the InnoDB startup aborts
When REPAIRing a CSV table, the CSV engine marks the table as having
no rows in the case that there are rows only in memory but not yet
written to disk. In this case, there's actually nothing to repair,
since nothing exists on disk; make REPAIR idempotent for CSV tables.
This controls which linux implementation to use for
innodb_use_native_aio=ON.
innodb_linux_aio=auto is equivalent to innodb_linux_aio=io_uring when
it is available, and falling back to innodb_linux_aio=aio when not.
Debian packaging is no longer aio exclusive or uring, so
for those older Debian or Ubuntu releases, its a remove_uring directive.
For more recent releases, add mandatory liburing for consistent packaging.
WITH_LIBAIO is now an independent option from WITH_URING.
LINUX_NATIVE_AIO preprocessor constant is renamed to HAVE_LIBAIO,
analogous to existing HAVE_URING.
tpool::is_aio_supported(): A common feature check.
is_linux_native_aio_supported(): Remove. This had originally been added in
mysql/mysql-server@0da310b69d in 2012
to fix an issue where io_submit() on CentOS 5.5 would return EINVAL
for a /tmp/#sql*.ibd file associated with CREATE TEMPORARY TABLE.
But, starting with commit 2e814d4702 InnoDB
temporary tables will be written to innodb_temp_data_file_path.
The 2012 commit said that the error could occur on "old kernels".
Any GNU/Linux distribution that we currently support should be based
on a newer Linux kernel; for example, Red Hat Enterprise Linux 7
was released in 2014.
tpool::create_linux_aio(): Wraps the Linux implementations:
create_libaio() and create_liburing(), each defined in separate
compilation units (aio_linux.cc, aio_libaio.cc, aio_liburing.cc).
The CMake definitions are simplified using target_sources() and
target_compile_definitions(), all available since CMake 2.8.12.
With this change, there is no need to include ${CMAKE_SOURCE_DIR}/tpool
or add TPOOL_DEFINES flags anymore, target_link_libraries(lib tpool)
does all that.
This is joint work with Daniel Black and Vladislav Vaintroub.
In a UBSAN debug build, the comparisons with next_mrec_end are made
with index->online_log's head/tail members' block ptr with a sort buffer
size offset (1048576).
The logic that flows though to this point means that even srv_sort_buf_size
above a null pointer wouldn't contain the value of next_mrec_end.
As such this is a UBSAN type fix where we first check if the
head.block / tail.block is null before doing the asserts around
this debug condition. This would be required for the assertions
conditions not to segfault anyway.
Problem:
=========
- During truncation of a fulltext table, InnoDB does create the
table and does insert the default config fts values in fulltext
common config table using create table transaction.
- Before committing the create table transaction, InnoDB does update
the dictionary by loading the stopword into fts cache
and write the stopword configuration into fulltext common
config table by creating a separate transaction. This leads to
lock wait timeout error and rollbacks the transaction.
- But truncate table holds dict_sys.lock and rollback also
tries to acquire dict_sys.lock. This leads to assertion during
rollback.
Solution:
=========
ha_innobase::truncate(): Commit the create table transaction
before updating the dictionary after create table.
MSAN has been updated since 2022 when this macro was added
and as such the working around MSAN's deficient understanding
of the fstat/stat syscall behaviour at the time is no longer
required.
As an effective no-op a straight removal is sufficient.
Problem:
=======
- InnoDB unpoisons the freed page memory to make sure that
no other thread uses this freed page. In buf_pool_t::close(),
InnoDB unmap() the buffer pool memory during shutdown or it
encountered during startup. Later at some point, server
re-uses the same virtual address using mmap() and writes into
memory region. This leads to use_after_poison error.
This issue doesn't happen in latest clang and gcc version.
Older version of clang and gcc can still fail with this error.
ASAN should unpoison the memory while reusing the same virtual
address. This issue was already raised in
https://github.com/google/sanitizers/issues/1705
Fix:
===
In order to avoid this failure, let's unpoison the buffer
pool memory explictly during buf_pool_t::close() for
lesser than gcc-14 and clang-18 version.
Now that RocksDB has been synced up to 6.29 which includes the changes
mentioned in the CMake comment support for building on non-Linux aarch64
OSes can be enabled.
As of CMake 3.24 CMAKE_COMPILER_IS_GNU(CC|CXX) are deprecated and should
be replaced with CMAKE_(C|CXX)_COMPILER_ID which were introduced with
CMake 2.6.
In the main.plugin this function is called assuming the function
prototype int (*)(THD *, st_mysql_show_var *, void *, system_status_var *, enum_var_type)'
as changed in b4ff64568c.
We update the ha_example::show_func_example to match the prototype
on which it is called.
If the execution of the two reads in log_t::get_lsn_approx() is
interleaved with concurrent writes of those fields in
log_t::write_buf() or log_t::persist(), the returned approximation
will be an upper bound. If log_t::append_prepare_wait() is pending,
the approximation could be a lower bound.
We must adjust each caller of log_t::get_lsn_approx() for the
possibility that the return value is larger than
MAX(oldest_modification) in buf_pool.flush_list.
af_needed_for_redo(): Add a comment that explains why the glitch
is not a problem.
page_cleaner_flush_pages_recommendation(): Revise the logic for
the unlikely case cur_lsn < oldest_lsn. The original logic would have
invoked af_get_pct_for_lsn() with a very large age value, which
would likely cause an overflow of the local variable lsn_age_factor,
and make pct_for_lsn a "random number". Based on that value,
total_ratio would be normalized to something between 0.0 and 1.0.
Nothing extremely bad should have happened in this case;
the innodb_io_capacity_max should not be exceeded.
buf_pool_t::resize(): After successfully shrinking the buffer pool,
announce the success. The size had already been updated in shrunk().
After failing to shrink the buffer pool, re-enable the adaptive
hash index if it had been enabled.
Reviewed by: Debarun Banerjee
Calling SetArrayOptions with Nodes[i - 1].Key as its nm argument
exposed a MemorySanitizer error when i=0 for the tests:
* connect.bson_udf
* connect.json_udf
* connect.json_udf_bin
Its assumed that a basic optimization would have eliminated
these invalid expressions.
As the nm argument was unused, it has been removed.
Clang ~16+ on MSAN became quite strict with uninitalized
data being passed and returned from functions. Non-debug builds
have a basic optimization that hides these from those builds
Two innodb cases violate the assumptions, however once inlined
with a basic optimization those that existed for uninitialized
values are removed.
(MDEV-36316) rec_set_bit_field_2 calling mach_read_from_2 hits a read of
bits it wasn't actually changing.
(MDEV-36327) The function dict_process_sys_columns_rec left
nth_v_col uninitialized unless it was a virtual column. This was
ok as the function i_s_sys_columns_fill_table also didn't read
this value unless it was a virtual column.
Problem:
=======
- In 10.11, During Copy algorithm, InnoDB does use bulk insert
for row by row insert operation. When temporary directory
ran out of memory, row_mysql_handle_errors() fails to handle
DB_TEMP_FILE_WRITE_FAIL.
- During inplace algorithm, concurrent DML fails to write
the log operation into the temporary file. InnoDB fail to
mark the error for the online log.
- ddl_log_write() releases the global ddl lock prematurely before
release the log memory entry
Fix:
===
row_mysql_handle_errors(): Rollback the transaction when
InnoDB encounters DB_TEMP_FILE_WRITE_FAIL
convert_error_code_to_mysql(): Report an aborted transaction
when InnoDB encounters DB_TEMP_FILE_WRITE_FAIL during
alter table algorithm=copy or innodb bulk insert operation
row_log_online_op(): Mark the error in online log when
InnoDB ran out of temporary space
fil_space_extend_must_retry(): Mark the os_has_said_disk_full
as true if os_file_set_size() fails
btr_cur_pessimistic_update(): Return error code when
btr_cur_pessimistic_insert() fails
ddl_log_write(): Release the global ddl lock after releasing
the log memory entry when error was encountered
btr_cur_optimistic_update(): Relax the assertion that
blob pointer can be null during rollback because InnoDB can
ran out of space while allocating the external page
ha_innobase::extra(): Rollback the transaction during DDL before
calling convert_error_code_to_mysql().
row_undo_mod_upd_exist_sec(): Remove the assertion which says
that InnoDB should fail to build index entry when rollbacking
an incomplete transaction after crash recovery. This scenario
can happen when InnoDB ran out of space.
row_upd_changes_ord_field_binary_func(): Relax the assertion to
make that externally stored field can be null when InnoDB ran out
of space.
Problem:
=======
- During inplace algorithm, concurrent DML fails to write
the log operation into the temporary file. InnoDB fail to
mark the error for the online log.
- ddl_log_write() releases the global ddl lock prematurely before
release the log memory entry
Fix:
===
row_log_online_op(): Mark the error in online log when
InnoDB ran out of temporary space
fil_space_extend_must_retry(): Mark the os_has_said_disk_full
as true if os_file_set_size() fails
btr_cur_pessimistic_update(): Return error code when
btr_cur_pessimistic_insert() fails
ddl_log_write(): Release the global ddl lock after releasing the
log memory entry when error was encountered
btr_cur_optimistic_update(): Relax the assertion that
blob pointer can be null during rollback because InnoDB can
ran out of space while allocating the external page
row_undo_mod_upd_exist_sec(): Remove the assertion which says
that InnoDB should fail to build index entry when rollbacking
an incomplete transaction after crash recovery. This scenario
can happen when InnoDB ran out of space.
row_upd_changes_ord_field_binary_func(): Relax the assertion to
make that externally stored field can be null when InnoDB ran out
of space.
When table2myisam() prepares recinfo structures BIT field was skipped
because pack_length_in_rec() returns 0. Instead of BIT field
DB_ROW_HASH_1 field was taken into recinfo structure and its length
was added to reclength. This is wrong because not stored fields must
not be prepared as record columns (MI_COLUMNDEF) in storage
layer. 0-length fields are prepared in "reserve space for null bits"
branch.
The problem only occurs with tables where there is no data for the
main record outside of the null bits.
The fix updates minpos condition so we avoid fields after
stored_rec_length and these are not stored by
definition. share->reclength already includes not stored lengths from
CREATE TABLE so we cannot use it as minpos starting point.
In Aria there is no "reserve space for null bits" and it does not
create column definition for BIT. Also there is no
setup_vcols_for_repair() to reproduce the issue. But nonetheless it
creates column definition for not stored fields redundantly, so we
patch table2maria() as well. The test case for Aria tries to
demonstrate BIT field works, it does not reproduce any issues (as
redundant column definition for not stored field seem to not cause any
problems).
ER_DUP_ENTRY on partitioned table
Now as c1492f3d07 (MDEV-36115) restores m_last_part table->file
points to partition p0 while the error happens in p1, so error index
does not match ib_table in innobase_get_mysql_key_number_for_index().
This case is handled by separate code block in
innobase_get_mysql_key_number_for_index() which was wrong on using
secondary index for dict_index_is_auto_gen_clust() and it was not
covered by the tests.
page_is_corrupted(): Do not allocate the buffers from stack,
but from the heap, in xb_fil_cur_open().
row_quiesce_write_cfg(): Issue one type of message when we
fail to create the .cfg file.
update_statistics_for_table(), read_statistics_for_table(),
delete_statistics_for_table(), rename_table_in_stat_tables():
Use a common stack buffer for Index_stat, Column_stat, Table_stat.
ha_connect::FileExists(): Invoke push_warning_printf() so that
we can avoid allocating a buffer for snprintf().
translog_init_with_table(): Do not duplicate TRANSLOG_PAGE_SIZE_BUFF.
Let us also globally enable the GCC 4.4 and clang 3.0 option
-Wframe-larger-than=16384 to reduce the possibility of introducing
such stack overflow in the future. For RocksDB and Mroonga we relax
these limits.
Reviewed by: Vladislav Lesin
In commit b6923420f3 (MDEV-29445)
some hash tables were accidentally created with the minimum size
(101 entries) instead of correctly deriving the size from the
initial innodb_buffer_pool_size. This led to very long hash bucket
chains, which are very slow to traverse.
ut_find_prime(): Assert that the size is nonzero in order to catch
this type of regression in the future.
innodb_init_params(): Do not bother reading buf_pool.curr_size()
when it is known to be 0,
srv_start(): Correctly initialize srv_lock_table_size to 5 times
buf_pool.curr_size(), that is, the buffer pool size in pages,
between invoking buf_pool.create() and lock_sys.create().
btr_search_enable(), dict_sys_t::create(), dict_sys_t::resize():
Correctly refer to buf_pool.curr_pool_size(), that is,
innodb_buffer_pool_size in bytes, when calculating the hash table size.
In MDEV-29445 the expressions buf_pool_get_curr_size() were
accidentally replaced with buf_pool.curr_size().
buf_buddy_shrink(): Properly cover the case when KEY_BLOCK_SIZE
corresponds to the innodb_page_size, that is, the ROW_FORMAT=COMPRESSED
page frame is directly allocated from the buffer pool, not via the
binary buddy allocator.
buf_LRU_check_size_of_non_data_objects(): Avoid a crash when the
buffer pool is being shrunk.
buf_pool_t::shrink(): Abort if over 95% of the shrunk buffer pool
would be occupied by the adaptive hash index or record locks.
In commit b6923420f3 (MDEV-29445)
we started to specify the MAP_POPULATE flag for allocating the
InnoDB buffer pool. This would cause a lot of time to be spent
on __mm_populate() inside the Linux kernel, such as 16 seconds
to pre-fault or commit innodb_buffer_pool_size=64G.
Let us revert to the previous way of allocating the buffer pool
at startup. Note: An attempt to increase the buffer pool size by
SET GLOBAL innodb_buffer_pool_size (up to innodb_buffer_pool_size_max)
will invoke my_virtual_mem_commit(), which will use MAP_POPULATE
to zero-fill and prefault the requested additional memory area, blocking
buf_pool.mutex.
Before MDEV-29445 we allocated the InnoDB buffer pool by invoking
mmap(2) once (via my_large_malloc()). After the change, we would
invoke mmap(2) twice, first via my_virtual_mem_reserve() and then
via my_virtual_mem_commit(). Outside Microsoft Windows, we are
reverting back to my_large_malloc() like allocation.
my_virtual_mem_reserve(): Define only for Microsoft Windows.
Other platforms should invoke my_large_virtual_alloc() and
update_malloc_size() instead of my_virtual_mem_reserve() and
my_virtual_mem_commit().
my_large_virtual_alloc(): Define only outside Microsoft Windows.
Do not specify MAP_NORESERVE nor MAP_POPULATE, to preserve compatibility
with my_large_malloc(). Were MAP_POPULATE specified, the mmap()
system call would be significantly slower, for example 18 seconds
to reserve 64 GiB upfront.
log_t::append_prepare_wait(): Do not attempt to read log_sys.write_lsn
because it is not protected by log_sys.latch but by write_lock, which
we cannot hold here. The assertion could fail if log_t::write_buf()
is executing concurrently, and it has not yet executed log_write_buf()
or updated log_sys.write_lsn.
Fixes up commit acd071f599 (MDEV-21923)
log_t::append_prepare_wait(): Do not attempt to read log_sys.write_lsn
because it is not protected by log_sys.latch but by write_lock, which
we cannot hold here. The assertion could fail if log_t::write_buf()
is executing concurrently, and it has not yet executed log_write_buf()
or updated log_sys.write_lsn.
Fixes up commit acd071f599 (MDEV-21923)