When MariaDB Server is run in a container under
Windows Subsystem for Linux, the fstat(2) system calls that InnoDB
invokes in os_file_set_size() or os_file_get_size() are causing a
failure in case the file had been renamed in the past while the file
handle was open. This affects at least ALTER TABLE and OPTIMIZE TABLE.
os_file_get_size(): Invoke lseek(2) instead of fstat(2). We do not mind
if the file pointer is moving to the end of the file, because InnoDB
exclusively invokes positioned reads and writes, or in some rare cases,
appends to an existing file.
os_file_set_size(): Invoke os_file_get_size() instead of fstat(2).
Define the POSIX and Windows versions separately. Formerly, the
Windows version was called os_file_change_size_win32().
fil_node_t::read_page0(): Use os_file_get_size() to determine the
size, and do not crash on error.
fil_node_t::read_metadata(): Remove the non-Windows stat* parameter
and always invoke fstat(2) outside Windows, but do tolerate errors.
Because fstat(2) is more likely to fail than lseek(2), and this is
not time critical code, we can afford the extra lseek(2) system call.
Reviewed by: Vladislav Vaintroub
lock_rec_unlock_unmodified() is executed either under lock_sys.wr_lock()
or under a combination of lock_sys.rd_lock() + record locks hash table
cell latch. It also requests page latch to check if locked records were
changed by the current transaction or not.
Usually InnoDB requests page latch to find the certain record on the
page, and then requests lock_sys and/or record lock hash cell latch to
request record lock. lock_rec_unlock_unmodified() requests the latches
in the opposite order, what causes deadlocks. One of the possible
scenario for the deadlock is the following:
thread 1 - lock_rec_unlock_unmodified() is invoked under locks hash table
cell latch, the latch is acquired;
thread 2 - purge thread acquires page latch and tries to remove
delete-marked record, it invokes lock_update_delete(), which
requests locks hash table cell latch, held by thread 1;
thread 1 - requests page latch, held by thread 2.
To fix it we need to release lock_sys.latch and/or lock hash cell latch,
acquire page latch and re-acquire lock_sys related latches.
When lock_sys.latch and/or lock hash cell latch are released in
lock_release_on_prepare() and lock_release_on_prepare_try(), the page on
which the current lock is held, can be merged. In this case the bitmap
of the current lock must be cleared, and the new lock must be added to
the end of trx->lock.trx_locks list, or bitmap of already existing lock
must be changed.
The new field trx_lock_t::set_nth_bit_calls indicates if new locks
(bits in existing lock bitmaps or new lock objects) were created during
the period when lock_sys was released in trx->lock.trx_locks list
iteration loop in lock_release_on_prepare() or
lock_release_on_prepare_try(). And, if so, we traverse the list again.
The block can be freed during pages merging, what causes assertion
failure in buf_page_get_gen(), as btr_block_get() passes BUF_GET as page
get mode to it. That's why page_get_mode parameter was added to
btr_block_get() to pass BUF_GET_POSSIBLY_FREED from
lock_release_on_prepare() and lock_release_on_prepare_try() to
buf_page_get_gen().
As searching for id of trx, which modified secondary index record, is
quite expensive operation, restrict its usage for master. System variable
was added to remove the restriction for testing simplifying. The
variable exists only either for debug build or for build with
-DINNODB_ENABLE_XAP_UNLOCK_UNMODIFIED_FOR_PRIMARY option to increase the
probability of catching bugs for release build with RQG.
Note that the code, which does primary index lookup to find out what
transaction modified secondary index record, is necessary only when
there is no primary key and no unique secondary key on replica with row
based replication, because only in this case extra X locks on unmodified
records can be set during scan phase.
Reviewed by Marko Mäkelä.
There is no need to exclude exclusive non-gap locks from the procedure
of locks releasing on XA PREPARE execution in
lock_release_on_prepare_try() after commit
17e59ed3aa (MDEV-33454), because
lock_rec_unlock_unmodified() should check if the record was modified
with the XA, and release the lock if it was not.
lock_release_on_prepare_try(): don't skip X-locks, let
lock_rec_unlock_unmodified() to process them.
lock_sec_rec_some_has_impl(): add template parameter for not acquiring
trx_t::mutex for the case if a caller already holds the mutex, don't
crash if lock's bitmap is clean.
row_vers_impl_x_locked(), row_vers_impl_x_locked_low(): add new argument
to skip trx_t::mutex acquiring.
rw_trx_hash_t::validate_element(): don't acquire trx_t::mutex if the
current thread already holds it.
Thanks to Andrei Elkin for finding the bug.
Reviewed by Marko Mäkelä, Debarun Banerjee.
convert_error_code_to_mysql(): Treat DB_DEADLOCK and DB_RECORD_CHANGED
in the same way, that is, signal to the SQL layer that the transaction
had been rolled back.
Replication of non-transactional engines is experimental and
uses TOI. This naturally means that if there is open transaction
with transactional engine it's changes will be rolled back.
Fixed by adding error message if non-transactional engine
is part of multi-engine transaction with warning.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
It is possible that recv_sys.scanned_lsn is ahead of recv_sys.recovered_lsn
by a few 512-byte log blocks in case the last mini-transaction in the log
had not been written out completely before the server was killed.
This is occasionally the case when running the test
innodb.innodb-32k-crash.
log_sort_flush_list(): Correct some debug assertions that had been added in
commit 0d175968d1 (MDEV-31354).
The writes of some blocks may be completed and the oldest_modification()
set to 1 at any time.
The bogus assertion failures led to occasional failures of the test
innodb.innodb-32k-crash.
Removed 'purpose' parameter from os_file_create() and related functions.
Always use FILE_FLAG_OVERLAPPED when opening Windows files.
No performance regression was measured, nor there is any measurable
improvement.
A debug assertion in buf_LRU_get_free_block() could fail if
SET GLOBAL innodb_lru_scan_depth is being executed during a workload
that involves allocating buffer pool pages.
buf_pool_t::LRU_scan_depth: Replaces srv_LRU_scan_depth.
buf_pool_t::flush_neighbors: Replaces srv_flush_neighbors.
innodb_buf_pool_update<T>(): Update a parameter of buf_pool
while holding buf_pool.mutex.
- InnoDB fulltext rebuilds the FTS COMMON table while adding the
new fulltext index. This can be optimized by avoiding rebuilding
the FTS COMMON table in case of FTS COMMON TABLE already exists.
Reviewed-by: Marko Mäkelä <marko.makela@mariadb.com>
The invariant of write-ahead logging is that before any change to a
page is written to the data file, the corresponding log record must
must first have been durably written.
On crash recovery, there were some sloppy checks for this. Let us
implement accurate checks and flag an inconsistency as a hard error,
so that we can avoid further corruption of a corrupted database.
For data extraction from the corrupted database, innodb_force_recovery
can be used.
Before recovery is reading any data pages or invoking
buf_dblwr_t::recover() to recover torn pages from the
doublewrite buffer, InnoDB will have parsed the log until the
final LSN and updated log_sys.lsn to that. So, we can rely on
log_sys.lsn at all times. The doublewrite buffer recovery has been
refactored in such a way that the recv_sys.dblwr.pages may be consulted
while discovering files and their page sizes, but nothing will be
written back to data files before buf_dblwr_t::recover() is invoked.
A section of the test mariabackup.innodb_redo_overwrite
that is parsing some mariadb-backup --backup output has
been removed, because that output "redo log block is overwritten"
would often be missing in a Microsoft Windows environment
as a result of these changes.
recv_max_page_lsn, recv_lsn_checks_on: Remove.
recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging
condition at the end of the recovery.
recv_dblwr_t::validate_page(): Keep track of the maximum LSN
(if we are checking a non-doublewrite copy of a page) but
do not complain LSN being in the future. The doublewrite buffer
is a special case, because it will be read early during recovery.
Besides, starting with commit 762bcb81b5
the dblwr=true copies of pages may legitimately be "too new".
recv_dblwr_t::find_page(): Find a valid page with the smallest
FIL_PAGE_LSN that is in the valid range for recovery.
recv_dblwr_t::restore_first_page(): Replaced by find_page().
Only buf_dblwr_t::recover() will write to data files.
buf_dblwr_t::recover(): Simplify the message output. Do attempt
doublewrite recovery on user page read error. Ignore doublewrite
pages whose FIL_PAGE_LSN is outside the usable bounds. Previously,
we could wrongly recover a too new page from the doublewrite buffer.
It is unlikely that this could have lead to an actual error.
Write back all recovered pages from the doublewrite buffer here,
including for the first page of any tablespace.
buf_page_is_corrupted(): Distinguish the return values
CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER.
buf_page_check_corrupt(): Return the error code DB_CORRUPTION
in case the LSN is in the future.
Datafile::read_first_page(): Handle FSP_SPACE_FLAGS=0xffffffff
in the same way on both 32-bit and 64-bit architectures.
Datafile::read_first_page_flags(): Split from read_first_page().
Take a copy of the first page as a parameter.
recv_sys_t::free_corrupted_page(): Take the file as a parameter
and return whether a message was displayed. This avoids some duplicated
and incomplete error messages.
buf_page_t::read_complete(): Remove some redundant output and always
display the name of the corrupted file. Never return DB_FAIL;
use it only in internal error handling.
IORequest::read_complete(): Assume that buf_page_t::read_complete()
will have reported any error.
fil_space_t::set_corrupted(): Return whether this is the first time
the tablespace had been flagged as corrupted.
Datafile::validate_first_page(), fil_node_open_file_low(),
fil_node_open_file(), fil_space_t::read_page0(),
fil_node_t::read_page0(): Add a parameter for a copy of the
first page, and a parameter to indicate whether the FIL_PAGE_LSN
check should be suppressed. Before buf_dblwr_t::recover() is
invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the
FSP_SPACE_FLAGS and the tablespace ID that may be present in a
potentially too new copy of a page.
Reviewed by: Debarun Banerjee
dict_index_t::clear(), btr_drop_temporary_table(): Make use of the
root page guess if it is available.
btr_read_autoinc(): Invoke btr_root_block_get() to access the root page.
btr_blob_free(): Retain a buffer-fix on the page across mtr_t::commit()
in order to avoid a buf_pool.page_hash lookup.
dict_load_table_one(): Remove a redundant check for page id. It was
already validated in buf_page_t::read_complete().
trx_t::apply_log(): Make use of buf_pool.page_fix() to avoid some
mtr_t related overhead.
Reviewed by: Thirunarayanan Balathandayuthapani
Remove workaround for MDEV-13941, it served for 5 years,and all affected
pre-release 10.2 installation should have been already fixed in between.
Apparently Innodb is using is_sparse parameter in os_file_set_size()
inconsistently, and it passes is_sparse=false now during first file
extension. With MDEV-13941 workaround in place, it would unsparse
the file, which is makes compression not to work at all anymore.
In commit b7b9f3ce82 (MDEV-34515) we
accidentally made the InnoDB MVCC code acquire a shared
purge_sys.latch twice. Recursive shared latch acquisition may cause a
deadlock of InnoDB threads if another thread in between will start waiting
for an exclusive latch.
purge_sys_t::latch: In debug builds, use srw_lock_debug instead of
srw_spin_lock, so that bugs like this will result in debug assertion
failures.
trx_undo_report_row_operation(): Pass the view_guard to
trx_undo_prev_version() and the rest of the arguments in the same
order, so that the work to permute argument registers is minimized.
Problem:
=======
- Redundant table fails to insert into the table after
instant drop blob column. Instant drop column only marking
the column as hidden and consecutive insert statement tries
to insert NULL value for the dropped BLOB column and returns
the fixed length of the blob type as 65535. This lead to
row size too large error.
Fix:
====
For redundant table, if the non-fixed dropped column can be null
then set the length of the field type as 0.
Stop skipping const items when selecting but skip them when storing
their results to spider row to avoid storing in mismatching temporary
table fields.
Skip auxiliary fields in SELECTing, and do not store
the (non-existing) results to the corresponding temporary table
accordingly.
When there are BOTH auxiliary fields AND const items in the auxiliary
field items, do not use the spider GBH. This is a rare occasion if it
happens at all and not worth the added complexity to cover it.
Use the original item (item_ptr) in constructing GROUP BY and ORDER
BY, which also means using item->name instead of field->field_name as
aliases in constructing SELECT items. This fixes spurious regressions
caused by the above changes in some tests using ORDER BY, such as
mdev_24517.test. As a by-product, this also fixes MDEV-29546.
Therefore we update mdev_29008.test to include the MDEV-29546 case.
Remove the dead-code, in Spider, which is related to the Spider's
Oracle OCI support. The code has been disabled for a long time and
it is unlikely that the code will be enabled.
During spider query construction of certain cast functions, it
locates the last occurrence of a keyword in the output of the
Item::print() function and append from there to the constructed query
so far. For example, consider the following query
SELECT * FROM t2 ORDER BY CAST(c AS INET6);
It constructs the following query and executes it at the data
node (assuming the data node table is called t0).
select cast(t0.`c` as inet6) ``,t0.`c` `c` from `test`.`t1` t0 order by ``
When the construction has completed the initial part
select cast(t0.`c`
It then attempts to construct the " as inet6" part. To that end, it
calls print() on the Item_typecast_fbt corresponding to the cast item,
and obtains
cast(`test`.`t2`.`c` as inet6)
It then looks for " as ", and places cursor there for appending:
cast(`test`.`t2`.`c` as inet6)
^
In this patch, if the search fails, i.e. there's no " as ...", we
make sure that the cursor is not placed before the beginning of the
string (out of bound).
We also relax the search from " as char" to " as " in the case of
CHAR_TYPECAST_FUNC, since there is more than one Item type with this
func type. For example, "AS INET6" is an Item_typecast_fbt which has
this func type.
When an DDL statement results in a local partition table with
partitions not covering all values in the table, a failure is emitted.
However, when the table in question is a spider table, the issue does
not surface until some future statements (DELETE in the test examples
in this commit) are executed. This is consistent with the design of
spider which aims to minimise connections with the data node. The
resulting error is legitimate and should not result in an assertion
failure. Similarly, a partitioned spider table could have misplaced
rows, so we remove the other assertion as well.
- document tmp_share, which are temporary spider shares with only one
link (no ha)
- simplify spider_get_sys_tables_connect_info() where link_idx is
always 0
- InnoDB fails to set the index information or index number
for the spatial index error HA_ERR_NULL_IN_SPATIAL.
row_build_spatial_index_key(): Initialize the tmp_mbr array completely.
check_if_supported_inplace_alter(): Fix the spelling mistake of alter
heap-buffer-overflow in _mi_put_key_in_record
Rec buffer size depends on vreclength like this:
length= MY_MAX(length, info->s->vreclength);
The problem is rec buffer is allocated before vreclength is
calculated. The fix reallocates rec buffer if vreclength changed.
1. Rec buffer allocated
f0 mi_alloc_rec_buff (...) at ../src/storage/myisam/mi_open.c:738
f1 0x00005f4928244516 in mi_open (...) at ../src/storage/myisam/mi_open.c:671
f2 0x00005f4928210b98 in ha_myisam::open (...)
at ../src/storage/myisam/ha_myisam.cc:847
f3 0x00005f49273aba41 in handler::ha_open (...) at ../src/sql/handler.cc:3105
f4 0x00005f4927995a65 in open_table_from_share (...)
at ../src/sql/table.cc:4320
f5 0x00005f492769f084 in open_table (...) at ../src/sql/sql_base.cc:2024
f6 0x00005f49276a3ea9 in open_and_process_table (...)
at ../src/sql/sql_base.cc:3819
f7 0x00005f49276a29b8 in open_tables (...) at ../src/sql/sql_base.cc:4303
f8 0x00005f49276a6f3f in open_and_lock_tables (...)
at ../src/sql/sql_base.cc:5250
f9 0x00005f49275162de in open_and_lock_tables (...)
at ../src/sql/sql_base.h:509
f10 0x00005f4927a30d7a in open_only_one_table (...)
at ../src/sql/sql_admin.cc:412
f11 0x00005f4927a2c0c2 in mysql_admin_table (...)
at ../src/sql/sql_admin.cc:603
f12 0x00005f4927a2fda8 in Sql_cmd_optimize_table::execute (...)
at ../src/sql/sql_admin.cc:1517
f13 0x00005f49278102e3 in mysql_execute_command (...)
at ../src/sql/sql_parse.cc:6180
f14 0x00005f49278012d7 in mysql_parse (...) at ../src/sql/sql_parse.cc:8236
2. vreclength calculated
f0 ha_myisam::setup_vcols_for_repair (...)
at ../src/storage/myisam/ha_myisam.cc:1002
f1 0x00005f49282138b4 in ha_myisam::optimize (...)
at ../src/storage/myisam/ha_myisam.cc:1250
f2 0x00005f49273b4961 in handler::ha_optimize (...)
at ../src/sql/handler.cc:4896
f3 0x00005f4927a2d254 in mysql_admin_table (...)
at ../src/sql/sql_admin.cc:875
f4 0x00005f4927a2fda8 in Sql_cmd_optimize_table::execute (...)
at ../src/sql/sql_admin.cc:1517
f5 0x00005f49278102e3 in mysql_execute_command (...)
at ../src/sql/sql_parse.cc:6180
f6 0x00005f49278012d7 in mysql_parse (...) at ../src/sql/sql_parse.cc:8236
FYI backtrace was done with
set print frame-info location
set print frame-arguments presence
set width 80
There where unused variable. They were not conditional
on defines, so removed them.
Added an error handing in proc_object if there was no db
as subsequent operations would have failed.
CMake rewriting the tests causes Mroonga to be un-buildable
on build environments where there source directory is read
only.
In the test results, the version wasn't particularly important.
Remove the version dependence of tests.
storage/connect/tabfmt.cpp:419:24: error: '%.3d' directive writing between 3 and 10 bytes into a region of size 5 [-Werror=format-overflow=]
419 | sprintf(buf, "COL%.3d", i+1);
row_purge_reset_trx_id(): Reserve large enough offsets for accomodating
the maximum width PRIMARY KEY followed by DB_TRX_ID,DB_ROLL_PTR.
Reviewed by: Thirunarayanan Balathandayuthapani
purge_sys_t::get_page(): Avoid accessing a freed reference to pages[id]
after pages.erase(id). This heap-use-after-free would sometimes be
caught by AddressSanitizer.
purge_sys_t::iterator::free_history_rseg(): Do not crash if undo=nullptr
(the database is corrupted).
Reviewed by: Debarun Banerjee
Another chance for cutting back overhead due to C++ exceptions being
enabled; the `dict_sys_t` class is a good candidate because its
locking methods are called frequently.
Binary size reduction this time:
text data bss dec hex filename
24448622 2436488 9473537 36358647 22ac9f7 build/release/sql/mariadbd
24448474 2436488 9473601 36358563 22ac9a3 build/release/sql/mariadbd
MariaDB is compiled with C++ exceptions enabled, and that disallows
some optimizations (e.g. the stack must always be unwinding-safe). By
adding `noexcept` to functions that are guaranteed to never throw,
some of these optimizations can be regained. Low-level locking
functions that are called often are a good candidate for this.
This shrinks the executable a bit (tested with GCC 14 on aarch64):
text data bss dec hex filename
24448910 2436488 9473185 36358583 22ac9b7 build/release/sql/mariadbd
24448622 2436488 9473537 36358647 22ac9f7 build/release/sql/mariadbd
Don't allow the referencing key column from NULL TO NOT NULL
when
1) Foreign key constraint type is ON UPDATE SET NULL
2) Foreign key constraint type is ON DELETE SET NULL
3) Foreign key constraint type is UPDATE CASCADE and referenced
column declared as NULL
Don't allow the referenced key column from NOT NULL to NULL
when foreign key constraint type is UPDATE CASCADE
and referencing key columns doesn't allow NULL values
get_foreign_key_info(): InnoDB sends the information about
nullability of the foreign key fields and referenced key fields.
fk_check_column_changes(): Enforce the above rules for COPY
algorithm
innobase_check_foreign_drop_col(): Checks whether the dropped
column exists in existing foreign key relation
innobase_check_foreign_low() : Enforce the above rules for
INPLACE algorithm
dict_foreign_t::check_fk_constraint_valid(): This is used
by CREATE TABLE statement to check nullability for foreign
key relation.
The method was declared to return an unsigned integer, but it is
really a boolean (and used as such by all callers).
A secondary change is the addition of "const" and "noexcept" to this
method.
In ha_mroonga.cpp, I also added "inline" to the two helper methods of
referenced_by_foreign_key(). This allows the compiler to flatten the
method.