Assertion failure has happened due to this scenario:
A query was ran with optimizer_join_limit_pref_ratio=1.
The query had "ORDER BY t1.col LIMIT N".
The optimizer set join->limit_shortcut_applicable=1.
Then, table t1 was marked as constant.
The code in choose_query_plan() still set join->limit_optimization_mode=1
which caused the optimizer to only consider t1 as the first non-const table.
But t1 was already put into the join prefix as the constant table.
The optimizer couldn't produce any join order at all and crashed.
Fixed by not searching for shortcut plan if ORDER BY table is a constant.
We will not try to do sorting anyway in this case (and LIMIT short-cutting
will be done for any join order).
(Variant 2: only allow rewrite for ref(const))
make_join_select() has a "ref_to_range" rewrite: it would rewrite
any ref access to a range access on the same index if the latter uses
more keyparts.
It seems, he initial intent of this was to fix poor query plan choice
in cases like
t.keypart1=const AND t.keypart2 < 'foo'
Due to deficiency in cost model, ref access could be picked while range
would enumerate fewer rows and be cheaper.
However, the condition also forces a rewrite in cases like:
t.keypart1=prev_table.col AND t.keypart1<='foo' AND t.keypart2<'bar'
Here, it can be that
* keypart1=prev_table.col is highly selective
* (keypart1, keypart2) <= ('foo', 'bar') is not at all selective.
Still, the rewrite would be made and poor query plan chosen.
Fixed this by only doing the rewrite if ref access was ref(const)
so we can be certain that quick select also used these restrictions
and will scan a subset of rows that ref access would scan.
The invariant of write-ahead logging is that before any change to a
page is written to the data file, the corresponding log record must
must first have been durably written.
On crash recovery, there were some sloppy checks for this. Let us
implement accurate checks and flag an inconsistency as a hard error,
so that we can avoid further corruption of a corrupted database.
For data extraction from the corrupted database, innodb_force_recovery
can be used.
Before recovery is reading any data pages or invoking
buf_dblwr_t::recover() to recover torn pages from the
doublewrite buffer, InnoDB will have parsed the log until the
final LSN and updated log_sys.lsn to that. So, we can rely on
log_sys.lsn at all times. The doublewrite buffer recovery has been
refactored in such a way that the recv_sys.dblwr.pages may be consulted
while discovering files and their page sizes, but nothing will be
written back to data files before buf_dblwr_t::recover() is invoked.
A section of the test mariabackup.innodb_redo_overwrite
that is parsing some mariadb-backup --backup output has
been removed, because that output "redo log block is overwritten"
would often be missing in a Microsoft Windows environment
as a result of these changes.
recv_max_page_lsn, recv_lsn_checks_on: Remove.
recv_sys_t::validate_checkpoint(): Validate the write-ahead-logging
condition at the end of the recovery.
recv_dblwr_t::validate_page(): Keep track of the maximum LSN
(if we are checking a non-doublewrite copy of a page) but
do not complain LSN being in the future. The doublewrite buffer
is a special case, because it will be read early during recovery.
Besides, starting with commit 762bcb81b5
the dblwr=true copies of pages may legitimately be "too new".
recv_dblwr_t::find_page(): Find a valid page with the smallest
FIL_PAGE_LSN that is in the valid range for recovery.
recv_dblwr_t::restore_first_page(): Replaced by find_page().
Only buf_dblwr_t::recover() will write to data files.
buf_dblwr_t::recover(): Simplify the message output. Do attempt
doublewrite recovery on user page read error. Ignore doublewrite
pages whose FIL_PAGE_LSN is outside the usable bounds. Previously,
we could wrongly recover a too new page from the doublewrite buffer.
It is unlikely that this could have lead to an actual error.
Write back all recovered pages from the doublewrite buffer here,
including for the first page of any tablespace.
buf_page_is_corrupted(): Distinguish the return values
CORRUPTED_FUTURE_LSN and CORRUPTED_OTHER.
buf_page_check_corrupt(): Return the error code DB_CORRUPTION
in case the LSN is in the future.
Datafile::read_first_page(): Handle FSP_SPACE_FLAGS=0xffffffff
in the same way on both 32-bit and 64-bit architectures.
Datafile::read_first_page_flags(): Split from read_first_page().
Take a copy of the first page as a parameter.
recv_sys_t::free_corrupted_page(): Take the file as a parameter
and return whether a message was displayed. This avoids some duplicated
and incomplete error messages.
buf_page_t::read_complete(): Remove some redundant output and always
display the name of the corrupted file. Never return DB_FAIL;
use it only in internal error handling.
IORequest::read_complete(): Assume that buf_page_t::read_complete()
will have reported any error.
fil_space_t::set_corrupted(): Return whether this is the first time
the tablespace had been flagged as corrupted.
Datafile::validate_first_page(), fil_node_open_file_low(),
fil_node_open_file(), fil_space_t::read_page0(),
fil_node_t::read_page0(): Add a parameter for a copy of the
first page, and a parameter to indicate whether the FIL_PAGE_LSN
check should be suppressed. Before buf_dblwr_t::recover() is
invoked, we cannot validate the FIL_PAGE_LSN, but we can trust the
FSP_SPACE_FLAGS and the tablespace ID that may be present in a
potentially too new copy of a page.
Reviewed by: Debarun Banerjee
dict_index_t::clear(), btr_drop_temporary_table(): Make use of the
root page guess if it is available.
btr_read_autoinc(): Invoke btr_root_block_get() to access the root page.
btr_blob_free(): Retain a buffer-fix on the page across mtr_t::commit()
in order to avoid a buf_pool.page_hash lookup.
dict_load_table_one(): Remove a redundant check for page id. It was
already validated in buf_page_t::read_complete().
trx_t::apply_log(): Make use of buf_pool.page_fix() to avoid some
mtr_t related overhead.
Reviewed by: Thirunarayanan Balathandayuthapani
In commit b7b9f3ce82 (MDEV-34515) we
accidentally made the InnoDB MVCC code acquire a shared
purge_sys.latch twice. Recursive shared latch acquisition may cause a
deadlock of InnoDB threads if another thread in between will start waiting
for an exclusive latch.
purge_sys_t::latch: In debug builds, use srw_lock_debug instead of
srw_spin_lock, so that bugs like this will result in debug assertion
failures.
trx_undo_report_row_operation(): Pass the view_guard to
trx_undo_prev_version() and the rest of the arguments in the same
order, so that the work to permute argument registers is minimized.
Pre-11.0 variant:
1. In recompute_join_cost_with_limit(), add an assertion that that
partial_join_cost >= 0.0.
2. best_extension_by_limited_search() subtracts COST_EPS from
join->best_read. But it is not subtracted from
join->positions[0].read_time, add it back.
2. We could get very small negative partial_join_cost due to rounding
errors. For fraction=1.0, we were computing essentially this (denote
as EXPR-1):
$row_read_cost + $where_cost - ($row_read_cost + $where_cost)
which should compute to 0.
But the computation was done in the following order (left-to-right):
EXPR-2:
($row_read_cost + $where_cost) - $row_read_cost - $where_cost
this produced a value of -1.1102230246251565e-16 due to a rounding
error. Change the computation use EXPR-1 instead of EXPR-2.
optimize_straight_join and best_extension_by_limited_search()
use 0.001 to make choice between plans with identical cost deterministic.
Use COST_EPS instead of 0.001, like it's done in newer versions.
Stop skipping const items when selecting but skip them when storing
their results to spider row to avoid storing in mismatching temporary
table fields.
Skip auxiliary fields in SELECTing, and do not store
the (non-existing) results to the corresponding temporary table
accordingly.
When there are BOTH auxiliary fields AND const items in the auxiliary
field items, do not use the spider GBH. This is a rare occasion if it
happens at all and not worth the added complexity to cover it.
Use the original item (item_ptr) in constructing GROUP BY and ORDER
BY, which also means using item->name instead of field->field_name as
aliases in constructing SELECT items. This fixes spurious regressions
caused by the above changes in some tests using ORDER BY, such as
mdev_24517.test. As a by-product, this also fixes MDEV-29546.
Therefore we update mdev_29008.test to include the MDEV-29546 case.
Remove the dead-code, in Spider, which is related to the Spider's
Oracle OCI support. The code has been disabled for a long time and
it is unlikely that the code will be enabled.
During spider query construction of certain cast functions, it
locates the last occurrence of a keyword in the output of the
Item::print() function and append from there to the constructed query
so far. For example, consider the following query
SELECT * FROM t2 ORDER BY CAST(c AS INET6);
It constructs the following query and executes it at the data
node (assuming the data node table is called t0).
select cast(t0.`c` as inet6) ``,t0.`c` `c` from `test`.`t1` t0 order by ``
When the construction has completed the initial part
select cast(t0.`c`
It then attempts to construct the " as inet6" part. To that end, it
calls print() on the Item_typecast_fbt corresponding to the cast item,
and obtains
cast(`test`.`t2`.`c` as inet6)
It then looks for " as ", and places cursor there for appending:
cast(`test`.`t2`.`c` as inet6)
^
In this patch, if the search fails, i.e. there's no " as ...", we
make sure that the cursor is not placed before the beginning of the
string (out of bound).
We also relax the search from " as char" to " as " in the case of
CHAR_TYPECAST_FUNC, since there is more than one Item type with this
func type. For example, "AS INET6" is an Item_typecast_fbt which has
this func type.
When an DDL statement results in a local partition table with
partitions not covering all values in the table, a failure is emitted.
However, when the table in question is a spider table, the issue does
not surface until some future statements (DELETE in the test examples
in this commit) are executed. This is consistent with the design of
spider which aims to minimise connections with the data node. The
resulting error is legitimate and should not result in an assertion
failure. Similarly, a partitioned spider table could have misplaced
rows, so we remove the other assertion as well.
- document tmp_share, which are temporary spider shares with only one
link (no ha)
- simplify spider_get_sys_tables_connect_info() where link_idx is
always 0
- InnoDB fails to set the index information or index number
for the spatial index error HA_ERR_NULL_IN_SPATIAL.
row_build_spatial_index_key(): Initialize the tmp_mbr array completely.
check_if_supported_inplace_alter(): Fix the spelling mistake of alter
Sometimes, in MariaDB Server 10.5 but apparently not in later branches,
the test would hang because con1 and con2 would be blocked in
debug_sync (for example, lock_wait_suspend_thread_enter and
row_ins_sec_index_entry_dup_locks_created) and therefore blocking
the purge of transactions from completing.
To prevent an occasional DEBUG_SYNC induced hang in the test, we will
wait for everything to be purged, except the last 2 transactions.
This change should be null-merged to 10.6, because the test is not
failing in 10.6 or later major versions.
Field_blob::store() has special code for GROUP_CONCAT temporary table
(to store blob values in Blob_mem_storage - this prevents them
from being freed/overwritten when a next row is read).
Field_geom and Field_blob_compressed inherit from Field_blob but they
have their own ::store() method without this special Blob_mem_storage
support.
Considering that non-grouping CONCAT() of such fields converts
them to plain BLOB, let's do the same for GROUP_CONCAT. To do it,
Item_func_group_concat::setup will signal that it's creating
a temporary table for GROUP_CONCAT, and Field_blog::make_new_field()
override will create base Field_blob when under group concat.
Hash index is vcol-based wrapper (MDEV-371). row_end is added to
unique index. So when row_end is updated unique hash index must be
recalculated via vcol_update_fields(). DELETE did not update virtual
fields, so DELETE HISTORY was getting wrong hash value.
The fix does update_virtual_fields() on vers_update_end() so in every
case row_end is updated virtual fields are updated as well.
work consistently on replication
Row-based replication does not execute CREATE .. SELECT but instead
CREATE TABLE. CREATE .. SELECT creates implict system fields on
unusual place: in-between declared fields and select fields. That was
done because select_field_pos logic requires select fields go last in
create_list.
So, CREATE .. SELECT on master and CREATE TABLE on slave create system
fields on different positions and replication gets field mismatch.
To fix this we've changed CREATE .. SELECT to create implicit system
fields on usual place in the end and updated select_field_pos for
handling this case.
heap-buffer-overflow in _mi_put_key_in_record
Rec buffer size depends on vreclength like this:
length= MY_MAX(length, info->s->vreclength);
The problem is rec buffer is allocated before vreclength is
calculated. The fix reallocates rec buffer if vreclength changed.
1. Rec buffer allocated
f0 mi_alloc_rec_buff (...) at ../src/storage/myisam/mi_open.c:738
f1 0x00005f4928244516 in mi_open (...) at ../src/storage/myisam/mi_open.c:671
f2 0x00005f4928210b98 in ha_myisam::open (...)
at ../src/storage/myisam/ha_myisam.cc:847
f3 0x00005f49273aba41 in handler::ha_open (...) at ../src/sql/handler.cc:3105
f4 0x00005f4927995a65 in open_table_from_share (...)
at ../src/sql/table.cc:4320
f5 0x00005f492769f084 in open_table (...) at ../src/sql/sql_base.cc:2024
f6 0x00005f49276a3ea9 in open_and_process_table (...)
at ../src/sql/sql_base.cc:3819
f7 0x00005f49276a29b8 in open_tables (...) at ../src/sql/sql_base.cc:4303
f8 0x00005f49276a6f3f in open_and_lock_tables (...)
at ../src/sql/sql_base.cc:5250
f9 0x00005f49275162de in open_and_lock_tables (...)
at ../src/sql/sql_base.h:509
f10 0x00005f4927a30d7a in open_only_one_table (...)
at ../src/sql/sql_admin.cc:412
f11 0x00005f4927a2c0c2 in mysql_admin_table (...)
at ../src/sql/sql_admin.cc:603
f12 0x00005f4927a2fda8 in Sql_cmd_optimize_table::execute (...)
at ../src/sql/sql_admin.cc:1517
f13 0x00005f49278102e3 in mysql_execute_command (...)
at ../src/sql/sql_parse.cc:6180
f14 0x00005f49278012d7 in mysql_parse (...) at ../src/sql/sql_parse.cc:8236
2. vreclength calculated
f0 ha_myisam::setup_vcols_for_repair (...)
at ../src/storage/myisam/ha_myisam.cc:1002
f1 0x00005f49282138b4 in ha_myisam::optimize (...)
at ../src/storage/myisam/ha_myisam.cc:1250
f2 0x00005f49273b4961 in handler::ha_optimize (...)
at ../src/sql/handler.cc:4896
f3 0x00005f4927a2d254 in mysql_admin_table (...)
at ../src/sql/sql_admin.cc:875
f4 0x00005f4927a2fda8 in Sql_cmd_optimize_table::execute (...)
at ../src/sql/sql_admin.cc:1517
f5 0x00005f49278102e3 in mysql_execute_command (...)
at ../src/sql/sql_parse.cc:6180
f6 0x00005f49278012d7 in mysql_parse (...) at ../src/sql/sql_parse.cc:8236
FYI backtrace was done with
set print frame-info location
set print frame-arguments presence
set width 80
Search conditions were evaluated using val_int(), which was wrong.
Fixing the code to use val_bool() instead.
Details:
- Adding a new item_base_t::IS_COND flag which marks Items used
as <search condition> in WHERE, HAVING, JOIN ON, CASE WHEN clauses.
The flag is at the parse time.
These expressions must be evaluated using val_bool() rather than val_int().
Note, the optimizer creates more Items which are used as search conditions.
Most of these items are not marked with IS_COND yet. This is OK for now,
but eventually these Items can also be fixed to have the flag.
- Adding a method Item::is_cond() which tests if the Item has the IS_COND flag.
- Implementing Item_cache_bool. It evaluates the cached expression using
val_bool() rather than val_int().
Overriding Type_handler_bool::Item_get_cache() to create Item_cache_bool.
- Implementing Item::save_bool_in_field(). It uses val_bool() rather than
val_int() to evaluate the expression.
- Implementing Type_handler_bool::Item_save_in_field()
using Item::save_bool_in_field().
- Fixing all Item_bool_func descendants to implement a virtual val_bool()
rather than a virtual val_int().
- To find places where val_int() should be fixed to val_bool(), a few
DBUG_ASSERT(!is_cond()) where added into val_int() implementations
of selected (most frequent) classes:
Item_field
Item_str_func
Item_datefunc
Item_timefunc
Item_datetimefunc
Item_cache_bool
Item_bool_func
Item_func_hybrid_field_type
Item_basic_constant descendants
- Fixing all places where DBUG_ASSERT() happened during an "mtr" run
to use val_bool() instead of val_int().
Fixed by checking handler_stats if it's active instead of
thd->variables.log_slow_verbosity & LOG_SLOW_VERBOSITY_ENGINE.
Reviewed-by: Sergei Petrunia <sergey@mariadb.com>