(Variant 2b: call greedy_search() twice, correct handling for limited
search_depth)
Modify the join optimizer to specifically try to produce join orders that
can short-cut their execution for ORDER BY..LIMIT clause.
The optimization is controlled by @@optimizer_join_limit_pref_ratio.
Default value 0 means don't construct short-cutting join orders.
Other value means construct short-cutting join order, and prefer it only
if it promises speedup of more than #value times.
In Optimizer Trace, look for these names:
* join_limit_shortcut_is_applicable
* join_limit_shortcut_plan_search
* join_limit_shortcut_choice
In commit fa8a46eb68 (MDEV-33613)
the parameter innodb_lru_flush_size ceased to have any effect.
Let us declare the parameter as deprecated and additionally as
MARIADB_REMOVED_OPTION, so that there will be a warning written
to the error log in case the option is specified in the command line.
Let us also do the same for the parameter
innodb_purge_rseg_truncate_frequency
that was deprecated&ignored earlier in MDEV-32050.
Reviewed by: Debarun Banerjee
In a Sysbench oltp_update_index workload that involves 1 table,
a serious contention between the workload and the purge of history
was observed. This was the worst when the table contained only 1 record.
This turned out to be fixed by setting innodb_purge_batch_size=128,
which corresponds to the number of usable persistent rollback segments.
When we go above that, there would be contention between row_purge_poss_sec()
and the workload, typically on the clustered index page latch, sometimes
also on a secondary index page latch. It might be that with smaller
batches, trx_sys.history_size() will end up pausing all concurrent
transaction start/commit frequently enough so that purge will be able
to make some progress, so that there would be less contention on the
index page latches between purge and SQL execution.
In commit aa719b5010 (part of MDEV-32050)
the interpretation of the parameter innodb_purge_batch_size was slightly
changed. It would correspond to the maximum desired size of the
purge_sys.pages cache. Before that change, the parameter was referring to
a number of undo log pages, but the accounting might have been inaccurate.
To avoid a regression, we will reduce the default value to
innodb_purge_batch_size=127, which will also be compatible with
innodb_undo_tablespaces>1 (which will disable rollback segment 0).
Additionally, some logic in the purge and MVCC checks is simplified.
The purge tasks will make use of purge_sys.pages when accessing undo
log pages to find out if a secondary index record can be removed.
If an undo page needs to be looked up in buf_pool.page_hash, we will
merely buffer-fix it. This is correct, because the undo pages are
append-only in nature. Holding purge_sys.latch or purge_sys.end_latch
or the fact that the current thread is executing as a part of an
in-progress purge batch will prevent the contents of the undo page from
being freed and subsequently reused. The buffer-fix will prevent the
page from being evicted form the buffer pool. Thanks to this logic,
we can refer to the undo log record directly in the buffer pool page
and avoid copying the record.
buf_pool_t::page_fix(): Look up and buffer-fix a page. This is useful
for accessing undo log pages, which are append-only by nature.
There will be no need to deal with change buffer or ROW_FORMAT=COMPRESSED
in that case.
purge_sys_t::view_guard::view_guard(): Allow the type of guard to be
acquired: end_latch, latch, or no latch (in case we are a purge thread).
purge_sys_t::view_guard::get(): Read-only accessor to purge_sys.pages.
purge_sys_t::get_page(): Invoke buf_pool_t::page_fix().
row_vers_old_has_index_entry(): Replaced with row_purge_is_unsafe()
and row_undo_mod_sec_unsafe().
trx_undo_get_undo_rec(): Merged to trx_undo_prev_version_build().
row_purge_poss_sec(): Add the parameter mtr and remove redundant
or unused parameters sec_pcur, sec_mtr, is_tree. We will use the
caller's mtr object but release any acquired page latches before
returning.
btr_cur_get_page(), page_cur_get_page(): Do not invoke page_align().
row_purge_remove_sec_if_poss_leaf(): Return the value of PAGE_MAX_TRX_ID
to be checked against the page in row_purge_remove_sec_if_poss_tree().
If the secondary index page was not changed meanwhile, it will be
unnecessary to invoke row_purge_poss_sec() again.
trx_undo_prev_version_build(): Access any undo log pages using
the caller's mini-transaction object.
row_purge_vc_matches_cluster(): Moved to the only compilation unit that
needs it.
Reviewed by: Debarun Banerjee
- During copy algorithm, InnoDB should use bulk insert operation
for row by row insert operation. By doing this, copy algorithm
can effectively build indexes. This optimization is disabled
for temporary table, versioning table and table which has
foreign key relation.
Introduced the variable innodb_alter_copy_bulk to allow
the bulk insert operation for copy alter operation
inside InnoDB. This is enabled by default
ha_innobase::extra(): HA_EXTRA_END_ALTER_COPY mode tries to apply
the buffered bulk insert operation, updates the non-persistent
table stats.
row_merge_bulk_t::write_to_index(): Update stat_n_rows after
applying the bulk insert operation
row_ins_clust_index_entry_low(): In case of copy algorithm,
switch to bulk insert operation.
copy_data_error_ignore(): Handles the error while copying
the data from source to target file.
(With trivial fixes by sergey@mariadb.com)
Added option fix_innodb_cardinality to optimizer_adjust_secondary_key_costs
Using fix_innodb_cardinality disables the 'divide by 2' of rec_per_key_int
in InnoDB that in effect doubles the Cardinality for secondary keys.
This has the biggest effect for indexes where a few rows has the same key
value. Using this may also cause table scans for very small tables (which
in some cases may be better than an index scan).
The user visible effect is that 'SHOW INDEX FROM table_name' will for
InnoDB show the true Cardinality (and not 2x the real value). It will
also allow the optimizer to chose a better index in some cases as the
division by 2 could have a bad effect for tables with 2-5 identical values
per key.
A few notes about using fix_innodb_cardinality:
- It has direct affect for SHOW INDEX FROM table_name. SHOW INDEX
will also update the statistics in table share.
- The effect of fix_innodb_cardinality for query plans or EXPLAIN
is only visible after first open of the table. This is why one must
do a flush tables or use SHOW INDEX for the option to take effect.
- Using fix_innodb_cardinality can thus affect all user in their query
plans if they are using the same tables.
Because of this, it is strongly recommended that one uses
optimizer_adjust_secondary_key_costs=fix_innodb_cardinality mainly
in configuration files to not cause issues for other users.
The feedback plugin server_uid variable and the calculate_server_uid()
function is moved from feedback/utils.cc to sql/mysqld.cc
server_uid is added as a global variable (shown in 'show variables') and
is written to the error log on server startup together with server version
and server commit id.
We have an issue if a user have the following in a configuration file:
log_slow_filter="" # Log everything to slow query log
log_queries_not_using_indexes=ON
This set log_slow_filter to 'not_using_index' which disables
slow_query_logging of most queries.
In effect, on should never use log_slow_filter="" in config files but
instead use log_slow_filter=ALL.
Fixed by changing log_slow_filter="" that comes either from a
configuration file or from the command line, when starting to the server,
to log_slow_filter=ALL.
A warning will be printed when this happens.
Other things:
- One can now use =ALL for any 'set' variable to set all options at once.
(backported from 10.6)
As part of commit 685d958e38 (MDEV-14425)
the parameter innodb_log_write_ahead_size was removed, because it was
thought that determining the physical block size would be a sufficient
replacement.
However, we can only determine the physical block size on Linux or
Microsoft Windows. On some file systems, the physical block size
is not relevant. For example, XFS uses a block size of 4096 bytes
even if the underlying block size may be smaller.
On Linux, we failed to determine the physical block size if
innodb_log_file_buffered=OFF was not requested or possible.
This will be fixed.
log_sys.write_size: The value of the reintroduced parameter
innodb_log_write_ahead_size. To keep it simple, this is read-only
and a power of two between 512 and 4096 bytes, so that the previous
alignment guarantees are fulfilled. This will replace the previous
log_sys.get_block_size().
log_sys.block_size, log_t::get_block_size(): Remove.
log_t::set_block_size(): Ensure that write_size will not be less
than the physical block size. There is no point to invoke this
function with 512 or less, because that is the minimum value of
write_size.
innodb_params_adjust(): Add some disabled code for adjusting
the minimum value and default value of innodb_log_write_ahead_size
to reflect the log_sys.write_size.
log_t::set_recovered(): Mark the recovery completed. This is the
place to adjust some things if we want to allow write_size>4096.
log_t::resize_write_buf(): Refer to write_size.
log_t::resize_start(): Refer to write_size instead of get_block_size().
log_write_buf(): Simplify some arithmetics and remove a goto.
log_t::write_buf(): Refer to write_size. If we are writing less than
that, do not switch buffers, but keep writing to the same buffer.
Move some code to improve the locality of reference.
recv_scan_log(): Refer to write_size instead of get_block_size().
os_file_create_func(): For type==OS_LOG_FILE on Linux, always invoke
os_file_log_maybe_unbuffered(), so that log_sys.set_block_size() will
be invoked even if we are not attempting to use O_DIRECT.
recv_sys_t::find_checkpoint(): Read the entire log header
in a single 12 KiB request into log_sys.buf.
Tested with:
./mtr --loose-innodb-log-write-ahead-size=4096
./mtr --loose-innodb-log-write-ahead-size=2048
When the system variables @@debug_dbug was assigned to
some expression, Sys_debug_dbug::do_check() did not properly
convert the value from the expression character set to utf8.
So the value was erroneously re-interpretted as utf8 without
conversion. In case of a tricky expression character set
(e.g. utf16le), this led to unexpected results.
Fix:
Re-using Sys_var_charptr::do_string_check() in Sys_debug_dbug::do_check().
On Microsoft Windows, ReadFile() as well as WriteFile() limit the size
of the request to DWORD, which is 32 bits (at most 4 GiB - 1) also on
64-bit systems.
On FreeBSD, sysctl debug.iosize_max_clamp could limit the size of a
write request to INT_MAX. The size of a read request is always limited
to INT_MAX. This would allow the request size to be 4095 bytes more than
the Linux limit (0x7ffff000 according to "man 2 read" and "man 2 write").
On OpenBSD, Solaris and possibly NetBSD, the read request size is limited
to SSIZE_T_MAX, which would be half the current maximum
innodb_log_buffer_size. This should be not much of an issue anyway,
because on contemporary 64-bit platforms, the virtual addresses are
limited to 48 bits.
IBM AIX documentation mentions OFF_MAX which would apply when
a 64-bit application is running on a 32-bit kernel.
Let us declare innodb_log_buffer_size as 32-bit unsigned and make the
maximum 0x7ffff000, to be compatible with the least common
denominator (Linux).
The maximum innodb_sort_buffer_size already was 64 MiB,
which is not a problem.
SyncFileIO::execute(): Assert that the size of a synchronous read or
write request is limited to the maximum.
Reviewed by: Vladislav Vaintroub
In commit 24648768b4 (MDEV-30136)
the parameter innodb_flush_method was deprecated, with no direct
replacement for innodb_flush_method=O_DIRECT_NO_FSYNC.
Let us change innodb_doublewrite from Boolean to ENUM that can
be changed while the server is running:
OFF: Assume that writes of innodb_page_size are atomic
ON: Prevent torn writes (the default)
fast: Like ON, but avoid synchronizing writes to data files
The deprecated start-up parameter innodb_flush_method=NO_FSYNC will cause
innodb_doublewrite=ON to be changed to innodb_doublewrite=fast,
which will prevent InnoDB from making any durable writes to data files.
This would normally be done right before the log checkpoint LSN is updated.
Depending on the file systems being used and their configuration,
this may or may not be safe.
The value innodb_doublewrite=fast differs from the previous combination of
innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always
invoking os_file_flush() on the doublewrite buffer itself
in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer
when there are multiple doublewrite batches between checkpoints.
Typically, once per second, buf_flush_page_cleaner() would write out
up to innodb_io_capacity pages and advance the log checkpoint.
Also typically, innodb_io_capacity>128, which is the size of the
doublewrite buffer in pages. Should os_file_flush_func() not be invoked
between doublewrite batches, writes could be reordered in an unsafe way.
The setting innodb_doublewrite=fast could be safe when the doublewrite
buffer (the first file of the system tablespace) and the data files
reside in the same file system.
This was tested by running "./mtr --rr innodb.alter_kill". On the first
server startup, with innodb_doublewrite=fast, os_file_flush_func()
would only be invoked on the ibdata1 file and possibly ib_logfile0.
On subsequent startups with innodb_doublewrite=OFF, os_file_flush_func()
will be invoked on the individual data files during log_checkpoint().
Note: The setting debug_no_sync (in the code, my_disable_sync) would
disable all durable writes to InnoDB files, which would be much less safe.
IORequest::Type: Introduce special values WRITE_DBL and PUNCH_DBL
for asynchronous writes that are submitted via the doublewrite buffer.
In this way, fil_space_t::use_doublewrite() or buf_dblwr.in_use()
will only be consulted during buf_page_t::flush() and the doublewrite
buffer can be enabled or disabled without any fear of inconsistency.
buf_dblwr_t::block_size: Replaces block_size().
buf_dblwr_t::flush_buffered_writes(): If !in_use() and the doublewrite
buffer is empty, just invoke fil_flush_file_spaces() and return. The
doublewrite buffer could have been disabled while a batch was in
progress.
innodb_init_params(): If innodb_flush_method=O_DIRECT_NO_FSYNC,
set innodb_doublewrite=fast or innodb_doublewrite=fearless.
Thanks to Mark Callaghan for reporting this, and Vladislav Vaintroub
for feedback.
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
The log_sys.lsn_lock is a very contended resource with a small
critical section in log_sys.append_prepare(). On many processor
microarchitectures, replacing the system call based log_sys.lsn_lock
with a pure spin lock would fare worse during high concurrency workloads,
wasting a significant amount of CPU cycles in the spin loop.
On other microarchitectures, we would see a significant amount of time
being spent in native_queued_spin_lock_slowpath() in the Linux kernel,
plus context switching between user and kernel address space. This was
pointed out by Steve Shaw from Intel Corporation.
Depending on the workload and the hardware implementation, it may be
useful to use a pure spin lock in log_sys.append_prepare().
We will introduce a parameter. The statement
SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50;
would enable a spin lock that will execute that many MY_RELAX_CPU()
operations (such as the x86 PAUSE instruction) between successive
attempts of acquiring the spin lock. The use of a system call based
log_sys.lsn_lock (which is the default setting) can be enabled by
SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0;
This patch will also introduce #ifdef LOG_LATCH_DEBUG
(part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate
tracking of log_sys.latch ownership and reorganize the fields of
log_sys to improve the locality of reference and to reduce the
chances of false sharing.
When a spin lock is being used, it will be maintained in the
most significant bit of log_sys.buf_free. This is useful, because that is
one of the fields that is covered by the lock. For IA-32 or AMD64, we
implement the spin lock specially via log_t::lsn_lock_bts(), employing the
i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would
translate into an inefficient loop around LOCK CMPXCHG.
mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay.
mtr_t::finisher: Pointer to the currently used mtr_t::finish_write()
implementation. This allows to avoid introducing conditional branches.
We no longer invoke log_sys.is_pmem() at the mini-transaction level,
but we would do that in log_write_up_to().
mtr_t::finisher_update(): Update finisher when spin_wait_delay is
changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or
vice versa).
https://jepsen.io/analyses/mysql-8.0.34 highlights that the
transaction isolation levels in the InnoDB storage engine do not
correspond to any widely accepted definitions, such as
"Generalized Isolation Level Definitions"
https://pmg.csail.mit.edu/papers/icde00.pdf
(PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ,
PL-3 = SERIALIZABLE).
Only READ UNCOMMITTED in InnoDB seems to match the above definition.
The issue is that InnoDB does not detect write/write conflicts
(Section 4.4.3, Definition 6) in the above.
It appears that as soon as we implement write/write conflict detection
(SET SESSION innodb_snapshot_isolation=ON), the default isolation level
(SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become
Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of
"A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
Locking reads inside InnoDB used to read the latest committed version,
ignoring what should actually be visible to the transaction.
The added test innodb.lock_isolation illustrates this. The statement
UPDATE t SET a=3 WHERE b=2;
is executed in a transaction that was started before a read view or
a snapshot of the current transaction was created, and committed before
the current transaction attempts to execute
UPDATE t SET b=3;
If SET innodb_snapshot_isolation=ON is in effect when the second
transaction was started, the second transaction will be aborted with
the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF),
the second transaction would execute inconsistently, displaying an
incorrect SELECT COUNT(*) FROM t in its read view.
If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a
record that does not exist in the current read view is made, an error
DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will
be raised. This error will be treated in the same way as a deadlock:
the transaction will be rolled back.
lock_clust_rec_read_check_and_lock(): If the current transaction has
a read view where the record is not visible and
innodb_snapshot_isolation=ON, fail before trying to acquire the lock.
row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON,
disable the "semi-consistent read" logic that had been implemented by
myself on the directions of Heikki Tuuri in order to address
https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer
wanting UPDATE to skip locked rows that do not match the WHERE condition.
It looks like my changes were included in the MySQL 5.1.5
commit ad126d90e019f223470e73e1b2b528f9007c4532; at that time, employees
of Innobase Oy (a recent acquisition of Oracle) had lost write access to
the repository.
The only reason why we set innodb_snapshot_isolation=OFF by default is
backward compatibility with applications, such as the one that motivated
the implementation of "semi-consistent read" back in 2005. In a later
major release, we can default to innodb_snapshot_isolation=ON.
Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work
on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations
and some testing of this fix.
Thanks to Vladislav Lesin for the initial test for MDEV-26643,
as well as reviewing these changes.
In MariaDB up to 10.11, the test_if_cheaper_ordering() code (that tries
to optimizer how GROUP BY is executed) assumes that if a table scan is used
then if there is any index usable by GROUP BY it will be used.
The reason MySQL 10.4 provides a better plan is because of two differences:
- Plans using 'ref' has a cost of 1/10 of what it should be (as a
protection against table scans). This is why 'ref' is used in 10.4
and not in 10.5.
- When 'ref' is used, then GROUP BY will not use an index for GROUP BY.
In MariaDB 10.5 the chosen plan is a table scan (as it calculated to be
faster) but as 'ref' is not used, the test_if_cheaper_ordering()
optimizer phase decides (as ref is not usd) to use an index for GROUP BY,
which has bad performance.
Description of fix:
- All new code is protected by the "optimizer_adjust_secondary_key_costs"
variable, which is now a bit map, and is only executed if the option
"disable_forced_index_in_group_by" set.
- Corrects GROUP BY handling in test_if_cheaper_ordering() by making
the choise of using and index with GROUP BY cost based instead of rule
based.
- Adds TIME_FOR_COMPARE to all costs, when using group by, to make
read_time, index_scan_time and range_cost comparable.
Other things:
- Made optimizer_adjust_secondary_key_costs a bit map (compatible with old
code).
Notes:
Current code ignores costs for the algorithm used when doing GROUP
BY on the first table:
- Create an in-memory temporary table for handling group by and doing a
filesort of the result file
We can probably in 10.6 continue to ignore this cost.
This patch should NOT be merged to 11.0 series (not needed in 11.0).
Reason:
======
undo_space_dblwr test case fails if the first page of undo
tablespace is not flushed before restart the server. While
restarting the server, InnoDB fails to detect the first
page of undo tablespace from doublewrite buffer.
Fix:
===
Use "ib_log_checkpoint_avoid_hard" debug sync point
to avoid checkpoint and make sure to flush the
dirtied page before killing the server.
innodb_make_page_dirty(): Fails to set
srv_fil_make_page_dirty_debug variable.
Most things where wrong in the test suite.
The one thing that was a bug was that table_map_id was in some places
defined as ulong and in other places as ulonglong. On Linux 64 bit this
is not a problem as ulong == ulonglong, but on windows this caused failures.
Fixed by ensuring that all instances of table_map_id are ulonglong.