Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
The mini-benchmark.sh script failed to run in the latest Fedora
distributions in GitLab CI. Executing the benchmark inside a Docker
container had failed because the check for `perf` was done in a way that
caused the benchmark to exit because of the `set -e` option. Test and
skip `perf` to allowing the remaining benchmark activities to proceed.
This check was added in acb6684 but inadvertantly reverted in 42a1f94.
Logic was corrected to only run perf when the flag is enabled, and to
prevent perf stat and perf record from being simultaneously enabled.
Set -ex is also added to enable easier identification of mini-benchmark
issues in the future.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
Item_func_group_concat::print() did not take into account
that Item_func_group_concat::separator can be of a different character set
than the "String *str" (when the printing is being done to).
Therefore, printing did not work correctly for:
- non-ASCII separators when GROUP_CONCAT is done on 8bit data
or multi-byte data with mbminlen==1.
- all separators (even including simple ones like comma)
when GROUP_CONCAT is done on ucs2/utf16/utf32 data (mbminlen>1).
Because of this problem, VIEW definitions did not print correctly to
their FRM files. This later led to a wrong SELECT and SHOW CREATE output.
Fix:
- Adding new String methods:
bool append_for_single_quote_using_mb_wc(const char *str, size_t length,
CHARSET_INFO *cs);
bool append_for_single_quote_opt_convert(const char *str,
size_t length,
CHARSET_INFO *cs)
which perform both escaping and character set conversion at the same time.
- Adding a new String method escaped_wc_for_single_quote(),
to reuse the code between the old and the new methods.
- Fixing Item_func_group_concat::print() to use the new
method append_for_single_quote_opt_convert().
innodb_log_spin_wait_delay_update(): Always acquire log_sys.latch
to protect the change of mtr_t::spin_wait_delay.
log_t::lock_lsn(): In the general case, actually use
mtr_t::spin_wait_delay as it was intended. In the x86 specific
log_t::lock_lsn_bts() we used mtr_t::spin_wait_delay.
Queries that select concatenated constant strings now have
colname and value that match. For example,
SELECT '123' 'x';
will return a result where the column name and value both
are '123x'.
Review: Daniel Black
.. even with MDEV-9095 fix
CapabilityBounding sets require filesystem setcap attributes
for the executable to gain privileges during execution.
A side effect of this however is the getauxvec(AT_SECURE) gets
set, and the secure_getenv from OpenSSL internals on
OPENSSL_CONF environment variable will get ignored (openssl gh issue
21770).
According to capabilities(7), Ambient capabilities don't trigger
ld.so triggering the secure execution mode.
Include SELinux and Apparmor capabilities for ipc_lock
This was the orginal implementation that reverted with a bunch of
commits.
This reverts commit a13e521bc5.
Revert "cmake: append to the array correctly"
This reverts commit 51e3f1daf5.
Revert "build failure with cmake < 3.10"
This reverts commit 49cf702ee5.
Revert "MDEV-33301 memlock with systemd still not working"
This reverts commit 8a1904d782.
We should not set debug sync point when holding a mutex
to avoid mutex ordering failure.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
User transactions may acquire explicit MDL locks from InnoDB level
when persistent statistics is re-read for a table.
If such a transaction would be subject to BF-abort, it was improperly
detected as a system transaction and wouldn't get aborted.
The fix: Check if a transaction holding explicit MDL locks is a user
transaction in the MDL conflict handling code.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Add "real ip:<ip_or_localhost>" part to the aborted message
Only for proxy-protocoled connection, so it does not not to cause
confusion to normal users.
Problem is that not all conflicting transactions have THD object.
Therefore, it must be checked that victim has THD
before it's identification is added to victim list as victim's
thread identification is later requested using thd_get_thread_id
function that requires that we have valid pointer to THD object
in trx->mysql_thd.
Victim might not have trx->mysql_thd in two cases:
(1) An incomplete transaction that was recovered from undo logs
on server startup (and not yet rolled back).
(2) Transaction that is in XA PREPARE state and whose client
connection was disconnected.
Neither of these can complete before lock_wait_wsrep()
releases lock_sys.latch.
(1) trx_t::commit_in_memory() is clearing both
trx_t::state and trx_t::is_recovered before it invokes
lock_release(trx_t*) (which would be blocked by the exclusive
lock_sys.latch that we are holding here). Hence, it is not
possible to write a debug assertion to document this scenario.
(2) If is in XA PREPARE state, it would eventually be rolled
back and the lock conflict would be resolved when an XA COMMIT
or XA ROLLBACK statement is executed in some other connection.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
.. is not updating some system tables
Some schema changes from MDEV-24312 master_host has 60 character limit, increase to 255 bytes
failed to happen in the upgrade for tables in the mysql schema:
* mysql.global_priv
* mysql.procs_priv
* mysql.proxies_priv
* mysql.roles_mapping
When we commit empty transaction we should allow wsrep
transaction to be on s_must_replay state for DDL that
was killed during certification.
Fix is tested with RQG because deterministic mtr-testcase
was not found.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
MONITOR_INC_VALUE_CUMULATIVE is a multiline macro, so the second statement
will be executed always, regardless of "if" condition.
These problems first started with
commit b1ab211dee (MDEV-15053).
Thanks to Yury Chaikou from ServiceNow for the report.
From the correctness point of view, it should be safe to release
all locks on index records that were not modified by the transaction.
Doing so should make the locks after XA PREPARE fully compatible
with what would happen if the server were restarted: InnoDB table
IX locks and exclusive record locks would be resurrected based on
undo log records.
Concurrently running transactions that are waiting for a lock may invoke
lock_rec_convert_impl_to_expl() to create an explicit record lock object
on behalf of the lock-owning transaction so that they can attaching
their waiting lock request on the explicit record lock object. Explicit
locks would be released by trx_t::release_locks() during commit or
rollback.
Any clustered index record whose DB_TRX_ID belongs to a transaction that
is in active or XA PREPARE state will be implicitly locked by that
transaction. On XA PREPARE, we can release explicit exclusive locks on
records whose DB_TRX_ID does not match the current transaction identifier.
lock_rec_unlock_unmodified(): Release record locks that are not implicitly
held by the current transaction.
lock_release_on_prepare_try(), lock_release_on_prepare():
Invoke lock_rec_unlock_unmodified().
row_trx_id_offset(): Declare non-static.
lock_rec_unlock(): Replaces lock_rec_unlock_supremum().
Reviewed by: Vladislav Lesin
By design, InnoDB has always hung when permanently running out of
buffer pool, for example when several threads are waiting to allocate
a block, and all of the buffer pool is buffer-fixed by the active threads.
The hang that we are fixing here occurs when the buffer pool is only
temporarily running out and the situation could be rescued by writing out
some dirty pages or evicting some clean pages.
buf_LRU_get_free_block(): Simplify the way how we wait for
the buf_flush_page_cleaner thread. This fixes occasional hangs
of the test encryption.innochecksum that were introduced by
commit a55b951e60 (MDEV-26827).
To play it safe, we use a timed wait when waiting for the
buf_flush_page_cleaner() thread to perform its job. Should that
thread get stuck, we will invoke buf_pool.LRU_warn() in order to
display a message that pages could not be freed, and keep trying
to wake up the buf_flush_page_cleaner() thread.
The INFORMATION_SCHEMA.INNODB_METRICS counters
buffer_LRU_single_flush_failure_count and
buffer_LRU_get_free_waits will be removed.
The latter is represented by buffer_pool_wait_free.
Also removed will be the message
"InnoDB: Difficult to find free blocks in the buffer pool"
because in d34479dc66 we
introduced a more precise message
"InnoDB: Could not free any blocks in the buffer pool"
in the buf_flush_page_cleaner thread.
buf_pool_t::LRU_warn(): Issue the warning message that we could
not free any blocks in the buffer pool. This may also be invoked
by buf_LRU_get_free_block() if buf_flush_page_cleaner() appears
to be stuck.
buf_pool_t::n_flush_dec(): Remove.
buf_pool_t::n_flush_dec_holding_mutex(): Rename to n_flush_dec().
buf_flush_LRU_list_batch(): Increment the eviction counter for blocks
of temporary, discarded or dropped tablespaces.
buf_flush_LRU(): Make static, and remove the constant parameter
evict=false. The only caller will be the buf_flush_page_cleaner()
thread.
IORequest::is_LRU(): Remove. The only case of evicting pages on
write completion will be when we are writing out pages of the
temporary tablespace. Those pages are not in buf_pool.flush_list,
only in buf_pool.LRU.
buf_page_t::flush(): Remove the parameter evict.
buf_page_t::write_complete(): Change the parameter "bool temporary"
to "bool persistent" and add a parameter for an already read state().
Reviewed by: Debarun Banerjee
The log_sys.lsn_lock is a very contended resource with a small
critical section in log_sys.append_prepare(). On many processor
microarchitectures, replacing the system call based log_sys.lsn_lock
with a pure spin lock would fare worse during high concurrency workloads,
wasting a significant amount of CPU cycles in the spin loop.
On other microarchitectures, we would see a significant amount of time
being spent in native_queued_spin_lock_slowpath() in the Linux kernel,
plus context switching between user and kernel address space. This was
pointed out by Steve Shaw from Intel Corporation.
Depending on the workload and the hardware implementation, it may be
useful to use a pure spin lock in log_sys.append_prepare().
We will introduce a parameter. The statement
SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50;
would enable a spin lock that will execute that many MY_RELAX_CPU()
operations (such as the x86 PAUSE instruction) between successive
attempts of acquiring the spin lock. The use of a system call based
log_sys.lsn_lock (which is the default setting) can be enabled by
SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0;
This patch will also introduce #ifdef LOG_LATCH_DEBUG
(part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate
tracking of log_sys.latch ownership and reorganize the fields of
log_sys to improve the locality of reference and to reduce the
chances of false sharing.
When a spin lock is being used, it will be maintained in the
most significant bit of log_sys.buf_free. This is useful, because that is
one of the fields that is covered by the lock. For IA-32 or AMD64, we
implement the spin lock specially via log_t::lsn_lock_bts(), employing the
i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would
translate into an inefficient loop around LOCK CMPXCHG.
mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay.
mtr_t::finisher: Pointer to the currently used mtr_t::finish_write()
implementation. This allows to avoid introducing conditional branches.
We no longer invoke log_sys.is_pmem() at the mini-transaction level,
but we would do that in log_write_up_to().
mtr_t::finisher_update(): Update finisher when spin_wait_delay is
changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or
vice versa).
When using semi-sync replication with
rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the
primary can significantly reduce compared to AFTER_SYNC's
performance for workloads with many concurrent users executing
transactions. This is because all connections on the primary share
the same cond_wait variable/mutex pair, so any time an ACK is
received from a replica, all waiting connections are awoken to check
if the ACK was for itself, which is done in mutual exclusion.
This patch changes this such that the waiting THD will use its own
local condition variable, and the ACK receiver thread only signals
connections which have been ACKed for wakeup. That is, the
THD::LOCK_wakeup_ready condition variable is re-used for this
purpose, and the Active_tranx queue nodes are extended to hold the
waiting thread, so it can be signalled once ACKed.
Additionally:
1) Removed part of MDEV-11853 additions, which allowed suspended
connection threads awaiting their semi-sync ACKs to live until their
ACKs had been received. This part, however, wasn't needed. That is,
all that was needed was for the Ack_thread to survive. So now the
connection threads are killed during phase 1. Thereby
THD::is_awaiting_semisync_ack, and all its related code was removed.
2) COND_binlog_send is repurposed to signal on the condition when
Active_tranx is emptied during clear_active_tranx_nodes.
3) At master shutdown (when waiting for slaves), instead of the
main loop individually waiting for each ACK, await_slave_reply()
(renamed await_all_slave_replies()) just waits once for the
repurposed COND_binlog_send to signal it is empty.
4) Test rpl_semi_sync_shutdown_await_ack is updates as following:
4.1) Added test case (adapted from Kristian Nielsen) to ensure
that if a thread awaiting its ACK is killed while SHUTDOWN WAIT FOR
ALL SLAVES is issued, the primary will still wait for the ACK from
the killed thread.
4.2) As connections which by-passed phase 1 of thread killing no
longer are delayed for kill until phase 2, we can no longer query
yes/no tx after receiving an ACK/timeout. The check for these
variables is removed.
4.3) Comment descriptions are updated which mention that the
connection is alive; and adjusted to be the Ack_thread.
Reviewed By:
============
Kristian Nielsen <knielsen@knielsen-hq.org>
https://jepsen.io/analyses/mysql-8.0.34 highlights that the
transaction isolation levels in the InnoDB storage engine do not
correspond to any widely accepted definitions, such as
"Generalized Isolation Level Definitions"
https://pmg.csail.mit.edu/papers/icde00.pdf
(PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ,
PL-3 = SERIALIZABLE).
Only READ UNCOMMITTED in InnoDB seems to match the above definition.
The issue is that InnoDB does not detect write/write conflicts
(Section 4.4.3, Definition 6) in the above.
It appears that as soon as we implement write/write conflict detection
(SET SESSION innodb_snapshot_isolation=ON), the default isolation level
(SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become
Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of
"A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
Locking reads inside InnoDB used to read the latest committed version,
ignoring what should actually be visible to the transaction.
The added test innodb.lock_isolation illustrates this. The statement
UPDATE t SET a=3 WHERE b=2;
is executed in a transaction that was started before a read view or
a snapshot of the current transaction was created, and committed before
the current transaction attempts to execute
UPDATE t SET b=3;
If SET innodb_snapshot_isolation=ON is in effect when the second
transaction was started, the second transaction will be aborted with
the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF),
the second transaction would execute inconsistently, displaying an
incorrect SELECT COUNT(*) FROM t in its read view.
If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a
record that does not exist in the current read view is made, an error
DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will
be raised. This error will be treated in the same way as a deadlock:
the transaction will be rolled back.
lock_clust_rec_read_check_and_lock(): If the current transaction has
a read view where the record is not visible and
innodb_snapshot_isolation=ON, fail before trying to acquire the lock.
row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON,
disable the "semi-consistent read" logic that had been implemented by
myself on the directions of Heikki Tuuri in order to address
https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer
wanting UPDATE to skip locked rows that do not match the WHERE condition.
It looks like my changes were included in the MySQL 5.1.5
commit ad126d90e019f223470e73e1b2b528f9007c4532; at that time, employees
of Innobase Oy (a recent acquisition of Oracle) had lost write access to
the repository.
The only reason why we set innodb_snapshot_isolation=OFF by default is
backward compatibility with applications, such as the one that motivated
the implementation of "semi-consistent read" back in 2005. In a later
major release, we can default to innodb_snapshot_isolation=ON.
Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work
on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations
and some testing of this fix.
Thanks to Vladislav Lesin for the initial test for MDEV-26643,
as well as reviewing these changes.
Though the test itself doesn't create any transactions
directly, the added test suppressions are replicated,
and when the SQL thread is stopped mid-execution,
it is set into an error state because these are
non-transactional events being aborted.
This patch fixes the test by ensuring that the test
suppressions are fully replicated before continuing
Problem:
=======
- In case of large file size, InnoDB eagerly adds the new extent
even though there are many existing unused pages of the segment.
Reason is that in case of larger file size, threshold
(1/8 of reserved pages) for adding new extent has been
reached frequently.
Solution:
=========
- Try to utilise the unused pages in the segment before adding
the new extent in the file segment.
need_for_new_extent(): In case of larger file size, try to use
the 4 * FSP_EXTENT_SIZE as threshold to allocate the new extent.
fseg_alloc_free_page_low(): Rewrote the function to allocate
the page in the following order.
1) Try to get the page from existing segment extent.
2) Check whether the segment needs new extent
(need_for_new_extent()) and allocate the new extent,
find the page.
3) Take individual page from the unused page from
segment or tablespace.
4) Allocate a new extent and take first page from it.
Removed FSEG_FILLFACTOR, FSEG_FRAG_LIMIT variable.
Do *not* check if socket is closed by another thread. This is
race-condition prone, unnecessary, and harmful. VIO state was introduced
to debug the errors, not to change the behavior.
Rather than checking if socket is closed, add a DBUG_ASSERT that it is
*not* closed, because this is an actual logic error, and can potentially
lead to all sorts of funny behavior like writing error packets to Innodb
files.
Unlike closesocket(), shutdown(2) is not actually race-condition prone,
and it breaks poll() and read(), and it worked for longer than a decade,
and it does not need any state check in the code.
Let us skip the recently added test main.mysql-interactive if
an instrumented ncurses library is not available.
In InnoDB, let us work around an uninstrumented libnuma, by
declaring that the objects returned by numa_get_mems_allowed()
are initialized.
Starting with clang-16, MemorySanitizer appears to check that
uninitialized values not be passed by value nor returned.
Previously, it was allowed to copy uninitialized data in such cases.
get_foreign_key_info(): Remove a local variable that was passed
uninitialized to a function.
DsMrr_impl: Initialize key_buffer, because DsMrr_impl::dsmrr_init()
is reading it.
test_bind_result_ext1(): MYSQL_TYPE_LONG is 32 bits, hence we must
use a 32-bit type, such as int. sizeof(long) differs between
LP64 and LLP64 targets.
maria_repair_parallel() clears the MY_THREAD_SPECIFIC flag for allocations
since it uses different threads. But it still did one _ma_alloc_buffer()
call as thread-specific which would later assert if another thread needed
to extend the buffer with realloc.
This patch, due to Monty, removes the MY_THREAD_SPECIFIC flag for
allocations that need to realloc in different threads, and preserves
it for those that are allocated/freed in the user's thread.
Also fixes MDEV-33562.
Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Remove work-around that disables bulk insert optimization in replication
The root cause of the original problem is now fixed (MDEV-33475). Though the
bulk insert optimization will still be disabled in replication, as it is
only enabled in special circumstances meant for loading a mysqldump.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
An earlier patch for MDEV-13577 fixed the most common instances of this, but
missed one case for tables without primary key when the scan reaches the end
of the table. This patch adds similar code to handle this case, converting
the error to HA_ERR_RECORD_CHANGED when doing optimistic parallel apply.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This patch makes the server wait for the manager thread to actually start
before proceeding with server startup.
Without this, if thread scheduling is really slow and the server shutdowns
quickly, then it is possible that the manager thread is not yet started when
shutdown_performance_schema() is called. If the manager thread starts at just
the wrong moment and just before the main server reaches exit(), the thread
can try to access no longer available performance schema data. This was seen
as occasional assertion in the main.bootstrap test.
As an additional improvement, make sure to run all pending actions before
exiting the manager thread.
Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
eliminate_item_equal() uses quick_fix_field() for Item objects it creates.
It computes some of their attributes on its own (see update_used_tables()
call) but it doesn't update not_null_tables_cache.
Recompute not_null_tables_cache also. Not computing it is currently
harmless, except for producing MSAN error when some other code
propagates the wrong value of not_null_tables_cache to other item.
- Suppress the "Difficult to find free blocks" warning
globally to avoid many different test case failing.
- Demote the error information in validate_first_page() to note.
So first page can recovered from doublewrite buffer and can throw
error in case the page wasn't found in doublewrite buffer.
to iterate over all status variables one should use
LOCK_all_status_vars not LOCK_status
this fixes sporadic mutex lock inversion in plugins.password_reuse_check:
* acl_cache->lock is taken over complex operations that might increment
status counters (under LOCK_status).
* acl_cache->lock is needed to get the values of Acl% status variables
when iterating over status variables