This commit implements
mysql/mysql-server@7037a0bdc8
functionality, i.e. if some transaction A holds not-gap S-lock on some
record, and some other transactions B={b1, b2, ..., bn} have not-gap
X-locks waiting for the S-lock of transaction A, and transaction A
requests not-gap and not insert intention X-lock which conflicts with
the X-locks of transactions B and does not conflict with another locks
in the queue, then grant the X-lock to transaction A.
MySQL's commit contains the following explanation of why insert-intention
locks must not overtake a waiting ordinary or gap locks:
"It is important that this decission rule doesn't allow
INSERT_INTENTION locks to overtake WAITING locks on gaps (`S`, `S|GAP`,
`X`, `X|GAP`), as inserting a record into a gap would split such WAITING
lock, violating the invariant that each transaction can have at most
single WAITING lock at any time."
I would add to the explanation the following. Suppose we have trx 1 which
holds ordinary X-lock on some record. And trx 2 executes "DELETE FROM t"
or "SELECT * FOR UPDATE" in RR(see lock_delete_updated.test and
MDEV-27992), i.e. it creates waiting ordinary X-lock on the same record.
And then trx 1 wants to insert some record just before the locked record.
It requests insert-intention lock, and if the lock overtakes trx 2 lock,
there will be phantom records for trx 2 in RR. lock_delete_updated.test
shows how "DELETE" allows to insert some records in already scanned gap
and misses some records to delete.
The current implementation differs from MySQL implementation. There are
two key differences:
1. Lock queue ordering. In MySQL all waiting locks precede all granted
locks. A new waiting lock is added to the head of the queue, a new
granted lock is added to the end of the queue, if some waiting lock
is granted, it's moved to the end of the queue. In MariaDB any new
lock is added to the end of the queue and waiting lock does not change
its position in the queue where the lock is granted. The rule is that
blocking lock must be located before blocked lock in lock queue. We
maintain the rule with inserting bypassing lock just before bypassed
one.
2. MySQL implementation uses some object(locksys::Trx_locks_cache) which
can be passed to consecutive calls to rec_lock_has_to_wait() for the
same trx and heap_no to cache the result of checking if trx has a
granted lock which is blocking the waiting lock(see
locksys::Trx_locks_cache::has_granted_blocker()). The current
implementation does not use such object, because it looks for such
granted lock on the level of lock_rec_other_has_conflicting() and
lock_rec_has_to_wait_in_queue(). I.e. there is no need in additional
lock queue iteration in
locksys::Trx_locks_cache::has_granted_blocker(), as we already iterate
it in lock_rec_other_has_conflicting() and
lock_rec_has_to_wait_in_queue().
During the testing the following case was found. Suppose we have
delete-marked record and going to do inplace insert into
that delete-marked record. Usually we don't create explicit lock if
there are no conlicting with not gap X-lock locks(see
lock_clust_rec_modify_check_and_lock(), btr_cur_update_in_place()). The
implicit lock will be converted to explicit one by demand.
That can happen during INSERT, the not-gap S-lock can
be acquired on searching for duplicates(see
row_ins_duplicate_error_in_clust()), and, if delete-marked record is
found, inplace insert(see btr_cur_upd_rec_in_place()) modifies the
record, what is treated as implicit lock.
But there can be a case when some transaction trx1 holds not-gap S-lock,
another transaction trx2 creates waiting X-lock, and then trx2 tries to
do inplace insert. Before the fix the waiting X-lock of trx2 would be
conflicting lock, and trx1 would try to create explicit X-lock, what
would cause deadlock, and one of the transactions whould be rolled back.
But after the fix, trx2 waiting X-lock is not treated as conflicting
with trx1 X-lock anymore, as trx1 already holds S-lock. If we don't create
explicit lock, then some other transaction trx3 can create it during
implicit to explicit lock conversion and place it at the end of the
queue. So there can be the following locks order in the queue:
S1(granted) X2(waiting) X1(granted)
The above queue is not valid, because all granted trx1 locks must be
placed before waiting trx2 lock. Besides, lock_rec_release_try() can
remove S(granted, trx1) lock and grant X lock to trx 2, and there can be
two granted X-locks on the same record:
X2(granted) X1(granted)
Taking into account that lock_rec_release_try() can release cell and
lock_sys latches leaving some locks unreleased, the queue validation
function can fail in any unexpected place.
It can be fixed with two ways:
1) Place explicit X(granted, trx1) lock before X(waiting, trx2) lock
during implicit to explicit lock conversion. This option is implemented
in MySQL, as granted lock is always placed at the top of locks queue,
and waiting locks are placed at the bottom of the queue. MariaDB does
not do this, and implementing this variant would require conflicting
locks search before converting implicit to explicit lock, what, in
turns, would require cell and/or lock_sys latch acquiring.
2) Create and place X(granted, trx1) lock before X(waiting, trx2) during
inplace INSERT, i.e. when lock_rec_lock() is invoked from
lock_clust_rec_modify_check_and_lock() or
lock_sec_rec_modify_check_and_lock(), if X(waiting, trx2) is
bypassed. Such a way we don't need in additional conflicting locks
search, as they are searched anyway in lock_rec_low().
This fix implements the second variant(see the changes around
c_lock_info.insert_after in lock_rec_lock). I.e. if some record was
delete-marked and we do inplace insert in such a record, and some lock for
bypass was found, create explicit lock to avoid conflicting lock search on
each implicit to explicit lock conversion. We can remove it if MDEV-35624
is implemented.
lock_rec_other_has_conflicting(), lock_rec_has_to_wait_in_queue():
search locks to bypass along with conflicting locks searching in the
same loop. The result is returned in conflicting_lock_info object.
There can be several locks to bypass, only the first one is returned to
limit lock_rec_find_similar_on_page() with the first bypassed lock to
preserve "blocking before blocked" invariant. conflicting_lock_info also
contains a pointer to the lock, after which we can insert bypassing
lock. This lock precedes bypassed one.
Bypassing lock can be next-key lock, and the following cases are
possible:
1. S1(not-gap, granted) II2(granted) X3(waiting for S1),
When new X1(ordinary) lock is acquired, there will be the following
locks queue:
S1(not-gap, granted) II2(granted) X1(ordinary, granted) X3(waiting for
S1)
If we had inserted new X1 lock just after S1, and S1 had been released
on transaction commit or rollback, we would have the following
sequence in the locks queue:
X1(ordinary, granted) II2(granted) X3(waiting for X1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is not a real issue as II lock once granted can be
ignored but it could possibly hit some assert(taking into account
that lock_release_try() can release lock_sys latch, and other threads
can acquire the latch and validate lock queue) as it breaks our design
constraint that any granted lock in the queue should not conflict
with locks ahead in the queue. But lock_rec_queue_validate() does not
check the above contraint. We place new bypassing lock just before
bypassed one, but there still can be the case when lock bitmap is used
instead of creating new lock object(see lock_rec_add_to_queue() and
lock_rec_find_similar_on_page()), and the lock, which owns the
bitmap, can precede II2(granted). We can either disable
lock_rec_find_similar_on_page() space optimization for bypassing locks
or treat "X1(ordinary, granted) II2(granted)" sequence as valid. As
we don't currently have the function which would fail on the above
sequence, let treat it as valid for the case, when lock_release()
execution is in process.
2. S1(ordinary, granted) II2(waiting for S1) X3(waiting for S1)
When new X1(ordinary) lock is acquired, there will be the following
locks queue:
S1(ordinary, granted) II2(waiting for S1) X1(ordinary, granted)
X3(waiting for S1).
After S1 releasing there will be:
II2(granted) X1(ordinary, granted) X3(waiting for S1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The above queue is valid because ordinary lock does not conflict with
II-lock(see lock_rec_has_to_wait()).
lock_rec_create_low(): insert new lock to the position which
lock_rec_other_has_conflicting(), lock_rec_has_to_wait_in_queue()
return if the lock is bypassing.
lock_rec_find_similar_on_page(): add ability to limit similiar lock search
with the certain lock to preserve "blocking before blocked" invariant for
all bypassed locks.
lock_rec_add_to_queue(): don't treat bypassed locks as waiting ones to
let lock bitmap reusing for bypassing locks.
lock_rec_lock(): fix inplace insert case, explained above.
lock_rec_dequeue_from_page(), lock_rec_rebuild_waiting_queue: move
bypassing lock to the correct place to preserve "blocking before blocked"
invariant.
It's supposed that the function gets the previous lock set on a record.
But if there are several locks set on a record, it will return only the
first one. Continue locks list iteration till the certain lock even if
the certain bit in lock bitmap is set.
The assertion fails during wsrep recovery step, in function
innobase_rollback_by_xid(). The transaction's xid is normally
cleared as part of lookup by xid, unless the transaction has
a wsrep specific xid.
This is a regression from MDEV-24035 (commit ddd7d5d8e3)
which removed the part clears xid before rollback for transaction
with a wsrep specific xid.
In function buf_page_create_low(), remove duplicate code that
over-write the ibuf_exist variable incorrectly when only compressed
page is loaded in buffer pool. This would help removing any old change
buffer record immediately before re-using the page.
strerror_s on Linux will, for unknown error codes, display
'Unknown error <codenum>' and our tests are written with this assumption.
However, on macOS, sterror_s returns 'Unknown error: <codenum>' in the
same case, which breaks tests. Make my_strerror consistent across the
platforms by removing the ':' when present.
The previous commit for fixing MDEV-35446 disabled setting
Galera errors on COM_STMT_PREPARE commands.
As a side effect, a number of tests were started to fail
due to the client receiving different error codes from the
ones expected in the test dependending on whether --ps-protocol
was used.
Also, in the case of test galera_ftwrl, it was found that
it is expected that during COM_STMT_PREPARE command, we
may perform a sync wait operation, which can fail with
LOCK_WAIT_TIMEOUT error.
The revised fix consists in anticipating the call to
wsrep_after_command_before_result(), so that we check for
BF aborts or errors during statement prepare, before sending
back the statement metadata message to client.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
Partition tests requiring lower_case_table_names = 2 (default on macOS)
fail on mac because the product has changed over time but the tests were
not run regularly enough to observe their breakage.
The limit of socket length on unix according to libc is 108, see
sockaddr_un::sun_path, but in the table it is a string of max length
64, which results in truncation of socket and failure to connect by
plugins using servers such as spider.
The test turns out to be senstive to @@global.gtid_cleanup_batch_size.
With a rather small default value of the latter
SELECTing from mysql.gtid_slave_pos may not be deterministic: tests
that run before may increase a pending for automitic deletion batch.
The test is refined to set its own value for the batch size which
is virtually unreachable.
Thanks to Kristian Nielsen for the analysis.
zlib-ng results in different compression length. The compression
length isn't that important as the test output examines the uncompressed
results.
fixes for zlib-ng
backport of 75488a57f2
This problem occured for statements like `INSERT INTO t1 SELECT 1`,
which do not have tables in the SELECT part. In such scenarios
SELECT_LEX::insert_tables was not properly set at `setup_tables()`,
and this led to either incorrect execution or a crash
Reviewer: Oleksandr Byelkin <sanja@mariadb.com>
This regression is introduced in 10.6 by following commit.
commit 35d477dd1d
MDEV-34453 Trying to read 16384 bytes at 70368744161280
The page state could change after being buffer-fixed and needs to be
read again after locking the page.
Item_func_concat_ws::val_str():
- collects the result into the string "str" passed as a parameter.
- calls val_str(&tmp_buffer) to get arguments.
At some point due to heuristic it decides to swap the buffers:
- collect the result into &tmp_buffer
- call val_str(str) to get arguments
Item_func_password::val_str_ascii() returns a String pointing to its
member tmp_value[SCRAMBLED_PASSWORD_CHAR_LENGTH+1].
As a result, it's possible that both str and tmp_buffer in
Item_func_concat_ws::val_str() point to Item_func_password::tmp_value.
Then, memcmp() called on overlapping memory fragrments.
Fixing Item_func_password::val_str_ascii() to use Item::copy()
instead of Item::set().
trx_sys_t::find_same_or_older_in_purge(): Correct a mistake that
was made in commit 19acb0257e
(MDEV-35508) and make the caching logic correspond to the one in
trx_sys_t::find_same_or_older(). In the more common code path
for 64-bit systems, the condition !hot was inadvertently inverted,
making us wrongly skip calls to find_same_or_older_low() when the
transaction may still be active.
Furthermore, the call should have been to find_same_or_older_low()
and not the wrapper find_same_or_older().
This bug has the same nature as the issues
MDEV-34718: Trigger doesn't work correctly with bulk update
MDEV-24411: Trigger doesn't work correctly with bulk insert
To fix the issue covering all use cases, resetting the thd->bulk_param
temporary to the value nullptr before invoking triggers and restoring
its original value on finishing execution of a trigger is moved to the method
Table_triggers_list::process_triggers
that be invoked ultimately for any kind of triggers.
add_special_frame_cursors() did not check the return
value offset_func->fix_fields(). It can return an error
if the data type does not support the operator "minus".
When UNION ALL is used with LIMIT ROWS EXAMINED, and when the limit is
exceeded for a SELECT that is not the last in the UNION, interrupt the
execution and call end_eof on the result. This makes sure that the
results are sent, and the query result status is conclusive rather
than empty, which would cause an assertion failure.
Added get_footprint() implementation for FreeBSD (and for other
non-Linux systems), and added "apparent file size" mode for Linux
to take into account the real file size (without compression) when
used with filesystems like ZFS.
This commit fixes some functions in wsrep_sst_common
to ensure that now and in the future return codes from
a number of helper functions will be zero on success.
Fixed some issues in the script code, mainly related
to handling situations when a failure occurs:
1) the signal handler in the mariadb-backup SST script
was using an uninitialized variable when trying to kill
a hung streaming process;
2) inaccurate error messages were being logged sometime;
3) after completing SST, temporary or old (extra) files
could remain in database directories.
A prepared SELECT statement because of CF_REEXECUTION_FRAGILE needs to
check the table is the same definition as previously otherwise a
re-prepare of the statement can occur.
When running many 'SELECT DEFAULT(name) FROM table1_containing_sequence'
in parallel the TABLE_LIST::is_the_same_definition may be called when
m_table_ref_type is TABLE_REF_NULL because it hasn't been checked yet.
In this case populate the TABLE_LIST with the values determined by the
TABLE_SHARE and allow the execution to continue.
As a result of this, the main.ps_ddl test doesn't need to reprepare
as the defination hasn't changed. This is another case where
TABLE_LIST::is_the_same_definition is called when m_table_ref_type is
TABLE_REF_NULL, but that doesn't mean that the defination is different.
Fixing a wrong DBUG_ASSERT.
thd->start_time and thd->start_time_sec_part cannot be 0 at the same time.
But thd->start_time can be 0 when thd->start_time_sec_part is not 0,
e.g. after:
SET timestamp=0.99;
Under unknown circumstances, the SQL layer may wrongly disregard an
invocation of thd_mark_transaction_to_rollback() when an InnoDB
transaction had been aborted (rolled back) due to one of the following errors:
* HA_ERR_LOCK_DEADLOCK
* HA_ERR_RECORD_CHANGED (if innodb_snapshot_isolation=ON)
* HA_ERR_LOCK_WAIT_TIMEOUT (if innodb_rollback_on_timeout=ON)
Such an error used to cause a crash of InnoDB during transaction commit.
These changes aim to catch and report the error earlier, so that not only
this crash can be avoided but also the original root cause be found and
fixed more easily later.
The idea of this fix is from Michael 'Monty' Widenius.
HA_ERR_ROLLBACK: A new error code that will be translated into
ER_ROLLBACK_ONLY, signalling that the current transaction
has been aborted and the only allowed action is ROLLBACK.
trx_t::state: Add TRX_STATE_ABORTED that is like
TRX_STATE_NOT_STARTED, but noting that the transaction had been
rolled back and aborted.
trx_t::is_started(): Replaces trx_is_started().
ha_innobase: Check the transaction state in various places.
Simplify the logic around SAVEPOINT.
ha_innobase::is_valid_trx(): Replaces ha_innobase::is_read_only().
The InnoDB logic around transaction savepoints, commit, and rollback
was unnecessarily complex and might have contributed to this
inconsistency. So, we are simplifying that logic as well.
trx_savept_t: Replace with const undo_no_t*. When we rollback to
a savepoint, all we need to know is the number of undo log records
that must survive.
trx_named_savept_t, DB_NO_SAVEPOINT: Remove. We can store undo_no_t
directly in the space allocated at innobase_hton->savepoint_offset.
fts_trx_create(): Do not copy previous savepoints.
fts_savepoint_rollback(): If a savepoint was not found, roll back
everything after the default savepoint of fts_trx_create().
The test innodb_fts.savepoint is extended to cover this code.
Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
Most resource limit information is excessive, particularly
limits that aren't limited.
We restructure the output by considering the Linux format
of /proc/limits which had its soft limits beginning at offset
26. "u"limited lines are skipped.
Example output:
Resource Limits (excludes unlimited resources):
Limit Soft Limit Hard Limit Units
Max stack size 8388608 unlimited bytes
Max processes 127235 127235 processes
Max open files 32198 32198 files
Max locked memory 8388608 8388608 bytes
Max pending signals 127235 127235 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
This is 8 lines less that what was before.
The FreeBSD limits file was /proc/curproc/rlimit and a different format
so a loss of non-Linux proc enabled OSes isn't expected.
Provide bug url in addition to how to report the bug.
Remove obsolete information like key_buffers and used connections
as they haven't meaningfully added value to a bug report for quite
a while. Remove information that comes from long fixed interfaces
in glibc/kernel.
Encourage the use of a full backtrace from the core with debug
symbols.
Lets be realistic about the error messages, its the users we are addressing
not developers so wording around getting the information communicated
is the key aspect.
All the user readable text and instructions are in on place, as
non-understandable is the end of the reading process for the user.
Remove the duplicate printing of the query.
Use my_progname rather than "mysqld" to reflex the program name.
So the signal handler output is now in the form:
1. User instructions
2. Server Information
3. Stacktrace
4. connection/query/optimizer_switch
5. Core information and resource limits
6. Kernel information
The segfault in wsrep_check_sequence is due to a
null pointer deference on:
db_type= thd->lex->create_info.db_type->db_type;
Where create_info.db_type is null. This occured under
a used_engine==true condition which is set in the calling
function based on create_info.used_fields==HA_CREATE_USED_ENGINE.
However the create_info.used_fields was a left over
from the parsing of the previous failed CREATE TABLE where
because of its failure, db_type wasn't populated.
This is corrected by cleaning the create_info when we start
to parse ALTER SEQUENCE statements.
Other paths to wsrep_check_sequence is via CREATE SEQUENCE
and CREATE TABLE LIKE which both initialize the create_info
correctly.
buf_dblwr_t::recover(): Correct a debug assertion failure that had
been added in commit bb47e575de (MDEV-34830).
The server may have been killed while a log write was in progress, and
therefore recv_sys.scanned_lsn may be up to RECV_PARSING_BUF_SIZE bytes
ahead of recv_sys.recovered_lsn.
Thanks to Matthias Leich for providing "rr replay" traces and
testing this.
fil_space_t::create(): Instead of invoking the default fil_space_t
constructor on a zero-filled buffer, allocate an uninitialized buffer
and invoke an explicitly defined constructor on it. Also, specify
initializer expressions for all constant data members, so that all of them
will be initialized in the constructor.
fil_space_t::being_imported: Replaces part of fil_space_t::purpose.
fil_space_t::is_being_imported(), fil_space_t::is_temporary():
Replaces fil_space_t::purpose.
fil_space_t:🆔 Changed the type from ulint to uint32_t to reduce
incompatibility with later branches that include
commit ca501ffb04 (MDEV-26195).
fil_space_t::try_to_close(): Do not attempt to close files that are
in an I/O bound phase of ALTER TABLE…IMPORT TABLESPACE.
log_file_op, first_page_init: recv_spaces_t:
Use uint32_t for the tablespace id.
Reviewed by: Debarun Banerjee
os_innodb_umask was of the incorrect type resulting in warnings
in clang-19. The correct type is mode_t.
As os_innodb_umask was set during innnodb_init from my_umask,
corrected the type there along with its companion my_umask_dir.
Because of this, the defaults mask values in innodb never
had an effect.
The resulting change allow found signed differences in
my_create{,_nosymlink}, open_nosymlinks:
mysys/my_create.c:47:20: error: operand of ?: changes signedness from ‘int’ to ‘mode_t’ {aka ‘unsigned int’} due to unsignedness of other operand [-Werror=sign-compare]
47 | CreateFlags ? CreateFlags : my_umask);
Ref: clang-19 warnings:
[55/123] Building CXX object storage/innobase/CMakeFiles/innobase.dir/os/os0file.cc.o
storage/innobase/os/os0file.cc:1075:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1075 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1249:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1249 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1381:45: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1381 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
Threads can normally exit without a explicit pthread_exit call.
There seem to date to old glibc bugs, many around 2.2.5.
The semi related bug was https://bugs.mysql.com/bug.php?id=82886.
To improve safety in the signal handlers DBUG_* code was removed.
These where also needed to avoid some MSAN unresolved stack issues.
This is effectively a backport of 2719cc4925.
These tests rely on THR_KEY_mysys but it is not initialized. On
Linux, the corresponding thread variable is null, but on macOS it has a
nonzero value. In all cases, initialize the variable explicitly by
calling MY_INIT and my_end appropriately.