* rpl.rpl_system_versioning_partitions updated for MDEV-32188
* innodb.row_size_error_log_warnings_3 changed error for MDEV-33658
(checks are done in a different order)
In Log_event::read_log_event(), don't use IO_CACHE::error of the relay log's
IO_CACHE to signal an error back to the caller. When reading the active
relay log, this flag is also being used by the IO thread, and setting it can
randomly cause the IO thread to wrongly detect IO error on writing and
permanently disable the relay log.
This was seen sporadically in test case rpl.rpl_from_mysql80. The read
error set by the SQL thread in the IO_CACHE would be interpreted as a
write error by the IO thread, which would cause it to throw a fatal
error and close the relay log. And this would later cause CHANGE
MASTER to try to purge a closed relay log, resulting in nullptr crash.
SQL thread is not able to parse an event read from the relay log. This
can happen like here when replicating unknown events from a MySQL master,
potentially also for other reasons.
Also fix a mistake in my_b_flush_io_cache() introduced back in 2001
(fa09f2cd7e) where my_b_flush_io_cache() could wrongly return an error set
in IO_CACHE::error, even if the flush operation itself succeeded.
Also fix another sporadic failure in rpl.rpl_from_mysql80 where the outout
of MASTER_POS_WAIT() depended on timing of SQL and IO thread.
Reviewed-by: Monty <monty@mariadb.org>
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
`ATTRIBUITE_FORMAT` from #3360 uncovers issues on `my_snprintf` uses.
This commit fixes the one in `Binlog_commit_by_rotate::`
`replace_binlog_file()` about “required size too big”.
All I found is that it’s not present in 11.4
(after I prepared previous batches for all maintained branches),
for GitHub blame can’t process a file with over 10K lines.
The LOCK_global_system_variables must not be held when taking mutexes
such as LOCK_commit_ordered and LOCK_log, as this causes inconsistent
mutex locking order that can theoretically cause the server to
deadlock.
To avoid this, temporarily release LOCK_global_system_variables in two
system variable update functions, like it is done in many other
places.
Enforce the correct locking order at server startup, to more easily
catch (in debug builds) any remaining wrong orders that may be hidden
elsewhere in the code.
Note that when this is merged to 11.4, similar unlock/lock of
LOCK_global_system_variables must be added in update_binlog_space_limit()
as is done in binlog_checksum_update() and fix_max_binlog_size(), as this
is a new function added in 11.4 that also needs the same fix. Tests will
fail with wrong mutex order until this is done.
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Fixing a wrong DBUG_ASSERT.
thd->start_time and thd->start_time_sec_part cannot be 0 at the same time.
But thd->start_time can be 0 when thd->start_time_sec_part is not 0,
e.g. after:
SET timestamp=0.99;
into a separate transaction_participant structure
handlerton inherits it, so handlerton itself doesn't change.
but entities that only need to participate in a transaction,
like binlog or online alter log, use a transaction_participant
and no longer need to pretend to be a full-blown but invisible
storage engine which doesn't support create table.
For a primary configured with wait_point=AFTER_SYNC, if two threads
T1 (binlogging through MYSQL_BIN_LOG::write()) and T2 were
binlogging at the same time, T1 could accidentally wait for its
semi-sync ACK using the binlog coordinates of T2. Prior to
MDEV-33551, this only resulted in delayed transactions, because all
transactions shared the same condition variable for ACK signaling.
However, with the MDEV-33551 changes, each thread has its own
condition variable to signal. So T1 could wait indefinitely when
either:
1) T1's ACK is received but not T2's when T1 goes into
wait_after_sync(), because the ACK receiver thread has already
notified about the T1 ACK, but T1 was _actually_ waiting on T2's
ACK, and therefore tries to wait (in vain).
2) T1 goes to wait_after_sync() before any ACKs have arrived. When
T1's ACK comes in, T1 is woken up; however, sees it needs to wait
more (because it was actually waiting on T2's ACK), and goes to wait
again (this time, in vain).
Note that the actual cause of T1 waiting on T2's binlog coordinates
is when MYSQL_BIN_LOG::write() would call
Repl_semisync_master::wait_after_sync(), the binlog offset parameter
was read as the end of MYSQL_BIN_LOG::log_file, which is shared
among transactions. So if T2 had updated the binary log _after_ T1
had released LOCK_log, but not yet invoked wait_after_sync(), it
would use the end of the binary log file as the binlog offset, which
was that of T2 (or any future transaction).
The fix in this patch ensures consistency between the binary log
coordinates a transaction uses between report_binlog_update() and
wait_after_sync().
Reviewed By
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Andrei Elkin <andrei.elkin@mariadb.com>
for large transaction
Description
===========
When a transaction commits, it copies the binlog events from
binlog cache to binlog file. Very large transactions
(eg. gigabytes) can stall other transactions for a long time
because the data is copied while holding LOCK_log, which blocks
other commits from binlogging.
The solution in this patch is to rename the binlog cache file to
a binlog file instead of copy, if the commiting transaction has
large binlog cache. Rename is a very fast operation, it doesn't
block other transactions a long time.
Design
======
* binlog_large_commit_threshold
type: ulonglong
scope: global
dynamic: yes
default: 128MB
Only the binlog cache temporary files large than 128MB are
renamed to binlog file.
* #binlog_cache_files directory
To support rename, all binlog cache temporary files are managed
as normal files now. `#binlog_cache_files` directory is in the same
directory with binlog files. It is created at server startup if it doesn't
exist. Otherwise, all files in the directory is deleted at startup.
The temporary files are named with ML_ prefix and the memorary address
of the binlog_cache_data object which guarantees it is unique.
* Reserve space
To supprot rename feature, It must reserve enough space at the
begin of the binlog cache file. The space is required for
Format description, Gtid list, checkpoint and Gtid events when
renaming it to a binlog file.
Since binlog_cache_data's cache_log is directly accessed by binlog log,
online alter and wsrep. It is not easy to update all the code. Thus
binlog cache will not reserve space if it is not session binlog cache or
wsrep session is enabled.
- m_file_reserved_bytes
Stores the bytes reserved at the begin of the cache file.
It is initialized in write_prepare() and cleared by reset().
The reserved file header is hide to callers. Thus there is no
change for callers. E.g.
- get_byte_position() still get the length of binlog data
written to the cache, but not the file length.
- truncate(0) will truncate the file to m_file_reserved_bytes but not 0.
- write_prepare()
write_prepare() is called everytime when anything is being written
into the cache. It will call init_file_reserved_bytes() to create
the cache file (if it doesn't exist) and reserve suitable space if
the data written exceeds buffer's size.
* Binlog_commit_by_rotate
It is used to encapsulate the code for remaing a binlog cache
tempoary file to binlog file.
- should_commit_by_rotate()
it is called by write_transaction_to_binlog_events() to check if
a binlog cache should be rename to a binlog file.
- commit()
That is the entry to rename a binlog cache and commit the
transaction. Both rename and commit are protected by LOCK_log,
Thus not other transactions can write anything into the renamed
binlog before it.
Rename happens in a rotation. After the new binlog file is generated,
replace_binlog_file() is called to:
- copy data from the new binlog file to its binlog cache file.
- write gtid event.
- rename the binlog cache file to binlog file.
After that the rotation will continue to succeed. Then the transaction
is committed in a seperated group itself. Its cache file will be
detached and cache log will be reset before calling
trx_group_commit_with_engines(). Thus only Xid event be written.
The problem was that when using clang + asan, we do not get a correct value
for the thread stack as some local variables are not allocated at the
normal stack.
It looks like that for example clang 18.1.3, when compiling with
-O2 -fsanitize=addressan it puts local variables and things allocated by
alloca() in other areas than on the stack.
The following code shows the issue
Thread 6 "mariadbd" hit Breakpoint 3, do_handle_one_connection
(connect=0x5080000027b8,
put_in_cache=<optimized out>) at sql/sql_connect.cc:1399
THD *thd;
1399 thd->thread_stack= (char*) &thd;
(gdb) p &thd
(THD **) 0x7fffedee7060
(gdb) p $sp
(void *) 0x7fffef4e7bc0
The address of thd is 24M away from the stack pointer
(gdb) info reg
...
rsp 0x7fffef4e7bc0 0x7fffef4e7bc0
...
r13 0x7fffedee7060 140737185214560
r13 is pointing to the address of the thd. Probably some kind of
"local stack" used by the sanitizer
I have verified this with gdb on a recursive call that calls alloca()
in a loop. In this case all objects was stored in a local heap,
not on the stack.
To solve this issue in a portable way, I have added two functions:
my_get_stack_pointer() returns the address of the current stack pointer.
The code is using asm instructions for intel 32/64 bit, powerpc,
arm 32/64 bit and sparc 32/64 bit.
Supported compilers are gcc, clang and MSVC.
For MSVC 64 bit we are using _AddressOfReturnAddress()
As a fallback for other compilers/arch we use the address of a local
variable.
my_get_stack_bounds() that will return the address of the base stack
and stack size using pthread_attr_getstack() or NtCurrentTed() with
fallback to using the address of a local variable and user provided
stack size.
Server changes are:
- Moving setting of thread_stack to THD::store_globals() using
my_get_stack_bounds().
- Removing setting of thd->thread_stack, except in functions that
allocates a lot on the stack before calling store_globals(). When
using estimates for stack start, we reduce stack_size with
MY_STACK_SAFE_MARGIN (8192) to take into account the stack used
before calling store_globals().
I also added a unittest, stack_allocation-t, to verify the new code.
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
RESET MASTER waits for storage engines to reply to a binlog checkpoint
requests. If this response is delayed for a long time for some reason, then
RESET MASTER can hang.
Fix this by forcing a log sync in all engines just before waiting for the
checkpoint reply.
(Waiting for old checkpoint responses is needed to preserve durability of
any commits that were synced to disk in the to-be-deleted binlog but not yet
synced in the engine.)
Reviewed-by: Andrei Elkin <andrei.elkin@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The option binlog_optimize_thread_scheduling was initially added
to provide a safe alternative for the newly added binlog group
commit logic, such that when 0, it would disable a leader thread
from performing the binlog write for all transactions that are a
part of the group commit. Any problems related to the binlog group
commit optimization should be sorted out by now, so we can
deprecate-to-eventually-remove the option altogether.
This commit performs the deprecation, and the removal is tracked
by MDEV-33745. Note, as the option is only able to be provided
via configuration at startup time, users will not see a
deprecation message unless looking through the CLI help
message.
Reviewed By
============
Kristian Nielsen <knielsen@knielsen-hq.org>
Sergei Golubchik <serg@mariadb.org>
fprintf() on Windows, when used on unbuffered FILE*, writes bytewise.
This can make crash handler messages harder to read, if they are mixed up
with other error log output.
Fixed , on Windows, by using a small buffer for formatting, and fwrite
instead of fprintf, if buffer is large enough for message.
This commit adds 3 new status variables to 'show all slaves status':
- Master_last_event_time ; timestamp of the last event read from the
master by the IO thread.
- Slave_last_event_time ; Master timestamp of the last event committed
on the slave.
- Master_Slave_time_diff: The difference of the above two timestamps.
All the above variables are NULL until the slave has started and the
slave has read one query event from the master that changes data.
- Added information_schema.slave_status, which allows us to remove:
- show_master_info(), show_master_info_get_fields(),
send_show_master_info_data(), show_all_master_info()
- class Sql_cmd_show_slave_status.
- Protocol::store(I_List<i_string_pair>* str_list) as it is not
used anymore.
- Changed old SHOW SLAVE STATUS and SHOW ALL SLAVES STATUS to
use the SELECT code path, as all other SHOW ... STATUS commands.
Other things:
- Xid_log_time is set to time of commit to allow slave that reads the
binary log to calculate Master_last_event_time and
Slave_last_event_time.
This is needed as there is not 'exec_time' for row events.
- Fixed that Load_log_event calculates exec_time identically to
Query_event.
- Updated RESET SLAVE to reset Master/Slave_last_event_time
- Updated SQL thread's update on first transaction read-in to
only update Slave_last_event_time on group events.
- Fixed possible (unlikely) bugs in sql_show.cc ...old_format() functions
if allocation of 'field' would fail.
Reviewed By:
Brandon Nesterenko <brandon.nesterenko@mariadb.com>
Kristian Nielsen <knielsen@knielsen-hq.org>