The performance_schema counter wait/io/file/innodb/innodb_log_file
is always reported as 0.
The way how redo log writes are being waited for was refactored in
commit 30ea63b7d2 by the introduction
of flush_lock and write_lock. Even before that change, all the
wait/io/file/innodb/ counters were always 0 in my tests.
Moreover, if the PMEM interface that was introduced in
commit 3daef523af
is being used, writes to the InnoDB log file will completely avoid
any system calls and performance_schema instrumentation.
In commit 685d958e38 also the reads
of the redo log (during recovery) would bypass any system calls.
A prominent bottleneck in mtr_t::commit() is log_sys.mutex between
log_sys.append_prepare() and log_close().
User-visible change: The minimum innodb_log_file_size will be
increased from 1MiB to 4MiB so that some conditions can be
trivially satisfied.
log_sys.latch (log_latch): Replaces log_sys.mutex and
log_sys.flush_order_mutex. Copying mtr_t::m_log to
log_sys.buf is protected by a shared log_sys.latch.
Writes from log_sys.buf to the file system will be protected
by an exclusive log_sys.latch.
log_sys.lsn_lock: Protects the allocation of log buffer
in log_sys.append_prepare().
sspin_lock: A simple spin lock, for log_sys.lsn_lock.
Thanks to Vladislav Vaintroub for suggesting this idea, and for
reviewing these changes.
mariadb-backup: Replace some use of log_sys.mutex with recv_sys.mutex.
buf_pool_t::insert_into_flush_list(): Implement sorting of flush_list
because ordering is otherwise no longer guaranteed. Ordering by LSN
is needed for the proper operation of redo log checkpoints.
log_sys.append_prepare(): Advance log_sys.lsn and log_sys.buf_free by
the length, and return the old values. Also increment write_to_buf,
which was previously done in log_close().
mtr_t::finish_write(): Obtain the buffer pointer from
log_sys.append_prepare().
log_sys.buf_free: Make the field Atomic_relaxed,
to simplify log_flush_margin(). Use only loads and stores
to avoid costly read-modify-write atomic operations.
buf_pool.flush_list_requests: Replaces
export_vars.innodb_buffer_pool_write_requests
and srv_stats.buf_pool_write_requests.
Protected by buf_pool.flush_list_mutex.
buf_pool_t::insert_into_flush_list(): Do not invoke page_cleaner_wakeup().
Let the caller do that after a batch of calls.
recv_recover_page(): Invoke a minimal part of
buf_pool.insert_into_flush_list().
ReleaseBlocks::modified: A number of pages added to buf_pool.flush_list.
ReleaseBlocks::operator(): Merge buf_flush_note_modification() here.
log_t::set_capacity(): Renamed from log_set_capacity().
In commit 685d958e38 (MDEV-14425),
the log parsing in mariadb-backup --backup was rewritten.
The parameter STORE_IF_EXISTS that is being passed to recv_sys.parse_mtr()
or recv_sys.parse_pmem() instead of STORE_NO caused unnecessary additional
memory allocation for redo log records.
- Store the deferred tablespace name while loading the tablespace
for backup process.
- Mariabackup stores the list of space ids which has page0 INIT_PAGE
records. backup_first_page_op() and first_page_init() was introduced
to track the page0 INIT_PAGE records.
- backup_file_op() and log_file_op() was changed to handle
FILE_MODIFY redo log records. It is used to identify the
deferred tablespace space id.
- Whenever file operation redo log was processed by backup,
backup_file_op() should check whether the space name exist
in deferred tablespace. If it is then it needs to store the
space id, name when FILE_MODIFY, FILE_RENAME redo log processed
and it should delete the tablespace name from defer list in other
cases.
- backup_fix_ddl() should check whether deferred tablespace has
any page0 init records. If it is then consider the tablespace
as newly created tablespace. If not then backup should try
to reload the tablespace with SRV_BACKUP_NO_DEFER mode to
avoid the deferring of tablespace.
In commit 685d958e38 (MDEV-14425)
a bug was introduced to mariadb-backup --backup for the case when
the log is wrapping around to log_sys.START_OFFSET (12288).
This could also cause a "Missing FILE_CHECKPOINT" error during
mariadb-backup --prepare, in case the log copying resumed after
the server had produced a multiple of innodb_log_file_size-12288
bytes of more log so that the last mini-transaction would end
exactly at the end of the log file.
xtrabackup_copy_logfile(): If the log wraps around, read everything
to the end of the log file, and then the rest from log_sys.START_OFFSET.
in my_print_defaults
Analysis: --defaults* option is recognized anywhere in the commandline
instead of only at the beginning because handle_options() recognizes
options in any order.
Fix: use get_defaults_options() which recognizes --defaults* options only at
the beginning. After this is done, we only want to recognize other options
given in any order which can be done using handle_options(). So only skip
--defaults* options and pass rest of them to handle_options().
Also, removed -e, -g and -c because only my_print_defaults supports them.
- compile wolfcrypt with kdf.c, to avoid undefined symbols in tls13.c
- define WOLFSSL_HAVE_ERROR_QUEUE to avoid endless loop SSL_get_error
- Do not use SSL_CTX_set_tmp_dh/get_dh2048, this would require additional
compilation options in WolfSSL. Disable it for WolfSSL build, it works
without it anyway.
- fix "macro already defined" Windows warning.
The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.
We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.
Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.
An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.
An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.
Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.
The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.
The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.
We extend the MDEV-12353 redo log record format as follows.
(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.
In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.
Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.
The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.
The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:
os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes
The following status variables will be removed:
Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)
log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.
If the file system buffers can be bypassed, a message like the following
will be issued:
InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)
This has been tested on Linux and Microsoft Windows with both sizes.
On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.
The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.
When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:
InnoDB: log sequence number 0 (memory-mapped); transaction id 3
On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().
mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.
backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.
xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.
trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).
recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.
recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.
recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.
recv_sys_t, log_sys_t: Removed many data members.
recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.
recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.
recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.
recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)
recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().
recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.
mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.
log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.
log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.
log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.
recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).
We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.
Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.
Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
The previous default innodb_buffer_pool_chunk_size of 128M
made sense when the innodb buffer pool size was a few GB.
When the pool size is 128GB this means the chunk size is 0.1%
of this. Fine tuning the buffer pool size on such a fine
increment doesn't make practical sense. Also on extremely
large buffer pool systems, initializing on the default 128M can
also take a considerable amount of time.
When large pages are enabled, the chunk size has to be a multiple
of an available large page size or memory allocation without
use can occur.
Previously the default 0 was documented as disabling resizing.
With srv_buf_pool_chunk_unit > 0 assertions in the code and the
minimium value set, I doubt this was ever the case.
As such the autosizing (based on default 0) takes place as follows:
* a 64th of the innodb_buffer_pool_size
* if large pages, this is rounded down the the nearest multiple
of the large page size.
* If less than 1MB, set to 1MB.
This does mean the new default innodb_buffer_pool_chunk size is
2MB, derived form the above formular with 128MB as the buffer pool
size.
The innodb_buffer_pool_chunk_size is changed to a size_t for
better compatiblity with the memory allocations which use size_t.
The previous upper limit is changed to the maxium of a size_t. The
maximium value used is the buffer pool size anyway.
Getting this default value of the chunk size to a more practical
size facilitates further development of more automated resizing
without significant overhead or memory fragmentation.
innodb_buffer_pool_resize test adjusted based on 1M default
chunk size thanks Wlad.
1) Removed symlinks that are not very well supported in tar under Windows.
2) Added comment + changed code formatting in viosslfactories.c
3) Fixed a small bug in the yassl code.
4) Fixed a typo in the script code.
The previous threads locked need to be released too.
This occurs if the initialization of any of the non-first
mutex/conditition variables errors occurs.
This is follow-up to commit 1193a793c4.
We will set innodb_use_native_aio=OFF by default also in mariadb-backup
when running on a potentially affected kernel.
because plugin code is not only about encryption anymore
(also loads provider plugins), and xb_ prefix prevents name
clashes with the server code (that mariabackup links with).
bzip2/lz4/lzma/lzo/snappy compression is now provided via *services*
they're almost like normal services, but in include/providers/
and they're supposed to provide exactly the same interface
as original compression libraries (but not everything,
only enough of if for the code to compile).
the services are implemented via dummy functions that return
corresponding error values (LZMA_PROG_ERROR, LZO_E_INTERNAL_ERROR, etc).
the actual compression libraries are linked into corresponding
provider plugins. Providers are daemon plugins that when loaded
replace service pointers to point to actual compression functions.
That is, run-time dependency on compression libraries is now on plugins,
and the server doesn't need any compression libraries to run, but
will automatically support the compression when a plugin is loaded.
InnoDB and Mroonga use compression plugins now. RocksDB doesn't,
because it comes with standalone utility binaries that cannot
load plugins.
https://jira.mariadb.org/browse/MDEV-26221
my_sys DYNAMIC_ARRAY and DYNAMIC_STRING inconsistancy
The DYNAMIC_STRING uses size_t for sizes, but DYNAMIC_ARRAY used uint.
This patch adjusts DYNAMIC_ARRAY to use size_t like DYNAMIC_STRING.
As the MY_DIR member number_of_files is copied from a DYNAMIC_ARRAY,
this is changed to be size_t.
As MY_TMPDIR members 'cur' and 'max' are copied from a DYNAMIC_ARRAY,
these are also changed to be size_t.
The lists of plugins and stored procedures use DYNAMIC_ARRAY,
but their APIs assume a size of 'uint'; these are unchanged.