Commit graph

2863 commits

Author SHA1 Message Date
Marko Mäkelä
620c55e708 Merge 10.4 into 10.5 2022-04-21 15:33:50 +03:00
Vlad Lesin
1b558dd462 MDEV-27919 mariabackup --log-copy-interval is measured in millisecondss in 10.5 and in microseconds in 10.6
Multiply polling interval by 1000.
2022-04-21 15:24:59 +03:00
Marko Mäkelä
aec856073d WolfSSL v5.2.0-stable 2022-04-21 12:02:36 +03:00
Marko Mäkelä
394784095e Merge 10.3 into 10.4 2022-04-21 11:33:59 +03:00
Nayuta Yanagisawa
cbf9d8a8d5 Merge 10.7 into 10.8 2022-04-13 17:52:27 +09:00
Marko Mäkelä
aa3a9d1ef5 Merge 10.6 into 10.7 2022-04-12 16:11:29 +03:00
Sergei Golubchik
bbdec04d59 MDEV-24317 Data race in LOGGER::init_error_log at sql/log.cc:1443 and in LOGGER::error_log_print at sql/log.cc:1181
don't initialize error_log_handler_list in set_handlers()
* error_log_handler_list is initialized to LOG_FILE early, in init_base()
* set_handlers always reinitializes it to LOG_FILE, so it's pointless
* after init_base() concurrent threads start using sql_log_warning,
  so following set_handlers() shouldn't modify error_log_handler_list
  without some protection
2022-04-12 13:07:20 +02:00
Marko Mäkelä
ca3bbf4c0c Merge 10.5 into 10.6 2022-04-12 09:26:02 +03:00
Daniel Black
4ee00a29e3 MDEV-28250 aix test case failure innodb_zip.innochecksum_3,4k,crc32,innodb
As discovered by tracing, but also presenting in AIX fseeko
documentation, seeking beyond the EOF is acceptable, as you can write
there.

To display the same error in AIX to other implementations that return
errors on seek, we take the EFBIG error code on reading and error the
same way.

An AIX truss of an aspect of the test:

truss extra/innochecksum --page=18446744073709551615 $MYSQLD_DATADIR/test/tab1.ibd

  statx("./mysql-test/var/log/innodb_zip.innochecksum_3-4k,crc32,innodb/mysqld.1/data//test/tab1.ibd", 0x0FFFFFFFFFFFF610, 176, 010) = 0
  kopen("./mysql-test/var/log/innodb_zip.innochecksum_3-4k,crc32,innodb/mysqld.1/data//test/tab1.ibd", O_RDONLY|O_LARGEFILE) = 3
  kfcntl(3, 12, 0x00000001100006C8)               = 0
  kfcntl(3, F_GETFL, 0x00000001100A6CF8)          = 67108864
  kioctl(3, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
  klseek(3, 0, 1, 0x0FFFFFFFFFFFF3F0)             = 0
  kioctl(3, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
  kread(3, "DEADBEEF\0\0\0\0FFFFFFFF".., 4096)    = 4096
  klseek(3, 0, 1, 0x0FFFFFFFFFFFF450)             = 0
  klseek(3, 17592186040320, 0, 0x0FFFFFFFFFFFF450) = 0
  klseek(3, 0, 1, 0x0FFFFFFFFFFFF3F0)             = 0
  kread(3, "DEADBEEF\0\0\0\0FFFFFFFF".., 4096)    Err#27 EFBIG

An equivalent Linux trace:

ltrace extra/innochecksum --page=18446744073709551615 $MYSQLD_DATADIR/test/tab1.ibd

  stat64(0x7fff10ea2dc3, 0x7fff10ea0670, 88, 0x8026be41)         = 0
  open64("./mysql-test/var/log/innodb_zip."..., 0, 02072403160)  = 3
  fcntl64(3, 6, 0x139f180, 1)                                    = 0
  fgetpos64(0x615000000080, 0x7fff10ea0760, 1, 0)                = 0
  fseeko64(0x615000000080, 0xffffffff000, 0, 5 <unfinished ...>
  pthread_getspecific(0, 0x4d0eb8, 0x7fff10ea0490, 0)            = 0x7f7b2806d000
  <... fseeko64 resumed> )                                       = 0
  fgetpos64(0x615000000080, 0x7fff10ea0760, 1, 1)                = 0
  feof(0x615000000080)                                           = 0
  feof(0x615000000080)                                           = 1
  Error: Unable to seek to necessary offset
2022-04-07 15:17:52 +10:00
Marko Mäkelä
b2baeba415 Merge 10.7 into 10.8 2022-04-06 13:28:25 +03:00
Marko Mäkelä
2d8e38bc94 Merge 10.6 into 10.7 2022-04-06 13:00:09 +03:00
Marko Mäkelä
9d94c60f2b Merge 10.5 into 10.6 2022-04-06 12:08:30 +03:00
Marko Mäkelä
5d8dcfd86c MDEV-25975: Merge 10.4 into 10.5 2022-04-06 10:30:49 +03:00
Marko Mäkelä
cacb61b6be Merge 10.4 into 10.5 2022-04-06 10:06:39 +03:00
Marko Mäkelä
d172df9913 MDEV-25975: Merge 10.3 into 10.4 2022-04-06 09:18:38 +03:00
Marko Mäkelä
d6d66c6e90 Merge 10.3 into 10.4 2022-04-06 08:59:09 +03:00
Marko Mäkelä
e9735a8185 MDEV-25975 innodb_disallow_writes causes shutdown to hang
We will remove the parameter innodb_disallow_writes because it is badly
designed and implemented. The parameter was never allowed at startup.
It was only internally used by Galera snapshot transfer.
If a user executed
SET GLOBAL innodb_disallow_writes=ON;
the server could hang even on subsequent read operations.

During Galera snapshot transfer, we will block writes
to implement an rsync friendly snapshot, as follows:

sst_flush_tables() will acquire a global lock by executing
FLUSH TABLES WITH READ LOCK, which will block any writes
at the high level.

sst_disable_innodb_writes(), invoked via ha_disable_internal_writes(true),
will suspend or disable InnoDB background tasks or threads that could
initiate writes. As part of this, log_make_checkpoint() will be invoked
to ensure that anything in the InnoDB buf_pool.flush_list will be written
to the data files. This has the nice side effect that the Galera joiner
will avoid crash recovery.

The changes to sql/wsrep.cc and to the tests are based on a prototype
that was developed by Jan Lindström.

Reviewed by: Jan Lindström
2022-04-06 08:06:49 +03:00
Marko Mäkelä
7c584d8270 Merge 10.2 into 10.3 2022-04-06 08:06:35 +03:00
Marko Mäkelä
5c69e93630 Merge 10.7 into 10.8 2022-03-30 09:34:07 +03:00
Marko Mäkelä
a4d753758f Merge 10.6 into 10.7 2022-03-30 08:52:05 +03:00
Vlad Lesin
33ff18627e MDEV-27835 innochecksum -S crashes for encrypted .ibd tablespace
As main() invokes parse_page() when -S or -D are set, it can be a case
when parse_page() is invoked when -D filename is not set, that is why
any attempt to write to page dump file must be done only if the file
name is set with -D.

The bug is caused by 2ef7a5a13a
(MDEV-13443).
2022-03-29 13:47:37 +03:00
Marko Mäkelä
b2fa874e46 MDEV-28181 The innochecksum -w option was inadvertently removed
In commit 7a4fbb55b0 (MDEV-25105)
the innochecksum option --write (-w) was removed altogether.
It should have been made a Boolean option, so that old data files
may be converted to a format that is compatible with
innodb_checksum_algorithm=strict_crc32 by executing the following:

innochecksum -n -w ibdata* */*.ibd

It would be better to use an older-version innochecksum
for such a conversion, so that page checksums will be validated
before updating the checksum.

It never was possible for innochecksum to convert files to the
innodb_checksum_algorithm=full_crc32 format that is the default
for new InnoDB data files.
2022-03-28 11:35:10 +03:00
Marko Mäkelä
dce8a846ae Merge 10.7 into 10.8 2022-03-03 11:34:58 +02:00
Marko Mäkelä
64ea3eab8f Merge 10.6 into 10.7 2022-03-03 11:11:00 +02:00
Otto Kekäläinen
1fa872f6ef Fix various spelling errors
Among others:
existance -> existence
reinitialze -> reinitialize
successfuly -> successfully
2022-03-03 13:42:49 +11:00
Marko Mäkelä
32d741b5b0 Merge 10.7 into 10.8 2022-02-25 16:24:13 +02:00
Marko Mäkelä
3d88f9f34c Merge 10.6 into 10.7 2022-02-25 16:09:16 +02:00
Marko Mäkelä
a23414dd32 MDEV-27939 Log buffer wrap-around errors on PMEM
When the log is stored in persistent memory, log_sys.buf[] is
a ring buffer that directly maps to the circular ib_logfile0 file.

There were several errors that could occur in the special case
when a log record ends exactly at the end of the log file and the
next record would start at log_sys.buf[log_sys.START_OFFSET].

mariabackup.huge_lsn,strict_full_crc32: Write the first record
at the very end of the circular file, to reproduce the failure
scenarios.

recv_sys_t::parse(): On PMEM, wrap the end offset of the record
from log_sys.file_size to log_sys.START_OFFSET if needed.
Otherwise, both InnoDB recovery and mariadb-backup would try
to parse the next record from an invalid address.

filename_to_spacename(): Remove an assumption about the format
of file names. While the server currently writes file names like
./databasename/tablename.ibd we might want to stop writing the
redundant ./ prefix in the future. The test mariabackup.huge_lsn
is generating such file names.

xtrabackup_copy_logfile(): Correctly copy a record that ends at
the very end of the log_sys.buf[].

The errors in mariadb-backup were reproduced with the test
mariabackup.huge_lsn,strict_full_crc32 and an additional patch
to use the start checkpoint of the test:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index 27dce5fa17d..e17a1692d6f 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -1796,7 +1796,8 @@ dberr_t recv_sys_t::find_checkpoint()
         continue;
       }

-      if (checkpoint_lsn >= log_sys.next_checkpoint_lsn)
+      if (checkpoint_lsn >= log_sys.next_checkpoint_lsn &&
+          checkpoint_lsn != 0x1000fffffe10)
       {
         log_sys.next_checkpoint_lsn= checkpoint_lsn;
         log_sys.next_checkpoint_no= field == log_t::CHECKPOINT_1;
2022-02-25 15:50:09 +02:00
Marko Mäkelä
6daf8f8a0d Merge 10.5 into 10.6 2022-02-25 13:48:47 +02:00
Marko Mäkelä
b791b942e1 Merge 10.4 into 10.5 2022-02-25 13:27:41 +02:00
Marko Mäkelä
f5ff7d09c7 Merge 10.3 into 10.4 2022-02-25 13:00:48 +02:00
Marko Mäkelä
00b70bbb51 Merge 10.2 into 10.3 2022-02-25 10:43:38 +02:00
Julius Goryavsky
17e0f5224c MDEV-27524: Incorrect binlogs after Galera SST using rsync and mariabackup
This commit adds correct handling of binlogs for SST using rsync
or mariabackup. Before this fix, binlogs were handled incorrectly -
- only one (last) binary log file was transferred during SST, which
then led to various failures (for example, when trying to list all
events from the binary log). These bugs were long masked by flaws
in the primitive binlogs handling code in the SST scripts, which
causing binary logs files to be erased after transfer or not added
to the binlog index on the joiner node. Now the correct transfer
of all binary logs (not just the last of the binary log files) has
been implemented both for the rsync (at the script level) and for
the mariabackup (at the level of the main utility code).

This commit also adds a new sst_max_binlogs=<n> parameter, which
can be located in the [sst] section or in the [xtrabackup] section
(historically, supported for mariabackup only, not for rsync), or
in one of the server sections. This parameter specifies the number
of binary log files to be sent to the joiner node during SST. This
option is added for compatibility with old SST scripting behavior,
which can be emulated by setting the sst_max_binlogs=1 (although
in general this can cause problems for the reasons described above).
In addition, setting the sst_max_binlogs=0 can be used to suppress
the transmission of binary logs to the joiner nodes during SST
(although sometimes a single file with the current binary log can
still be transmitted to the joiner, even with sst_max_binlogs=0,
because this sometimes necessary in modes that involve the use of
GTIDs with Galera).

Also, this commit ensures correct handling of paths to various
innodb files and directories in the SST scripts, and fixes some
problems with this that existed in mariabackup utility (which
were associated with incorrect handling of the innodb_data_dir
parameter in some scenarios).

In addition, this commit contains the following enhancements:

 1) Added tests for mtr, which check the correct work with binlogs
    after SST (using rsync and mariabackup);
 2) Added correct handling of slashes at the end of all paths that
    the SST script receives as parameters;
 3) Improved parsing code for --mysqld-args parameters. Now it
    correctly processes the sequence "--" after the name of the
    one-letter option;
 4) Checking the secret signature during joiner authentication
    is made independent of presence of bash (as a unix shell)
    in the system and diff utility no longer needed to check
    certificates compliance;
 5) All directories that are necessary for the correct placement
    of various logs are automatically created by SST scripts in
    advance (before running mariabackup on the joiner node);
 6) Removal of old binary logs on joiner is done using the binlog
    index (if it exists) (not only by fixed pattern that based
    on the current binlog name, as before);
 7) Paths for placing binary logs are correctly processed if they
    are set as relative paths (to the datadir);
 8) SST scripts are made even more resistant to spaces in filenames
    (now for binlogs);
 9) In case of failure, SST scripts now always end with an exit
    code other than zero;
10) SST script for rsync now correctly create a tar file with
    the binlogs, even if the paths to them (in the binlog index
    file) are specified as a mix of absolute and relative paths,
    and even if they do not match with the datadir path specified
    in the current configuration settings.
2022-02-22 10:45:06 +01:00
Julius Goryavsky
571eb9d775 mariabackup: cosmetic changes (whitespaces and indentation) 2022-02-22 10:20:58 +01:00
Marko Mäkelä
f1beeb58e6 MDEV-27848: Remove unused wait/io/file/innodb/innodb_log_file
The performance_schema counter wait/io/file/innodb/innodb_log_file
is always reported as 0.

The way how redo log writes are being waited for was refactored in
commit 30ea63b7d2 by the introduction
of flush_lock and write_lock. Even before that change, all the
wait/io/file/innodb/ counters were always 0 in my tests.

Moreover, if the PMEM interface that was introduced in
commit 3daef523af
is being used, writes to the InnoDB log file will completely avoid
any system calls and performance_schema instrumentation.
In commit 685d958e38 also the reads
of the redo log (during recovery) would bypass any system calls.
2022-02-15 15:03:15 +02:00
Marko Mäkelä
a635c40648 MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()
A prominent bottleneck in mtr_t::commit() is log_sys.mutex between
log_sys.append_prepare() and log_close().

User-visible change: The minimum innodb_log_file_size will be
increased from 1MiB to 4MiB so that some conditions can be
trivially satisfied.

log_sys.latch (log_latch): Replaces log_sys.mutex and
log_sys.flush_order_mutex. Copying mtr_t::m_log to
log_sys.buf is protected by a shared log_sys.latch.
Writes from log_sys.buf to the file system will be protected
by an exclusive log_sys.latch.

log_sys.lsn_lock: Protects the allocation of log buffer
in log_sys.append_prepare().

sspin_lock: A simple spin lock, for log_sys.lsn_lock.

Thanks to Vladislav Vaintroub for suggesting this idea, and for
reviewing these changes.

mariadb-backup: Replace some use of log_sys.mutex with recv_sys.mutex.

buf_pool_t::insert_into_flush_list(): Implement sorting of flush_list
because ordering is otherwise no longer guaranteed. Ordering by LSN
is needed for the proper operation of redo log checkpoints.

log_sys.append_prepare(): Advance log_sys.lsn and log_sys.buf_free by
the length, and return the old values. Also increment write_to_buf,
which was previously done in log_close().

mtr_t::finish_write(): Obtain the buffer pointer from
log_sys.append_prepare().

log_sys.buf_free: Make the field Atomic_relaxed,
to simplify log_flush_margin(). Use only loads and stores
to avoid costly read-modify-write atomic operations.

buf_pool.flush_list_requests: Replaces
export_vars.innodb_buffer_pool_write_requests
and srv_stats.buf_pool_write_requests.
Protected by buf_pool.flush_list_mutex.

buf_pool_t::insert_into_flush_list(): Do not invoke page_cleaner_wakeup().
Let the caller do that after a batch of calls.

recv_recover_page(): Invoke a minimal part of
buf_pool.insert_into_flush_list().

ReleaseBlocks::modified: A number of pages added to buf_pool.flush_list.

ReleaseBlocks::operator(): Merge buf_flush_note_modification() here.

log_t::set_capacity(): Renamed from log_set_capacity().
2022-02-10 16:37:12 +02:00
Marko Mäkelä
8c7c92adf3 MDEV-27787 mariadb-backup --backup is allocating extra memory for log records
In commit 685d958e38 (MDEV-14425),
the log parsing in mariadb-backup --backup was rewritten.
The parameter STORE_IF_EXISTS that is being passed to recv_sys.parse_mtr()
or recv_sys.parse_pmem() instead of STORE_NO caused unnecessary additional
memory allocation for redo log records.
2022-02-10 15:39:27 +02:00
Oleksandr Byelkin
4fb2cb1a30 Merge branch '10.7' into 10.8 2022-02-04 14:50:25 +01:00
Oleksandr Byelkin
9ed8deb656 Merge branch '10.6' into 10.7 2022-02-04 14:11:46 +01:00
Oleksandr Byelkin
f5c5f8e41e Merge branch '10.5' into 10.6 2022-02-03 17:01:31 +01:00
Oleksandr Byelkin
cf63eecef4 Merge branch '10.4' into 10.5 2022-02-01 20:33:04 +01:00
Thirunarayanan Balathandayuthapani
8d742fe4ac MDEV-26326 mariabackup skip valid ibd file
- Store the deferred tablespace name while loading the tablespace
for backup process.

- Mariabackup stores the list of space ids which has page0 INIT_PAGE
records. backup_first_page_op() and first_page_init() was introduced
to track the page0 INIT_PAGE records.

- backup_file_op() and log_file_op() was changed to handle
FILE_MODIFY redo log records. It is used to identify the
deferred tablespace space id.

- Whenever file operation redo log was processed by backup,
backup_file_op() should check whether the space name exist
in deferred tablespace. If it is then it needs to store the
space id, name when FILE_MODIFY, FILE_RENAME redo log processed
and it should delete the tablespace name from defer list in other
cases.

- backup_fix_ddl() should check whether deferred tablespace has
any page0 init records. If it is then consider the tablespace
as newly created tablespace. If not then backup should try
to reload the tablespace with SRV_BACKUP_NO_DEFER mode to
avoid the deferring of tablespace.
2022-02-01 19:50:08 +05:30
Oleksandr Byelkin
a576a1cea5 Merge branch '10.3' into 10.4 2022-01-30 09:46:52 +01:00
Oleksandr Byelkin
41a163ac5c Merge branch '10.2' into 10.3 2022-01-29 15:41:05 +01:00
Marko Mäkelä
c64e507fad MDEV-27621 Backup fails with FATAL ERROR: Was only able to copy log
In commit 685d958e38 (MDEV-14425)
a bug was introduced to mariadb-backup --backup for the case when
the log is wrapping around to log_sys.START_OFFSET (12288).

This could also cause a "Missing FILE_CHECKPOINT" error during
mariadb-backup --prepare, in case the log copying resumed after
the server had produced a multiple of innodb_log_file_size-12288
bytes of more log so that the last mini-transaction would end
exactly at the end of the log file.

xtrabackup_copy_logfile(): If the log wraps around, read everything
to the end of the log file, and then the rest from log_sys.START_OFFSET.
2022-01-27 16:17:40 +02:00
Rucha Deodhar
5217211e07 MDEV-26238: Remove inconsistent behaviour of --default-* options
in my_print_defaults

Analysis: --defaults* option is recognized anywhere in the commandline
instead of only at the beginning because handle_options() recognizes
options in any order.
Fix: use get_defaults_options() which recognizes --defaults* options only at
the beginning. After this is done, we only want to recognize other options
given in any order which can be done using handle_options(). So only skip
--defaults* options and pass rest of them to handle_options().
Also, removed -e, -g and -c because only my_print_defaults supports them.
2022-01-26 18:43:06 +01:00
Vladislav Vaintroub
be1d965384 MDEV-27373 wolfSSL 5.1.1
- compile wolfcrypt with kdf.c, to avoid undefined symbols in tls13.c
- define WOLFSSL_HAVE_ERROR_QUEUE to avoid endless loop SSL_get_error
- Do not use SSL_CTX_set_tmp_dh/get_dh2048, this would require additional
  compilation options in WolfSSL. Disable it for WolfSSL build, it works
  without it anyway.
- fix "macro already defined" Windows warning.
2022-01-25 11:19:00 +01:00
Oleksandr Byelkin
8db47403ff WolfSSL v5.1.1 2022-01-25 11:19:00 +01:00
Marko Mäkelä
5d54fd611f Cleanup: Replace ut_crc32c(x,y) with my_crc32c(0,x,y) 2022-01-21 16:13:04 +02:00
Marko Mäkelä
685d958e38 MDEV-14425 Improve the redo log for concurrency
The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.

We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.

Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.

An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.

An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.

Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.

The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.

The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.

We extend the MDEV-12353 redo log record format as follows.

(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.

In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.

Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.

The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.

The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:

os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes

The following status variables will be removed:

Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)

log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.

If the file system buffers can be bypassed, a message like the following
will be issued:

InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)

This has been tested on Linux and Microsoft Windows with both sizes.

On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.

The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.

When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:

InnoDB: log sequence number 0 (memory-mapped); transaction id 3

On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().

mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.

backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.

xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.

trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).

recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.

recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.

recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.

recv_sys_t, log_sys_t: Removed many data members.

recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.

recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.

recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.

recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)

recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().

recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.

mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.

log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.

log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.

log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.

recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).

We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.

Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.

Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
2022-01-21 16:03:47 +02:00