When using the default innodb_log_buffer_size=2m, mariadb-backup --backup
would spend a lot of time re-reading and re-parsing the log. For reads,
it would be beneficial to memory-map the entire ib_logfile0 to the
address space (typically 48 bits or 256 TiB) and read it from there,
both during --backup and --prepare.
We will introduce the Boolean read-only parameter innodb_log_file_mmap
that will be OFF by default on most platforms, to avoid aggressive
read-ahead of the entire ib_logfile0 in when only a tiny portion would be
accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON,
because those platforms define a specific mmap(2) option for enabling
such read-ahead and therefore it can be assumed that the default would
be on-demand paging. This parameter will only have impact on the initial
InnoDB startup and recovery. Any writes to the log will use regular I/O,
except when the ib_logfile0 is stored in a specially configured file system
that is backed by persistent memory (Linux "mount -o dax").
We also experimented with allowing writes of the ib_logfile0 via a
memory mapping and decided against it. A fundamental problem would be
unnecessary read-before-write in case of a major page fault, that is,
when a new, not yet cached, virtual memory page in the circular
ib_logfile0 is being written to. There appears to be no way to tell
the operating system that we do not care about the previous contents of
the page, or that the page fault handler should just zero it out.
Many references to HAVE_PMEM have been replaced with references to
HAVE_INNODB_MMAP.
The predicate log_sys.is_pmem() has been replaced with
log_sys.is_mmap() && !log_sys.is_opened().
Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the
way that an open file handle to ib_logfile0 will be retained. In both
code paths, log_sys.is_mmap() will hold. Holding a file handle open will
allow log_t::clear_mmap() to disable the interface with fewer operations.
It should be noted that ever since
commit 685d958e38 (MDEV-14425)
most 64-bit Linux platforms on our CI platforms
(s390x a.k.a. IBM System Z being a notable exception) read and write
/dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is
persistent memory (mount -o dax). So, the memory mapping based log
parsing that this change is enabling by default on Linux and FreeBSD
has already been extensively tested on Linux.
::log_mmap(): If a log cannot be opened as PMEM and the desired access
is read-only, try to open a read-only memory mapping.
xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile():
Copy the InnoDB log in mariadb-backup --backup from a memory
mapped file.
The problem was that the signal thread was not killed when using
unireg_abort().
The bug was introduced by:
MDEV-30260: Slave crashed:reload_acl_and_cache during shutdown
Other things fixed:
- Don't produce memory leaks with safemalloc if all threads was not
ended properly (not useful)
Fixed that internal temporary tables are not waiting for freed disk space.
Other things:
- 'kill id' will now kill a query waiting for free disk space instantly.
Before it could take up to 60 seconds for the kill would be noticed.
- Fixed that sorting one index is not using MY_WAIT_IF_FULL for temp files.
- Fixed bug where share->write_flag set MY_WAIT_IF_FULL for temp files.
It is quite hard to do a test case for this. Instead I tested all
combinations interactively.
The leaks are all 40 bytes and happens in this call stack when running
mtr vcol.vcol_syntax:
alloc_root()
...
Virtual_column_info::fix_and_check_exp()
...
Delayed_insert::get_local_table()
The problem was that one copied a MEM_ROOT from THD to a TABLE without
taking into account that new blocks would be allocated through the
TABLE memroot (and would thus be leaked).
In general, one should NEVER copy MEM_ROOT from one object to another
without clearing the copied memroot!
Fixed by, at end of get_local_table(), copy all new allocated objects
to client_thd->mem_root.
Other things:
- Removed references to MEM_ROOT::total_alloc that was wrongly left
after a previous commit
Use ICU to work with timezones, to retrieve current timezone name,
abbreviation, and offset from GMT. However in case TZ environment variable
is used to set timezone, and ICU does not have corresponding one,
C runtime functions will be used.
Moved some of timezone handling to mysys.
Added unit tests.
Compute binlog checksums (when enabled) already when writing events
into the statement or transaction caches, where before it was done
when the caches are copied to the real binlog file. This moves the
checksum computation outside of holding LOCK_log, improving
scalabitily.
At stmt/trx cache write time, the final end_log_pos values are not
known, so with this patch these will be set to 0. Events that are
written directly to the binlog file (not through stmt/trx cache) keep
the correct end_log_pos value. The GTID and COMMIT/XID events at the
start and end of event groups are written directly, so the zero
end_log_pos is only for events in the middle of event groups, which
do not negatively affect replication.
An option --binlog-legacy-event-pos, off by default, is provided to
disable this behavior to provide backwards compatibility with any
external applications that might rely on end_log_pos in events in the
middle of event groups.
Checksums cannot be pre-computed when binlog encryption is enabled, as
encryption relies on correct end_log_pos to provide part of the
nonce/IV.
Checksum pre-computation is also disabled for WSREP/Galera, as it uses
events differently in its write-sets and so on. Extending pre-computation of
checksums to Galera where it makes sense could be added in a future patch.
The current --binlog-checksum configuration is saved in
binlog_cache_data at transaction start and used to pre-compute
checksums in cache, if applicable. When the cache is later copied to
the binlog, a check is made if the saved value still matches the
configured global value; if so, the events are block-copied directly
into the binlog file. If --binlog-checksum was changed during the
transaction, events are re-written to the binlog file one-by-one and
the checksums recomputed/discarded as appropriate.
Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The MDEV-29693 conflict resolution is from Monty, as well as is
a bug fix where ANALYZE TABLE wrongly built histograms for
single-column PRIMARY KEY.
Also includes a fix for safe_malloc error reporting.
Other things:
- Copied main.log_slow from 10.4 to avoid mtr issue
Disabled test:
- spider/bugfix.mdev_27239 because we started to get
+Error 1429 Unable to connect to foreign data source: localhost
-Error 1158 Got an error reading communication packets
- main.delayed
- Bug#54332 Deadlock with two connections doing LOCK TABLE+INSERT DELAYED
This part is disabled for now as it fails randomly with different
warnings/errors (no corruption).
Problem:
Under terms of MDEV-27490, we'll update Unicode version used
to compare identifiers to 14.0.0. Unlike in the old Unicode version,
in the new version a string can grow during lower-case. We cannot
perform check_db_name() inplace any more.
Change summary:
- Allocate memory to store lower-cased identifiers in memory root
- Removing check_db_name() performing both in-place lower-casing and validation
at the same time. Splitting it into two separate stages:
* creating a memory-root lower-cased copy of an identifier
(using new MEM_ROOT functions and Query_arena wrapper methods)
* performing validation on a constant string
(using Lex_ident_fs methods)
Implementation details:
- Adding a mysys helper function to allocate lower-cased strings on MEM_ROOT:
lex_string_casedn_root()
and a Query_arena wrappers for it:
make_ident_casedn()
make_ident_opt_casedn()
- Adding a Query_arena method to perform both MEM_ROOT lower-casing and
database name validation at the same time:
to_ident_db_internal_with_error()
This method is very close to the old (pre-11.3) check_db_name(),
but performs lower-casing to a newly allocated MEM_ROOT
memory (instead of performing lower-casing the original string in-place).
- Adding a Table_ident method which additionally handles derived table names:
to_ident_db_internal_with_error()
- Removing the old check_db_name()
Revert the old work-around for buggy fdatasync() on Linux ext3. This bug was
fixed in Linux > 10 years ago back to kernel version at least 3.0.
Reviewed-by: Marko Mäkelä <marko.makela@mariadb.com>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This includes:
- cleanup and optimization of filtering and pushdown engine code.
- Adjusted costs for rowid filters (based on extensive testing
and profiling).
This made a small two changes to the handler_rowid_filter_is_active()
API:
- One should not call it with a zero pointer!
- One does not need to call handler_rowid_filter_is_active() for every
row anymore. It is enough to check if filter is active by calling it
call it during index_init() or when handler::rowid_filter_changed()
is called
The changes was to avoid unnecessary function calls and checks if
pushdown conditions and rowid_filter is not used.
Updated costs for rowid_filter_lookup() to be closer to reality.
The old cost was based only on rowid_compare_cost. This is now
changed to take into account the overhead in checking the rowid.
Changed the Range_rowid_filter class to use DYNAMIC_ARRAY directly
instead of Dynamic_array<>. This was done to be able to use the new
append_dynamic() functions which gives a notable speed improvment
compared to the old code. Removing the abstraction also makes
the code easier to understand.
The cost of filtering is now slightly lower than before, which
is reflected in some test cases that is now using rowid filters.