This commit updates default memory allocations size used with MEM_ROOT
objects to minimize the number of calls to malloc().
Changes:
- Updated MEM_ROOT block sizes in sql_const.h
- Updated MALLOC_OVERHEAD to also take into account the extra memory
allocated by my_malloc()
- Updated init_alloc_root() to only take MALLOC_OVERHEAD into account as
buffer size, not MALLOC_OVERHEAD + sizeof(USED_MEM).
- Reset mem_root->first_block_usage if and only if first block was used.
- Increase MEM_ROOT buffers sized used by my_load_defaults, plugin_init,
Create_tmp_table, allocate_table_share, TABLE and TABLE_SHARE.
This decreases number of malloc calls during queries.
- Use a small buffer for THD->main_mem_root in THD::THD. This avoids
multiple malloc() call for new connections.
I tried the above changes on a complex select query with 12 tables.
The following shows the number of extra allocations that where used
to increase the size of the MEM_ROOT buffers.
Original code:
- Connection to MariaDB: 9 allocations
- First query run: 146 allocations
- Second query run: 24 allocations
Max memory allocated for thd when using with heap table: 61,262,408
Max memory allocated for thd when using Aria tmp table: 419,464
After changes:
Connection to MariaDB: 0 allocations
- First run: 25 allocations
- Second run: 7 allocations
Max memory allocated for thd when using with heap table: 61,347,424
Max memory allocated for thd when using Aria table: 529,168
The new code uses slightly more memory, but avoids memory fragmentation
and is slightly faster thanks to much fewer calls to malloc().
Reviewed-by: Sergei Golubchik <serg@mariadb.org>
This is done by mapping most of the existing MySQL unicode 0900 collations
to MariadB 1400 unicode collations. The assumption is that 1400 is a super
set of 0900 for all practical purposes.
I also added a new function 'compare_collations()' and changed most code
to use this instead of comparing character sets directly.
This enables one to seamlessly mix-and-match the corresponding 0900 and
1400 sets. Field comparision and alter table treats the character sets
as identical.
All MySQL 8.0 0900 collations are supported except:
- utf8mb4_ja_0900_as_cs
- utf8mb4_ja_0900_as_cs_ks
- utf8mb4_ru_0900_as_cs
- utf8mb4_zh_0900_as_cs
These do not have corresponding entries in the MariadB 01400 collations.
Other things:
- Added COMMENT colum to information_schema.collations. For utf8mb4_0900
colletions it contains the corresponding alias collation.
os_innodb_umask was of the incorrect type resulting in warnings
in clang-19. The correct type is mode_t.
As os_innodb_umask was set during innnodb_init from my_umask,
corrected the type there along with its companion my_umask_dir.
Because of this, the defaults mask values in innodb never
had an effect.
The resulting change allow found signed differences in
my_create{,_nosymlink}, open_nosymlinks:
mysys/my_create.c:47:20: error: operand of ?: changes signedness from ‘int’ to ‘mode_t’ {aka ‘unsigned int’} due to unsignedness of other operand [-Werror=sign-compare]
47 | CreateFlags ? CreateFlags : my_umask);
Ref: clang-19 warnings:
[55/123] Building CXX object storage/innobase/CMakeFiles/innobase.dir/os/os0file.cc.o
storage/innobase/os/os0file.cc:1075:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1075 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1249:46: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1249 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
storage/innobase/os/os0file.cc:1381:45: warning: implicit conversion loses integer precision: 'ulint' (aka 'unsigned long') to 'mode_t' (aka 'unsigned int') [-Wshorten-64-to-32]
1381 | file = open(name, create_flag | O_CLOEXEC, os_innodb_umask);
| ~~~~ ^~~~~~~~~~~~~~~
Partial commit of the greater MDEV-34348 scope.
MDEV-34348: MariaDB is violating clang-16 -Wcast-function-type-strict
The functions queue_compare, qsort2_cmp, and qsort_cmp2
all had similar interfaces, and were used interchangable
and unsafely cast to one another.
This patch consolidates the functions all into the
qsort_cmp2 interface.
Reviewed By:
============
Marko Mäkelä <marko.makela@mariadb.com>
* preserve the graph in memory between statements
* keep it in a TABLE_SHARE, available for concurrent searches
* nodes are generally read-only, walking the graph doesn't change them
* distance to target is cached, calculated only once
* SIMD-optimized bloom filter detects visited nodes
* nodes are stored in an array, not List, to better utilize bloom filter
* auto-adjusting heuristic to estimate the number of visited nodes
(to configure the bloom filter)
* many threads can concurrently walk the graph. MEM_ROOT and Hash_set
are protected with a mutex, but walking doesn't need them
* up to 8 threads can concurrently load nodes into the cache,
nodes are partitioned into 8 mutexes (8 is chosen arbitrarily, might
need tuning)
* concurrent editing is not supported though
* this is fine for MyISAM, TL_WRITE protects the TABLE_SHARE and the
graph (note that TL_WRITE_CONCURRENT_INSERT is not allowed, because an
INSERT into the main table means multiple UPDATEs in the graph)
* InnoDB uses secondary transaction-level caches linked in a list in
in thd->ha_data via a fake handlerton
* on rollback the secondary cache is discarded, on commit nodes
from the secondary cache are invalidated in the shared cache
while it is exclusively locked
* on savepoint rollback both caches are flushed. this can be improved
in the future with a row visibility callback
* graph size is controlled by @@mhnsw_cache_size, the cache is flushed
when it reaches the threshold
savepoint support was added a year ago as get_last_memroot_block()
and free_all_new_blocks(), but it didn't work and was disabled.
* fix it to work
* instead of freeing the memory, only mark blocks free - this feature
is supposed to be used inside a loop (otherwise there is no need
to free anything, end of statement will do it anyway). And freeing
blocks inside a loop is a bad idea, as they'll be all malloc-ed
on the next iteration again. So, don't.
* fix a bug in mark_blocks_free() - it doesn't change the number of blocks
When using the default innodb_log_buffer_size=2m, mariadb-backup --backup
would spend a lot of time re-reading and re-parsing the log. For reads,
it would be beneficial to memory-map the entire ib_logfile0 to the
address space (typically 48 bits or 256 TiB) and read it from there,
both during --backup and --prepare.
We will introduce the Boolean read-only parameter innodb_log_file_mmap
that will be OFF by default on most platforms, to avoid aggressive
read-ahead of the entire ib_logfile0 in when only a tiny portion would be
accessed. On Linux and FreeBSD the default is innodb_log_file_mmap=ON,
because those platforms define a specific mmap(2) option for enabling
such read-ahead and therefore it can be assumed that the default would
be on-demand paging. This parameter will only have impact on the initial
InnoDB startup and recovery. Any writes to the log will use regular I/O,
except when the ib_logfile0 is stored in a specially configured file system
that is backed by persistent memory (Linux "mount -o dax").
We also experimented with allowing writes of the ib_logfile0 via a
memory mapping and decided against it. A fundamental problem would be
unnecessary read-before-write in case of a major page fault, that is,
when a new, not yet cached, virtual memory page in the circular
ib_logfile0 is being written to. There appears to be no way to tell
the operating system that we do not care about the previous contents of
the page, or that the page fault handler should just zero it out.
Many references to HAVE_PMEM have been replaced with references to
HAVE_INNODB_MMAP.
The predicate log_sys.is_pmem() has been replaced with
log_sys.is_mmap() && !log_sys.is_opened().
Memory-mapped regular files differ from MAP_SYNC (PMEM) mappings in the
way that an open file handle to ib_logfile0 will be retained. In both
code paths, log_sys.is_mmap() will hold. Holding a file handle open will
allow log_t::clear_mmap() to disable the interface with fewer operations.
It should be noted that ever since
commit 685d958e38 (MDEV-14425)
most 64-bit Linux platforms on our CI platforms
(s390x a.k.a. IBM System Z being a notable exception) read and write
/dev/shm/*/ib_logfile0 via a memory mapping, pretending that it is
persistent memory (mount -o dax). So, the memory mapping based log
parsing that this change is enabling by default on Linux and FreeBSD
has already been extensively tested on Linux.
::log_mmap(): If a log cannot be opened as PMEM and the desired access
is read-only, try to open a read-only memory mapping.
xtrabackup_copy_mmap_snippet(), xtrabackup_copy_mmap_logfile():
Copy the InnoDB log in mariadb-backup --backup from a memory
mapped file.
Apart from better performance when accessing thread local variables,
we'll get rid of things that depend on initialization/cleanup of
pthread_key_t variables.
Where appropriate, use compiler-dependent pre-C++11 thread-local
equivalents, where it makes sense, to avoid initialization check overhead
that non-static thread_local can suffer from.
Two new variables added:
- max_tmp_space_usage : Limits the the temporary space allowance per user
- max_total_tmp_space_usage: Limits the temporary space allowance for
all users.
New status variables: tmp_space_used & max_tmp_space_used
New field in information_schema.process_list: TMP_SPACE_USED
The temporary space is counted for:
- All SQL level temporary files. This includes files for filesort,
transaction temporary space, analyze, binlog_stmt_cache etc.
It does not include engine internal temporary files used for repair,
alter table, index pre sorting etc.
- All internal on disk temporary tables created as part of resolving a
SELECT, multi-source update etc.
Special cases:
- When doing a commit, the last flush of the binlog_stmt_cache
will not cause an error even if the temporary space limit is exceeded.
This is to avoid giving errors on commit. This means that a user
can temporary go over the limit with up to binlog_stmt_cache_size.
Noteworthy issue:
- One has to be careful when using small values for max_tmp_space_limit
together with binary logging and with non transactional tables.
If a the binary log entry for the query is bigger than
binlog_stmt_cache_size and one hits the limit of max_tmp_space_limit
when flushing the entry to disk, the query will abort and the
binary log will not contain the last changes to the table.
This will also stop the slave!
This is also true for all Aria tables as Aria cannot do rollback
(except in case of crashes)!
One way to avoid it is to use @@binlog_format=statement for
queries that updates a lot of rows.
Implementation:
- All writes to temporary files or internal temporary tables, that
increases the file size, are routed through temp_file_size_cb_func()
which updates and checks the temp space usage.
- Most of the temporary file monitoring is done inside IO_CACHE.
Temporary file monitoring is done inside the Aria engine.
- MY_TRACK and MY_TRACK_WITH_LIMIT are new flags for ini_io_cache().
MY_TRACK means that we track the file usage. TRACK_WITH_LIMIT means
that we track the file usage and we give an error if the limit is
breached. This is used to not give an error on commit when
binlog_stmp_cache is flushed.
- global_tmp_space_used contains the total tmp space used so far.
This is needed quickly check against max_total_tmp_space_usage.
- Temporary space errors are using EE_LOCAL_TMP_SPACE_FULL and
handler errors are using HA_ERR_LOCAL_TMP_SPACE_FULL.
This is needed until we move general errors to it's own error space
so that they cannot conflict with system error numbers.
- Return value of my_chsize() and mysql_file_chsize() has changed
so that -1 is returned in the case my_chsize() could not decrease
the file size (very unlikely and will not happen on modern systems).
All calls to _chsize() are updated to check for > 0 as the error
condition.
- At the destruction of THD we check that THD::tmp_file_space == 0
- At server end we check that global_tmp_space_used == 0
- As a precaution against errors in the tmp_space_used code, one can set
max_tmp_space_usage and max_total_tmp_space_usage to 0 to disable
the tmp space quota errors.
- truncate_io_cache() function added.
- Aria tables using static or dynamic row length are registered in 8K
increments to avoid some calls to update_tmp_file_size().
Other things:
- Ensure that all handler errors are registered. Before, some engine
errors could be printed as "Unknown error".
- Fixed bug in filesort() that causes a assert if there was an error
when writing to the temporay file.
- Fixed that compute_window_func() now takes into account write errors.
- In case of parallel replication, rpl_group_info::cleanup_context()
could call trans_rollback() with thd->error set, which would cause
an assert. Fixed by resetting the error before calling trans_rollback().
- Fixed bug in subselect3.inc which caused following test to use
heap tables with low value for max_heap_table_size
- Fixed bug in sql_expression_cache where it did not overflow
heap table to Aria table.
- Added Max_tmp_disk_space_used to slow query log.
- Fixed some bugs in log_slow_innodb.test
This task is to ensure we have a clear definition and rules of how to
repair or optimize a table.
The rules are:
- REPAIR should be used with tables that are crashed and are
unreadable (hardware issues with not readable blocks, blocks with
'unexpected data' etc)
- OPTIMIZE table should be used to optimize the storage layout for the
table (recover space for delete rows and optimize the index
structure.
- ALTER TABLE table_name FORCE should be used to rebuild the .frm file
(the table definition) and the table (with the original table row
format). If the table is from and older MariaDB/MySQL release with a
different storage format, it will convert the data to the new
format. ALTER TABLE ... FORCE is used as part of mariadb-upgrade
Here follows some more background:
The 3 ways to repair a table are:
1) ALTER TABLE table_name FORCE" (not other options).
As an alias we allow: "ALTER TABLE table_name ENGINE=original_engine"
2) "REPAIR TABLE" (without FORCE)
3) "OPTIMIZE TABLE"
All of the above commands will optimize row space usage (which means that
space will be needed to hold a temporary copy of the table) and
re-generate all indexes. They will also try to replicate the original
table definition as exact as possible.
For ALTER TABLE and "REPAIR TABLE without FORCE", the following will hold:
If the table is from an older MariaDB version and data conversion is
needed (for example for old type HASH columns, MySQL JSON type or new
TIMESTAMP format) "ALTER TABLE table_name FORCE, algorithm=COPY" will be
used.
The differences between the algorithms are
1) Will use the fastest algorithm the engine supports to do a full repair
of the table (except if data conversions are is needed).
2) Will use the storage engine internal REPAIR facility (MyISAM, Aria).
If the engine does not support REPAIR then
"ALTER TABLE FORCE, ALGORITHM=COPY" will be used.
If there was data incompatibilities (which means that FORCE was used)
then there will be a warning after REPAIR that ALTER TABLE FORCE is
still needed.
The reason for this is that REPAIR may be able to go around data
errors (wrong incompatible data, crashed or unreadable sectors) that
ALTER TABLE cannot do.
3) Will use the storage engine internal OPTIMIZE. If engine does not
support optimize, then "ALTER TABLE FORCE" is used.
The above will ensure that ALTER TABLE FORCE is able to
correct almost any errors in the row or index data. In case of
corrupted blocks then REPAIR possible followed by ALTER TABLE is needed.
This is important as mariadb-upgrade executes ALTER TABLE table_name
FORCE for any table that must be re-created.
Bugs fixed with InnoDB tables when using ALTER TABLE FORCE:
- No error for INNODB_DEFAULT_ROW_FORMAT=COMPACT even if row length
would be too wide. (Independent of innodb_strict_mode).
- Tables using symlinks will be symlinked after any of the above commands
(Independent of the setting of --symbolic-links)
If one specifies an algorithm together with ALTER TABLE FORCE, things
will work as before (except if data conversion is required as then
the COPY algorithm is enforced).
ALTER TABLE .. OPTIMIZE ALL PARTITIONS will work as before.
Other things:
- FORCE argument added to REPAIR to allow one to first run internal
repair to fix damaged blocks and then follow it with ALTER TABLE.
- REPAIR will not update frm_version if ha_check_for_upgrade() finds
that table is still incompatible with current version. In this case the
REPAIR will end with an error.
- REPAIR for storage engines that does not have native repair, like InnoDB,
is now using ALTER TABLE FORCE.
- REPAIR csv-table USE_FRM now works.
- It did not work before as CSV tables had extension list in wrong
order.
- Default error messages length for %M increased from 128 to 256 to not
cut information from REPAIR.
- Documented HA_ADMIN_XX variables related to repair.
- Added HA_ADMIN_NEEDS_DATA_CONVERSION to signal that we have to
do data conversions when converting the table (and thus ALTER TABLE
copy algorithm is needed).
- Fixed typo in error message (caused test changes).
The problem was that the signal thread was not killed when using
unireg_abort().
The bug was introduced by:
MDEV-30260: Slave crashed:reload_acl_and_cache during shutdown
Other things fixed:
- Don't produce memory leaks with safemalloc if all threads was not
ended properly (not useful)
Fixed that internal temporary tables are not waiting for freed disk space.
Other things:
- 'kill id' will now kill a query waiting for free disk space instantly.
Before it could take up to 60 seconds for the kill would be noticed.
- Fixed that sorting one index is not using MY_WAIT_IF_FULL for temp files.
- Fixed bug where share->write_flag set MY_WAIT_IF_FULL for temp files.
It is quite hard to do a test case for this. Instead I tested all
combinations interactively.
The leaks are all 40 bytes and happens in this call stack when running
mtr vcol.vcol_syntax:
alloc_root()
...
Virtual_column_info::fix_and_check_exp()
...
Delayed_insert::get_local_table()
The problem was that one copied a MEM_ROOT from THD to a TABLE without
taking into account that new blocks would be allocated through the
TABLE memroot (and would thus be leaked).
In general, one should NEVER copy MEM_ROOT from one object to another
without clearing the copied memroot!
Fixed by, at end of get_local_table(), copy all new allocated objects
to client_thd->mem_root.
Other things:
- Removed references to MEM_ROOT::total_alloc that was wrongly left
after a previous commit
Use ICU to work with timezones, to retrieve current timezone name,
abbreviation, and offset from GMT. However in case TZ environment variable
is used to set timezone, and ICU does not have corresponding one,
C runtime functions will be used.
Moved some of timezone handling to mysys.
Added unit tests.
Compute binlog checksums (when enabled) already when writing events
into the statement or transaction caches, where before it was done
when the caches are copied to the real binlog file. This moves the
checksum computation outside of holding LOCK_log, improving
scalabitily.
At stmt/trx cache write time, the final end_log_pos values are not
known, so with this patch these will be set to 0. Events that are
written directly to the binlog file (not through stmt/trx cache) keep
the correct end_log_pos value. The GTID and COMMIT/XID events at the
start and end of event groups are written directly, so the zero
end_log_pos is only for events in the middle of event groups, which
do not negatively affect replication.
An option --binlog-legacy-event-pos, off by default, is provided to
disable this behavior to provide backwards compatibility with any
external applications that might rely on end_log_pos in events in the
middle of event groups.
Checksums cannot be pre-computed when binlog encryption is enabled, as
encryption relies on correct end_log_pos to provide part of the
nonce/IV.
Checksum pre-computation is also disabled for WSREP/Galera, as it uses
events differently in its write-sets and so on. Extending pre-computation of
checksums to Galera where it makes sense could be added in a future patch.
The current --binlog-checksum configuration is saved in
binlog_cache_data at transaction start and used to pre-compute
checksums in cache, if applicable. When the cache is later copied to
the binlog, a check is made if the saved value still matches the
configured global value; if so, the events are block-copied directly
into the binlog file. If --binlog-checksum was changed during the
transaction, events are re-written to the binlog file one-by-one and
the checksums recomputed/discarded as appropriate.
Reviewed-by: Monty <monty@mariadb.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The MDEV-29693 conflict resolution is from Monty, as well as is
a bug fix where ANALYZE TABLE wrongly built histograms for
single-column PRIMARY KEY.
Also includes a fix for safe_malloc error reporting.
Other things:
- Copied main.log_slow from 10.4 to avoid mtr issue
Disabled test:
- spider/bugfix.mdev_27239 because we started to get
+Error 1429 Unable to connect to foreign data source: localhost
-Error 1158 Got an error reading communication packets
- main.delayed
- Bug#54332 Deadlock with two connections doing LOCK TABLE+INSERT DELAYED
This part is disabled for now as it fails randomly with different
warnings/errors (no corruption).