MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The only exception is the
C runtime library. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.
In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.
Note: Before MariaDB Server 10.5, ./mtr will typically fail
due to the old PCRE library, which was updated in MDEV-14024.
The following cmake options were tested on 10.5
in commit 94d0bb4dbe:
cmake \
-DCMAKE_C_FLAGS='-march=native -O2' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \
-DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DWITH_MSAN=ON
MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and __msan_unpoison().
MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for
VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow().
InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros.
ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.
mariabackup tries to allocate a buffer of page_size*page_size/4 size.
for 64k page it means 1Gb, which doesn't work very well on 32-bit builders.
Skip the 64k page test on 32bit.
The issue here is for a DEPENDENT subquery that has an aggregate function in the ORDER BY clause,
is wrapped inside an Item_aggregate_ref. For computation of ORDER BY we need to refer to the
temp table field corresponding to this item. But in the function make_sortorder, we were
explicitly casting Item_aggrgate_ref to Item_sum, which leads to us not getting the temp
table field corresponding to the item.
This function is very common in a debug build. I can even see it in
profiler.
This patch reduces execution time of fil_validate() from
8948ns
8367ns
8650ns
8906ns
8448ns
to
260ns
232ns
403ns
275ns
169ns
in my environment.
The trick is a faster fil_space_t iteration. Hash table
is typically initialized with a size of 50,000. And looping through
it is slow. Slower, than iterating an exact amount of fil_space_t
which is typically less than ten.
Only debug builds are affected.
The issue here was that the left expr and right expr of the ANY subquery
had different character sets, so we were converting the left expr to utf8 character set.
So when this conversion was happening we were actually converting the item inside the cache,
it looked like <cache>(convert(t1.l1 using utf8)), which is incorrect.
To fix this problem we are going to store the reference of the left expr and convert that
to utf8 character set, it would look like convert(<cache>(`test`.`t1`.`l1`) using utf8)
Problem:
========
Relay_log_info::flush reports following MSAN issue.
==17820==WARNING: MemorySanitizer: use-of-uninitialized-value is reported
#5 0x00005584f0981441 in my_write (Filedes=22,
Buffer=0x72500003e818 "5\n./slave-relay-bin.000003\n21385\n
master-bin.000001\n21643\n0\n", '\245' <repeats 141 times>..., Count=118,
MyFlags=532) at /home/sujatha/bug_repo/test-10.5-msan/mysys/my_write.c:49
Analysis:
=========
In parallel replication at the end of each statement execution the worker execution
status is updated in 'relay-log.info' file. When two workers try to flush
the status at the same time, since the write to cache is not serialized both
workers write to the same address simultaneously and increment the
length twice. Because of this the length of buffer is more than actual data.
When flush code tries to read the buffer beyond valid data length MSAN
reports uninitialized values error.
Fix:
===
Serialize the relay log flush operation using "rli->data_lock".
Deadlock in DbugParse, on Linux.
In 10.1, DBUG recursive mutex was improperly implemented.
CODE_STATE::locked counter was never updated.
Copy the code around LockMutex/UnlockMutex from 10.2
Analysis:
========
When "Profiling" is enabled, server collects the resource usage of each
statement that gets executed in current session. Profiling doesn't support
nested statements. In order to ensure this behavior when profiling is enabled
for a statement, there should not be any other active query which is being
profiled. This active query information is stored in 'current' variable. When
a nested query arrives it finds 'current' being not NULL and server aborts.
When 'init_connect' and 'init_slave' system variables are set they contain a
set of statements to be executed. "execute_init_command" is the function call
which invokes "dispatch_command" for each statement provided in
'init_connect', 'init_slave' system variables. "execute_init_command" invokes
"start_new_query" and it passes the statement list to "dispatch_command". This
"dispatch_command" intern invokes "start_new_query" which leads to nesting of
queries. Hence '!current' assert is triggered.
Fix:
===
Remove profiling from "execute_init_command" as it will be done within
"dispatch_command" execution.
The galera.galera_parallel_autoinc_manytrx mtr test opens and runs test
scenario through 3 connections to node 1 and one connection to node 2.
In the test initialization phase, the test creates two tables 't1' and 'ten'
and then creates a stored procedure 'p1' to operate on these tables.
These 3 create DDL statements are issued through same connection to node 1.
In the next test phase, the mtr script uses send command to launch the call
for the p1 stored procedure through all 3 connections to node 1 and through
one connection to node 2. As the mtr send command is asynchronous,
this test phase is non blocking and fast operation.
Now, if the replication between nodes is slow, it may happen that the
initialization phase DDL statements have not been received or have not been
fully applied in node 2. Therefore there is no guarantee that the test tables
and the stored procedure have been created in node 2. Yet, the test is trying
to call p1 in node 2.
In the failure case error logs, there is error message
"MTR failed: query 'reap' failed: 1305: PROCEDURE test.p1 does not exist"
The reap command through connection to node 2, is the first place where test
execution may observe that test tables and/or stored procedure are not yet
created in node 2.
The fix in this commit adds a wait condition in connection to node 2, to wait
until the stored procedure is created before calling the stored procedure.
The wait is implemented by looking in information_schema.routines for the p1
stored procedure.
MDEV-22140 galera.galera_drop_database MTR failed: InnoDB: MySQL is trying to drop database `fts`.`` though there are still open handles
Add wait conditions to wait that all operations are done in both
nodes.
On FreeBSD, perl isn't in /usr/bin, its in /usr/local/bin or
elsewhere in the path.
Like storage/{maria/unittest/,}ma_test_* , we use /usr/bin/env to
find perl and run it.
Item_func_json_extract did not implement val_decimal(),
so CAST(JSON_EXTRACT('{"x":true}', '$.x') AS DECIMAL) erroneously
returned 0 with a warning because of convertion from the string "true"
to decimal.
Implementing val_decimal(), so boolean values are correctly handled.
This feature adds the support for rr in mtr. These 2 options are added
--rr run the mysqld in rr record mode
--rr_option= run the rr with custom record option, for multiple
options use --rr_option= for each option.
For example
./mtr main.view --rr_option=-h --rr_option=-u --rr_option=-c=23
--boot-rr run the mysqld performing bootstrap in rr record mode
Recording are stored in mysql-test/var/rr folder.
To run recording please run
rr replay var/rr/mysql-X
Limitations
Restart will create a new recording.
Repeat will work on same recording , So might be harder to debug.
If test create the multiple instance of mariadb all will be stored in var/rr
The code in fill_schema_schemata() did not take into account that
make_db_list() can leave empty db_names if the requested database
name was too long, so the call for db_names.at(0) crashed on assert.
- Moving the code testing if the database directory exists
into a separate function verify_database_directory_exists()
- Modifying the test to check if db_names is not empty
Problem:
========
- InnoDB clears the fts resource when last FTS index is being dropped
if the table has user defined FTS_DOC_ID. While creating the new fts
index, InnoDB expects to have FTS resources.
Fix:
===
fts_drop_index(): Removed the fts resource clear.
fts_clear_all(): Clear the fts resource when there are no new fts
index to be added.
commit_cache_norebuild(), row_merge_drop_indexes():
Tries to call fts resource after removing associated fts index
from table object
copy_data_between_tables() sets to->s->default_fields to 0, as a part
of the code disabling ON UPDATE actions for all old fields
(so ON UPDATE is enable only for new fields during copying).
After the actual copying, copy_data_between_tables() did not restore
to->s->default_fields to the original value. As a result, the
TABLE_SHARE to->s was left in a wrong state after copy_data_between_tables()
and further open_table_from_share() using this TABLE_SHARE did not
populate TABLE::default_field, which further made
TABLE::evaluate_update_default_function() crash on access to NULL
pointer.
Fix:
Changing copy_data_between_tables() to restore to->s->default_fields
to its original value after the copying loop.
Problem:
=======
The "Start binlog_dump" message hasn't been updated to include the slave's
requested GTID position:
20:05:05 139836760311552 [Note] Start binlog_dump to slave_server(2), pos(, 4)
For diagnostic purposes, it would be helpful if the GTID position were
included.
Fix:
===
Imporve "Start binlog_dump" print message to include "using_gtid" and
"GTID position" requested by slave.
Ex:
[Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1),
gtid('1-1-201,2-2-100')
[Note] Start binlog_dump to slave_server(3), pos('mariadb-bin.004142',
507988273), using_gtid(0), gtid('')
When a prepared statement parameter '?' is used in a CTE that is used
multiple times, the following happens:
- The CTE definition is re-parsed multiple times.
- There are multiple Item_param objects referring to the same "?" in
the original query.
- Prepared_statement::param has a pointer to the first of them, the
others are "clones".
- When prepared statement parameter gets the value, it should be passed
over to clones with param->sync_clones() call.
This call is made in insert_params(), etc. It was not made in
insert_params_with_log().
This would cause Item_param to not have any value which would confuse
the query optimizer.
Added the missing call.
The test case that was added for MDEV-21217
(commit b68f1d847f)
should have only two possible outcomes for the locking SELECT statement:
(1) The statement is blocked, and the test will eventually fail
with a lock wait timeout. This is what I observed when the
code fix for MDEV-21217 was missing.
(2) The lock conflict will ensure that the statement will execute
after the rollback has completed, and an empty table will be observed.
This is the expected outcome with the recovery fix.
What occasionally happens (in some of our CI environments only, so far)
is that the locking SELECT will return all 1,000 rows of the table that
had been inserted by the transaction that was never supposed to be
committed. One possibility is that the transaction was unexpectedly
committed when the server was killed.
Let us disable the test until the reason of the failure has been
determined and addressed.
Item_cache_datetime::decimals was always copied from example->decimals
without limiting to 6 (maximum possible fractional digits), so
val_str() later crashed on asserts inside my_time_to_str() and
my_datetime_to_str().
The code erroneously used buff[100] in a fiew places to make
a GRANTEE value in the form:
'user'@'host'
Fix:
- Fixing the code to use (USER_HOST_BUFF_SIZE + 6) instead of 100.
- Adding a DBUG_ASSERT to make sure the buffer is enough
- Wrapping the code into a class Grantee_str, to reuse it easier in 4 places.
On large hard disks (> 2TB), the plugin won't function correctly, always
showing 2 TB of available space due to integer overflow. Upgrade table
fields to bigint to resolve this problem.
The real problem was that attempt to roll back cahnes after end of memory in QC was made incorrectly and lead to using uninitialized memory.
(bug has nothing to do with resize operation, it is just lack of resources erro processed incorrectly)