Atomic_relaxed<T>: add fetch_or() and fetch_and()
innodb_init(): rely on a zero-initialization of a global variable
monitor_set_tbl: make Atomic_relaxed<ulint> array and use proper operations
for setting bit, unsetting bit and reading bit
Reviewed by: Marko Mäkelä
The clang++ -stdlib=libc++ header file <fstream> depends on
<filesystem> that defines a member function path::root_name(),
which conflicts with the rather unused #define root_name()
that had been introduced in
commit 7c58e97bf6.
Because an instrumented -stdlib=libc++ (rather than the default
-stdlib=libstdc++) is easier to build for a working -fsanitize=memory
(cmake -DWITH_MSAN=ON), let us remove the conflicting #define for now.
This reverts commit 6cf8f05fd9.
Original patch assumed that MAP_HUGETLB as consistent across
achitectures which isn't the case. Defining it unconditionally
broke large pages on every achitecutre where the value differed
from x86_64.
With the EOL for Centos/RHEL6 announced in 10.5.7, <3.8 linux
kernels are no longer supported.
This PR fixes same issue as MDEV-21577 for TRUNCATE TABLE.
MDEV-21577 fixed TOI replication for OPTIMIZE, REPAIR and ALTER TABLE
operating on FK child table. It was later found out that also TRUNCATE
has similar problem and needs a fix.
The actual fix is to do FK parent table lookup before TRUNCATE TOI
isolation and append found FK parent table names in certification key
list for the write set.
PR contains also new test scenario in galera_ddl_fk_conflict test where
FK child has two FK parent tables and there are two DML transactions operating
on both parent tables.
For development convenience, new TO isolation macro was added:
WSREP_TO_ISOLATION_BEGIN_IF and WSREP_TO_ISOLATION_BEGIN_ALTER macro was changed
to skip the goto statement.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
MDEV-23855 broke the handling of innodb_flush_sync=OFF.
That parameter is supposed to limit the page write rate
in case the log capacity is being exceeded and log checkpoints
are needed.
With this fix, the following should pass:
./mtr --mysqld=--loose-innodb-flush-sync=0
One of our best regression tests for page flushing is
encryption.innochecksum. With innodb_page_size=16k and
innodb_flush_sync=OFF it would likely hang without this fix.
log_sys.last_checkpoint_lsn: Declare as Atomic_relaxed<lsn_t>
so that we are allowed to read the value while not holding
log_sys.mutex.
buf_flush_wait_flushed(): Let the page cleaner perform the flushing
also if innodb_flush_sync=OFF. After the page cleaner has
completed, perform a checkpoint if it is needed, because
buf_flush_sync_for_checkpoint() will not be run if
innodb_flush_sync=OFF.
buf_flush_ahead(): Simplify the condition. We do not really care
whether buf_flush_page_cleaner() is running.
buf_flush_page_cleaner(): Evaluate innodb_flush_sync at the low
level. If innodb_flush_sync=OFF, rate-limit the batches to
innodb_io_capacity_max pages per second.
Reviewed by: Vladislav Vaintroub
Some DDL statements appear to acquire MDL locks for a table referenced by
foreign key constraint from the actual affected table of the DDL statement.
OPTIMIZE, REPAIR and ALTER TABLE belong to this class of DDL statements.
Earlier MariaDB version did not take this in consideration, and appended
only affected table in the certification key list in write set.
Because of missing certification information, it could happen that e.g.
OPTIMIZE table for FK child table could be allowed to apply in parallel
with DML operating on the foreign key parent table, and this could lead to
unhandled MDL lock conflicts between two high priority appliers (BF).
The fix in this patch, changes the TOI replication for OPTIMIZE, REPAIR and
ALTER TABLE statements so that before the execution of respective DDL
statement, there is foreign key parent search round. This FK parent search
contains following steps:
* open and lock the affected table (with permissive shared locks)
* iterate over foreign key contstraints and collect and array of Fk parent
table names
* close all tables open for the THD and release MDL locks
* do the actual TOI replication with the affected table and FK parent
table names as key values
The patch contains also new mtr test for verifying that the above mentioned
DDL statements replicate without problems when operating on FK child table.
The mtr test scenario #1, which can be used to check if some other DDL
(on top of OPTIMIZE, REPAIR and ALTER) could cause similar excessive FK
parent table locking.
Reviewed-by: Aleksey Midenkov <aleksey.midenkov@mariadb.com>
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
This follows up commit
commit 94a520ddbe and
commit 7c5519c12d.
After these changes, the default test suites on a
cmake -DWITH_UBSAN=ON build no longer fail due to passing
null pointers as parameters that are declared to never be null,
but plenty of other runtime errors remain.
There are 2 issues here:
Issue #1: memory allocation.
An IO_CACHE that uses encryption uses a larger buffer (it needs space for the encrypted data,
decrypted data, IO_CACHE_CRYPT struct to describe encryption parameters etc).
Issue #2: IO_CACHE::seek_not_done
When IO_CACHE objects are cloned, they still share the file descriptor.
This means, operation on one IO_CACHE may change the file read position
which will confuse other IO_CACHEs using it.
The fix of these issues would be:
Allocate the buffer to also include the extra size needed for encryption.
Perform seek again after one IO_CACHE reads the file.
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.
The configuration parameters will be changed as follows:
innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)
Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
* When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
* As a buf_pool.free limit in buf_LRU_list_batch() for terminating
the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
The status variables will be changed as follows:
innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call
When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.
Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.
The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.
The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.
Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.
We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.
To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.
For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.
mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.
buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.
buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.
buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.
buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().
flush_counters_t::unzip_LRU_evicted: Remove.
IORequest: Make more members const. FIXME: m_fil_node should be removed.
buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).
page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().
recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.
srv_started_redo: Replaces srv_start_state.
SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().
buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().
buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.
buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.
buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.
buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.
buf_flush_check_neighbor(): Take id.fold() as a parameter.
buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.
buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_do_LRU_batch(): Return the number of pages flushed.
buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.
buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.
buf_LRU_check_size_of_non_data_objects(): Simplify the code.
buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.
buf_page_create(): Take a pre-allocated page as a parameter.
buf_pool_t::free_block(): Free a pre-allocated block.
recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.
BtrBulk::logFreeCheck(): Skip a redundant condition.
row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.
sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.
Reviewed by: Vladislav Vaintroub
Add CRC32C code to mysys. The x86-64 implementation uses PCMULQDQ in addition to CRC32 instruction
after Intel whitepaper, and is ported from rocksdb code.
Optimized ARM and POWER CRC32 were already present in mysys.
Remove incorrect BF (brute force) handling from lock_rec_has_to_wait_in_queue
and move condition to correct callers. Add a function to report
BF lock waits and assert if incorrect BF-BF lock wait happens.
wsrep_report_bf_lock_wait
Add a new function to report BF lock wait.
wsrep_assert_no_bf_bf_wait
Add a new function to check do we have a
BF-BF wait and if we have report this case
and assert as it is a bug.
lock_rec_has_to_wait
Use new wsrep_assert_bf_wait to check BF-BF wait.
lock_rec_create_low
lock_table_create
Use new function to report BF lock waits.
lock_rec_insert_by_trx_age
lock_grant_and_move_on_page
lock_grant_and_move_on_rec
Assert that trx is not Galera as VATS is not compatible
with Galera.
lock_rec_add_to_queue
If there is conflicting lock in a queue make sure that
transaction is BF.
lock_rec_has_to_wait_in_queue
Remove incorrect BF handling. If there is conflicting
locks in a queue all transactions must wait.
lock_rec_dequeue_from_page
lock_rec_unlock
If there is conflicting lock make sure it is not
BF-BF case.
lock_rec_queue_validate
Add Galera record locking rules comment and use
new function to report BF lock waits.
All attempts to reproduce the original assertion have been
failed. Therefore, there is no test case on this commit.
This follows up MDEV-14374, which was filed against MariaDB Server 10.3.
Back then, on a 48-core Qualcomm Centriq 2400, the performance of
delay loops for spinloops was tested both with and without the dummy
compare-and-swap operation, and it was decided to keep the dummy
operation.
On target architectures where nothing special is available (other than
x86 (IA-32, AMD64) or POWER), we perform a dummy compare-and-swap operation.
This is contrary to the idea of the x86 PAUSE instruction and the
__ppc_get_timebase(), which aim to keep the memory bus idle for a while,
to allow other cores to better execute code while a spinloop is waiting
for something to be changed.
On MariaDB Server 10.4 and another implementation of the ARMv8 ISA,
omitting the dummy compare-and-swap improved performance by up to 12%.
So, let us avoid the dummy compare-and-swap on ARM.
For now, we are retaining the dummy compare-and-swap on other ISAs
(such as SPARC, MIPS, S390x, RISC-V) because we do not have any
performance data for them.
Raspberry Pi 4 supports crc32 but doesn't support pmull (MDEV-23030).
The PR #1645 offers a solution to fix this issue. But it does not consider
the condition that the target platform does support crc32 but not support PMULL.
In this condition, it should leverage the Arm64 crc32 instruction (__crc32c) and
just only skip parallel computation (pmull/vmull) rather than skip all hardware
crc32 instruction of computation.
The PR also removes unnecessary CRC32_ZERO branch in 'crc32c_aarch64' for MariaDB,
formats the indent and coding style.
Change-Id: I76371a6bd767b4985600e8cca10983d71b7e9459
Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
In 10.3, DBUG_ASSERT() may expand to something that includes
__builtin_expect(), which expects integer arguments, not pointers.
To avoid any compiler warnings, let us use an explicit rather than
implicit comparison to the null pointer.
MariaDB adopted a hardware optimized crc32c approach on ARM64 starting 10.5.
Said implementation of crc32c needs support from target hardware for crc32
and pmull instructions. Existing logic is checking only for crc32 support
from target hardware through a runtime check and so if target hardware
doesn't support pmull it would cause things to fail/crash.
Expanded runtime check to ensure pmull support is also checked on the target
hardware along with existing crc32.
Thanks to Marko and Daniel for review.
Due to restricted size of the threadpool, execution of client queries can
be delayed (queued) for a while. This delay was interpreted as client
inactivity, and connection is closed, if client idle time + queue time
exceeds wait_timeout.
But users did not expect queue time to be included into wait_timeout.
This patch changes the behavior. We don't close connection anymore,
if there is some unread data present on connection,
even if wait_timeout is exceeded. Unread data means that client
was not idle, it sent a query, which we did not have time to process yet.
aarch64 timer is available to userspace via arch register.
clang's __builtin_readcyclecounter is wrong for aarch64 (reads the PMU
cycle counter instead of the archi-timer register), so we don't use it.
my_rdtsc unit-test on AWS m6g shows:
frequency: 121830845
resolution: 1
overhead: 1
This counter is not strictly increasing, but it is non-decreasing.
This patch ensures that all identical character sets shares the same
cs->csname.
This allows us to replace strcmp() in my_charset_same() with comparisons
of pointers. This fixes a long standing performance issue that could cause
as strcmp() for every item sent trough the protocol class to the end user.
One consequence of this patch is that we don't allow one to add a character
definition in the Index.xml file that changes the csname of an existing
character set. This is by design as changing character set names of existing
ones is extremely dangerous, especially as some storage engines just records
character set numbers.
As we now have a hash over character set's csname, we can in the future
use that for faster access to a specific character set. This could be done
by changing the hash to non unique and use the hash to find the next
character set with same csname.
When high priority replication slave applier encounters lock conflict in innodb,
it will force the conflicting lock holder transaction (victim) to rollback.
This is a must in multi-master sychronous replication model to avoid cluster lock-up.
This high priority victim abort (aka "brute force" (BF) abort), is started
from innodb lock manager while holding the victim's transaction's (trx) mutex.
Depending on the execution state of the victim transaction, it may happen that the
BF abort will call for THD::awake() to wake up the victim transaction for the rollback.
Now, if BF abort requires THD::awake() to be called, then the applier thread executed
locking protocol of: victim trx mutex -> victim THD::LOCK_thd_data
If, at the same time another DBMS super user issues KILL command to abort the same victim,
it will execute locking protocol of: victim THD::LOCK_thd_data -> victim trx mutex.
These two locking protocol acquire mutexes in opposite order, hence unresolvable mutex locking
deadlock may occur.
The fix in this commit adds THD::wsrep_aborter flag to synchronize who can kill the victim
This flag is set both when BF is called for from innodb and by KILL command.
Either path of victim killing will bail out if victim's wsrep_killed is already
set to avoid mutex conflicts with the other aborter execution. THD::wsrep_aborter
records the aborter THD's ID. This is needed to preserve the right to kill
the victim from different locations for the same aborter thread.
It is also good error logging, to see who is reponsible for the abort.
A new test case was added in galera.galera_bf_kill_debug.test for scenario where
wsrep applier thread and manual KILL command try to kill same idle victim
In fsp_path_to_space_name(), we would access a byte right before
the start of the string, tripping AddressSanitizer.
This reverts commit d87006a1c1
and commit a7634281aa.
This version is not optimized yet. It could have bugs because I didn't
check it with unit tests. Also, std::char_traits are not really supported.
So, now it's not possible to create f.ex. a case insensitive string_view.
fil_path_to_space_name(): renamed, moved to another file
and refactored to use string_view
accept might return an error, including SOCKET_EAGAIN/
SOCKET_EINTR. The caller, usually handle_connections_sockets
can these however and invalid file descriptor isn't something
to call fcntl on.
Thanks to Etienne Guesnet (ATOS) for diagnosis,
sample patch description and testing.
In AddressSanitizer, we only want memory poisoning to happen
in connection with custom memory allocation or freeing.
The primary use of MEM_UNDEFINED is for declaring memory uninitialized
in Valgrind or MemorySanitizer. We do not want MEM_UNDEFINED to
have the unwanted side effect that AddressSanitizer would no longer
be able to complain about accessing unallocated memory.
MEM_UNDEFINED(): Define as no-op for AddressSanitizer.
MEM_MAKE_ADDRESSABLE(): Define as MEM_UNDEFINED() or
ASAN_UNPOISON_MEMORY_REGION().
MEM_CHECK_ADDRESSABLE(): Wrap also __asan_region_is_poisoned().
- Some of the bug fixes are backports from 10.5!
- The fix in innobase/fil/fil0fil.cc is just a backport to get less
error messages in mysqld.1.err when running with valgrind.
- Renamed HAVE_valgrind_or_MSAN to HAVE_valgrind
MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The only exception is the
C runtime library. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.
In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.
Note: Before MariaDB Server 10.5, ./mtr will typically fail
due to the old PCRE library, which was updated in MDEV-14024.
The following cmake options were tested on 10.5
in commit 94d0bb4dbe:
cmake \
-DCMAKE_C_FLAGS='-march=native -O2' \
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2' \
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug \
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF \
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO \
-DWITH_SAFEMALLOC=OFF \
-DWITH_{ZLIB,SSL,PCRE}=bundled \
-DHAVE_LIBAIO_H=0 \
-DWITH_MSAN=ON
MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and __msan_unpoison().
MEM_GET_VBITS(), MEM_SET_VBITS(): Aliases for
VALGRIND_GET_VBITS(), VALGRIND_SET_VBITS(), __msan_copy_shadow().
InnoDB: Replace the UNIV_MEM_ macros with corresponding MEM_ macros.
ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.
When high priority replication slave applier encounters lock conflict in innodb,
it will force the conflicting lock holder transaction (victim) to rollback.
This is a must in multi-master sychronous replication model to avoid cluster lock-up.
This high priority victim abort (aka "brute force" (BF) abort), is started
from innodb lock manager while holding the victim's transaction's (trx) mutex.
Depending on the execution state of the victim transaction, it may happen that the
BF abort will call for THD::awake() to wake up the victim transaction for the rollback.
Now, if BF abort requires THD::awake() to be called, then the applier thread executed
locking protocol of: victim trx mutex -> victim THD::LOCK_thd_data
If, at the same time another DBMS super user issues KILL command to abort the same victim,
it will execute locking protocol of: victim THD::LOCK_thd_data -> victim trx mutex.
These two locking protocol acquire mutexes in opposite order, hence unresolvable mutex locking
deadlock may occur.
The fix in this commit adds THD::wsrep_aborter flag to synchronize who can kill the victim
This flag is set both when BF is called for from innodb and by KILL command.
Either path of victim killing will bail out if victim's wsrep_killed is already
set to avoid mutex conflicts with the other aborter execution. THD::wsrep_aborter
records the aborter THD's ID. This is needed to preserve the right to kill
the victim from different locations for the same aborter thread.
It is also good error logging, to see who is reponsible for the abort.
A new test case was added in galera.galera_bf_kill_debug.test for scenario where
wsrep applier thread and manual KILL command try to kill same idle victim
The idea was borrowed from http://wg21.link/p0052
scope_exit class is a helper, its name is hidden from user in
the namespace detail.
Alternative implementation of scope_exit with std::function
looks slower on goldbolt.org as it may require allocation, etc.
scope_exit doesn't need to own a callable, so beeing a pointer
is enough. And std::decay produces such a pointer from callable.
include/maria.h is a common header included in half of the server,
if should only contain definitions and declarations that are
used outside of storage/maria
internal definitions and declarations should be in maria_def.h
also remove few duplicate declarations
MDEV-21298: mariabackup doesn't read from the [mariadbd] and [mariadbd-X.Y]
server option groups from configuration files
MDEV-21301: mariabackup doesn't read [mariadb-backup] option group in
configuration file
All three issues require to change the same code, that is why their
fixes are joined in one commit.
The fix is in invoking load_defaults_or_exit() and handle_options() for
backup-specific groups separately from client-server groups to let the last
handle_options() call fail on unknown backup-specific options.
The order of options procesing is the following:
1) Load server groups and process server options, ignore unknown
options
2) Load client groups and process client options, ignore unknown
options
3) Load backup groups and process client-server options, exit on
unknown option
4) Process --mysqld-args command line options, ignore unknown options
New global flag my_handle_options_init_variables was added to have
ability to invoke handle_options() for the same allowed options set
several times without re-initialising previously set option values.
--password value destroying is moved from option processing callback to
mariabackup's handle_options() function to have ability to invoke server's
handle_options() several times for the same possible allowed options
set.
Galera invokes wsrep_sst_mariabackup.sh with mysqld command line
options to configure mariabackup as close to the server as possible.
It is not known what server options are supported by mariabackup when the
script is invoked. That is why new mariabackup option "--mysqld-args" is added,
all unknown options that follow this option will be silently ignored.
wsrep_sst_mariabackup.sh was also changed to:
- use "--mysqld-args" mariabackup option to pass mysqld options,
- remove deprecated innobackupex mode,
- remove unsupported mariabackup options:
--encrypt
--encrypt-key
--rebuild-indexes
--rebuild-threads
Problem was that FLUSH TABLES where trying to read latest sequence state
which conflicted with a running ALTER SEQUENCE. Removed the reading
of the state, when opening a table for FLUSH, as it's not needed in this
case.
Other thing:
- Fixed a potential issue with concurrently running ALTER SEQUENCE where
the later ALTER could potentially read old data
MCOL-3875 Columnstore write cache
The main change is to change thr_lock function get_status to
return a value that indicates we have to abort the lock.
Other thing:
- Made start_bulk_insert() and end_bulk_insert() protected so that the
insert cache can use these
The used code is largely based on code from Tencent
The problem is that in some rare cases there may be a conflict between .frm
files and the files in the storage engine. In this case the DROP TABLE
was not able to properly drop the table.
Some MariaDB/MySQL forks has solved this by adding a FORCE option to
DROP TABLE. After some discussion among MariaDB developers, we concluded
that users expects that DROP TABLE should always work, even if the
table would not be consistent. There should not be a need to use a
separate keyword to ensure that the table is really deleted.
The used solution is:
- If a .frm table doesn't exists, try dropping the table from all storage
engines.
- If the .frm table exists but the table does not exist in the engine
try dropping the table from all storage engines.
- Update storage engines using many table files (.CVS, MyISAM, Aria) to
succeed with the drop even if some of the files are missing.
- Add HTON_AUTOMATIC_DELETE_TABLE to handlerton's where delete_table()
is not needed and always succeed. This is used by ha_delete_table_force()
to know which handlers to ignore when trying to drop a table without
a .frm file.
The disadvantage of this solution is that a DROP TABLE on a non existing
table will be a bit slower as we have to ask all active storage engines
if they know anything about the table.
Other things:
- Added a new flag MY_IGNORE_ENOENT to my_delete() to not give an error
if the file doesn't exist. This simplifies some of the code.
- Don't clear thd->error in ha_delete_table() if there was an active
error. This is a bug fix.
- handler::delete_table() will not abort if first file doesn't exists.
This is bug fix to handle the case when a drop table was aborted in
the middle.
- Cleaned up mysql_rm_table_no_locks() to ensure that if_exists uses
same code path as when it's not used.
- Use non_existing_Table_error() to detect if table didn't exists.
Old code used different errors tests in different position.
- Table_triggers_list::drop_all_triggers() now drops trigger file if
it can't be parsed instead of leaving it hanging around (bug fix)
- InnoDB doesn't anymore print error about .frm file out of sync with
InnoDB directory if .frm file does not exists. This change was required
to be able to try to drop an InnoDB file when .frm doesn't exists.
- Fixed bug in mi_delete_table() where the .MYD file would not be dropped
if the .MYI file didn't exists.
- Fixed memory leak in Mroonga when deleting non existing table
- Fixed memory leak in Connect when deleting non existing table
Bugs fixed introduced by the original version of this commit:
MDEV-22826 Presence of Spider prevents tables from being force-deleted from
other engines
Compiler tells something about argument-dependent lookup. I do not
understand how that ADL works. But I know that such operators should
be free functions, instead of methods:
http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Ro-symmetric
Such syntax defines 'friend' free functions.
* FreeBSD calls amd64 what Linux calls x86_64
* signal returns void (*)(int)
* struct pam_message has char*, not const char*
* krb5_free_unparsed_name exists, but is deprecated
Introduce a new ATTRIBUTE_NOINLINE to
ib::logger member functions, and add UNIV_UNLIKELY hints to callers.
Also, remove some crash reporting output. If needed, the
information will be available using debugging tools.
Furthermore, remove some fts_enable_diag_print output that included
indexed words in raw form. The code seemed to assume that words are
NUL-terminated byte strings. It is not clear whether a NUL terminator
is always guaranteed to be present. Also, UCS2 or UTF-16 strings would
typically contain many NUL bytes.
Existing implementation used my_checksum (from mysys)
for calculating table checksum and binlog checksum.
This implementation was optimized for powerpc only and lacked
SIMD implementation for x86 (using clmul) and ARM
(using ACLE) instead used zlib-crc32.
mariabackup had its own copy of the crc32 implementation
using hardware optimized implementation only for x86 and lagged
hardware based implementation for powerpc and ARM.
Patch helps unifies all such calls and help aggregate all of them
using an unified interface my_checksum().
Said unification also enables hardware optimized calls for all
architecture viz. x86, ARM, POWERPC.
Default always fallback to zlib crc32.
Thanks to Daniel Black for reviewing, fixing and testing
PowerPC changes. Thanks to Marko and Daniel for early code feedback.
namespace intrusive: removed
split class into two: ilist<T> and sized_ilist<T> which has a size field.
ilist<T> no more NULLify pointers to bring a slignly better performance.
As a consequence, fil_space_t::is_in_unflushed_spaces and
fil_space_t::is_in_rotation_list boolean members are needed now.
When my_vsnprintf() is patched, the code protected disabled with
'WAITING_FOR_BUGFIX_TO_VSPRINTF' should be enabled again. Also all %b
formats in this patch should be revert to %s again
In the merge 9e6e43551f
we made Atomic_counter a more generic wrapper of std::atomic
so that dict_index_t would support the implicit assignment operator.
It is better to revert the changes to Atomic_counter and
instead introduce Atomic_relaxed as a generic wrapper to std::atomic.
Unlike Atomic_counter, we will not define operator++, operator+=
or similar, because we want to make the operations more explicit
in the users of Atomic_wrapper, because unlike loads and stores,
atomic read-modify-write operations always incur some overhead.
When neither MSAN nor Valgrind are enabled, declare
Field::mark_unused_memory_as_defined() as an empty inline function,
instead of declaring it as a virtual function.
MDEV-22073 MSAN use-of-uninitialized-value in
collect_statistics_for_table()
Other things:
innodb.analyze_table was changed to mainly test statistic
collection. Was discussed with Marko.
The code did not take into account that:
- U+005C (backslash) can occupy more than mbminlen characters (e.g. in sjis)
- Some character sets do not have a code for U+005C (e.g. swe7)
Adding a new function my_wc_to_printable into MY_CHARSET_HANDLER to
cover all special cases easier.
Fix WolfSSL build:
- Do not build with TLSv1.0,it stopped working,at least with SChannel client
- Disable a test that depends on TLSv1.0
- define FP_MAX_BITS always, to fix 32bit builds.
- Increase MAX_AES_CTX_SIZE, to fix build on Linux
Rowid Filter check is just like Index Condition Pushdown check: before
we check the filter, we must check if we have walked out of the range
we are scanning. (If we did, we should return, and not continue the scan).
Consequences of this:
- Rowid filtering doesn't work for keys that have partially-covered
blob columns (just like Index Condition Pushdown)
- The rowid filter function has three return values: CHECK_POS (passed)
CHECK_NEG (filtered out), CHECK_OUT_OF_RANGE.
All of the above is implemented in this patch
queues.c cleanup and refactoring.
Restore old version of _downhead() (from before cd483c5520)
that works well in an average case. Use it for queue_fix().
Move existing specialized version of _downhead() to queue_replace()
where it'll be handling the case it was specifically optimized for
(moving the element to the end of the queue).
And correct it to fix the heap not only down, but also up
(this fixes BUG#30301356).
Add unit tests.
Collateral cosmetic fixes.
This is a way do disable DBUG_ENTER()/DBUG_EXIT() stuff which is
needed to dbug trace. Those who doesn't need it may avoid tests
slowdown with -DWITH_DBUG_TRACE=OFF
dbug/tests.c: add define which is neede always in this test
innodb.log_file_name_debug.test: do not depend on DBUG trace stuff
in test
Benchmark results: each test eats less CPU and you can have more
parallel jobs in MTR.
patched:
./mtr -mem -par=8 -suite=innodb 185.34s user 86.85s system 133% cpu 3:23.27 total
./mtr -mem -par=8 -suite=main 80.96s user 36.01s system 182% cpu 1:04.07 total
main.select [ pass ] 1660
main.select [ pass ] 1513
main.select [ pass ] 1543
main.select [ pass ] 1660
main.select [ pass ] 1521
main.select [ pass ] 1511
main.select [ pass ] 1508
main.select [ pass ] 1520
main.select [ pass ] 1514
main.select [ pass ] 1522
vanilla:
./mtr -mem -par=8 -suite=innodb 203.61s user 92.16s system 140% cpu 3:30.16 total
./mtr -mem -par=8 -suite=main 94.11s user 35.51s system 206% cpu 1:02.69 total
main.select [ pass ] 2032
main.select [ pass ] 2017
main.select [ pass ] 2040
main.select [ pass ] 2183
main.select [ pass ] 2253
main.select [ pass ] 2075
main.select [ pass ] 2109
main.select [ pass ] 2080
main.select [ pass ] 2098
main.select [ pass ] 2114
KEY_MULTI_RANGE::range_flag does not have correct flag bits for
per-endpoint flags (NEAR_MIN, NEAR_MAX, NO_MIN_RANGE, NO_MAX_RANGE).
It only has bits for flags that describe both endpoints.
So
- Document this.
- Switch optimizer trace to using {start|end}_key.flag values, instead.
This fixes the bug.
- Switch records_in_column_ranges() to doing that too. (This used to
work, because KEY_MULTI_RANGE::range_flag had correct flag value
for the last key component, and EITS only uses one-component
pseudo-indexes)
sig_return: Solaris/OSX returns different function ptr
Move defination to my_alarm.h as its the only use.
prevents compile warnings (copied from 10.3 branch)
mysys/my_sync.c:136:19: error: 'cur_dir_name' defined but not used [-Werror=unused-const-variable=]
136 | static const char cur_dir_name[]= {FN_CURLIB, 0};
| ^~~~~~~~~~~~
fix compile error (DEPRECATED) leaked from ssl headers.
In file included from /export/home/dan/mariadb-server-10.4/sql/sys_vars.cc:37:
/export/home/dan/mariadb-server-10.4/sql/sys_vars.ic:69: error: "DEPRECATED" redefined [-Werror]
69 | #define DEPRECATED(X) X
|
In file included from /export/home/dan/mariadb-server-10.4/include/violite.h:150,
from /export/home/dan/mariadb-server-10.4/sql/sql_class.h:38,
from /export/home/dan/mariadb-server-10.4/sql/sys_vars.cc:36:
/usr/include/openssl/ssl.h:2356: note: this is the location of the previous definition
2356 | # define DEPRECATED __attribute__((deprecated))
|
Avoid Werror condition on non-Linux:
plugin/server_audit/server_audit.c:2267:7: error: variable 'db_len_off' set but not used [-Werror=unused-but-set-variable]
2267 | int db_len_off;
| ^~~~~~~~~~
plugin/server_audit/server_audit.c:2266:7: error: variable 'db_off' set but not used [-Werror=unused-but-set-variable]
2266 | int db_off;
| ^~~~~~
auth_gssapi fix include path for Solaris
Consistent with the upstream packaged patch:
https://github.com/OpenIndiana/oi-userland/blob/oi/hipster/components/database/mariadb-103/patches/06-gssapi.h.patch
compile warnings on Solaris
[ 91%] Building C object plugin/server_audit/CMakeFiles/server_audit.dir/server_audit.c.o
/plugin/server_audit/server_audit.c: In function 'auditing_v8':
/plugin/server_audit/server_audit.c:2194:20: error: unused variable 'db_len_off' [-Werror=unused-variable]
2194 | static const int db_len_off= 128;
| ^~~~~~~~~~
/plugin/server_audit/server_audit.c:2193:20: error: unused variable 'db_off' [-Werror=unused-variable]
2193 | static const int db_off= 120;
| ^~~~~~
/plugin/server_audit/server_audit.c:2192:20: error: unused variable 'cmd_off' [-Werror=unused-variable]
2192 | static const int cmd_off= 4432;
| ^~~~~~~
At top level:
/plugin/server_audit/server_audit.c:2192:20: error: 'cmd_off' defined but not used [-Werror=unused-const-variable=]
/plugin/server_audit/server_audit.c:2193:20: error: 'db_off' defined but not used [-Werror=unused-const-variable=]
2193 | static const int db_off= 120;
| ^~~~~~
/plugin/server_audit/server_audit.c:2194:20: error: 'db_len_off' defined but not used [-Werror=unused-const-variable=]
2194 | static const int db_len_off= 128;
| ^~~~~~~~~~
cc1: all warnings being treated as errors
tested on:
$ uname -a
SunOS openindiana 5.11 illumos-b97b1727bc i86pc i386 i86pc
read TLS with my_thread_var
write TLS with set_mysys_var()
my_thread_var is no longer __attribute__ ((const)): this attribute
is simply incorrect here. Read gcc manual for more information.
sql/threadpool_generic.cc fails with that attribute.
The reason why we have wsrep_on() at all is that the macro WSREP(thd)
depends on the definition of THD, and that is intentionally an opaque
data type for InnoDB. So, we cannot avoid invoking wsrep_on(), but
we can evaluate the less expensive conditions thd && WSREP_ON before
calling the function.
Global_read_lock: Use WSREP_NNULL(thd) instead of wsrep_on(thd)
because we not only know the definition of THD but also that
the pointer is not null.
wsrep_open(): Use WSREP(thd) instead of wsrep_on(thd).
InnoDB: Replace thd && wsrep_on(thd) with wsrep_on(thd), now that
the condition has been merged to the definition of the macro
wsrep_on().
If the server is compiled WITH_WSREP=OFF, we should avoid evaluating
conditions on a global variable that is constant.
WSREP_ON_: Renamed from WSREP_ON. Defined only WITH_WSREP=ON.
WSREP_ON: Defined as unlikely(WSREP_ON_).
wsrep_on(): Defined as WSREP_ON && wsrep_service->wsrep_on_func().
The reason why we have wsrep_on() at all is that the macro WSREP(thd)
depends on the definition of THD, and that is intentionally an opaque
data type for InnoDB. So, we cannot avoid invoking wsrep_on(), but
we can evaluate the less expensive condition WSREP_ON before calling
the function.
Stop linking plugins to the server executable on Windows.
Instead, extract whole server functionality into a large DLL, called
server.dll. Link both plugins, and small server "stub" exe to it.
This eliminates plugin dependency on the name of the server executable.
It also reduces the size of the packages (since tiny mysqld.exe
and mariadbd.exe are now both linked to one big DLL)
Also, simplify the functionality of exporing all symbols from selected
static libraries. Rely on WINDOWS_EXPORT_ALL_SYMBOLS, rather than old
self-backed solution.
fix compile error
replace GetProcAddress(GetModuleHandle(NULL), "variable_name")
for server exported data with actual variable names.
Runtime loading was never required,was error prone
, since symbols could be missing at runtime, and now it actually failed,
because we do not export symbols from executable anymore, but from a shared
library
This did require a MYSQL_PLUGIN_IMPORT decoration for the plugin,
but made the code more straightforward, and avoids missing symbols at
runtime (as mentioned before).
The audit plugin is still doing some dynamic loading, as it aims to work
cross-version. Now it won't work cross-version on Windows, as it already
uses some symbols that are *not* dynamically loaded, e.g fn_format
and those symbols now exported from server.dll , when earlier they were
exported by mysqld.exe
Windows, fixes for storage engine plugin loading
after various rebranding stuff
Create server.dll containing functionality of the whole server
make mariadbd.exe/mysqld.exe a stub that is only calling mysqld_main()
fix build
For ENGINE=Aria, we work around bugs in various tests that catch
writes of uninitialized bytes from the Aria page cache.
(Why do we even write anything on DROP TABLE?)
MemorySanitizer (clang -fsanitize=memory) requires that all code
be compiled with instrumentation enabled. The C runtime library
is an exception. Failure to use instrumented libraries will cause
bogus messages about memory being uninitialized.
In WITH_MSAN builds, we must avoid calling getservbyname(),
because even though it is a standard library function, it is
not instrumented, not even in clang 10.
The following cmake options were tested:
-DCMAKE_C_FLAGS='-march=native -O2'
-DCMAKE_CXX_FLAGS='-stdlib=libc++ -march=native -O2'
-DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=Debug
-DWITH_INNODB_{BZIP2,LZ4,LZMA,LZO,SNAPPY}=OFF
-DPLUGIN_{ARCHIVE,TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT,SPIDER}=NO
-DWITH_SAFEMALLOC=OFF
-DWITH_{ZLIB,SSL,PCRE}=bundled
-DHAVE_LIBAIO_H=0
-DWITH_MSAN=ON
MEM_MAKE_DEFINED(): An alias for VALGRIND_MAKE_MEM_DEFINED()
and in the future, __msan_unpoison().
For now, neither MEM_MAKE_DEFINED() nor MEM_UNDEFINED()
perform any action under MSAN. Enabling them will catch more bugs, but
will also require some more fixes or work-arounds.
Json_writer::add_double(): Work around a frequently occurring
failure in optimizer tests, related to EXPLAIN FORMAT=JSON.
dtoa(): Disable MSAN altogether. For some reason, this function
is triggering a lot of trouble, especially when invoked for
DBUG functions. The MDL default timeout is dd=86400 seconds,
and for some reason it is claimed to be uninitialized.
InnoDB: Define UNIV_DEBUG_VALGRIND also WITH_MSAN.
ut_crc32_8_hw(), ut_crc32_64_low_hw(): Use the compiler built-in
functions instead of inline assembler when building WITH_MSAN.
This will require at least -msse4.2 when building for IA-32 or AMD64.
The inline assembler would not be instrumented, and would thus cause
bogus failures.
- multi_range_read_info_const now uses the new records_in_range interface
- Added handler::avg_io_cost()
- Don't calculate avg_io_cost() in get_sweep_read_cost if avg_io_cost is
not 1.0. In this case we trust the avg_io_cost() from the handler.
- Changed test_quick_select to use TIME_FOR_COMPARE instead of
TIME_FOR_COMPARE_IDX to align this with the rest of the code.
- Fixed bug when using test_if_cheaper_ordering where we didn't use
keyread if index was changed
- Fixed a bug where we didn't use index only read when using order-by-index
- Added keyread_time() to HEAP.
The default keyread_time() was optimized for blocks and not suitable for
HEAP. The effect was the HEAP prefered table scans over ranges for btree
indexes.
- Fixed get_sweep_read_cost() for HEAP tables
- Ensure that range and ref have same cost for simple ranges
Added a small cost (MULTI_RANGE_READ_SETUP_COST) to ranges to ensure
we favior ref for range for simple queries.
- Fixed that matching_candidates_in_table() uses same number of records
as the rest of the optimizer
- Added avg_io_cost() to JT_EQ_REF cost. This helps calculate the cost for
HEAP and temporary tables better. A few tests changed because of this.
- heap::read_time() and heap::keyread_time() adjusted to not add +1.
This was to ensure that handler::keyread_time() doesn't give
higher cost for heap tables than for normal tables. One effect of
this is that heap and derived tables stored in heap will prefer
key access as this is now regarded as cheap.
- Changed cost for index read in sql_select.cc to match
multi_range_read_info_const(). All index cost calculation is now
done trough one function.
- 'ref' will now use quick_cost for keys if it exists. This is done
so that for '=' ranges, 'ref' is prefered over 'range'.
- scan_time() now takes avg_io_costs() into account
- get_delayed_table_estimates() uses block_size and avg_io_cost()
- Removed default argument to test_if_order_by_key(); simplifies code
Prototype change:
- virtual ha_rows records_in_range(uint inx, key_range *min_key,
- key_range *max_key)
+ virtual ha_rows records_in_range(uint inx, const key_range *min_key,
+ const key_range *max_key,
+ page_range *res)
The handler can ignore the page_range parameter. In the case the handler
updates the parameter, the optimizer can deduce the following:
- If previous range's last key is on the same block as next range's first
key
- If the current key range is in one block
- We can also assume that the first and last block read are cached!
This can be used for a better calculation of IO seeks when we
estimate the cost of a range index scan.
The parameter is fully implemented for MyISAM, Aria and InnoDB.
A separate patch will update handler::multi_range_read_info_const() to
take the benefits of this change and also remove the double
records_in_range() calls that are not anymore needed.
The code that commit bb24fa31fa
moved to a separate file assume_aligned.h was introduced in
commit 25e2a556de and developed
by employees of MariaDB Corporation.
move span.h to a proper place to make it available for the whole server
Reformat it.
Constuctors from a contigous container are fixed
to use cont.data() instead of cont.begin()
span<>::index_type is replaced with span<>::size_type
Several macros such as sint2korr() and uint4korr() are using the
arithmetic + operator while a bitwise or operator would suffice.
GCC 5 and clang 5 and later can detect patterns consisting of
bitwise or and shifts by multiples of 8 bits, such as those used
in the InnoDB function mach_read_from_4(). They actually translate
that verbose low-level code into high-level machine language
(i486 bswap instruction or fused into the Haswell movbe instruction).
We should do the same for MariaDB Server code that is outside InnoDB.
Note: The Microsoft C compiler is lacking this optimization.
There, we might consider using _byteswap_ushort(), _byteswap_ulong(),
_byteswap_uint64(). But, those would lead to unaligned reads, which are
bad for reasons stated in MDEV-20277. Besides, outside InnoDB,
most data is already being stored in the native little-endian format
of that compiler.
Some .c and .cc files are compiled as part of Mariabackup.
Enabling -Wconversion for InnoDB would also enable it for
Mariabackup. The .h files are being included during
InnoDB or Mariabackup compilation.
Notably, GCC 5 (but not GCC 4 or 6 or later versions)
would report -Wconversion for x|=y when the type is
unsigned char. So, we will either write x=(uchar)(x|y)
or disable the -Wconversion warning for GCC 5.
bitmap_set_bit(), bitmap_flip_bit(), bitmap_clear_bit(), bitmap_is_set():
Always implement as inline functions.
Introduced a new wsrep_strict_ddl configuration variable in which
Galera checks storage engine of the effected table. If table is not
InnoDB (only storage engine currently fully supporting Galera
replication) DDL-statement will return error code:
ER_GALERA_REPLICATION_NOT_SUPPORTED
eng "DDL-statement is forbidden as table storage engine does not support Galera replication"
However, when wsrep_replicate_myisam=ON we allow DDL-statements to
MyISAM tables. If effected table is allowed storage engine Galera
will run normal TOI.
This new setting should be for now set globally on all
nodes in a cluster. When this setting is set following DDL-clauses
accessing tables not supporting Galera replication are refused:
* CREATE TABLE (e.g. CREATE TABLE t1(a int) engine=Aria
* ALTER TABLE
* TRUNCATE TABLE
* CREATE VIEW
* CREATE TRIGGER
* CREATE INDEX
* DROP INDEX
* RENAME TABLE
* DROP TABLE
Statements on PROCEDURE, EVENT, FUNCTION are allowed as effected
tables are known only at execution. Furthermore, USER, ROLE, SERVER,
DATABASE statements are also allowed as they do not really have
effected table.
Also added support for MAP_SYNC. It allows to achieve decent performance
with DAX devices even when libpmem is unavailable.
Fixed Windows version of my_msync(): according to manual FlushViewOfFile()
may return before flush is actually completed. It is advised to issue
FlushFileBuffers() after FlushViewOfFile().
Problem:-
So the issue is when we do bulk insert with rows
> MI_MIN_ROWS_TO_DISABLE_INDEXES(100) , We try to disable the indexes to
speedup insert. But current logic also disables the long unique indexes.
Solution:- In ha_myisam::start_bulk_insert if we find long hash index
(HA_KEY_ALG_LONG_HASH) we will not disable the index.
This commit also refactors the mi_disable_indexes_for_rebuild function,
Since this is function is called at only one place, it is inlined into
start_bulk_insert
mi_clear_key_active is added into myisamdef.h because now it is also used
in ha_myisam.cc file.
(Same is done for Aria Storage engine)
I found that memcpy_aligned was used incorrectly at redo log and decided to put
assertions in aligned functions. And found even more incorrect cases.
Given the amount discovered of bugs, I left assertions to prevent future bugs.
my_assume_aligned(): instead of MY_ASSUME_ALIGNED macro
Features:
* STL-like interface
* Fast modification: no branches on insertion or deletion
* Fast iteration: one pointer dereference and one pointer comparison
* Your class can be a part of several lists
Modeled after std::list<T> but currently has fewer methods (not complete yet)
For even more performance it's possible to customize list with templates so
it won't have size counter variable or won't NULLify unlinked node.
How existing lists differ?
No existing lists support STL-like interface.
I_List:
* slower iteration (one more branch on iteration)
* element can't be a part of two lists simultaneously
I_P_List:
* slower modification (branches, except for the fastest push_back() case)
* slower iteration (one more branch on iteration)
UT_LIST_BASE_NODE_T:
* slower modification (branches)
Three UT_LISTs were replaced: two in fil_system_t and one in dyn_buf_t.
This is joint work with Thirunarayanan Balathandayuthapani.
The MDL interface between InnoDB and the rest of the server
(in storage/innobase/dict/dict0dict.cc and in include/)
is my work, while most everything else is Thiru's.
The collection of InnoDB persistent statistics and the
defragmentation were not refactored to use MDL. They will
keep relying on lower-level interlocking with
fil_check_pending_operations().
The purge of transaction history and the background operations on
fulltext indexes will use MDL. We will revert
commit 2c4844c9e7
(MDEV-17813) because thanks to MDL, purge cannot conflict
with DDL operations anymore. For a similar reason, we will remove
the MDEV-16222 test case from gcol.innodb_virtual_debug_purge.
Purge is essentially replacing all use of the global dict_sys.latch
with MDL. Purge will skip the undo log records for tables whose names
start with #sql-ib or #sql2. Theoretically, such tables might
be renamed back to visible table names if TRUNCATE fails to
create a new table, or the final rename in ALTER TABLE...ALGORITHM=COPY
fails. In that case, purge could permanently leave some garbage
in the table. Such garbage will be tolerated; the table would not
be considered corrupted.
To avoid repeated MDL releases and acquisitions,
trx_purge_attach_undo_recs() will sort undo log records by table_id,
and purge_node_t will keep the MDL and table handle open for multiple
successive undo log records.
get_purge_table(): A new accessor, used during the purge of
history for indexed virtual columns. This interface should ideally
not exist at all.
thd_mdl_context(): Accessor of THD::mdl_context.
Wrapped in a new thd_mdl_service.
dict_get_db_name_len(): Define inline.
dict_acquire_mdl_shared(): Acquire explicit shared MDL on a table name
if needed.
dict_table_open_on_id(): Return MDL_ticket, if requested.
dict_table_close(): Release MDL ticket, if requested.
dict_fts_index_syncing(), dict_index_t::index_fts_syncing: Remove.
row_drop_table_for_mysql() no longer needs to check these, because
MDL guarantees that a fulltext index sync will not be in progress
while MDL_EXCLUSIVE is protecting a DDL operation.
dict_table_t::parse_name(): Parse the table name for acquiring MDL.
purge_node_t::undo_recs: Change the type to std::list<trx_purge_rec_t*>
(different container, and storing also roll_ptr).
purge_node_t: Add mdl_ticket, last_table_id, purge_thd, mdl_hold_recs
for acquiring MDL and for keeping the table open across multiple
undo log records.
purge_vcol_info_t, row_purge_store_vsec_cur(), row_purge_restore_vsec_cur():
Remove. We will acquire the MDL earlier.
purge_sys_t::heap: Added, for reading undo log records.
fts_sync_during_ddl(): Invoked during ALGORITHM=INPLACE operations
to ensure that fts_sync_table() will not conflict with MDL_EXCLUSIVE.
Uses fts_t::sync_message for bookkeeping.
This patch contains two fixes:
* wsrep_handle_mdl_conflict(): handle the case where SR transaction
is in aborting state. Previously, a BF-BF conflict was reported, and
the process would abort.
* wsrep_thd_bf_abort(): do not restore thread vars after calling
wsrep_bf_abort(). Thread vars are already restored in wsrep-lib if
necessary. This also removes the assumption that the caller of
wsrep_thd_bf_abort() is the given bf_thd, which is not the case.
Also in this patch:
* Remove unnecessary check for active victim transaction in
wsrep_thd_bf_abort(): the exact same check is performed later in
wsrep_bf_abort().
* Make wsrep_thd_bf_abort() and wsrep_log_thd() const-correct.
* Change signature of wsrep_abort_thd() to take THD pointers instead
of void pointers.
Use my_thread_var::stack_ends_here inside lf_pinbox_real_free() for address
where thread stack ends.
Remove LF_PINS::stack_ends_here.
It is not safe to assume that mysys_var that was used during pin allocation,
remains correct during free. E.g with binlog group commit in Innodb,
that frees pins for multiple Innodb transactions, it does not work
correctly.
Introduce memcpy_aligned<N>(), memcmp_aligned<N>(), memset_aligned<N>()
and use them for accessing InnoDB page header fields that are known
to be aligned.
MY_ASSUME_ALIGNED(): Wrapper for the GCC/clang __builtin_assume_aligned().
Nothing similar seems to exist in Microsoft Visual Studio, and the
C++20 std::assume_aligned is not available to us yet.
Explicitly specified alignment guarantees allow compilers to generate
faster code on platforms with strict alignment rules, instead of
emitting calls to potentially unaligned memcpy(), memcmp(), or memset().
The fix consists of three commits backported from 10.3:
1) Cleanup isnan() portability checks
(cherry picked from commit 7ffd7fe962)
2) Cleanup isinf() portability checks
Original problem reported by Wlad: re-compilation of 10.3 on top of 10.2
build would cache undefined HAVE_ISINF from 10.2, whereas it is expected
to be 1 in 10.3.
std::isinf() seem to be available on all supported platforms.
(cherry picked from commit bc469a0bdf)
3) Use std::isfinite in C++ code
This is addition to parent revision fixing build failures.
(cherry picked from commit 54999f4e75)
Threadpool will need a functionality for periodic thr_timer
(the threadpool maintainence task is a timer that runs periodically).
Also increase the stack size for the timer thread, 8k won't be enough.
A conflict between MDEV-19514 (b42294bc64)
and MDEV-20934 (d7a2401750)
was resolved. We will not invoke the function ibuf_delete_recs()
from ibuf_merge_or_delete_for_page(). Instead, we will add that
logic to the function ibuf_read_merge_pages().
Don't save/restore HP_INFO as it could be changed by a concurrent thread.
different parts of HP_INFO are protected by different mutexes and
the mutex that protect most of the HP_INFO does not protect its open_list
data.
As a bonus, make heap_check_heap() to take const HP_INFO* and not
make any changes there whatsoever.
Lock wait can happen on secondary index when doing FK checks for wsrep.
We should just return error to upper layer and applier will retry
operation when needed.
- Defining MariaDB_FUNCTION_PLUGIN
- Changing the code in /plugins/type_inet/ and /plugins/type_test/
to use MariaDB_FUNCTION_PLUGIN instead of MariaDB_FUNCTION_COLLECTION_PLUGIN.
- Changing maturity for the INET6 data type plugin from experimental to alpha.
make load_defaults() store the file name in the generated option list
using a special marker ---file-marker--- option.
Pick up this filename in handle_options().
Remove ---args-separator---, use ---file-marker--- with an empty file
name instead - this simplifies checks on the caller, only one special
option to recognize.
only my_getopt should use it, because it changes my_getopt's behavior.
If one simply wants to skip the separator - don't ask it to be added
in the first place
process all --defaults* options uniformly,
get rid of special case for --no-defaults and --print-defaults
use realpath instead of blindly concatenating pwd and relative path.
it turns out that practically every single user of handle_options()
used the get_one_option callback. Simplify the code,
make it mandatory, adjust unit tests.
almost all my_getopt settings and callbacks are global variables,
directly assignable to configure my_getopt. Only getopt_get_addr
was using a setter function. Get rid of it, make it a global
directly assignable variable like all other settings.
Also make getopt_compare_strings() static.
This is a remnant of "MySQL Instance Manager", which was removed in
MySQL-5.5.0 and never existed in MariaDB
Remove callback, simplify and optimize the code accordingly.