If the allocation of spider_table_sts_threads failed,
we would DBUG_RETURN(error_num) without having initialized
it earlier.
Pre-initialize error_num to HA_ERR_OUT_OF_MEM and remove
a lot of assignments that thus became redundant.
This error was introduced in 207594afac
(Spider 3.3.13).
Problem:
=========
One of the purge thread access the corrupted page and tries to remove from
LRU list. In the mean time, other purge threads are waiting for same page
in buf_wait_for_read(). Assertion(buf_fix_count == 0) fails for the
purge thread which tries to remove the page from LRU list.
Solution:
========
- Set the page id as FIL_NULL to indicate the page is corrupted before
removing the block from LRU list. Acquire hash lock for the particular
page id and wait for the other threads to release buf_fix_count
for the block.
- Added the error check for btr_cur_open() in row_search_on_row_ref().
page_zip_compress(), page_zip_compress_write_log(),
page_zip_copy_recs(): Replace the parameters page,page_zip with block,
and set buf_page_t::init_on_flush on success
if innodb_log_optimize_ddl=OFF.
page_zip_parse_compress_no_data(): Merge with the only caller
recv_parse_or_apply_log_rec_body().
Thanks to MDEV-12699, the doublewrite buffer will only be needed in
those cases when a page is being updated in the data file. If the page
had never been written to the data file since it was initialized,
then recovery will be able to reconstruct the page based solely on
the contents of the redo log files.
The doublewrite buffer is only really needed when recovery needs to read
the page in order to apply redo log.
Note: As noted in MDEV-19739, we cannot safely disable the doublewrite
buffer if any MLOG_INDEX_LOAD records were written in the past or will
be written in the future. These records denote that redo logging was
disabled for some pages in a tablespace. Ideally, we would have
the setting innodb_log_optimize_ddl=OFF by default, and would not allow
it to be set while the server is running. If we wanted to make this
safe, assignments with SET GLOBAL innodb_log_optimize_ddl=...
should not only issue a redo log checkpoint (including a write of all
dirty pages from the entire buffer pool), but it should also wait for
all pending ALTER TABLE activity to complete. We elect not to do this.
Avoiding unnecessary use of the doublewrite buffer should improve the
write performance of InnoDB.
buf_page_t::init_on_flush: A new flag to indicate whether it is safe to
skip doublewrite buffering when writing the page.
fsp_init_file_page(): When writing a MLOG_INIT_FILE_PAGE2 record,
set the init_on_flush flag if innodb_log_optimize_ddl=OFF.
This is the only function that writes that log record.
buf_flush_write_block_low(): Skip doublewrite if init_on_flush is set.
fil_aio_wait(): Clear init_on_flush.
GCC 9.1.1 noticed that sd_notifyf() was always being invoked with
str=NULL argument for "%s". This code was added in
commit 2e814d4702
but not mentioned in the commit comment.
The STATUS messages for systemd matter during startup and shutdown,
and should not be emitted during normal operation.
ib_senderrf(): Remove the potentially harmful sd_notifyf() calls.
- If InnoDB encounters garbage or incomplete written log block during
recovery then don't throw the error. Treat it as end of the log.
- This kind of incomplete or empty block can be result of killing
InnoDB when writing the redo log.
followup for be5c432a42
ha_partition::calculate_checksum() has to invoke calculate_checksum()
for partitions unconditionally, not under (HA_HAS_OLD_CHECKSUM | HA_HAS_NEW_CHECKSUM).
Because the server uses ::info() to ask for a live checksum, while
calculate_checksum() must, precisely, calculate it the slow way,
also for tables that don't have the live checksum at all.
Also, fix the compilation on Windows (ha_checksum/ulonglong type mix).
InnoDB crash recovery buffers redo log records in a hash table.
The function recv_read_in_area() would pick a random hash bucket
and then try to submit read requests for a few nearby pages.
Let us replace the recv_sys.addr_hash with a std::map, which will
automatically be iterated in sorted order.
recv_sys_t::pages: Replaces recv_sys_t::addr_hash, recv_sys_t::n_addrs.
recv_sys_t::recs: Replaces most of recv_addr_t.
recv_t: Encapsulate a raw singly-linked list of records. This reduces
overhead compared to std::forward_list. Storage and cache overhead,
because the next-element pointer also points to the data payload.
Processing overhead, because recv_sys_t::recs_t::last will point to
the last record, so that recv_sys_t::add() can append straight to the
end of the list.
RECV_PROCESSED, RECV_DISCARDED: Remove. When a page is fully processed,
it will be deleted from recv_sys.pages.
recv_sys_t::trim(): Replaces recv_addr_trim().
recv_sys_t::add(): Use page_id_t for identifying pages.
recv_fold(), recv_hash(), recv_get_fil_addr_struct(): Remove.
recv_read_in_area(): Simplify the iteration.
Some I/O functions and macros that are declared in os0file.h used to
return a Boolean status code (nonzero on success). In MySQL 5.7, they
were changed to return dberr_t instead. Alas, in MariaDB Server 10.2,
some uses of functions were not adjusted to the changed return value.
Until MDEV-19231, the valid values of dberr_t were always nonzero.
This means that some code that was incorrectly checking for a zero
return value from the functions would never detect a failure.
After MDEV-19231, some tests for ALTER ONLINE TABLE would fail with
cmake -DPLUGIN_PERFSCHEMA=NO. It turned out that the wrappers
pfs_os_file_read_no_error_handling_int_fd_func() and
pfs_os_file_write_int_fd_func() were wrongly returning
bool instead of dberr_t. Also the callers of these functions were
wrongly expecting bool (nonzero on success) instead of dberr_t.
This mistake had been made when the addition of these functions was
merged from MySQL 5.6.36 and 5.7.18 into MariaDB Server 10.2.7.
This fix also reverts commit 40becbc3c7
which attempted to work around the problem.
Problem:
=======
fil_iterate() writes imported tablespace page0 as it is to discarded
tablespace. Space id wasn't even changed. While opening the tablespace,
tablespace fails with space id mismatch error.
Fix:
====
fil_iterate() copies the page0 with discarded space id to imported
tablespace.
os_file_write_func() and os_file_read_no_error_handling_func() returned
different result values depending on if UNIV_PFS_IO was defined or not.
Other things:
- Added some comments about return values for some functions
The issue is that two MARIA_HA instances shares the same MARIA_STATUS_INFO
object during UNION execution, so the second MARIA_HA instance state pointer
MARIA_HA::state points to the MARIA_HA::state_save of the first MARIA instance.
This happens in
thr_multi_lock(...) {
...
for (first_lock=data, pos= data+1 ; pos < end ; pos++)
{
...
if (pos[0]->lock == pos[-1]->lock && pos[0]->lock->copy_status)
(pos[0]->lock->copy_status)((*pos)->status_param,
(*first_lock)->status_param);
...
}
...
}
Usually the state is restored from ha_maria::external_lock(...):
\#0 _ma_update_status (param=0x6290000e6270) at ./storage/maria/ma_state.c:309
\#1 0x00005555577ccb15 in _ma_update_status_with_lock (info=0x6290000e6270) at ./storage/maria/ma_state.c:361
\#2 0x00005555577c7dcc in maria_lock_database (info=0x6290000e6270, lock_type=2) at ./storage/maria/ma_locking.c:66
\#3 0x0000555557802ccd in ha_maria::external_lock (this=0x61d0001b1308, thd=0x62a000048270, lock_type=2) at ./storage/maria/ha_maria.cc:2727
But _ma_update_status() does not take into account the case when
MARIA_HA::status points to the MARIA_HA::state_save of the other MARIA_HA
instance.
The fix is to restore MARIA_HA::state in ha_maria::external_lock() after
maria_lock_database() call for transactional tables.
Reverted incorrect change introduced by 548d03d7.
As result is char**, third qsort() parameter must be sizeof(char*).
Not sizeof(result[0] + 2), which is same as sizeof(result[0]).
Not even sizeof(result[0]) + 2, which would cause invalid memory access.
Proper sorting is responsibility of logfilenamecompare() callback.
Problem:
=======
rpl_blackhole.test fails when executed with following options
mysqld=--binlog_annotate_row_events=1, mysqld=--replicate_annotate_row_events=1
Test output:
------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
rpl.rpl_blackhole_bug 'mix' [ pass ] 791
rpl.rpl_blackhole_bug 'row' [ fail ]
Replicate_Wild_Ignore_Table
Last_Errno 1032
Last_Error Could not execute Update_rows_v1 event on table test.t1; Can't find
record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's
master log master-bin.000001, end_log_pos 1510
Analysis:
=========
Enabling "replicate_annotate_row_events" on slave, Tells the slave to write
annotate rows events received from the master to its own binary log. The
received annotate events are applied after the Gtid event as shown below.
thd->query() will be set to the actual query received from the master, through
annotate event. Annotate_rows event should not be deleted after the event is
applied as the thd->query will be used to generate new Annotate_rows event
during applying the subsequent Rows events. After the last Rows event has been
applied, the saved Annotate_rows event (if any) will be deleted.
In balckhole engine all the DML operations are noops as they donot store any
data. They simply return success without doing any operation. But the existing
strictly expects thd->query() to be 'NULL' to identify that row based
replication is in use. This assumption will fail when row annotations are
enabled as the query is not 'NULL'. Hence various row based operations like
'update', 'delete', 'index lookup' will fail when row annotations are enabled.
Fix:
===
Extend the row based replication check to include row annotations as well.
i.e Either the thd->query() is NULL or thd->query() points to query and row
annotations are in use.
row_search_mvcc(): Duplicate the logic of btr_pcur_move_to_next()
so that an infinite loop can be avoided when advancing to the next
page fails due to a corrupted page.
At higher levels of innodb_force_recovery, the InnoDB transaction
subsystem will not be set up at all.
At slightly lower levels, recovered transactions will not be rolled back,
and DDL operations could hang due to locks being held at all.
Let us consistently refuse all writes if the predicate
high_level_read_only holds. We failed to refuse DROP TABLE
and DROP DATABASE. (Refusing DROP TABLE is a partial backport
from MDEV-19570 in the 10.5 branch.)
- If one of the encryption threads already started the initialization
of the tablespace then don't remove the other uninitialized tablespace
from the rotation list.
- If there is a change in innodb_encrypt_tables then
don't remove the processed tablespace from rotation list.
- Don't apply redo log for the corrupted page when innodb_force_recovery > 0.
- Allow the table to be dropped when index root page is
corrupted when innodb_force_recovery > 0.
The update callback functions for several settable global InnoDB variables
are acquiring InnoDB latches while holding LOCK_global_system_variables.
On the other hand, some InnoDB code is invoking THDVAR() while holding
InnoDB latches. An example of this is thd_lock_wait_timeout() that is
called by lock_rec_enqueue_waiting(). In some cases, the
intern_sys_var_ptr() that is invoked by THDVAR() may acquire
LOCK_global_system_variables, via sync_dynamic_session_variables().
In lock_rec_enqueue_waiting(), we really must be holding some InnoDB
latch while invoking THDVAR(). This implies that
LOCK_global_system_variables must conceptually reside below any InnoDB
latch in the latching order. That in turns implies that the various
update callback functions must release LOCK_global_system_variables
before acquiring any InnoDB mutexes or rw-locks, and reacquire
LOCK_global_system_variables later. The validate functions are being
invoked while not holding LOCK_global_system_variables and thus they
do not need any changes.
The following statements are affected by this:
SET GLOBAL innodb_adaptive_hash_index = …;
SET GLOBAL innodb_cmp_per_index_enabled = 1;
SET GLOBAL innodb_old_blocks_pct = …;
SET GLOBAL innodb_fil_make_page_dirty_debug = …; -- debug builds only
SET GLOBAL innodb_buffer_pool_evict = uncompressed; -- debug builds only
SET GLOBAL innodb_purge_run_now = 1; -- debug builds only
SET GLOBAL innodb_purge_stop_now = 1; -- debug builds only
SET GLOBAL innodb_log_checkpoint_now = 1; -- debug builds only
SET GLOBAL innodb_buf_flush_list_now = 1; -- debug builds only
SET GLOBAL innodb_buffer_pool_dump_now = 1;
SET GLOBAL innodb_buffer_pool_load_now = 1;
SET GLOBAL innodb_buffer_pool_load_abort = 1;
SET GLOBAL innodb_status_output = …;
SET GLOBAL innodb_status_output_locks = …;
SET GLOBAL innodb_encryption_threads = …;
SET GLOBAL innodb_encryption_rotate_key_age = …;
SET GLOBAL innodb_encryption_rotation_iops = …;
SET GLOBAL innodb_encrypt_tables = …;
SET GLOBAL innodb_disallow_writes = …;
buf_LRU_old_ratio_update(): Correct the return type.
C++11 defines the singly-linked std::forward_list. Prefer it to
the doubly-linked std::list in cases where we dot really need it.
Also, clean up some code.
dict_index_remove_from_v_col_list(): Remove.
Obsoleted by dict_index_t::detach_columns().
There is no std::forward_list::push_back(). Use push_front() instead.
The ordering does not really matter.
dict_v_col_t::n_v_indexes: Added. There is no std::forward_list::size(),
and trx_undo_log_v_idx() needs to know the size.
rtr_info_track_t::rtr_active: Encapsulate. There really was no justification
for the pointer indirection.
MDEV-19581 Valgrind error with WolfSSL and encrypted binlog
WolfSSL can read memory out of bounds in EVP_CipherUpdate()
in decrypt/NOPAD mode, when the input length is not multiple of AES block
size.
The workaround ensures that input will have some padding at the end
by having slightly larger allocated buffer, or padding the structures
with 16 more bytes.
ARMv8 (AArch64) CPUs implement the CRC32 extension which is implemented by inline assembly ,
so they can also benefit from hardware acceleration in IO-intensive workloads.
The patch optimizes crc32c calculate with the armv8 crypto instruction(Intrinsics) when available
rather than original linear crc instructions.
Change-Id: I05d36a64c726d910c47befad93390108f4e6567f
Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
There is only one InnoDB crash recovery subsystem.
Allocating recv_sys statically removes one level of pointer indirection
and makes code more readable, and removes the awkward initialization of
recv_sys->dblwr.
recv_sys_t::create(): Replaces recv_sys_init().
recv_sys_t::debug_free(): Replaces recv_sys_debug_free().
recv_sys_t::close(): Replaces recv_sys_close().
recv_sys_t::add(): Replaces recv_add_to_hash_table().
recv_sys_t::empty(): Replaces recv_sys_empty_hash().
InnoDB duplicates file descriptor returned by create_temp_file() to
workaround further inconsistent use of this descriptor.
Use mysys file descriptors consistently for innobase_mysql_tmpfile(path).
Mostly close it by appropriate mysys wrappers.
InnoDB duplicates file descriptor returned by create_temp_file() to
workaround further inconsistent use of this descriptor.
Use mysys file descriptors consistently for innobase_mysql_tmpfile(NULL).
Mostly close it by appropriate mysys wrappers.
The INFORMATION_SCHEMA plugin INNODB_SYS_VIRTUAL, which was introduced
in MariaDB 10.2.2 along with the dictionary table SYS_VIRTUAL,
is similar to other, much older and already stable plugins that
provide access to InnoDB dictionary tables.
The option innodb_rollback_segments was deprecated already in
MariaDB Server 10.0. Its misleadingly named replacement innodb_undo_logs
is of very limited use. It makes sense to always create and use the
maximum number of rollback segments.
Let us remove the deprecated parameter innodb_rollback_segments and
deprecate&ignore the parameter innodb_undo_logs (to be removed in a
later major release).
This work involves some cleanup of InnoDB startup. Similar to other
write operations, DROP TABLE will no longer be allowed if
innodb_force_recovery is set to a value larger than 3.
The parameter innodb_stats_sample_pages became an alias for
innodb_stats_transient_sample_pages and was deprecated in
MariaDB Server 10.0. Let us finally remove that alias.
The transaction isolation levels READ COMMITTED and READ UNCOMMITTED
should behave similarly to the old deprecated setting
innodb_locks_unsafe_for_binlog=1, that is, avoid acquiring gap locks.
row_search_mvcc(): Reduce the scope of some variables, and clean up
the initialization and use of the variable set_also_gap_locks.
The parameter innodb_log_checksums that was introduced in MariaDB 10.2.2
via mysql/mysql-server@af0acedd88
does not make much sense. The original motivation of introducing this
parameter (initially called innodb_log_checksum_algorithm in
mysql/mysql-server@22ba38218e)
was that the InnoDB redo log used the slow and insecure innodb algorithm.
With hardware or SIMD accelerated CRC-32C, there should be no reason to
allow checksums to be disabled on the redo log.
The parameter innodb_encrypt_log already implies innodb_log_checksums=ON.
Let us deprecate the parameter innodb_log_checksums and always compute
redo log checksums, even if innodb_log_checksums=OFF is specified.
An upgrade from MariaDB 10.2.2 or later will only be possible after
using the default value innodb_log_checksums=ON. If the non-default
value innodb_log_checksums=OFF was in effect when the server was shut down,
a log block checksum mismatch will be reported and the upgraded server
will fail to start up.
A read-only storage engine that stores it's data in (aws) S3
To store data in S3 one could use ALTER TABLE:
ALTER TABLE table_name ENGINE=S3
libmarias3 integration done by Sergei Golubchik
libmarias3 created by Andrew Hutchings
Reason for the change was that ha_notify_table_changed() was done
after table open when .frm had been replaced, which caused failure
in engines that checks on open if .frm matches the engines table
definition.
Other changes:
- Remove not needed open/close call at end of inline alter table.
Some test that depended on the table beeing in the table cache after
ALTER TABLE had to be updated.
Use thd_get_ha_data()/thd_set_ha_data() which protect against plugin
removal until it has THD ha_data.
Do not reset THD ha_data in rocksdb_close_connection(), cleaner approach
is to let ha_close_connection() do it.
Removed transaction objects cleanup from rocksdb_done_func(). As we lock
plugin properly, there must be no transaction objects during RocksDB
shutdown.
Part of MDEV-19515 - Improve connect speed
Use thd_get_ha_data()/thd_set_ha_data() which protect against plugin
removal until it has THD ha_data.
Do not reset THD ha_data in sphinx_close_connection(), cleaner approach
is to let ha_close_connection() do it.
Part of MDEV-19515 - Improve connect speed
Use thd_get_ha_data()/thd_set_ha_data() which protect against plugin
removal until it has THD ha_data.
Do not reset THD ha_data in mrn_close_connection(), cleaner approach
is to let ha_close_connection() do it.
Part of MDEV-19515 - Improve connect speed
Do not reset THD ha_data in spider_close_connection(), cleaner approach
is to let ha_close_connection() do it.
Part of MDEV-19515 - Improve connect speed
Use thd_get_ha_data()/thd_set_ha_data() which protect against plugin
removal until it has THD ha_data.
Do not reset THD ha_data in ha_federatedx::disconnect(), cleaner approach
is to let ha_close_connection() do it.
Part of MDEV-19515 - Improve connect speed
Bootstrapping a new cluster from a backup created from a MariaDB
version prior to 10.3.5 may result in error "SST position can't be
set in past" when attempting to join additional nodes.
The problem stems from the fact that when reading the wsrep position
from InnoDB, the position is looked up in two places:
the TRX_SYS page, where versions prior to 10.3.5 used to store
WSREP's position; and rollback segments, this is where newer versions
store the position.
When starting a new cluster, the starting seqno is 0 and a new cluster
UUID is generated. This is persisted in rollback segments, but the old
UUID and seqno are not cleared from TRX_SYS page.
Subsequently, when reading back the position,
trx_rseg_read_wsrep_checkpoint() is going to return the maximum seqno
found in both TRX_SYS page and rollback segments. So in the case of a
newly bootstrapped cluster, it's always going to return the old
cluster information.
The fix consists of changing trx_rseg_read_wsrep_checkpoint() so that
only rollback segments are looked up. On startup, position is read
from the TRX_SYS page, and if present, it is copied to rollback
segments (unless a newer position is already present in the rollback
segments).
Finally the position stored in TRX_SYS page is cleared.
row_insert_for_mysql(): InnoDB sets values for row_start and row_end.
And this function used to return those values to server in
ha_innobase::write_row(). This buggy behavior was removed. Also,
a piece of code in this function was reformatted.
upd_node_t::make_versioned_helper(): Assert that the preallocated size
of the update vector is not exceeded.
dict_sys.lock(), dict_sys_lock(): Acquire both mutex and latch.
dict_sys.unlock(), dict_sys_unlock(): Release both mutex and latch.
dict_sys.assert_locked(): Assert that both mutex and latch are held.
dict_sys_t::create(): Renamed from dict_init().
dict_sys_t::close(): Renamed from dict_close().
dict_sys_t::add(): Sliced from dict_table_t::add_to_cache().
dict_sys_t::remove(): Renamed from dict_table_remove_from_cache().
dict_sys_t::prevent_eviction(): Renamed from
dict_table_move_from_lru_to_non_lru().
dict_sys_t::acquire(): Replaces dict_move_to_mru() and some more logic.
dict_sys_t::resize(): Renamed from dict_resize().
dict_sys_t::find(): Replaces dict_lru_find_table() and
dict_non_lru_find_table().
Fix both code paths:
- Change the test source code so it doesn't cause the "Unused variable"
warning (which -Werror converted into error and caused CMake not to set
HAVE_THREAD_LOCAL)
- If the system doesn't seem to support HAVE_THREAD_LOCAL, refuse to
compile (rather than producing a binary that crashes for some tests)
Originally submitted at https://github.com/facebook/mysql-5.6/pull/905
Use thd_get_ha_data()/thd_set_ha_data() which protect against plugin
removal until it has THD ha_data.
Do not reset THD ha_data in rocksdb_close_connection(), cleaner approach
is to let ha_close_connection() do it.
Removed transaction objects cleanup from rocksdb_done_func(). As we lock
plugin properly, there must be no transaction objects during RocksDB
shutdown.