rpl_semi_sync_slave_enabled_consistent.test and the first part of
the commit message comes from Brandon Nesterenko.
A test to show how to induce the "Read semi-sync reply magic number
error" message on a primary. In short, if semi-sync is turned on
during the hand-shake process between a primary and replica, but
later a user negates the rpl_semi_sync_slave_enabled variable while
the replica's IO thread is running; if the io thread exits, the
replica can skip a necessary call to kill_connection() in
repl_semisync_slave.slave_stop() due to its reliance on a global
variable. Then, the replica will send a COM_QUIT packet to the
primary on an active semi-sync connection, causing the magic number
error.
The test in this patch exits the IO thread by forcing an error;
though note a call to STOP SLAVE could also do this, but it ends up
needing more synchronization. That is, the STOP SLAVE command also
tries to kill the VIO of the replica, which makes a race with the IO
thread to try and send the COM_QUIT before this happens (which would
need more debug_sync to get around). See THD::awake_no_mutex for
details as to the killing of the replica’s vio.
Notes:
- The MariaDB documentation does not make it clear that when one
enables semi-sync replication it does not matter if one enables
it first in the master or slave. Any order works.
Changes done:
- The rpl_semi_sync_slave_enabled variable is now a default value for
when semisync is started. The variable does not anymore affect
semisync if it is already running. This fixes the original reported
bug. Internally we now use repl_semisync_slave.get_slave_enabled()
instead of rpl_semi_sync_slave_enabled. To check if semisync is
active on should check the @@rpl_semi_sync_slave_status variable (as
before).
- The semisync protocol conflicts in the way that the original
MySQL/MariaDB client-server protocol was designed (client-server
send and reply packets are strictly ordered and includes a packet
number to allow one to check if a packet is lost). When using
semi-sync the master and slave can send packets at 'any time', so
packet numbering does not work. The 'solution' has been that each
communication starts with packet number 1, but in some cases there
is still a chance that the packet number check can fail. Fixed by
adding a flag (pkt_nr_can_be_reset) in the NET struct that one can
use to signal that packet number checking should not be done. This
is flag is set when semi-sync is used.
- Added Master_info::semi_sync_reply_enabled to allow one to configure
some slaves with semisync and other other slaves without semisync.
Removed global variable semi_sync_need_reply that would not work
with multi-master.
- Repl_semi_sync_master::report_reply_packet() can now recognize
the COM_QUIT packet from semisync slave and not give a
"Read semi-sync reply magic number error" error for this case.
The slave will be removed from the Ack listener.
- On Windows, don't stop semisync Ack listener just because one
slave connection is using socket_id > FD_SETSIZE.
- Removed busy loop in Ack_receiver::run() by using
"Self-pipe trick" to signal new slave and stop Ack_receiver.
- Changed some Repl_semi_sync_slave functions that always returns 0
from int to void.
- Added Repl_semi_sync_slave::slave_reconnect().
- Removed dummy_function Repl_semi_sync_slave::reset_slave().
- Removed some duplicate semisync notes from the error log.
- Add test of "if (get_slave_enabled() && semi_sync_need_reply)"
before calling Repl_semi_sync_slave::slave_reply().
(Speeds up the code as we can skip all initializations).
- If epl_semisync_slave.slave_reply() fails, we disable semisync
for that connection.
- We do not call semisync.switch_off() if there are no active slaves.
Instead we check in Repl_semi_sync_master::commit_trx() if there are
no active threads. This simplices the code.
- Changed assert() to DBUG_ASSERT() to ensure that the DBUG log is
flushed in case of asserts.
- Removed the internal rpl_semi_sync_slave_status as it is not needed
anymore. The @@rpl_semi_sync_slave_status status variable is now
mapped to rpl_semi_sync_enabled.
- Removed rpl_semi_sync_slave_enabled as it is not needed anymore.
Repl_semi_sync_slave::get_slave_enabled() contains the active status.
- Added checking that we do not add a slave twice with
Ack_receiver::add_slave(). This could happen with old code.
- Removed Repl_semi_sync_master::check_and_switch() as it is not
needed anymore.
- Ensure that when we call Ack_receiver::remove_slave() that the slave
is removed from the listener before function returns.
- Call listener.listen_on_sockets() outside of mutex for better
performance and less contested mutex.
- Ensure that listening is ignoring newly added slaves when checking for
responses.
- Fixed the master ack_receiver listener is not killed if there are no
connected slaves (and thus stop semisync handling of future
connections). This could happen if all slaves sockets where would be
marked as unreliable.
- Added unlink() to base_ilist_iterator and remove() to
I_List_iterator. This enables us to remove 'dead' slaves in
Ack_recever::run().
- kill_zombie_dump_threads() now does killing of dump threads properly.
- It can now kill several threads (should be impossible but could
happen if IO slaves reconnects very fast).
- We now wait until the dump thread is done before starting the
dump.
- Added an error if kill_zombie_dump_threads() fails.
- Set thd->variables.server_id before calling
kill_zombie_dump_threads(). This simplies the code.
- Added a lot of comments both in code and tests.
- Removed DBUG_EVALUATE_IF "failed_slave_start" as it is not used.
Test changes:
- rpl.rpl_session_var2 added which runs rpl.rpl_session_var test with
semisync enabled.
- Some timings changed slight with startup of slave which caused
rpl_binlog_dump_slave_gtid_state_info.text to fail as it checked the
error log file before the slave had started properly. Fixed by
adding wait_for_pattern_in_file.inc that allows waiting for the
pattern to appear in the log file.
- Tests have been updated so that we first set
rpl_semi_sync_master_enabled on the master and then set
rpl_semi_sync_slave_enabled on the slaves (this is according to how
the MariaDB documentation document how to setup semi-sync).
- Error text "Master server does not have semi-sync enabled" has been
replaced with "Master server does not support semi-sync" for the
case when the master supports semi-sync but semi-sync is not
enabled.
Other things:
- Some trivial cleanups in Repl_semi_sync_master::update_sync_header().
- We should in 11.3 changed the default value for
rpl-semi-sync-master-wait-no-slave from TRUE to FALSE as the TRUE
does not make much sense as default. The main difference with using
FALSE is that we do not wait for semisync Ack if there are no slave
threads. In the case of TRUE we wait once, which did not bring any
notable benefits except slower startup of master configured for
using semisync.
Co-author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
This solves the problem reported in MDEV-32960 where a new
slave may not be registered in time and the master disables
semi sync because of that.
... on semisync slave
To provide semisync master crash-recovery the same server-id transactions
were made to accept for execution on the semisync slave when the strict gtid
mode (see MDEV-27760).
That however caused out-of-order error on a master's transaction
server of the circular setup.
The error was fair in the sense of the gtid strict mode rule as indeed
under the condition of the circular setup the replicated transaction
already exists in the local binlog.
This is fixed by the commit to ignore on the gtid strict mode semisync
slave those gtids that exist in the slave's binlog that effectively restores
the default same-server-id ignore policy.
At the same time the fixes complies with MDEV-21117 semisync slave recovery
to accept the same server-id transactions that do not exist in local binlog.
Problem:
========
This patch addresses two issues.
First, if a CHANGE MASTER command is issued and an error happens
while locating the replica’s relay logs, the logs can be put into an
invalid state where future updates fail and future CHANGE MASTER
calls crash the server. More specifically, right before a replica
purges the relay logs (part of the `CHANGE MASTER TO` logic), the
relay log is temporarily closed with state LOG_TO_BE_OPENED. If the
server errors in-between the temporary log closure and purge, i.e.
during the function find_log_pos, the log should be closed.
MDEV-25284 reveals the log is not properly closed.
Second, upon issuing a RESET SLAVE ALL command, a slave’s GTID
filters are not cleared (DO_DOMAIN_IDS, IGNORE_DOMIAN_IDS,
IGNORE_SERVER_IDS). MySQL had a similar bug report, Bug #18816897,
which fixed this issue to clear IGNORE_SERVER_IDS after issuing
RESET SLAVE ALL in version 5.7.
Solution:
=========
To fix the first problem, the CHANGE MASTER error handling logic was
extended to transition the relay log state to LOG_CLOSED from
LOG_TO_BE_OPENED.
To fix the second problem, the RESET SLAVE ALL logic is extended to
clear the domain_id filter and ignore_server_ids.
Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
Changes:
- To detect automatic strlen() I removed the methods in String that
uses 'const char *' without a length:
- String::append(const char*)
- Binary_string(const char *str)
- String(const char *str, CHARSET_INFO *cs)
- append_for_single_quote(const char *)
All usage of append(const char*) is changed to either use
String::append(char), String::append(const char*, size_t length) or
String::append(LEX_CSTRING)
- Added STRING_WITH_LEN() around constant string arguments to
String::append()
- Added overflow argument to escape_string_for_mysql() and
escape_quotes_for_mysql() instead of returning (size_t) -1 on overflow.
This was needed as most usage of the above functions never tested the
result for -1 and would have given wrong results or crashes in case
of overflows.
- Added Item_func_or_sum::func_name_cstring(), which returns LEX_CSTRING.
Changed all Item_func::func_name()'s to func_name_cstring()'s.
The old Item_func_or_sum::func_name() is now an inline function that
returns func_name_cstring().str.
- Changed Item::mode_name() and Item::func_name_ext() to return
LEX_CSTRING.
- Changed for some functions the name argument from const char * to
to const LEX_CSTRING &:
- Item::Item_func_fix_attributes()
- Item::check_type_...()
- Type_std_attributes::agg_item_collations()
- Type_std_attributes::agg_item_set_converter()
- Type_std_attributes::agg_arg_charsets...()
- Type_handler_hybrid_field_type::aggregate_for_result()
- Type_handler_geometry::check_type_geom_or_binary()
- Type_handler::Item_func_or_sum_illegal_param()
- Predicant_to_list_comparator::add_value_skip_null()
- Predicant_to_list_comparator::add_value()
- cmp_item_row::prepare_comparators()
- cmp_item_row::aggregate_row_elements_for_comparison()
- Cursor_ref::print_func()
- Removes String_space() as it was only used in one cases and that
could be simplified to not use String_space(), thanks to the fixed
my_vsnprintf().
- Added some const LEX_CSTRING's for common strings:
- NULL_clex_str, DATA_clex_str, INDEX_clex_str.
- Changed primary_key_name to a LEX_CSTRING
- Renamed String::set_quick() to String::set_buffer_if_not_allocated() to
clarify what the function really does.
- Rename of protocol function:
bool store(const char *from, CHARSET_INFO *cs) to
bool store_string_or_null(const char *from, CHARSET_INFO *cs).
This was done to both clarify the difference between this 'store' function
and also to make it easier to find unoptimal usage of store() calls.
- Added Protocol::store(const LEX_CSTRING*, CHARSET_INFO*)
- Changed some 'const char*' arrays to instead be of type LEX_CSTRING.
- class Item_func_units now used LEX_CSTRING for name.
Other things:
- Fixed a bug in mysql.cc:construct_prompt() where a wrong escape character
in the prompt would cause some part of the prompt to be duplicated.
- Fixed a lot of instances where the length of the argument to
append is known or easily obtain but was not used.
- Removed some not needed 'virtual' definition for functions that was
inherited from the parent. I added override to these.
- Fixed Ordered_key::print() to preallocate needed buffer. Old code could
case memory overruns.
- Simplified some loops when adding char * to a String with delimiters.
Problem:
========
Following typo in error log:
2019-03-13 15:58:10 0 [Note] Reading of all Master_info entries succeded
Should be 'succeeded'
Fix:
===
Fixed the typo with the right word 'succeeded'.
Signal handler is now respoinsible for setting abort_loop and breaking
poll() in main thread. The rest is handled by main thread itself.
Removed redundant LOCK_error_log init/destroy wrappers.
Removed redundant unireg_end(): it is trivial and it has only one caller.
Removed unused ready_to_exit from PFS.
Removed kill_in_progress: duplicates abort_loop.
Removed shutdown_in_progress: duplicates abort_loop.
Removed ready_to_exit: was used to make sure main thread waits for
cleanups, which are now done by main thread itself.
Removed SIGNALS_DONT_BREAK_READ, MAYBE_BROKEN_SYSCALL,
kill_broken_server: never defined/used.
Make clean_up() static.
In this commit we are adding three more status variable to SHOW SLAVE
STATUS. Slave_DDL_Events and Slave_Non_Transactional_Events.
Slave_DDL_Groups:- This status variable counts the occurrence of DDL
statements
Slave_Non_Transactional_Groups:- This variable count the occurrence
of non-transnational event group.
Slave_Transactional_Groups:- This variable count the occurrence
of transnational event group.
Patch Credit:- Kristian Nielsen
Handle string length as size_t, consistently (almost always:))
Change function prototypes to accept size_t, where in the past
ulong or uint were used. change local/member variables to size_t
when appropriate.
This fix excludes rocksdb, spider,spider, sphinx and connect for now.
- Added sql/mariadb.h file that should be included first by files in sql
directory, if sql_plugin.h is not used (sql_plugin.h adds SHOW variables
that must be done before my_global.h is included)
- Removed a lot of include my_global.h from include files
- Removed include's of some files that my_global.h automatically includes
- Removed duplicated include's of my_sys.h
- Replaced include my_config.h with my_global.h
Problem:- In the case of multisource replication/named slave
when we run "FLUSH LOGS" , it does not flush logs.
Solution:- A new function Master_info_index->flush_all_relay_logs()
is created which will rotate relay logs for all named slave.
This will be called in reload_acl_and_cache function when
connection_name.length == 0
Significantly reduce the amount of InnoDB, XtraDB and Mariabackup
code changes by defining pfs_os_file_t as something that is
transparently compatible with os_file_t.
Benefits of this patch:
- Removed a lot of calls to strlen(), especially for field_string
- Strings generated by parser are now const strings, less chance of
accidently changing a string
- Removed a lot of calls with LEX_STRING as parameter (changed to pointer)
- More uniform code
- Item::name_length was not kept up to date. Now fixed
- Several bugs found and fixed (Access to null pointers,
access of freed memory, wrong arguments to printf like functions)
- Removed a lot of casts from (const char*) to (char*)
Changes:
- This caused some ABI changes
- lex_string_set now uses LEX_CSTRING
- Some fucntions are now taking const char* instead of char*
- Create_field::change and after changed to LEX_CSTRING
- handler::connect_string, comment and engine_name() changed to LEX_CSTRING
- Checked printf() related calls to find bugs. Found and fixed several
errors in old code.
- A lot of changes from LEX_STRING to LEX_CSTRING, especially related to
parsing and events.
- Some changes from LEX_STRING and LEX_STRING & to LEX_CSTRING*
- Some changes for char* to const char*
- Added printf argument checking for my_snprintf()
- Introduced null_clex_str, star_clex_string, temp_lex_str to simplify
code
- Added item_empty_name and item_used_name to be able to distingush between
items that was given an empty name and items that was not given a name
This is used in sql_yacc.yy to know when to give an item a name.
- select table_name."*' is not anymore same as table_name.*
- removed not used function Item::rename()
- Added comparision of item->name_length before some calls to
my_strcasecmp() to speed up comparison
- Moved Item_sp_variable::make_field() from item.h to item.cc
- Some minimal code changes to avoid copying to const char *
- Fixed wrong error message in wsrep_mysql_parse()
- Fixed wrong code in find_field_in_natural_join() where real_item() was
set when it shouldn't
- ER_ERROR_ON_RENAME was used with extra arguments.
- Removed some (wrong) ER_OUTOFMEMORY, as alloc_root will already
give the error.
TODO:
- Check possible unsafe casts in plugin/auth_examples/qa_auth_interface.c
- Change code to not modify LEX_CSTRING for database name
(as part of lower_case_table_names)