Commit graph

193 commits

Author SHA1 Message Date
Michael Widenius
7af50e4df4 MDEV-32551: "Read semi-sync reply magic number error" warnings on master
rpl_semi_sync_slave_enabled_consistent.test and the first part of
the commit message comes from Brandon Nesterenko.

A test to show how to induce the "Read semi-sync reply magic number
error" message on a primary. In short, if semi-sync is turned on
during the hand-shake process between a primary and replica, but
later a user negates the rpl_semi_sync_slave_enabled variable while
the replica's IO thread is running; if the io thread exits, the
replica can skip a necessary call to kill_connection() in
repl_semisync_slave.slave_stop() due to its reliance on a global
variable. Then, the replica will send a COM_QUIT packet to the
primary on an active semi-sync connection, causing the magic number
error.

The test in this patch exits the IO thread by forcing an error;
though note a call to STOP SLAVE could also do this, but it ends up
needing more synchronization. That is, the STOP SLAVE command also
tries to kill the VIO of the replica, which makes a race with the IO
thread to try and send the COM_QUIT before this happens (which would
need more debug_sync to get around). See THD::awake_no_mutex for
details as to the killing of the replica’s vio.

Notes:
- The MariaDB documentation does not make it clear that when one
  enables semi-sync replication it does not matter if one enables
  it first in the master or slave. Any order works.

Changes done:
- The rpl_semi_sync_slave_enabled variable is now a default value for
  when semisync is started. The variable does not anymore affect
  semisync if it is already running. This fixes the original reported
  bug.  Internally we now use repl_semisync_slave.get_slave_enabled()
  instead of rpl_semi_sync_slave_enabled. To check if semisync is
  active on should check the @@rpl_semi_sync_slave_status variable (as
  before).
- The semisync protocol conflicts in the way that the original
  MySQL/MariaDB client-server protocol was designed (client-server
  send and reply packets are strictly ordered and includes a packet
  number to allow one to check if a packet is lost). When using
  semi-sync the master and slave can send packets at 'any time', so
  packet numbering does not work. The 'solution' has been that each
  communication starts with packet number 1, but in some cases there
  is still a chance that the packet number check can fail.  Fixed by
  adding a flag (pkt_nr_can_be_reset) in the NET struct that one can
  use to signal that packet number checking should not be done. This
  is flag is set when semi-sync is used.
- Added Master_info::semi_sync_reply_enabled to allow one to configure
  some slaves with semisync and other other slaves without semisync.
  Removed global variable semi_sync_need_reply that would not work
  with multi-master.
- Repl_semi_sync_master::report_reply_packet() can now recognize
  the COM_QUIT packet from semisync slave and not give a
  "Read semi-sync reply magic number error" error for this case.
  The slave will be removed from the Ack listener.
- On Windows, don't stop semisync Ack listener just because one
  slave connection is using socket_id > FD_SETSIZE.
- Removed busy loop in Ack_receiver::run() by using
 "Self-pipe trick" to signal new slave and stop Ack_receiver.
- Changed some Repl_semi_sync_slave functions that always returns 0
  from int to void.
- Added Repl_semi_sync_slave::slave_reconnect().
- Removed dummy_function Repl_semi_sync_slave::reset_slave().
- Removed some duplicate semisync notes from the error log.
- Add test of "if (get_slave_enabled() && semi_sync_need_reply)"
  before calling Repl_semi_sync_slave::slave_reply().
  (Speeds up the code as we can skip all initializations).
- If epl_semisync_slave.slave_reply() fails, we disable semisync
  for that connection.
- We do not call semisync.switch_off() if there are no active slaves.
  Instead we check in Repl_semi_sync_master::commit_trx() if there are
  no active threads. This simplices the code.
- Changed assert() to DBUG_ASSERT() to ensure that the DBUG log is
  flushed in case of asserts.
- Removed the internal rpl_semi_sync_slave_status as it is not needed
  anymore. The @@rpl_semi_sync_slave_status status variable is now
  mapped to rpl_semi_sync_enabled.
- Removed rpl_semi_sync_slave_enabled  as it is not needed anymore.
  Repl_semi_sync_slave::get_slave_enabled() contains the active status.
- Added checking that we do not add a slave twice with
  Ack_receiver::add_slave(). This could happen with old code.
- Removed Repl_semi_sync_master::check_and_switch() as it is not
  needed anymore.
- Ensure that when we call Ack_receiver::remove_slave() that the slave
  is removed from the listener before function returns.
- Call listener.listen_on_sockets() outside of mutex for better
  performance and less contested mutex.
- Ensure that listening is ignoring newly added slaves when checking for
  responses.
- Fixed the master ack_receiver listener is not killed if there are no
  connected slaves (and thus stop semisync handling of future
  connections). This could happen if all slaves sockets where would be
  marked as unreliable.
- Added unlink() to base_ilist_iterator and remove() to
  I_List_iterator. This enables us to remove 'dead' slaves in
  Ack_recever::run().
- kill_zombie_dump_threads() now does killing of dump threads properly.
  - It can now kill several threads (should be impossible but could
    happen if IO slaves reconnects very fast).
  - We now wait until the dump thread is done before starting the
    dump.
- Added an error if kill_zombie_dump_threads() fails.
- Set thd->variables.server_id before calling
  kill_zombie_dump_threads(). This simplies the code.
- Added a lot of comments both in code and tests.
- Removed DBUG_EVALUATE_IF "failed_slave_start" as it is not used.

Test changes:
- rpl.rpl_session_var2 added which runs rpl.rpl_session_var test with
  semisync enabled.
- Some timings changed slight with startup of slave which caused
  rpl_binlog_dump_slave_gtid_state_info.text to fail as it checked the
  error log file before the slave had started properly. Fixed by
  adding wait_for_pattern_in_file.inc that allows waiting for the
  pattern to appear in the log file.
- Tests have been updated so that we first set
  rpl_semi_sync_master_enabled on the master and then set
  rpl_semi_sync_slave_enabled on the slaves (this is according to how
  the MariaDB documentation document how to setup semi-sync).
- Error text "Master server does not have semi-sync enabled" has been
  replaced with "Master server does not support semi-sync" for the
  case when the master supports semi-sync but semi-sync is not
  enabled.

Other things:
- Some trivial cleanups in Repl_semi_sync_master::update_sync_header().
- We should in 11.3 changed the default value for
  rpl-semi-sync-master-wait-no-slave from TRUE to FALSE as the TRUE
  does not make much sense as default. The main difference with using
  FALSE is that we do not wait for semisync Ack if there are no slave
  threads.  In the case of TRUE we wait once, which did not bring any
  notable benefits except slower startup of master configured for
  using semisync.

Co-author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>

This solves the problem reported in MDEV-32960 where a new
slave may not be registered in time and the master disables
semi sync because of that.
2024-01-23 13:03:11 +02:00
Andrei
8d238d4726 MDEV-28609 refine gtid-strict-mode to ignore same server-id gtid from the past
... on semisync slave

To provide semisync master crash-recovery the same server-id transactions
were made to accept for execution on the semisync slave when the strict gtid
mode (see MDEV-27760).
That however caused out-of-order error on a master's transaction
server of the circular setup.
The error was fair in the sense of the gtid strict mode rule as indeed
under the condition of the circular setup the replicated transaction
already exists in the local binlog.

This is fixed by the commit to ignore on the gtid strict mode semisync
slave those gtids that exist in the slave's binlog that effectively restores
the default same-server-id ignore policy.
At the same time the fixes complies with MDEV-21117 semisync slave recovery
to accept the same server-id transactions that do not exist in local binlog.
2022-07-26 16:01:14 +03:00
Marko Mäkelä
73f5cbd0b6 Merge 10.5 into 10.6 2021-10-21 16:06:34 +03:00
Marko Mäkelä
5f8561a6bc Merge 10.4 into 10.5 2021-10-21 15:26:25 +03:00
Marko Mäkelä
489ef007be Merge 10.3 into 10.4 2021-10-21 14:57:00 +03:00
Marko Mäkelä
e4a7c15dd6 Merge 10.2 into 10.3 2021-10-21 13:41:04 +03:00
Brandon Nesterenko
2291f8ef73 MDEV-25284: Assertion `info->type == READ_CACHE || info->type == WRITE_CACHE' failed
Problem:
========
This patch addresses two issues.

First, if a CHANGE MASTER command is issued and an error happens
while locating the replica’s relay logs, the logs can be put into an
invalid state where future updates fail and future CHANGE MASTER
calls crash the server. More specifically, right before a replica
purges the relay logs (part of the `CHANGE MASTER TO` logic), the
relay log is temporarily closed with state LOG_TO_BE_OPENED. If the
server errors in-between the temporary log closure and purge, i.e.
during the function find_log_pos, the log should be closed.
MDEV-25284 reveals the log is not properly closed.

Second, upon issuing a RESET SLAVE ALL command, a slave’s GTID
filters are not cleared (DO_DOMAIN_IDS, IGNORE_DOMIAN_IDS,
IGNORE_SERVER_IDS). MySQL had a similar bug report, Bug #18816897,
which fixed this issue to clear IGNORE_SERVER_IDS after issuing
RESET SLAVE ALL in version 5.7.

Solution:
=========

To fix the first problem, the CHANGE MASTER error handling logic was
extended to transition the relay log state to LOG_CLOSED from
LOG_TO_BE_OPENED.

To fix the second problem, the RESET SLAVE ALL logic is extended to
clear the domain_id filter and ignore_server_ids.

Reviewed By:
============
Andrei Elkin <andrei.elkin@mariadb.com>
2021-10-18 10:43:51 -06:00
Monty
d754d3d951 Less noise in the error log
- Updated error messages for recovery
- Changed printing of debug sync point information to make it fit 80 char
2021-05-19 22:54:12 +02:00
Monty
b6ff139aa3 Reduce usage of strlen()
Changes:
- To detect automatic strlen() I removed the methods in String that
  uses 'const char *' without a length:
  - String::append(const char*)
  - Binary_string(const char *str)
  - String(const char *str, CHARSET_INFO *cs)
  - append_for_single_quote(const char *)
  All usage of append(const char*) is changed to either use
  String::append(char), String::append(const char*, size_t length) or
  String::append(LEX_CSTRING)
- Added STRING_WITH_LEN() around constant string arguments to
  String::append()
- Added overflow argument to escape_string_for_mysql() and
  escape_quotes_for_mysql() instead of returning (size_t) -1 on overflow.
  This was needed as most usage of the above functions never tested the
  result for -1 and would have given wrong results or crashes in case
  of overflows.
- Added Item_func_or_sum::func_name_cstring(), which returns LEX_CSTRING.
  Changed all Item_func::func_name()'s to func_name_cstring()'s.
  The old Item_func_or_sum::func_name() is now an inline function that
  returns func_name_cstring().str.
- Changed Item::mode_name() and Item::func_name_ext() to return
  LEX_CSTRING.
- Changed for some functions the name argument from const char * to
  to const LEX_CSTRING &:
  - Item::Item_func_fix_attributes()
  - Item::check_type_...()
  - Type_std_attributes::agg_item_collations()
  - Type_std_attributes::agg_item_set_converter()
  - Type_std_attributes::agg_arg_charsets...()
  - Type_handler_hybrid_field_type::aggregate_for_result()
  - Type_handler_geometry::check_type_geom_or_binary()
  - Type_handler::Item_func_or_sum_illegal_param()
  - Predicant_to_list_comparator::add_value_skip_null()
  - Predicant_to_list_comparator::add_value()
  - cmp_item_row::prepare_comparators()
  - cmp_item_row::aggregate_row_elements_for_comparison()
  - Cursor_ref::print_func()
- Removes String_space() as it was only used in one cases and that
  could be simplified to not use String_space(), thanks to the fixed
  my_vsnprintf().
- Added some const LEX_CSTRING's for common strings:
  - NULL_clex_str, DATA_clex_str, INDEX_clex_str.
- Changed primary_key_name to a LEX_CSTRING
- Renamed String::set_quick() to String::set_buffer_if_not_allocated() to
  clarify what the function really does.
- Rename of protocol function:
  bool store(const char *from, CHARSET_INFO *cs) to
  bool store_string_or_null(const char *from, CHARSET_INFO *cs).
  This was done to both clarify the difference between this 'store' function
  and also to make it easier to find unoptimal usage of store() calls.
- Added Protocol::store(const LEX_CSTRING*, CHARSET_INFO*)
- Changed some 'const char*' arrays to instead be of type LEX_CSTRING.
- class Item_func_units now used LEX_CSTRING for name.

Other things:
- Fixed a bug in mysql.cc:construct_prompt() where a wrong escape character
  in the prompt would cause some part of the prompt to be duplicated.
- Fixed a lot of instances where the length of the argument to
  append is known or easily obtain but was not used.
- Removed some not needed 'virtual' definition for functions that was
  inherited from the parent. I added override to these.
- Fixed Ordered_key::print() to preallocate needed buffer. Old code could
  case memory overruns.
- Simplified some loops when adding char * to a String with delimiters.
2021-05-19 22:27:48 +02:00
Sergei Golubchik
c1c5222cae cleanup: PSI key is *always* the first argument 2020-03-10 19:24:23 +01:00
Sergei Golubchik
7c58e97bf6 perfschema memory related instrumentation changes 2020-03-10 19:24:22 +01:00
Sergei Golubchik
2ac3121af2 perfschema - various collateral cleanups and small changes 2020-03-10 19:24:22 +01:00
Oleksandr Byelkin
a15234bf4b Merge branch '10.3' into 10.4 2019-12-09 15:09:41 +01:00
Faustin Lammler
2df2238cb8 Lintian complains on spelling error
The lintian check complains on spelling error:
https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/95739
2019-12-02 12:41:13 +02:00
Marko Mäkelä
e9c1701e11 Merge 10.3 into 10.4 2019-07-25 18:42:06 +03:00
Eugene Kosov
0f83c8878d Merge 10.2 into 10.3 2019-07-16 18:39:21 +03:00
Eugene Kosov
26c389b7b7 Merge 10.1 into 10.2 2019-07-09 13:22:22 +03:00
Eugene Kosov
838bb9fad4 fix Galera memory leak
This was caused by 7f2cfa8f47
2019-07-08 17:04:18 +03:00
Sachin
7f2cfa8f47 MDEV-8874 Replication filters configured in my.cnf are ignored if slave reset and reconfigured
Don't delete the rpl_filter on RESET SLAVE.
2019-06-27 09:54:20 +05:30
Oleksandr Byelkin
f66d1850ac Merge branch '10.3' into 10.4 2019-06-14 22:10:50 +02:00
Oleksandr Byelkin
4a3d51c76c Merge branch '10.2' into 10.3 2019-06-14 07:36:47 +02:00
Marko Mäkelä
4bbd8be482 Merge 10.1 into 10.2 2019-06-12 10:30:01 +03:00
Sujatha
78c1be8b6b MDEV-18913: typo in error log
Problem:
========
Following typo in error log:

2019-03-13 15:58:10 0 [Note] Reading of all Master_info entries succeded

Should be 'succeeded'

Fix:
===
Fixed the typo with the right word 'succeeded'.
2019-05-30 12:11:57 +05:30
Oleksandr Byelkin
c07325f932 Merge branch '10.3' into 10.4 2019-05-19 20:55:37 +02:00
Marko Mäkelä
be85d3e61b Merge 10.2 into 10.3 2019-05-14 17:18:46 +03:00
Marko Mäkelä
26a14ee130 Merge 10.1 into 10.2 2019-05-13 17:54:04 +03:00
Vicențiu Ciorbaru
cb248f8806 Merge branch '5.5' into 10.1 2019-05-11 22:19:05 +03:00
Vicențiu Ciorbaru
5543b75550 Update FSF Address
* Update wrong zip-code
2019-05-11 21:29:06 +03:00
Sergey Vojtovich
8553525931 MDEV-18400 - Move shutdown handling to main thread
Signal handler is now respoinsible for setting abort_loop and breaking
poll() in main thread. The rest is handled by main thread itself.

Removed redundant LOCK_error_log init/destroy wrappers.
Removed redundant unireg_end(): it is trivial and it has only one caller.
Removed unused ready_to_exit from PFS.
Removed kill_in_progress: duplicates abort_loop.
Removed shutdown_in_progress: duplicates abort_loop.
Removed ready_to_exit: was used to make sure main thread waits for
cleanups, which are now done by main thread itself.
Removed SIGNALS_DONT_BREAK_READ, MAYBE_BROKEN_SYSCALL,
kill_broken_server: never defined/used.
Make clean_up() static.
2019-01-29 11:56:35 +04:00
Sergei Golubchik
c9061d1102 mysys: rename ME_xxx flags to match plugin api 2018-06-04 12:32:23 +02:00
Monty
30ebc3ee9e Add likely/unlikely to speed up execution
Added to:
- if (error)
- Lex
- sql_yacc.yy and sql_yacc_ora.yy
- In header files to alloc() calls
- Added thd argument to thd_net_is_killed()
2018-05-07 00:07:32 +03:00
Sachin Setiya
419385dbf1 Mdev-10664 Add statuses about optimistic parallel replication stalls
In this commit we are adding three more status variable to SHOW SLAVE
STATUS.  Slave_DDL_Events and Slave_Non_Transactional_Events.

Slave_DDL_Groups:- This status variable counts the occurrence of DDL
statements

Slave_Non_Transactional_Groups:- This variable count the occurrence
of non-transnational event group.

Slave_Transactional_Groups:- This variable count the occurrence
of transnational event group.

Patch Credit:- Kristian Nielsen
2018-04-19 16:04:23 +05:30
Vladislav Vaintroub
6c279ad6a7 MDEV-15091 : Windows, 64bit: reenable and fix warning C4267 (conversion from 'size_t' to 'type', possible loss of data)
Handle string length as size_t, consistently (almost always:))
Change function prototypes to accept size_t, where in the past
ulong or uint were used. change local/member variables to size_t
when appropriate.

This fix excludes rocksdb, spider,spider, sphinx and connect for now.
2018-02-06 12:55:58 +00:00
Sergei Golubchik
bb8e99fdc3 Merge branch 'bb-10.2-ext' into 10.3 2017-08-26 00:34:43 +02:00
Sergei Golubchik
27412877db Merge branch '10.2' into bb-10.2-ext 2017-08-25 10:25:48 +02:00
Michael Widenius
4aaa38d26e Enusure that my_global.h is included first
- Added sql/mariadb.h file that should be included first by files in sql
  directory, if sql_plugin.h is not used (sql_plugin.h adds SHOW variables
  that must be done before my_global.h is included)
- Removed a lot of include my_global.h from include files
- Removed include's of some files that my_global.h automatically includes
- Removed duplicated include's of my_sys.h
- Replaced include my_config.h with my_global.h
2017-08-24 01:05:44 +02:00
Sergei Golubchik
cb1e76e4de Merge branch '10.1' into 10.2 2017-08-17 11:38:34 +02:00
Sergei Golubchik
8e8d42ddf0 Merge branch '10.0' into 10.1 2017-08-08 10:18:43 +02:00
Sergei Golubchik
2804a3fac4 memory leak: add a missing end_relay_log_info() 2017-07-27 12:42:21 +02:00
Vicențiu Ciorbaru
786ad0a158 Merge remote-tracking branch 'origin/5.5' into 10.0 2017-07-25 00:41:54 +03:00
Sergei Golubchik
9a5fe1f4ea Merge remote-tracking branch 'mysql/5.5' into 5.5 2017-07-18 14:59:10 +02:00
Kristian Nielsen
1d91910b94 MDEV-12179: Per-engine mysql.gtid_slave_pos table
Merge into MariaDB 10.3.
2017-07-03 09:33:41 +02:00
Alexander Barkov
765347384a Merge remote-tracking branch 'origin/10.2' into bb-10.2-ext 2017-06-15 15:27:11 +04:00
Marko Mäkelä
2d8fdfbde5 Merge 10.1 into 10.2
Replace have_innodb_zip.inc with innodb_page_size_small.inc.
2017-06-08 12:45:08 +03:00
Sachin Setiya
da61107fc8 MDEV-9544 FLUSH [RELAY] LOGS does not rotate logs for a named slave
Problem:- In the case of multisource replication/named slave
when we run "FLUSH LOGS" , it does not flush logs.

Solution:- A new function Master_info_index->flush_all_relay_logs()
is created which will rotate relay logs for all named slave.
This will be called in reload_acl_and_cache function when
connection_name.length == 0
2017-06-05 13:11:10 +05:30
Marko Mäkelä
8f643e2063 Merge 10.1 into 10.2 2017-05-23 11:09:47 +03:00
Marko Mäkelä
65e1399e64 Merge 10.0 into 10.1
Significantly reduce the amount of InnoDB, XtraDB and Mariabackup
code changes by defining pfs_os_file_t as something that is
transparently compatible with os_file_t.
2017-05-20 08:41:20 +03:00
Vicențiu Ciorbaru
339a290d22 Merge remote-tracking branch 'origin/5.5' into 10.0 2017-05-17 15:42:36 +03:00
Sergei Golubchik
2e1428c0b5 MDEV-12799 Buffer overflow
with a specially corrupted master.info one can
get an invalid heartbeat_period that will
trigger a heap overflow.
2017-05-15 22:01:15 +02:00
Monty
5a759d31f7 Changing field::field_name and Item::name to LEX_CSTRING
Benefits of this patch:
- Removed a lot of calls to strlen(), especially for field_string
- Strings generated by parser are now const strings, less chance of
  accidently changing a string
- Removed a lot of calls with LEX_STRING as parameter (changed to pointer)
- More uniform code
- Item::name_length was not kept up to date. Now fixed
- Several bugs found and fixed (Access to null pointers,
  access of freed memory, wrong arguments to printf like functions)
- Removed a lot of casts from (const char*) to (char*)

Changes:
- This caused some ABI changes
  - lex_string_set now uses LEX_CSTRING
  - Some fucntions are now taking const char* instead of char*
- Create_field::change and after changed to LEX_CSTRING
- handler::connect_string, comment and engine_name() changed to LEX_CSTRING
- Checked printf() related calls to find bugs. Found and fixed several
  errors in old code.
- A lot of changes from LEX_STRING to LEX_CSTRING, especially related to
  parsing and events.
- Some changes from LEX_STRING and LEX_STRING & to LEX_CSTRING*
- Some changes for char* to const char*
- Added printf argument checking for my_snprintf()
- Introduced null_clex_str, star_clex_string, temp_lex_str to simplify
  code
- Added item_empty_name and item_used_name to be able to distingush between
  items that was given an empty name and items that was not given a name
  This is used in sql_yacc.yy to know when to give an item a name.
- select table_name."*' is not anymore same as table_name.*
- removed not used function Item::rename()
- Added comparision of item->name_length before some calls to
  my_strcasecmp() to speed up comparison
- Moved Item_sp_variable::make_field() from item.h to item.cc
- Some minimal code changes to avoid copying to const char *
- Fixed wrong error message in wsrep_mysql_parse()
- Fixed wrong code in find_field_in_natural_join() where real_item() was
  set when it shouldn't
- ER_ERROR_ON_RENAME was used with extra arguments.
- Removed some (wrong) ER_OUTOFMEMORY, as alloc_root will already
  give the error.

TODO:
- Check possible unsafe casts in plugin/auth_examples/qa_auth_interface.c
- Change code to not modify LEX_CSTRING for database name
  (as part of lower_case_table_names)
2017-04-23 22:35:46 +03:00