Problem:
========
We have a Master/Master Setup on two servers, but are only writing to one of
those servers (so it is essentially Master/Slave) We upgraded from 10.1.* to
10.2.22 last week and starting with the upgrade, we are getting duplicate key
errors on the slave. BINLOG=mixed.
Analysis:
=========
This issue happens with LOCK TABLES and binlog_format=MIXED combination. When an
UNSAFE statement is encountered in 'MIXED' mode, it is logged in the form of
'ROW' format. For all the tables that are part of LOCK TABLES list their table maps
are written into the binary log. For each table in the list a check is
done to see if 'check_table_binlog_row_based_done' flag is set or not. If it is not set
a check process is initiated to see if table qualifies for row based binary
logging or not and 'check_table_binlog_row_based_done' is set. This flag will be
cleared at the time of closing thread tables.
But there can be special cases where the LOCK TABLES contains more number of
tables but the unsafe query is actually using subset of tables from LOCK TABLES
list.
For example: LOCK TABLES locks t1,t2,t3 but the unsafe statement makes use of
only two tables t1,t3. In this case the 'check_table_binlog_row_based_done' flag
is enabled for table 't2' while writing table map, but 'close_thread_tables'
function call will not reset this flag. Since the flag is not cleared for table
't2' even a safe statement which used t2 will be logged in the form of row based
format.
This leads to an assert on debug builds and causes duplicate entries in release
builds. In release builds a statement is logged in the form of both ROW and
STATEMENT format. This causes the slave to fail with duplicate key error.
Fix:
===
During 'close_thread_tables' when LOCK TABLE modes are active "ha_reset" is done
for all the tables which were part of current statement. As mentioned in the
example 'ha_reset' is called for tables 't1' and 't3'. This will clear the
'check_table_binlog_row_based_done' flag. At this point add a check for the rest
of the tables to see if 'check_table_binlog_row_based_done' is enabled or not.
If enabled clear the flag.
Some places didn't match the previous rules, making the Floor
address wrong.
Additional sed rules:
sed -i -e 's/Place.*Suite .*, Boston/Street, Fifth Floor, Boston/g'
sed -i -e 's/Suite .*, Boston/Fifth Floor, Boston/g'
InnoDB could return the same list again and again if the buffer
passed to trx_recover_for_mysql() is smaller than the number of
transactions that InnoDB recovered in XA PREPARE state.
We introduce the transaction state TRX_PREPARED_RECOVERED, which
is like TRX_PREPARED, but will be set during trx_recover_for_mysql()
so that each transaction will only be returned once.
Because init_server_components() is invoking ha_recover() twice,
we must reset the state of the transactions back to TRX_PREPARED
after returning the complete list, so that repeated traversals
will see the complete list again, instead of seeing an empty list.
Without this tweak, the test main.tc_heuristic_recover would hang
in MariaDB 10.1.
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
In the title of the MDEV-9519 it was proposed to ban start slave on a Galera
if master binlog_format = statement and wsrep_auto_increment_control = 1,
but the problem can be solved without such a restriction.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
https://jira.mariadb.org/browse/MDEV-9519
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
In the title of the MDEV-9519 it was proposed to ban start slave on a Galera
if master binlog_format = statement and wsrep_auto_increment_control = 1,
but the problem can be solved without such a restriction.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
https://jira.mariadb.org/browse/MDEV-9519
If the TC log did not provide list of XIDs to recover, the
commit by XID was skipped during wsrep recovery if binlog emulation
was on. However, with wsrep we want to commit every prepared transaction
with assigned wsrep XID since the transaction has already been
committed in the cluster.
Added a special condition to always proceed to commit by XID in
xarecover_handlerton() if binlog is off and the recovered transaction
has wsrep XID.
Clear wsrep XID in innobase_rollback_by_xid() for recovered wsrep
transaction in order to avoid resetting XID storage when rolling back
wsrep transaction during recovery.
Sort wsrep XIDs read from storage engine in ascending order and
erify that the range is continuous during crash recovery. If binlog is off,
commit all recovered transactions for continuous seqno range. This is safe
because all transactions with wsrep XID have been certified and must be
committed in the cluster. On the other hand if binlog is on, respect binlog
as a transaction coordinator in order to avoid missing transactions in binlog
that have been committed into storage engine .
extra/mariabackup/fil_cur.cc:361:42: warning: format specifies type 'unsigned long' but the argument has type 'ib_int64_t' (aka 'long long') [-Wformat]
extra/mariabackup/fil_cur.cc:376:9: warning: format specifies type 'unsigned long' but the argument has type 'ib_int64_t' (aka 'long long') [-Wformat]
sql/handler.cc:6196:45: warning: format specifies type 'unsigned long' but the argument has type 'wsrep_trx_id_t' (aka 'unsigned long long') [-Wformat]
sql/log.cc:1681:16: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
sql/log.cc:1687:16: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
sql/wsrep_sst.cc:1388:86: warning: format specifies type 'long' but the argument has type 'wsrep_seqno_t' (aka 'long long') [-Wformat]
sql/wsrep_sst.cc:232:86: warning: format specifies type 'long' but the argument has type 'wsrep_seqno_t' (aka 'long long') [-Wformat]
storage/connect/filamdbf.cpp:450:47: warning: format specifies type 'short' but the argument has type 'int' [-Wformat]
storage/connect/filamdbf.cpp:970:47: warning: format specifies type 'short' but the argument has type 'int' [-Wformat]
storage/connect/inihandl.cpp:197:16: warning: address of array 'key->name' will always evaluate to 'true' [-Wpointer-bool-conversion]
storage/innobase/btr/btr0scrub.cc:151:17: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/innobase/buf/buf0buf.cc:5085:8: warning: nonnull parameter 'bpage' will evaluate to 'true' on first encounter [-Wpointer-bool-conversion]
storage/innobase/fil/fil0crypt.cc:2454:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/innobase/handler/ha_innodb.cc:18685:7: warning: format specifies type 'unsigned long' but the argument has type 'wsrep_trx_id_t' (aka 'unsigned long long') [-Wformat]
storage/innobase/row/row0mysql.cc:3319:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/innobase/row/row0mysql.cc:3327:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/maria/ma_norec.c:35:10: warning: implicit conversion from 'int' to 'my_bool' (aka 'char') changes value from 131 to -125 [-Wconstant-conversion]
storage/maria/ma_norec.c:42:10: warning: implicit conversion from 'int' to 'my_bool' (aka 'char') changes value from 131 to -125 [-Wconstant-conversion]
storage/maria/ma_test2.c:1009:12: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
storage/maria/ma_test2.c:1010:12: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
storage/mroonga/ha_mroonga.cpp:9189:44: warning: use of logical '&&' with constant operand [-Wconstant-logical-operand]
storage/mroonga/vendor/groonga/lib/expr.c:4987:22: warning: comparison of constant -1 with expression of type 'grn_operator' is always false [-Wtautological-constant-out-of-range-compare]
storage/xtradb/btr/btr0scrub.cc:151:17: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/xtradb/buf/buf0buf.cc:5047:8: warning: nonnull parameter 'bpage' will evaluate to 'true' on first encounter [-Wpointer-bool-conversion]
storage/xtradb/fil/fil0crypt.cc:2454:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/xtradb/row/row0mysql.cc:3324:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
storage/xtradb/row/row0mysql.cc:3332:5: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
unittest/sql/mf_iocache-t.cc:120:35: warning: format specifies type 'unsigned long' but the argument has type 'int' [-Wformat]
unittest/sql/mf_iocache-t.cc:96:35: note: expanded from macro 'INFO_TAIL'
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
specific temporary errors
The optimistic parallel slave's worker thread could face a run-time error due to
the algorithm's specifics which allows for conflicts like the reported
"Can't find record in 'table'".
A typical stack is like
{noformat}
#0 handler::print_error (this=0x61c00008f8a0, error=149, errflag=0) at handler.cc:3650
#1 0x0000555555e95361 in write_record (thd=thd@entry=0x62a0000a2208, table=table@entry=0x61f00008ce88, info=info@entry=0x7fffdee356d0) at sql_insert.cc:1944
#2 0x0000555555ea7767 in mysql_insert (thd=thd@entry=0x62a0000a2208, table_list=0x61b00012ada0, fields=..., values_list=..., update_fields=..., update_values=..., duplic=<optimized out>, ignore=<optimized out>) at sql_insert.cc:1039
#3 0x0000555555efda90 in mysql_execute_command (thd=thd@entry=0x62a0000a2208) at sql_parse.cc:3927
#4 0x0000555555f0cc50 in mysql_parse (thd=0x62a0000a2208, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>) at sql_parse.cc:7449
#5 0x00005555566d4444 in Query_log_event::do_apply_event (this=0x61200005b9c8, rgi=<optimized out>, query_arg=<optimized out>, q_len_arg=<optimized out>) at log_event.cc:4508
#6 0x00005555566d639e in Query_log_event::do_apply_event (this=<optimized out>, rgi=<optimized out>) at log_event.cc:4185
#7 0x0000555555d738cf in Log_event::apply_event (rgi=0x61d0001ea080, this=0x61200005b9c8) at log_event.h:1343
#8 apply_event_and_update_pos_apply (ev=ev@entry=0x61200005b9c8, thd=thd@entry=0x62a0000a2208, rgi=rgi@entry=0x61d0001ea080, reason=<optimized out>) at slave.cc:3479
#9 0x0000555555d8596b in apply_event_and_update_pos_for_parallel (ev=ev@entry=0x61200005b9c8, thd=thd@entry=0x62a0000a2208, rgi=rgi@entry=0x61d0001ea080) at slave.cc:3623
#10 0x00005555562aca83 in rpt_handle_event (qev=qev@entry=0x6190000fa088, rpt=rpt@entry=0x62200002bd68) at rpl_parallel.cc:50
#11 0x00005555562bd04e in handle_rpl_parallel_thread (arg=arg@entry=0x62200002bd68) at rpl_parallel.cc:1258
{noformat}
Here {{handler::print_error}} computes whether to error log the
current error when --log-warnings > 1. The decision flag is consulted
bu {{my_message_sql()}} which can be eventually called.
In the bug case the decision is to log.
However in the optimistic mode slave applier case any conflict is
attempted to resolve with rollback and retry to success. Hence the
logging is at least extraneous.
The case is fixed with adding a new flag {{ME_LOG_AS_WARN}} which
{{handler::print_error}} may propagate further on through {{my_error}}
when the error comes from an optimistically running slave worker thread.
The new flag effectively requests the warning level for the errlog record,
while the thread's DA records the actual error (which is regarded as temporary one
by the parallel slave error handler).
Modern compilers (such as GCC 8) emit warnings that the
'register' keyword is deprecated and not valid C++17.
Let us remove most use of the 'register' keyword.
Code in 'extra/' is not touched.
This patch re-enables test galera.galera_var_max_ws_rows.
The test did not work because there were two distinct places where
the server was incrementing member THD::wsrep_affected_rows before
enforcing wsrep_max_ws_rows. Essentially, the test would fail because
every inserted row was counted twice.
The patch removes the extra code.
Also, allow the MariaDB 10.2 server to link InnoDB dynamically
against ha_innodb.so (which is what mysql-test-run.pl expects
to exist, instead of the default name ha_innobase.so).
wsrep_load_data_split(): Instead of referring to innodb_hton_ptr,
check the handlerton::db_type. This was recently broken by me in
MDEV-11415.
innodb_lock_schedule_algorithm: Define as a weak global symbol,
so that WITH_WSREP will not depend on InnoDB being linked statically.
I tested this manually. Notably, running a test that only does
SET GLOBAL wsrep_on=1;
with a static or dynamic InnoDB and
./mtr --mysqld=--loose-innodb-lock-schedule-algorithm=fcfs
will crash with SIGSEGV at shutdown. With the default VATS
combination the wsrep_on is properly refused for both the
static and dynamic InnoDB.
ha_close_connection(): Do invoke the method also for plugins
for which UNINSTALL PLUGIN was deferred due to open connections.
Thanks to @svoj for pointing this out.
thd_to_trx(): Return a pointer, not a reference to a pointer.
check_trx_exists(): Invoke thd_set_ha_data() for assigning a transaction.
log_write_checkpoint_info(): Remove an unused DEBUG_SYNC point
that would cause an assertion failure on shutdown after deferred
UNINSTALL PLUGIN.
This was tested as follows:
cmake -DWITH_WSREP=1 -DPLUGIN_INNOBASE:STRING=DYNAMIC \
-DWITH_MARIABACKUP:BOOL=OFF ...
make
cd mysql-test
./mtr innodb.innodb_uninstall
May also fix: MDEV-14970 "MariaDB crashed with signal 11 and Aria table"
I am not able to reproduce a crash, however there was no protection in
print_keydup_error() if the storage engine reported the wrong key number.
This patch adds such a protection and should stop any further crashes
in this case.
Other things:
- Added extra protection in Aria to not set errkey to more than number of
keys. (Don't think this is cause of this crash, but better safe than
sorry)
- Extend test_if_equal_repl_errors() to handle different cases of
ER_DUP_ENTRY. This is just mainly precaution for the future.