This occured when the SQL thread (but not the IO thread) stops while
GTID and parallel replication are used with multiple domain ids in the
GTID position, and is restarted.
In this case, the SQL needs to start some way back in the relay log,
applying or skipping events within each replication domain as
appropriate.
The SQL threads starts at the beginning of an old relay log file, and
this position may be in the middle of an event group. The bug was that
such partial event group could be re-applied, causing replication
corruption.
This patch fixes the issue, by making sure to skip any initial events
that were part of an earlier (already applied) event group.
cherry-pick from 5.7:
commit 6b24763
Author: Manish Kumar <manish.4.kumar@oracle.com>
Date: Tue Mar 27 13:10:42 2012 +0530
BUG#12977988 - ON STOP SLAVE: ERROR READING PACKET FROM SERVER: LOST CONNECTION
TO MYSQL SERVER
BUG#11761457 - ERROR 2013 + "ERROR READING RELAY LOG EVENT" ON STOP SLAVEBUG#12977988 - ON STOP SLAVE: ERROR READING PACKET FROM SERVER: LOST CONNECTION
TO MYSQL SERVER
ENOTEMPTY is 247 on hppa, not 39 like on Linux, according to this report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=837369
So add another replacement for this to rpl.rpl_drop_db and
binlog.binlog_databasae tests (there were already a couple similar
replacements for other platforms).
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The issue was that when running with valgrind the wait for master_pos_Wait()
was not long enough.
This patch also fixes two other failures that could affect rpl_mdev6020:
- check_if_conflicting_replication_locks() didn't properly check domains
- 'did_mark_start_commit' was after signals to other threads was sent which could
get the variable read too early.
1. remove unnecessary rpl-tokudb combination file.
2. fix rpl_ignore_table to cleanup properly (not leave test
grants in memory)
3. check_temp_dir() is supposed to set the error in stmt_da - do
it even when called multiple times, this fixes a crash when
rpl.rpl_slave_load_tmpdir_not_exist is run twice.
In some cases, MariaDB 10.0 could write a master.info file that was read
incorrectly by 10.1 and could cause server to fail to start after an upgrade.
(If writing a new master.info file that is shorter than the old, extra
junk may remain at the end of the file. This is handled properly in
10.1 with an END_MARKER line, but this line is not written by
10.0. The fix here is to make 10.1 robust at reading the master.info
files written by 10.0).
Fix several things around reading master.info and read_mi_key_from_file():
- read_mi_key_from_file() did not distinguish between a line with and
without an eqals '=' sign.
- If a line was empty, read_mi_key_from_file() would incorrectly return
the key from the previous call.
- An extra using_gtid=X line left-over by MariaDB 10.0 might incorrectly
be read and overwrite the correct value.
- Fix incorrect usage of strncmp() which should be strcmp().
- Add test cases.
LOG-SLOW-SLAVE-STATEMENTS NOT DISPLAYED.
These parameters were moved from the command line options to
the system variables section. Treatment of the
opt_log_slow_slave_statements changed to let the
dynamic change of the variable.
delete deferred events after they're executed
(otherwise they can be executed again for a sub-statement)
See also
commit 0e78d1d
Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com>
Date: Wed Mar 20 11:20:47 2013 +0530
BUG#15850951-DUPLICATE ERROR IN REPLICATION WITH SLAVE
TRIGGERS AND AUTO INCREMENT
The main.merge test case was failing when tested using row based
binlog format.
While analyzing the issue it was found the following issues:
a) The server is calling binlog related code even when a statement will
not be binlogged;
b) The child table list was not present into table structure by the time
to generate the create table statement;
c) The tables in the child table list will not be opened yet when
generating table create info using row based replication;
d) CREATE TABLE LIKE TEMP_TABLE does not preserve original table storage
engine when using row based replication;
This patch addressed all above issues.
@ sql/sql_class.h
Added a function to determine if the binary log is disabled to
the current session. This is related with issue (a) above.
@ sql/sql_table.cc
Added code to skip binary logging related code if the statement
will not be binlogged. This is related with issue (a) above.
Added code to add the children to the query list of the table that
will have its CREATE TABLE generated. This is related with issue (b)
above.
Added code to force the storage engine to be generated into the
CREATE TABLE. This is related with issue (d) above.
@ storage/myisammrg/ha_myisammrg.cc
Added a test to skip a table getting info about a child table if the
child table is not opened. This is related to issue (c) above.
The reason for the assertion failure is that the update statement for
the minimal row image sets only the PK column in the write_set of the
table to true. On the other hand, the trigger aims to update a different
column.
Make sure that triggers update the used columns accordingly, when being
processed.
when replicating old temporal type fields (that don't store
metadata in the binlog), take the precision from
destination fields.
(this fixes the replication failure, crashes were
fixed in a different commit)
Problem:
mysql-test/suite/rpl/t/rpl_killed_ddl.test
This test contains code which was disabled because of certain bugs.
BUG#44041 declared to be a duplicate of Bug#45516 which was fixed 2010
BUG#43353 fixed 2012
BUG#44171 fixed 2010
Fix:
Enabled the test code related to the above mentioned bugs.
This occurs when replication stops with an error, domain-based parallel
replication is used, and the GTID position contains more than one domain.
Furthermore, it relates to the case where the SQL thread is restarted
without first stopping the IO thread.
In this case, the file/offset relay-log position does not correctly
represent the slave's multi-dimensional position, because other domains may
be far ahead of, or behind, the domain with the failing event. So the code
reverts the relay log position back to the start of a relay log file that is
known to be before all active domains.
There was a bug that when the SQL thread was restarted, the
rli->relay_log_state was incorrectly initialised from @@gtid_slave_pos. This
position will likely be too far ahead, due to reverting the relay log
position. Thus, if the replication fails again after the SQL thread restart,
the rli->restart_gtid_pos might be updated incorrectly. This in turn would
cause a second SQL thread restart to replicate from the wrong position, if
the IO thread was still left running.
The fix is to initialise rli->relay_log_state from @@gtid_slave_pos only
when we actually purge and re-fetch relay logs from the master, not at every
SQL thread start.
A related problem is the use of sql_slave_skip_counter to resolve
replication failures in this kind of scenario. Since the slave position is
multi-dimensional, sql_slave_skip_counter can not work properly - it is
indeterminate exactly which event is to be skipped, and is unlikely to work
as expected for the user. So make this an error in the case where
domain-based parallel replication is used with multiple domains, suggesting
instead the user to set @@gtid_slave_pos to reliably skip the desired event.
The problem was that wait_for_slave_io_to_start reported that the io thread
was ready, when it was still initializing. This caused test suite to
continue too early, for example before the semi sync plugin was properly
enabled.
Fixed by introducing a new internal stage: "Preparing". Slave_IO_Running is
now set to "Yes" only when all initializing is done and the IO thread is
ready to read things from the master.
The only test affected by this change is rpl_flsh_tbls, which got stuck in
the preparing phase while trying to read the GTID position from a table.
Fixed by having this test waiting for Preparing instead of Yes.
NOT NULL constraint must be checked *after* the BEFORE triggers.
That is for INSERT and UPDATE statements even NOT NULL fields
must be able to store a NULL temporarily at least while
BEFORE INSERT/UPDATE triggers are running.
Problem:
========
1) Drop table queries are re-generated by server
before writing the events(queries) into binlog
for various reasons. If table name/db name contains
a non regular characters (like latin characters),
the generated query is wrong. Hence it breaks the
replication.
2) In the edge case, when table name/db name contains
64 characters, server is throwing an assert
assert(M_TBLLEN < 128)
3) In the edge case, when db name contains 64 latin
characters, binlog content is interpreted badly
which is leading replication failure.
Analysis & Fix :
================
1) Parser reads the table name from the query and converts
it to standard charset(utf8) and stores it in table_name variable.
When drop table query is regenerated with the same table_name
variable, it should be converted back to the original charset
from standard charset(utf8).
2) Latin character takes two bytes for each character. Limit
of the identifier is 64. SYSTEM_CHARSET_MBMAXLEN is set to '3'.
So there is a possiblity that tablename/dbname contains 3 * 64.
Hence assert is changed to
(M_TBLLEN <= NAME_CHAR_LEN*SYSTEM_CHARSET_MBMAXLEN)
3) db_len in the binlog event header is taking 1 byte.
db_len is ranged from 0 to 192 bytes (3 * 64).
While reading the db_len from the event, server
is casting to uint instead of uchar which is leading
to bad db_len. This problem is fixed by changing the
cast type to uchar.
- Better error from check_slave_param
- Better error message from TokuDB if it can't be compiled.
- Marked rpl_mixed_drop_create_temp_table and
rpl_stm_drop_create_temp_table as big tests to stop timeout
failures on power8
- Added sync_slave_with_master to semisync_future-7591 to
ensure that slave is up to date with master before calling
rpl_end.
- Disabled compiler warnings from connect and mroonga and on
MacOSX.
Mroonga:
- Fixed bug when testing if file is a normal file that can be deleted
- Marked a lot of date and datetime test to not run on macosx.
This is because mktime() can't handle negative years and this
restricts mroonga so that it can only store dates after the year 1900.
- Added some extra command to rpl_start_stop to ensure that the
IO thread has connected to the master before we shut down the server.
- if signal returns signalhandler_t, use this with the alarm code
- Added missing tests to sys_vars
- Fixed some possible overflow bugs in tabxml.cpp
Post-fix: The test case pushed with the fix had each node
acting as slave to the other two nodes with different set
of filters on server_id's. The slave's gtid_slave_pos is
updated after it processes the events received from master
nodes irrespective of whether the events were filtered
or not. Thus, sync_with_master_gtid.inc could unblock even
on filtered events.
As a result, sync_with_master_gtid.inc would fail to block
until the desired changes have been replicated. Fixed by
simplifying the topology.
Also, modified CHANGE MASTER commands to ignore based
on gtid_domain_id instead of server_id.