when replicating old temporal type fields (that don't store
metadata in the binlog), take the precision from
destination fields.
(this fixes the replication failure, crashes were
fixed in a different commit)
This occurs when replication stops with an error, domain-based parallel
replication is used, and the GTID position contains more than one domain.
Furthermore, it relates to the case where the SQL thread is restarted
without first stopping the IO thread.
In this case, the file/offset relay-log position does not correctly
represent the slave's multi-dimensional position, because other domains may
be far ahead of, or behind, the domain with the failing event. So the code
reverts the relay log position back to the start of a relay log file that is
known to be before all active domains.
There was a bug that when the SQL thread was restarted, the
rli->relay_log_state was incorrectly initialised from @@gtid_slave_pos. This
position will likely be too far ahead, due to reverting the relay log
position. Thus, if the replication fails again after the SQL thread restart,
the rli->restart_gtid_pos might be updated incorrectly. This in turn would
cause a second SQL thread restart to replicate from the wrong position, if
the IO thread was still left running.
The fix is to initialise rli->relay_log_state from @@gtid_slave_pos only
when we actually purge and re-fetch relay logs from the master, not at every
SQL thread start.
A related problem is the use of sql_slave_skip_counter to resolve
replication failures in this kind of scenario. Since the slave position is
multi-dimensional, sql_slave_skip_counter can not work properly - it is
indeterminate exactly which event is to be skipped, and is unlikely to work
as expected for the user. So make this an error in the case where
domain-based parallel replication is used with multiple domains, suggesting
instead the user to set @@gtid_slave_pos to reliably skip the desired event.
The problem was that wait_for_slave_io_to_start reported that the io thread
was ready, when it was still initializing. This caused test suite to
continue too early, for example before the semi sync plugin was properly
enabled.
Fixed by introducing a new internal stage: "Preparing". Slave_IO_Running is
now set to "Yes" only when all initializing is done and the IO thread is
ready to read things from the master.
The only test affected by this change is rpl_flsh_tbls, which got stuck in
the preparing phase while trying to read the GTID position from a table.
Fixed by having this test waiting for Preparing instead of Yes.
NOT NULL constraint must be checked *after* the BEFORE triggers.
That is for INSERT and UPDATE statements even NOT NULL fields
must be able to store a NULL temporarily at least while
BEFORE INSERT/UPDATE triggers are running.
Problem:
========
1) Drop table queries are re-generated by server
before writing the events(queries) into binlog
for various reasons. If table name/db name contains
a non regular characters (like latin characters),
the generated query is wrong. Hence it breaks the
replication.
2) In the edge case, when table name/db name contains
64 characters, server is throwing an assert
assert(M_TBLLEN < 128)
3) In the edge case, when db name contains 64 latin
characters, binlog content is interpreted badly
which is leading replication failure.
Analysis & Fix :
================
1) Parser reads the table name from the query and converts
it to standard charset(utf8) and stores it in table_name variable.
When drop table query is regenerated with the same table_name
variable, it should be converted back to the original charset
from standard charset(utf8).
2) Latin character takes two bytes for each character. Limit
of the identifier is 64. SYSTEM_CHARSET_MBMAXLEN is set to '3'.
So there is a possiblity that tablename/dbname contains 3 * 64.
Hence assert is changed to
(M_TBLLEN <= NAME_CHAR_LEN*SYSTEM_CHARSET_MBMAXLEN)
3) db_len in the binlog event header is taking 1 byte.
db_len is ranged from 0 to 192 bytes (3 * 64).
While reading the db_len from the event, server
is casting to uint instead of uchar which is leading
to bad db_len. This problem is fixed by changing the
cast type to uchar.
- Better error from check_slave_param
- Better error message from TokuDB if it can't be compiled.
- Marked rpl_mixed_drop_create_temp_table and
rpl_stm_drop_create_temp_table as big tests to stop timeout
failures on power8
- Added sync_slave_with_master to semisync_future-7591 to
ensure that slave is up to date with master before calling
rpl_end.
- Disabled compiler warnings from connect and mroonga and on
MacOSX.
Mroonga:
- Fixed bug when testing if file is a normal file that can be deleted
- Marked a lot of date and datetime test to not run on macosx.
This is because mktime() can't handle negative years and this
restricts mroonga so that it can only store dates after the year 1900.
- Added some extra command to rpl_start_stop to ensure that the
IO thread has connected to the master before we shut down the server.
- if signal returns signalhandler_t, use this with the alarm code
- Added missing tests to sys_vars
- Fixed some possible overflow bugs in tabxml.cpp
Post-fix: The test case pushed with the fix had each node
acting as slave to the other two nodes with different set
of filters on server_id's. The slave's gtid_slave_pos is
updated after it processes the events received from master
nodes irrespective of whether the events were filtered
or not. Thus, sync_with_master_gtid.inc could unblock even
on filtered events.
As a result, sync_with_master_gtid.inc would fail to block
until the desired changes have been replicated. Fixed by
simplifying the topology.
Also, modified CHANGE MASTER commands to ignore based
on gtid_domain_id instead of server_id.
Problem & Analysis: If DML invokes a trigger or a
stored function that inserts into an AUTO_INCREMENT column,
that DML has to be marked as 'unsafe' statement. If the
tables are locked in the transaction prior to DML statement
(using LOCK TABLES), then the same statement is not marked as
'unsafe' statement. The logic of checking whether unsafeness
is protected with if (!thd->locked_tables_mode). Hence if
we lock the tables prior to DML statement, it is *not* entering
into this if condition. Hence the statement is not marked
as unsafe statement.
Fix: Irrespective of locked_tables_mode value, the unsafeness
check should be done. Now with this patch, the code is moved
out to 'decide_logging_format()' function where all these checks
are happening and also with out 'if(!thd->locked_tables_mode)'.
Along with the specified test case in the bug scenario
(BINLOG_STMT_UNSAFE_AUTOINC_COLUMNS), we also identified that
other cases BINLOG_STMT_UNSAFE_AUTOINC_NOT_FIRST,
BINLOG_STMT_UNSAFE_WRITE_AUTOINC_SELECT, BINLOG_STMT_UNSAFE_INSERT_TWO_KEYS
are also protected with thd->locked_tables_mode which is not right. All
of those checks also moved to 'decide_logging_format()' function.
In domain ID based filtering, a flag is used to filter-out
the events that belong to a particular domain. This flag gets
set when IO thread receives a GTID_EVENT for the domain on
filter list and its reset at the last event in the GTID group.
The resetting, however, was wrongly done before the decision to
write/filter the event from relay log is made. As a result, the
last event in the group will always pass through the filter.
Fixed by deferring the reset logic. Also added a test case.
Problem is that FLUSH TABLES WITH READ LOCK first blocks threads from
starting new commits, then waits for running commits to complete. But
in-order parallel replication needs commits to happen in a particular
order, so this can easily deadlock.
To fix this problem, this patch introduces a way to temporarily pause
the parallel replication worker threads. Before starting FTWRL, we let
all worker threads complete in-progress transactions, and then
wait. Then we proceed to take the global read lock. Once the lock is
obtained, we unpause the worker threads. Now commits are blocked from
starting by the global read lock, so the deadlock will no longer occur.
(even when configured with --binlog-format=statement).
Before we got an error on the slave and the slave stopped if the master
was configured with --binlog-format=mixed or --binlog-format=row.
if we didn't write the CREATE TEMPORARY TABLE statement.
- Enable old code from one of my older changesets that didn't make into 10.0
- Fix test cased that failed as they expected DROP TEMPORARY TABLE in the log.
Use sync_with_master_gtid.inc instead of --sync_with_master. The latter is
not correct because the test case uses RESET MASTER; this invalidates the
existing binlog positions on the slave.
In this particular case, there was a small window where --sync_with_master
could trigger too early (on the old position), causing the test case to miss
one event.
Adjust the test case to try and avoid some sporadic failures on loaded test
hosts.
The wait for SQL thread to stop may complete before worker threads
have completed.
The code was using the wrong variable when comparing the binlog name
for the UNTIL position. This could cause the comparison to fail after
binlog rotation, in turn causing the UNTIL clause to not trigger slave
stop.