The SQL thread keeps track of the position in the current relay log from
which to read the next event. This position is not normally used, but a
certain interaction with the IO thread can cause the SQL thread to re-open
the relay log and seek to the stored position.
In parallel replication, there were a couple of places where the position
was not updated. This created a race where a re-open of the relay log could
seek to the wrong position and start re-reading and processing events
already handled once, causing various kinds of problems.
Fix this by moving the position update into a single place in
apply_event_and_update_pos(), which should ensure that the position is
always updated in the parallel replication case.
This problem was found from the testcase of MDEV-10863, but it is logically
a separate problem.
This has no functional changes, but it helps avoid merge problems from 10.0
to 10.1. In 10.0, code that checks for parallel replication uses
opt_slave_parallel_threads > 0, but this check needs to be
mi->using_parallel() in 10.1. By using the same check in 10.0 (with
unchanged semantics), merge problems to 10.1 are avoided.
This occured when the SQL thread (but not the IO thread) stops while
GTID and parallel replication are used with multiple domain ids in the
GTID position, and is restarted.
In this case, the SQL needs to start some way back in the relay log,
applying or skipping events within each replication domain as
appropriate.
The SQL threads starts at the beginning of an old relay log file, and
this position may be in the middle of an event group. The bug was that
such partial event group could be re-applied, causing replication
corruption.
This patch fixes the issue, by making sure to skip any initial events
that were part of an earlier (already applied) event group.
cherry-pick from 5.7:
commit 6b24763
Author: Manish Kumar <manish.4.kumar@oracle.com>
Date: Tue Mar 27 13:10:42 2012 +0530
BUG#12977988 - ON STOP SLAVE: ERROR READING PACKET FROM SERVER: LOST CONNECTION
TO MYSQL SERVER
BUG#11761457 - ERROR 2013 + "ERROR READING RELAY LOG EVENT" ON STOP SLAVEBUG#12977988 - ON STOP SLAVE: ERROR READING PACKET FROM SERVER: LOST CONNECTION
TO MYSQL SERVER
Minor review comments/changes:
- A bunch of style-fixes.
- Change macros to static inline functions.
- Update check_event_type() with compressed event types.
- Small .result file update.
Add some event types for the compressed event, there are:
QUERY_COMPRESSED_EVENT,
WRITE_ROWS_COMPRESSED_EVENT_V1,
UPDATE_ROWS_COMPRESSED_EVENT_V1,
DELETE_POWS_COMPRESSED_EVENT_V1,
WRITE_ROWS_COMPRESSED_EVENT,
UPDATE_ROWS_COMPRESSED_EVENT,
DELETE_POWS_COMPRESSED_EVENT.
These events inheritance the uncompressed editor events. One of their constructor functions and write
function have been overridden for uncompressing and compressing. Anything but this is totally the same.
On slave, The IO thread will uncompress and convert them When it receiving the events from the master.
So the SQL and worker threads can be stay unchanged.
Now we use zlib as compress algorithm. It maybe support other algorithm in the future.
Merge feature into 10.2 from feature branch.
Delayed replication adds an option
CHANGE MASTER TO master_delay=<seconds>
Replication will then delay applying events with that many
seconds. This creates a replication slave that reflects the state of
the master some time in the past.
Feature is ported from MySQL source tree.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Problem:
When using the delayed slave feature, and the SQL thread is delaying,
and the user issues STOP SLAVE, the event we wait for was executed.
It should not be executed.
Fix:
Check the return value from the delay function,
slave.cc:slave_sleep(). If the return value is 1, it means the thread
has been stopped, in this case we don't execute the statement.
Also, refactored the test case for delayed slave a little: added the
test script include/rpl_assert.inc, which asserts that a condition holds
and prints a message if not. Made rpl_delayed_slave.test use this. The
advantage is that the test file is much easier to read and maintain,
because it is clear what is an assertion and what is not, and also the
expected result can be found in the test file, you don't have to compare
it to the result file.
Manually merged into MariaDB from MySQL commit
fd2b210383358fe7697f201e19ac9779879ba72a
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The original MySQL patch left some refactoring todo's, possibly
because of known conflicts with other parallel development (like
info-repository feature perhaps).
This patch fixes those todos/refactorings.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Initial merge of delayed replication from MySQL git.
The code from the initial push into MySQL is merged, and the
associated test case passes. A number of tasks are still pending:
1. Check full test suite run for any regressions or .result file updates.
2. Extend the feature to also work for parallel replication.
3. There are some todo-comments about future refactoring left from
MySQL, these should be located and merged on top.
4. There are some later related MySQL commits, these should be checked
and merged. These include:
e134b9362ba0b750d6ac1b444780019622d14aa5
b38f0f7857c073edfcc0a64675b7f7ede04be00f
fd2b210383358fe7697f201e19ac9779879ba72a
afc397376ec50e96b2918ee64e48baf4dda0d37d
5. The testcase from MySQL relies heavily on sleep and timing for
testing, and seems likely to sporadically fail on heavily loaded test
servers in buildbot or distro build farms.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The function apply_event_and_update_pos() is called with the
rli->data_lock mutex held. However, there seems to be nothing in the
function actually needing the mutex to be held. Certainly not in the
parallel replication case, where sql_slave_skip_counter is always 0
since the non-zero case is handled by the SQL driver thread.
So this patch makes parallel replication use a variant of
apply_event_and_update_pos() without the need to take the
rli->data_lock mutex. This avoids one contended global mutex for each
event executed, which might improve performance on CPU-bound workloads
somewhat.
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When a deadlock kill is detected inside the storage engine, the kill
is not done immediately, to avoid calling back into the storage engine
kill_query method with various lock subsystem mutexes held. Instead the
kill is queued and done later by a slave background thread.
This patch in preparation for fixing TokuDB optimistic parallel
replication, as well as for removing locking hacks in InnoDB/XtraDB in
10.2.
Signed-off-by: Kristian Nielsen <knielsen at knielsen-hq.org>
- When waiting for events, start time is now counted from start of wait
- Instead of having "Connect" as "Command" for all replication threads we
now have:
- Slave_IO for Slave thread reading relay log
- Slave_SQL for slave executing SQL commands or distribution queries to
Slave workers
- Slave_worker for slave threads executin SQL commands in parallel replication
In well defined C code, the "this" pointer is never NULL. Currently, we
were potentially dereferencing a NULL pointer (master_info_index). GCC v6
removes any "if (!this)" conditions as it assumes this is always a
non-null pointer. In order to prevent undefined behaviour, check the
pointer before dereferencing and remove the check within member
functions.
- Fixed typos
- Added --core-on-failure to mysql-test-run
- More DBUG_PRINT in viosocket.c
- Don't forget CLIENT_REMEMBER_OPTIONS for compressed slave protocol
- Removed not used stage variables
Make the slave SQL thread always output to the error log the message "Slave
SQL thread exiting, replication stopped in ..." whenever it previously
outputted "Slave SQL thread initialized, starting replication ...".
Before this patch, it was somewhat inconsistent in which cases the message
would be output and in which not, depending on the exact time and cause of
the condition that caused the SQL thread to stop.
1. remove unnecessary rpl-tokudb combination file.
2. fix rpl_ignore_table to cleanup properly (not leave test
grants in memory)
3. check_temp_dir() is supposed to set the error in stmt_da - do
it even when called multiple times, this fixes a crash when
rpl.rpl_slave_load_tmpdir_not_exist is run twice.
Chery-picked commits from codership/mysql-wsrep.
MW-284: Slave I/O retry on ER_COM_UNKNOWN_ERROR
Slave would treat ER_COM_UNKNOWN_ERROR as fatal error and stop.
The fix here is to treat it as a network error and rely on the
built-in mechanism to retry.
MW-284: Add an MTR test
mysqld maintains a list of TABLE objects for all temporary
tables created within a session in THD. Here each table is
represented by a TABLE object.
A query referencing a particular temporary table for more
than once, however, failed with ER_CANT_REOPEN_TABLE error
because a TABLE_SHARE was allocate together with the TABLE,
so temporary tables always had only one TABLE per TABLE_SHARE.
This patch lift this restriction by separating TABLE and
TABLE_SHARE objects and storing TABLE_SHAREs for temporary
tables in a list in THD, and TABLEs in a list within their
respective TABLE_SHAREs.