The issue of the current bug is unguarded access to mi->slave_running
by the shutdown thread calling end_slave() that is bug#29968
(alas happened not to be cross-linked with the current bug)
Fixed:
with removing the unguarded read of the running status
and perform reading it in terminate_slave_thread()
at time run_lock is taken (mostly bug#29968 backporting, still with some
improvements over that patch - see the error reporting from
terminate_slave_thread()).
Issue of bug#38716 is fixed here for 5.0 branch as well.
Note:
There has been a separate artifact identified -
a race condition between init_slave() and end_slave() -
reported as Bug#44467.
- Remove bothersome warning messages. This change focuses on the warnings
that are covered by the ignore file: support-files/compiler_warnings.supp.
- Strings are guaranteed to be max uint in length
Problem: if the IO slave thread is attempting to connect,
STOP SLAVE waits for the attempt to finish.
It may take a long time.
Fix: don't wait, stop the slave immediately.
MASTER_POS_WAIT return values are different than expected when the server is not a slave.
It returns -1 instead of NULL.
Fixed with correcting st_relay_log_info::wait_for_pos() to return the proper
value in the case of rli info is not inited.
log-slave-updates and circul repl
Slave SQL thread may execute one extra event when there are events
skipped by slave I/O thread (e.g. originated by the same server).
Whereas it was requested not to do so by the UNTIL condition.
This happens because we compare with the end position of previously
executed event. This is fine when there are no skipped by slave I/O
thread events, as end position of previous event equals to start
position of to be executed event. Otherwise this position equals to
start position of skipped event.
This is fixed by:
- reading the event to be executed before checking if the until condition
is satisfied.
- comparing the start position of the event to be executed. Since we do
not have the start position available, we compute it by subtracting
event length from end position (which is available).
- if there are no events on the event queue at the slave sql starting
time, that meet until condition, we stop immediately, as in this
case we do not want to wait for next event.
and
bug#33932 assertion at handle_slave_sql if init_slave_thread() fails
the asserts were caused by
bug33931: having thd deleted at time of executing err: code plus
a missed initialization;
bug33932: initialization of slave_is_running member was missed;
fixed with relocating mi members initialization and removing delete thd
It is safe to do as deletion happens later explicitly in the caller of
init_slave_thread().
Todo: at merging the test is better to be moved into suite/bugs for 5.x (when x>0).
to leave
The artifact was caused by
a flaw in concurrent accessing the slave's io thd by
the io itself and a handling show slave status thread.
Namely, show_master_info did not acquire mi->run_lock mutex that is
specified for mi->io_thd member.
Fixed with deploying the mutex locking and unlocking. The mutex is kept
short time and without interleaving with mi->data_lock mutex.
Todo: to report and fix an issue with
sys_var_slave_skip_counter::{methods}
seem to acquire incorrectly
active_mi->rli.run_lock
instead of the specified
active_mi->rli.data_lock
A test case is difficult to compose, so rpl_packet should continue serving
as the indicator.
Complementary patch since LOAD DATA INFILE was not covered in
the previous patch.
This patch adds a check so that the slave skip counter is not
decreased to zero if seeing a BEGIN_LOAD_QUERY_EVENT,
APPEND_BLOCK_EVENT, or CREATE_FILE_EVENT since these cannot
end a group. The group is terminated by an EXECUTE_LOAD_QUERY_
EVENT or DELETE_FILE_EVENT.
Report claims that Seconds_behind_master behaves unexpectedly.
Code analysis shows that there is an evident flaw in that treating of FormatDescription event is wrong
so that after FLUSH LOGS on slave the Seconds_behind_master's calculation slips and incorrect
value can be reported to SHOW SLAVE STATUS.
Even worse is that the gap between the correct and incorrect deltas grows with time.
Fixed with prohibiting changes to rpl->last_master_timestamp by artifical events (any kind of).
suggestion as comments is added how to fight with lack of info on the slave side by means of
new heartbeat feature coming.
The test can not be done ealily fully determistic.
In case of out-of-memory error received from the master, print the corresponding message to the error log and stop slave I/O thread to avoid reconnecting with a wrong binary log position.
The issue found with bug 25411 is due to the function skip_rear_comments()
which damages the source code while implementing a work around.
The root cause of the problem is in the lexical analyser, which does not
process special comments properly.
For special comments like :
[1] aaa /*!50000 bbb */ ccc
since 5.0 is a version older that the current code, the parser is in lining
the content of the special comment, so that the query to process is
[2] aaa bbb ccc
However, the text of the query captured when processing a stored procedure,
stored function or trigger (or event in 5.1), can be after rebuilding it:
[3] aaa bbb */ ccc
which is wrong.
To fix bug 25411 properly, the lexical analyser needs to return [2] when
in lining special comments.
In order to implement this, some preliminary cleanup is required in the code,
which is implemented by this patch.
Before this change, the structure named LEX (or st_lex) contains attributes
that belong to lexical analysis, as well as attributes that represents the
abstract syntax tree (AST) of a statement.
Creating a new LEX structure for each statements (which makes sense for the
AST part) also re-initialized the lexical analysis phase each time, which
is conceptually wrong.
With this patch, the previous st_lex structure has been split in two:
- st_lex represents the Abstract Syntax Tree for a statement. The name "lex"
has not been changed to avoid a bigger impact in the code base.
- class lex_input_stream represents the internal state of the lexical
analyser, which by definition should *not* be reinitialized when parsing
multiple statements from the same input stream.
This change is a pre-requisite for bug 25411, since the implementation of
lex_input_stream will later improve to deal properly with special comments,
and this processing can not be done with the current implementation of
sp_head::reset_lex and sp_head::restore_lex, which interfere with the lexer.
This change set alone does not fix bug 25411.
Problem: to handle a situation when the size of event on the master is greater than max_allowed_packet on slave, we checked for the wrong constant (ER_NET_PACKET_TOO_LARGE instead of CR_NET_PACKET_TOO_LARGE).
Solution: test for the client "packet too large" error code instead of the server one in slave I/O thread.
"INSERT... ON DUPLICATE KEY UPDATE skips auto_increment values"
didn't make it into 5.0.36 and 5.1.16,
so we need to adjust the bug-detection-based-on-version-number code.
Because the rpl tree has a too old version, rpl_insert_id cannot pass,
so I disable it (like is already the case in 5.1-rpl for the same reason),
and the repl team will re-enable it when they merge 5.0 and 5.1 into
their trees (thus getting the right version number).
"INSERT... ON DUPLICATE KEY UPDATE skips auto_increment values".
When in an INSERT ON DUPLICATE KEY UPDATE, using
an autoincrement column, we inserted some autogenerated values and
also updated some rows, some autogenerated values were not used
(for example, even if 10 was the largest autoinc value in the table
at the start of the statement, 12 could be the first autogenerated
value inserted by the statement, instead of 11). One autogenerated
value was lost per updated row. Led to exhausting the range of the
autoincrement column faster.
Bug introduced by fix of BUG#20188; present since 5.0.24 and 5.1.12.
This bug breaks replication from a pre-5.0.24 master.
But the present bugfix, as it makes INSERT ON DUP KEY UPDATE
behave like pre-5.0.24, breaks replication from a [5.0.24,5.0.34]
master to a fixed (5.0.36) slave! To warn users against this when
they upgrade their slave, as agreed with the support team, we add
code for a fixed slave to detect that it is connected to a buggy
master in a situation (INSERT ON DUP KEY UPDATE into autoinc column)
likely to break replication, in which case it cannot replicate so
stops and prints a message to the slave's error log and to SHOW SLAVE
STATUS.
For 5.0.36->[5.0.24,5.0.34] replication we cannot warn as master
does not know the slave's version (but we always recommended to users
to have slave at least as new as master).
As agreed with support, I'll also ask for an alert to be put into
the MySQL Network Monitoring and Advisory Service.
The possibility of the race is removed by changing sequence of calls
pthread_mutex_unlock(&mi->run_lock);
pthread_cond_broadcast(&mi->stop_cond);
into
pthread_cond_broadcast(&mi->stop_cond);
pthread_mutex_unlock(&mi->run_lock);
at the end of I/O thread (similar change at the end of SQL thread). This ensures
that no thread waiting on the condition executes between the broadcast and the
unlock and thus can't delete the mi structure which caused the bug.
The relay log may not be open for some reason (e.g. disk error) after rotation,
and using it causes the slave crash.
Fix: check we have it open before access, return error otherwise.
The update_slave_list() call is a remainder from attempts to implement failsafe
replication. This code is now obsolete and not maintained (see comments in
rpl_failsafe.cc).
Inspecting the code one can see that this function do not interferre with normal
slave operation and thus can be safely removed. This will solve the issue
reported in the bug (errors on slave reconnection).
A related issue is to remove unneccessary reconnections done by slave. This is
handled in the patch for BUG#20435.
- Removed not used variables and functions
- Added #ifdef around code that is not used
- Renamed variables and functions to avoid conflicts
- Removed some not used arguments
Fixed some class/struct warnings in ndb
Added define IS_LONGDATA() to simplify code in libmysql.c
I did run gcov on the changes and added 'purecov' comments on almost all lines that was not just variable name changes
Fixed compiler warnings (detected by VC++):
- Removed not used variables
- Added casts
- Fixed wrong assignments to bool
- Fixed wrong calls with bool arguments
- Added missing argument to store(longlong), which caused wrong store method to be called.
(Mostly in DBUG_PRINT() and unused arguments)
Fixed bug in query cache when used with traceing (--with-debug)
Fixed memory leak in mysqldump
Removed warnings from mysqltest scripts (replaced -- with #)