Problem : The basic problem is the way the thread sleeps in mysql-5.5 and also in mysql-5.1
when we execute a stop slave on windows platform.
On windows platform if the stop slave is executed after the master dies, we have
this long wait before the stop slave return a value. This is because there is a
sleep of the thread. The sleep is uninterruptable in the two above version,
which was fixed by Davi patch for the BUG#11765860 for mysql-trunk. Backporting
his patch for mysql-5.5 fixes the problem.
Solution : A new pair of mutex and condition variable is introduced to synchronize thread
sleep and finalization. A new mutex is required because the slave threads are
terminated while holding the slave thread locks (run_lock), which can not be
relinquished during termination as this would affect the lock order.
Before BUG#28796, an empty host was used to identify that an instance was no
longer a slave. However, BUG#28796 changed this behavior and one cannot set
an empty host. Besides, a RESET SLAVE only cleans up information on the next
event to retrieve from the master, disables ssl and resets heartbeat period.
So a call to SHOW SLAVE STATUS after issuing a RESET SLAVE still returns some
valid information, such as host, port, user and password.
To fix this problem, we have introduced the command RESET SLAVE ALL that does
what a regular RESET SLAVE does and also clears host, port, user and password
information thus allowing users to identify when an instance is no longer a
slave.
Essentially, the problem is that safemalloc is excruciatingly
slow as it checks all allocated blocks for overrun at each
memory management primitive, yielding a almost exponential
slowdown for the memory management functions (malloc, realloc,
free). The overrun check basically consists of verifying some
bytes of a block for certain magic keys, which catches some
simple forms of overrun. Another minor problem is violation
of aliasing rules and that its own internal list of blocks
is prone to corruption.
Another issue with safemalloc is rather the maintenance cost
as the tool has a significant impact on the server code.
Given the magnitude of memory debuggers available nowadays,
especially those that are provided with the platform malloc
implementation, maintenance of a in-house and largely obsolete
memory debugger becomes a burden that is not worth the effort
due to its slowness and lack of support for detecting more
common forms of heap corruption.
Since there are third-party tools that can provide the same
functionality at a lower or comparable performance cost, the
solution is to simply remove safemalloc. Third-party tools
can provide the same functionality at a lower or comparable
performance cost.
The removal of safemalloc also allows a simplification of the
malloc wrappers, removing quite a bit of kludge: redefinition
of my_malloc, my_free and the removal of the unused second
argument of my_free. Since free() always check whether the
supplied pointer is null, redudant checks are also removed.
Also, this patch adds unit testing for my_malloc and moves
my_realloc implementation into the same file as the other
memory allocation primitives.
Problem: SQL and IO thread were racing for the IO_CACHE. The former to
flush it, the latter to close it. In some cases this would cause the
SQL thread to lock an invalid IO_CACHE mutex (it had been destroyed by
IO thread). This would happen when SQL thread was initializing the
master.info
Solution: We solve this by locking the log and checking if it is
hot. If it is we keep the log while seeking. Otherwise we release it
right away, because a log can get from hot to cold, but not from cold
to hot.
- Reserve line 17 in master.info for master_bind which has been
added in MySQL Cluster 6.3
- move the line for "list of server id for ignorable servers" to line 18
at mf_iocache.c, line 1722
The slave crashed while two threads: IO thread and user thread
raced for the same mutex (the append_buffer_lock protecting the
relay log's IO_CACHE). The IO thread was trying to flush the
cache, and for that was grabbing the append_buffer_lock.
However, the other thread was closing and reopening the relay log
when the IO thread tried to lock. Closing and reopening the log
includes destroying and reinitialising the IO_CACHE
mutex. Therefore, the IO thread tried to lock a destroyed mutex.
We fix this by backporting patch for BUG#50364 which fixed this
bug in mysql server 5.5+. The patch deploys missing
synchronization when flush_master_info is called and the relay
log is flushed by the IO thread. In detail the patch backports
revision (from mysql-trunk):
- luis.soares@sun.com-20100203165617-b1yydr0ee24ycpjm
This patch already includes the post-push fix also in BUG#50364:
- luis.soares@sun.com-20100222002629-0cijwqk6baxhj7gr
This patch:
- Moves all definitions from the mysql_priv.h file into
header files for the component where the variable is
defined
- Creates header files if the component lacks one
- Eliminates all include directives from mysql_priv.h
- Eliminates all circular include cycles
- Rename time.cc to sql_time.cc
- Rename mysql_priv.h to sql_priv.h
fails in PB sporadically)
The IO thread can concurrently access the relay log IO_CACHE
while another thread is performing an FLUSH LOGS procedure.
FLUSH LOGS closes and reopens the relay log and while doing so it
(re)initializes its IO_CACHE. During this procedure the IO_CACHE
mutex is also reinitialized, which can cause problems if some
other thread (namely the IO THREAD) is concurrently accessing it
at the time .
This patch fixes the problem by extending the interface of the
flush_master_info function to also include a second paramater,
"need_relay_log_lock", stating whether the thread should grab the
relay log lock or not before actually flushing the relay log.
Also, IO thread now calls flush_master_info with this flag set
when it flushes master info with in the event read_event loop.
Finally, we also increase loop time in rpl_heartbeat_basic test
case, so that the number of calls to flush logs doubles, stressing
this part of the code a little more.
NOTE: Backporting the patch to next-mr.
The slave was crashing while failing to execute the init_slave() function.
The issue stems from two different reasons:
1 - A failure while allocating the master info structure generated a
segfault due to a NULL pointer.
2 - A failure while recovering generated a segfault due to a non-initialized
relay log file. In other words, the mi->init and rli->init were both set to true
before executing the recovery process thus creating an inconsistent state as the
relay log file was not initialized.
To circumvent such problems, we refactored the recovery process which is now executed
while initializing the relay log. It is ensured that the master info structure is
created before accessing it and any error is propagated thus avoiding to set mi->init
and rli->init to true when for instance the relay log is not initialized or the relay
info is not flushed.
The changes related to the refactory are described below:
1 - Removed call to init_recovery from init_slave.
2 - Changed the signature of the function init_recovery.
3 - Removed flushes. They are called while initializing the relay log and master
info.
4 - Made sure that if the relay info is not flushed the mi-init and rli-init are not
set to true.
In this patch, we also replaced the exit(1) in the fault injection by DBUG_ABORT()
to make it compliant with the code guidelines.
NOTE: Backporting the patch to next-mr.
The fix proposed in BUG#35542 and BUG#31665 introduces a performance issue
when fsyncing the master.info, relay.info and relay-log.bin* after #th events.
Although such solution has been proposed to reduce the probability of corrupted
files due to a slave-crash, the performance penalty introduced by it has
made the approach impractical for highly intensive workloads.
In a nutshell, the option --syn-relay-log proposed in BUG#35542 and BUG#31665
simultaneously fsyncs master.info, relay-log.info and relay-log.bin* and
this is the main source of performance issues.
This patch introduces new options that give more control to the user on
what should be fsynced and how often:
1) (--sync-master-info, integer) which syncs the master.info after #th event;
2) (--sync-relay-log, integer) which syncs the relay-log.bin* after #th
events.
3) (--sync-relay-log-info, integer) which syncs the relay.info after #th
transactions.
To provide both performance and increased reliability, we recommend the following
setup:
1) --sync-master-info = 0 eventually the operating system will fsync it;
2) --sync-relay-log = 0 eventually the operating system will fsync it;
3) --sync-relay-log-info = 1 fsyncs it after every transaction;
Notice, that the previous setup does not reduce the probability of
corrupted master.info and relay-log.bin*. To overcome the issue, this patch also
introduces a recovery mechanism that right after restart throws away relay-log.bin*
retrieved from a master and updates the master.info based on the relay.info:
4) (--relay-log-recovery, boolean) which enables a recovery mechanism that
throws away relay-log.bin* after a crash.
However, it can only recover the incorrect binlog file and position in master.info,
if other informations (host, port password, etc) are corrupted or incorrect,
then this recovery mechanism will fail to work.
BUG#31665 sync_binlog should cause relay logs to be synchronized
NOTE: Backporting the patch to next-mr.
Add sync_relay_log option to server, this option works for relay log
the same as option sync_binlog for binlog. This option also synchronize
master info to disk when set to non-zero value.
Original patches from Sinisa and Mark, with some modifications
Adding new fields Last_{IO,SQL}_Errno and Last_{IO,SQL}_Error to output
of SHOW SLAVE STATUS to hold errors from I/O and SQL thread respectively.
Old fields Last_Error and Last_Errno are aliases for Last_SQL_Error and
Last_SQL_Errno respectively.
Fields are added last to output of SHOW SLAVE STATUS to allow old applications
to use the same positional arguments into the row, while allowing new
application to benefit from the added information.
In addition, some new error codes are added (especially for the I/O
thread) to be able to provide sensible error message.
- Add MASTER_SSL_VERIFY_SERVER_CERT option to CHANGE MASTER TO
- Add Master_Ssl_Serify_Server_Cert to SHOW SLAVE STATUS
- Save and restore ssl_verify_server_cert to master info file
setting it to disabled as default.