This patch fixes two problems that may arise when changing the
value of wsrep_slave_threads:
1) Threads may be leaked if wsrep_slave_threads is changed
repeatedly. Specifically, when changing the number of slave
threads, we keep track of wsrep_slave_count_change, the
number of slaves to start / stop. The problem arises when
wsrep_slave_count_change is updated before slaves had a
chance to exit (threads may take some time to exit, as they
exit only after commiting one more replication event).
The fix is to update wsrep_slave_count_change such that it
reflects the number of threads that already exited or are
scheduled to exit.
2) Attempting to set out of range value for wsrep_slave_threads
(below 1 / above 512) results in wsrep_slave_count_change to
be computed based on the out of range value, even though a
warning is generated and wsrep_slave_threads is set to a
truncated value. wsrep_slave_count_change is computed in
wsrep_slave_threads_check(), which is called before mysql
checks for valid range. Fix is to update wsrep_count_change
whenever wsrep_slave_threads is updated with a valid value.
The reason for this is that stop slave takes LOCK_active_mi over the
whole operation while some slave operations will also need LOCK_active_mi
which causes deadlocks.
Fixed by introducing object counting for Master_info and not taking
LOCK_active_mi over stop slave or even stop_all_slaves()
Another benefit of this approach is that it allows:
- Multiple threads can run SHOW SLAVE STATUS at the same time
- START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves
- Simpler interface for handling get_master_info()
- Added some missing unlock of 'log_lock' in error condtions
- Moved rpl_parallel_inactivate_pool(&global_rpl_thread_pool) to end
of stop_slave() to not have to use LOCK_active_mi inside
terminate_slave_threads()
- Changed argument for remove_master_info() to Master_info, as we always
have this available
- Fixed core dump when doing FLUSH TABLES WITH READ LOCK and parallel
replication. Problem was that waiting for pause_for_ftwrl was not done
when deleting rpt->current_owner after a force_abort.
Tasks:-
Changes in wsrep_dirty_reads variable
1.) Global + Session scope (Current: session-only)
2.) Can be set using command line.
3.) Allow all commands that do not change data (besides SELECT)
4.) Allow prepared Statements that do not change data
5.) Works with wsrep_sync_wait enabled
Tasks:-
Changes in wsrep_dirty_reads variable
1.) Global + Session scope (Current: session-only)
2.) Can be set using command line.
3.) Allow all commands that do not change data (besides SELECT)
4.) Allow prepared Statements that do not change data
5.) Works with wsrep_sync_wait enabled
Increase max number of possible table_open_cache instances from 512K
to 1024K. This only affects user who are trying to set the variable over
the old limit.
Delete not used test table_open_cache_instances_basic
(Need to be added back and rewritten in 10.2)
This has no functional changes, but it helps avoid merge problems from 10.0
to 10.1. In 10.0, code that checks for parallel replication uses
opt_slave_parallel_threads > 0, but this check needs to be
mi->using_parallel() in 10.1. By using the same check in 10.0 (with
unchanged semantics), merge problems to 10.1 are avoided.
In well defined C code, the "this" pointer is never NULL. Currently, we
were potentially dereferencing a NULL pointer (master_info_index). GCC v6
removes any "if (!this)" conditions as it assumes this is always a
non-null pointer. In order to prevent undefined behaviour, check the
pointer before dereferencing and remove the check within member
functions.
This changes variable wsrep_max_ws_size so that its value
is linked to the value of provider option repl.max_ws_size.
That is, changing the value of variable wsrep_max_ws_size
will change the value of provider option repl.max_ws_size,
and viceversa.
The writeset size limit is always enforced in the provider,
regardless of which option is used.
Since wsrep_sync_wait & wsrep_causal_reads variables are related,
they are always kept in sync whenever one of them changes.
Same is tried on server start, where wsrep_sync_wait get updated
based on wsrep_causal_reads' value. But, since wsrep_causal_reads
is OFF by default, wsrep_sync_wait's value gets modified and loses
its WSREP_SYNC_WAIT_BEFORE_READ bit.
Fixed by syncing wsrep_sync_wait & wsrep_causal_reads values
individually on server start in mysqld_get_one_option() based
on command line arguments used.
Post-fix #2:
- Update test results
- Make the optimization conditional under @@optimizer_switch flag.
- The optimization is now disabled by default, so .result files
are changed back to be what they were before the MDEV-8989 patch.
LOG-SLOW-SLAVE-STATEMENTS NOT DISPLAYED.
These parameters were moved from the command line options to
the system variables section. Treatment of the
opt_log_slow_slave_statements changed to let the
dynamic change of the variable.
* make a local variable for target_table->field[col]
* move an often-used bit function to my_bit.h
* remove a non-static and not really needed trivial comparison
function with a very generic name
This occurs when replication stops with an error, domain-based parallel
replication is used, and the GTID position contains more than one domain.
Furthermore, it relates to the case where the SQL thread is restarted
without first stopping the IO thread.
In this case, the file/offset relay-log position does not correctly
represent the slave's multi-dimensional position, because other domains may
be far ahead of, or behind, the domain with the failing event. So the code
reverts the relay log position back to the start of a relay log file that is
known to be before all active domains.
There was a bug that when the SQL thread was restarted, the
rli->relay_log_state was incorrectly initialised from @@gtid_slave_pos. This
position will likely be too far ahead, due to reverting the relay log
position. Thus, if the replication fails again after the SQL thread restart,
the rli->restart_gtid_pos might be updated incorrectly. This in turn would
cause a second SQL thread restart to replicate from the wrong position, if
the IO thread was still left running.
The fix is to initialise rli->relay_log_state from @@gtid_slave_pos only
when we actually purge and re-fetch relay logs from the master, not at every
SQL thread start.
A related problem is the use of sql_slave_skip_counter to resolve
replication failures in this kind of scenario. Since the slave position is
multi-dimensional, sql_slave_skip_counter can not work properly - it is
indeterminate exactly which event is to be skipped, and is unlikely to work
as expected for the user. So make this an error in the case where
domain-based parallel replication is used with multiple domains, suggesting
instead the user to set @@gtid_slave_pos to reliably skip the desired event.