use the same restriction for character_set_client on the command line
and from SQL.
Also: remove strange hack from thd_init_client_charset() that contradicted
the manual (collation_connection and character_set_result were not always set)
ALTER TABLE: don't fill default values per row, do it once.
And do it in two places - for copy_data_between_tables() and for online ALTER.
Also, run function_defaults test both for MyISAM and for InnoDB.
when reading data into the record buffer, the tail of the VARCHAR
(between real and max varchar length) is not written to. initialize the record
buffer to avoid writing uninitialized memory to disk.
reset default fields not for every modified row, but only once,
at the beginning, as the set of modified fields doesn't change.
exception: INSERT ... ON DUPLICATE KEY UPDATE - the set of fields
does change per row and in that case we reset default fields per row.
MDEV-6971 Bad results with joins comparing TIME and DOUBLE/DECIMAL columns
Disallow using indexes on non-temporal columns to optimize
ref access, range access and table elimination when the counterpart's
cmp_type is TIME_RESULT, e.g.:
SELECT * FROM t1 WHERE indexed_int_column=time_expression;
Only index on a temporal column can be used to optimize temporal comparison
operations.
When a master server restarts, it writes a restart format_description event as
the first event in the next binlog file. The parallel slave SQL thread queues
a special restart entry for the current worker thread to signal this, so that
the worker thread can roll back any prior partial transaction that might have
been written to the binlog due to master crashing.
This queueing was missing a mysql_cond_signal() to notify the worker
thread. This could cause the worker thread to not process the restart entry,
and this in turn would cause the SQL thread to hang infinitely waiting for the
worker thread to complete processing.
Fix by adding the missing wakeup signalling for this case.
The real problem here was inconsistent handling of entry->commit_errno in
MYSQL_BIN_LOG::write_transaction_or_stmt(). Some return paths were setting it
to the value of errno, some where not. And the setting was redundant anyway,
as it is set consistently by the caller.
Fix by consistently setting it in the caller, and not in each return path in
the function.
The test failure happened because a DBUG_EXECUTE_IF() used in the test case
set an entry->commit_errno that was immediately overwritten in the caller with
whatever happened to be the value of errno. This could lead to different error
message in the .result file.
This bug was seen when parallel replication experienced a deadlock between
transactions T1 and T2, where T2 has reached the commit phase and is waiting
for T1 to commit first. In this case, the deadlock is broken by sending a kill
to T2; that kill error is then later detected and converted to a deadlock
error, which causes T2 to be rolled back and retried.
The problem was that the kill caused ha_commit_trans() to errorneously call
wakeup_subsequent_commits() on T3, signalling it to abort because T2 failed
during commit. This is incorrect, because the error in T2 is only a temporary
error, which will be resolved by normal transaction retry. We should not
signal error to the next transaction until we have executed the code that
handles such temporary errors.
So this patch just removes the calls to wakeup_subsequent_commits() from
ha_commit_trans(). They are incorrect in this case, and they are not needed in
general, as wakeup_subsequent_commits() must in any case be called in
finish_event_group() to wakeup any transactions that may have started to wait
after ha_commit_trans(). And normally, wakeup will in fact have happened
earlier, either from the binlog group commit code, or (in case of no
binlogging) after the fast part of InnoDB/XtraDB group commit.
The symptom of this bug was that replication would break on some transaction
with "Commit failed due to failure of an earlier commit on which this one
depends", but with no such failure of an earlier commit visible anywhere.
The retry of an event group in parallel replication set the wrong value for
the end log position of the event that was retried
(qev->future_event_relay_log_pos). It was too large by the size of the event,
so it pointed into the middle of the following event.
If the retry happened in the very last event of the event group, _and_ the SQL
thread was stopped just after successfully retrying that event, then the SQL
threads's relay log position would be left incorrect. Restarting the SQL
thread could then try to read events from a garbage offset in the relay log,
usually leading to an error about not being able to read the event.
The code in binlog group commit around wait_for_commit that controls commit
order, did the wakeup of subsequent commits early, as soon as a following
transaction is put into the group commit queue, but before any such commit has
actually taken place. This causes problems with too early wakeup of
transactions that need to wait for prior to commit, but do not take part in
the binlog group commit for one reason or the other.
This patch solves the problem, by moving the wakeup to happen only after the
binlog group commit is completed.
This requires a new solution to ensure that transactions that arrive later
than the leader are still able to participate in group commit. This patch
introduces a flag wait_for_commit::commit_started. When this is set, a waiter
can queue up itself in the group commit queue.
This way, effectively the wait_for_prior_commit() is skipped only for
transactions that participate in group commit, so that skipping the wait is
safe. Other transactions still wait as needed for correctness.
The code that handles free lists of various objects passed to worker threads
in parallel replication handles freeing in batches, to avoid taking and
releasing LOCK_rpl_thread too often. However, it was possible for freeing to
be delayed to the point where one thread could stall the SQL driver thread due
to full queue, while other worker threads might be idle. This could
significantly degrade possible parallelism and thus performance.
Clean up the batch freeing code so that it is more robust and now able to
regularly free batches of object, so that normally the queue will not run full
unless the SQL driver thread is really far ahead of the worker threads.
The bug occured in parallel replication when re-trying transactions that
failed due to deadlock. In this case, the relay log file is re-opened and the
events are read out again. This reading requires a format description event of
the appropriate version. But the code was using a description event stored in
rli, which is not thread-safe. This could lead to various rare races if the
format description event was replaced by the SQL driver thread at the exact
moment where a worker thread was trying to use it.
The fix is to instead make the retry code create and maintain its own format
description event. When the relay log file is opened, we first read the format
description event from the start of the file, before seeking to the current
position. This now uses the same code as when the SQL driver threads starts
from a given relay log position. This also makes sure that the correct format
description event version will be used in cases where the version of the
binlog could change during replication.
In parallel replication, threads can do two different waits for a prior
transaction. One is for the prior transaction to start commit, the other is
for it to complete commit.
It turns out that the same PSI_stage_info message was errorneously used in
both cases (probably a merge error), causing SHOW PROCESSLIST to be
misleading.
Fix by using correct, distinct message in each case.
In parallel replication, the wait_for_commit facility is used to ensure that
events are written into the binlog in the correct order. This is handled in an
optimised way in the binlogging group commit code.
However, some statements, for example GRANT, are written directly into the
binlog, outside of the group commit code. There was a bug that this direct
write does not correctly wait for the prior transactions to have been written
first, which allows f.ex. GRANT to be written ahead of earlier transactions.
This patch adds the missing wait_for_prior_commit() before writing directly to
the binlog.
However, the problem is still there, although the race is much less likely to
occur now. The problem is that the optimised group commit code does wakeup of
following transactions early, before the binlog write is actually done. A
woken-up following transaction is then allowed to run ahead and queue up for
the group commit, which will ensure that binlog write happens in correct order
in the end. However, the code for directly written events currently bypass
this mechanism, so they get woken up and written too early.
This will be fixed properly in a later patch.
The real bug was that open_tables() returned error in case of
thd->killed() without properly calling thd->send_kill_message()
to set the correct error. This was fixed some time ago.
So remove the, now redundant, extra checks for thd->is_error(),
possibly allowing to catch in debug builds more incorrect
error handling cases.
In SAFE_MUTEX builds, reset the wait_for_commit mutex (destroy and
re-initialise), so that SAFE_MUTEX lock order check does not become
confused when the mutex is re-used for a different purpose.
The bug is not very important per se, but it was helpful to move
Item_func_strcmp out of Item_bool_func2 (to Item_int_func),
for the purposes of "MDEV-4912 Add a plugin to field types (column types)".
(Backport to 5.3)
(Attempt #2)
- Don't attempt to use BKA for materialized derived tables. The
table is neither filled nor fully opened yet, so attempt to
call handler->multi_range_read_info() causes crash.
(Backport to 5.3)
(variant #2, with fixed coding style)
- Make Mrr_ordered_index_reader::resume_read() restore index position
only if it was saved before with Mrr_ordered_index_reader::interrupt_read().
- TABLE::create_key_part_by_field() should not set PART_KEY_FLAG in field->flags
= The reason is that it is used by hash join code which calls it to create a hash
table lookup structure. It doesn't create a real index.
= Another caller of the function is TABLE::add_tmp_key(). Made it to set the flag itself.
- The differences in join_cache.result could also be observed before this patch: one
could put "FLUSH TABLES" before the queries and get exactly the same difference.
(Attempt #2)
- Don't attempt to use BKA for materialized derived tables. The
table is neither filled nor fully opened yet, so attempt to
call handler->multi_range_read_info() causes crash.
1. Do not use NULL `info' field in processlist to select the thread of
interest. This can fail if the read of processlist ends up happening after
REAP succeeds, but before the `info' field is reset. Instead, select on the
CONNECTION_ID(), making sure we still scan the whole list to trigger the same
code as in the original test case.
2. Wait for the query to really complete before reading it in the
processlist. When REAP returns, it only means that ack has been sent to
client, the reset of query stage happens a bit later in the code.
- Don't attempt to use BKA for materialized derived tables. The
table is neither filled nor fully opened yet, so attempt to
call handler->multi_range_read_info() causes crash.
- test_if_skip_sort_order()/create_ref_for_key() may change table
access from EQ_REF(index1) to REF(index2).
- Doing so doesn't make much sense from optimization POV, but since
they are doing it, they should update tab->read_record.unlock_row
accordingly.
Don't double-check privileges for a column in the GROUP BY that refers to
the same column in SELECT clause. Privileges were already checked for SELECT clause.