BUG#13519696 - 62940: SELECT RESULTS VARY WITH VERSION AND
WITH/WITHOUT INDEX RANGE SCAN
BUG#13453382 - REGRESSION SINCE 5.1.39, RANGE OPTIMIZER WRONG
RESULTS WITH DECIMAL CONVERSION
BUG#13463488 - 63437: CHAR & BETWEEN WITH INDEX RETURNS WRONG
RESULT AFTER MYSQL 5.1.
Those are all cases where the range optimizer got it wrong
with > and >=.
Root cause for this bug is that the optimizer try to detect&
optimize the special case:
'<field> BETWEEN c1 AND c1' and handle this as the condition '<field> = c1'
This was implemented inside add_key_field(.. *field, *value[]...)
which assumed field to refer key Field, and value[] to refer a [low...high]
constant pair. value[0] and value[1] was then compared for equality.
In a 'normal' BETWEEN condition of the form '<field> BETWEEN val1 and val2' the
BETWEEN operation is represented with an argementlist containing the
values [<field>, val1, val2] - add_key_field() is then called with
parameters field=<field>, *value=val1.
However, if the BETWEEN predicate specified:
1) '<const1> BETWEEN<const2> AND<field>
the 'field' and 'value' arguments to add_key_field() had to be swapped.
This was implemented by trying to cheat add_key_field() to handle it like:
2) '<const1> GE<const2> AND<const1> LE<field>'
As we didn't really replace the BETWEEN operation with 'ge' and 'le',
add_key_field() still handled it as a 'BETWEEN' and compared the (swapped)
arguments<const1> and<const2> for equality. If they was equal, the
condition 1) was incorrectly 'optimized' to:
3) '<field> EQ <const1>'
This fix moves this optimization of '<field> BETWEEN c1 AND c1' into
add_key_fields() which then calls add_key_equal_fields() to collect
key equality / comparison for the key fields in the BETWEEN condition.
Queries involving predicates of the form "const NOT BETWEEN
not_indexed_column AND indexed_column" could return wrong data
due to incorrect handling by the range optimizer.
For "c NOT BETWEEN f1 AND f2" predicates, get_mm_tree()
produces a disjunction of the SEL_ARG trees for "f1 > c" and
"f2 < c". If one of the trees is empty (i.e. one of the
arguments is not sargable) the resulting tree should be empty
as well, since the whole expression in this case is not
sargable.
The above logic is implemented in get_mm_tree() as follows. The
initial state of the resulting tree is NULL (aka empty). We
then iterate through arguments and compute the corresponding
SEL_ARG tree (either "f1 > c" or "f2 < c"). If the resulting
tree is NULL, it is simply replaced by the generated
tree. Otherwise it is replaced by a disjunction of itself and
the generated tree. The obvious flaw in this implementation is
that if the first argument is not sargable and thus produces a
NULL tree, the resulting tree will simply be replaced by the
tree for the second argument. As a result, "c NOT BETWEEN f1
AND f2" will end up as just "f2 < c".
Fixed by adding a check so that when the first argument
produces an empty tree for the NOT BETWEEN case, the loop is
aborted with an empty tree as a result. The whole idea of using
a loop for 2 arguments does not make much sense, but it was
probably used to avoid code duplication for several BETWEEN
variants.
feature
The test for bug no 50939 was put in range.test which isn't such a good idea
since it requires partitioning. Fixed by moving the test case to
partitioning_range.test.
remember range endpoints
The Loose Index Scan optimization keeps track of a sequence
of intervals. For the current interval it maintains the
current interval's endpoints. But the maximum endpoint was
not stored in the SQL layer; rather, it relied on the
storage engine to retain this value in-between reads. By
coincidence this holds for MyISAM and InnoDB. Not for the
partitioning engine, however.
Fixed by making the key values iterator
(QUICK_RANGE_SELECT) keep track of the current maximum endpoint.
This is also more efficient as we save a call through the
handler API in case of open-ended intervals.
The code to calculate endpoints was extracted into
separate methods in QUICK_RANGE_SELECT, and it was possible to
get rid of some code duplication as part of fix.
for each record'
There was an error in an internal structure in the range
optimizer (SEL_ARG). Bad design causes parts of a data
structure not to be initialized when it is in a certain
state. All client code must check that this state is not
present before trying to access the structure's data. Fixed
by
- Checking the state before trying to access data (in
several places, most of which not covered by test case.)
- Copying the keypart id when cloning SEL_ARGs
When merging ranges during calculation of the result of OR
to two range sets the current range may be obsoleted by the
resulting merged range.
The first overlapping range can be obsoleted as well.
Fixed by moving the pointer to the first overlapping range to the
pointer of the resulting union range.
Added few comments at key places in key_or().
When a query was using a DATE or DATETIME value formatted
using any other separator characters beside hyphen '-', a
query with a greater-or-equal '>=' condition matching only
the greatest value in an indexed column, the result was
empty if index range scan was employed.
The range optimizer got a new feature between 5.1.38 and
5.1.39 that changes a greater-or-equal condition to a
greater-than if the value matching that in the query was not
present in the table. But the value comparison function
compared the dates as strings instead of dates.
The bug was fixed by splitting the function
get_date_from_str in two: One part that parses and does
error checking. This function is now visible outside the
module. The old get_date_from_str now calls the new
function.
The problem was in incorrect handling of predicates involving
NULL as a constant value by the range optimizer.
For example, when creating a SEL_ARG node from a condition of
the form "field < const" (which would normally result in the
"NULL < field < const" SEL_ARG), the special case when "const"
is NULL was not taken into account, so "NULL < field < NULL"
was produced for the "field < NULL" condition.
As a result, SEL_ARG structures of this form could not be
further optimized which in turn could lead to incorrectly
constructed SEL_ARG trees. In particular, code assuming SEL_ARG
structures to always form a sequence of ordered disjoint
intervals could enter an infinite loop under some
circumstances.
Fixed by changing get_mm_leaf() so that for any sargable
predicate except "<=>" involving NULL as a constant, "empty"
SEL_ARG is returned, since such a predicate is always false.
The problem was in incorrect handling of predicates involving
NULL as a constant value by the range optimizer.
For example, when creating a SEL_ARG node from a condition of
the form "field < const" (which would normally result in the
"NULL < field < const" SEL_ARG), the special case when "const"
is NULL was not taken into account, so "NULL < field < NULL"
was produced for the "field < NULL" condition.
As a result, SEL_ARG structures of this form could not be
further optimized which in turn could lead to incorrectly
constructed SEL_ARG trees. In particular, code assuming SEL_ARG
structures to always form a sequence of ordered disjoint
intervals could enter an infinite loop under some
circumstances.
Fixed by changing get_mm_leaf() so that for any sargable
predicate except "<=>" involving NULL as a constant, "empty"
SEL_ARG is returned, since such a predicate is always false.
covering index
When two range predicates were combined under an OR
predicate, the algorithm tried to merge overlapping ranges
into one. But the case when a range overlapped several other
ranges was not handled. This lead to
1) ranges overlapping, which gave repeated results and
2) a range that overlapped several other ranges was cut off.
Fixed by
1) Making sure that a range got an upper bound equal to the
next range with a greater minimum.
2) Removing a continue statement
WHERE f1 < n ignored row if f1 was indexed integer column and
f1 = TYPE_MAX ^ n = TYPE_MAX+1. The latter value when treated
as TYPE overflowed (obviously). This was not handled, it is now.
Two disjuncts containing equalities of the form key=const1 and key=const2 can
be merged into one if const1 is equal to const2. To check it the common
collation of the constants were used rather than the collation of the field key.
For example when the default collation of the constants was cases insensitive
while the collation of the field was case sensitive, then two or-ed equality
predicates key='b' and key='B' incorrectly were merged into one f='b'. As a
result ref access was used instead of range access and wrong result sets were
returned in many cases.
Fixed the problem by comparing constant in the or-ed predicate with collation of
the key field.
- added join cache indication in EXPLAIN (Extra column).
- prefer filesort over full scan over
index for ORDER BY (because it's faster).
- when switching from REF to RANGE because
RANGE uses longer key turn off sort on
the head table only as the resulting
RANGE access is a candidate for join cache
and we don't want to disable it by sorting
on the first table only.
Pushbuild fixes:
- Make MAX_SEL_ARGS smaller (even 16K records_in_range() calls is
more than it makes sense to do in typical cases)
- Don't call sel_arg->test_use_count() if we've already allocated
more than MAX_SEL_ARGs elements. The test will succeed but will take
too much time for the test suite (and not provide much value).
- Added PARAM::alloced_sel_args where we count the # of SEL_ARGs
created by SEL_ARG tree cloning operations.
- Made the range analyzer to shortcut and not do any more cloning
if we've already created MAX_SEL_ARGS SEL_ARG objects in cloning.
- Added comments about space complexity of SEL_ARG-graph
representation.
for queries using 'range checked for each record'.
The problem was fixed in 5.0 by the patch for bug 12291.
This patch down-ported the corresponding code from 5.0 into
QUICK_SELECT::init() and added a new test case.