The bug is a regression introduced in 5.1 by the patch for bug28404.
Under some circumstances test_if_skip_sort_order() could leave some
data structures in an inconsistent state so that some parts of code
could assume the selected execution strategy for GROUP BY/DISTINCT as
a loose index scan (e.g. JOIN_TAB::is_using_loose_index_scan()), while
the actual strategy chosen was an ordered index scan, which led to
wrong data being returned.
Fixed test_if_skip_sort_order() so that when changing the type for a
join table, select->quick is reset not only for EXPLAIN, but for the
actual join execution as well, to not confuse code that depends on its
value to determine the chosen GROUP BY/DISTINCT strategy.
The optimization that uses a unique index to remove GROUP BY did not
ensure that the index was actually used, thus violating the ORDER BY
that is implied by GROUP BY.
Fixed by replacing GROUP BY with ORDER BY if the GROUP BY clause contains
a unique index over non-nullable field(s). In case GROUP BY ... ORDER BY
null is used, GROUP BY is simply removed.
This patch adds cost estimation for the queries with ORDER BY / GROUP BY
and LIMIT.
If there was a ref/range access to the table whose rows were required
to be ordered in the result set the optimizer always employed this access
though a scan by a different index that was compatible with the required
order could be cheaper to produce the first L rows of the result set.
Now for such queries the optimizer makes a choice between the cheapest
ref/range accesses not compatible with the given order and index scans
compatible with it.
- added join cache indication in EXPLAIN (Extra column).
- prefer filesort over full scan over
index for ORDER BY (because it's faster).
- when switching from REF to RANGE because
RANGE uses longer key turn off sort on
the head table only as the resulting
RANGE access is a candidate for join cache
and we don't want to disable it by sorting
on the first table only.
DATE and DATETIME can be compared either as strings or as int. Both
methods have their disadvantages. Strings can contain valid DATETIME value
but have insignificant zeros omitted thus became non-comparable with
other DATETIME strings. The comparison as int usually will require conversion
from the string representation and the automatic conversion in most cases is
carried out in a wrong way thus producing wrong comparison result. Another
problem occurs when one tries to compare DATE field with a DATETIME constant.
The constant is converted to DATE losing its precision i.e. losing time part.
This fix addresses the problems described above by adding a special
DATE/DATETIME comparator. The comparator correctly converts DATE/DATETIME
string values to int when it's necessary, adds zero time part (00:00:00)
to DATE values to compare them correctly to DATETIME values. Due to correct
conversion malformed DATETIME string values are correctly compared to other
DATE/DATETIME values.
As of this patch a DATE value equals to DATETIME value with zero time part.
For example '2001-01-01' equals to '2001-01-01 00:00:00'.
The compare_datetime() function is added to the Arg_comparator class.
It implements the correct comparator for DATE/DATETIME values.
Two supplementary functions called get_date_from_str() and get_datetime_value()
are added. The first one extracts DATE/DATETIME value from a string and the
second one retrieves the correct DATE/DATETIME value from an item.
The new Arg_comparator::can_compare_as_dates() function is added and used
to check whether two given items can be compared by the compare_datetime()
comparator.
Two caching variables were added to the Arg_comparator class to speedup the
DATE/DATETIME comparison.
One more store() method was added to the Item_cache_int class to cache int
values.
The new is_datetime() function was added to the Item class. It indicates
whether the item returns a DATE/DATETIME value.
The optimizer transforms DISTINCT into a GROUP BY
when possible.
It does that by constructing the same structure
(a list of ORDER instances) the parser makes when
parsing GROUP BY.
While doing that it also eliminates duplicates.
But if a duplicate is found it doesn't advance the
pointer to ref_pointer array, so the next
(and subsequent) ORDER structures point to the wrong
element in the SELECT list.
Fixed by advancing the pointer in ref_pointer_array
even in the case of a duplicate.
The optimizer takes away columns from GROUP BY/DISTINCT if they constitute
all the parts of an unique index.
However if some of the columns can contain NULLs this cannot be done
(because an UNIQUE index can have multiple rows with NULL values).
Fixed by not using UNIQUE indexes with nullable columns to remove
grouping columns from GROUP BY/DISTINCT.
The optimizer removes expressions from GROUP BY/DISTINCT
if they happen to participate in a <expression> = <const>
predicates of the WHERE clause (the idea being that if
it's always equal to a constant it can't have multiple
values).
However for predicates where the expression and the
constant item are of different result type this is not
valid (e.g. a string column compared to 0).
Fixed by additional check of the result types of the
expression and the constant and if they differ the
expression don't get removed from the group by list.
GROUP BY/DISTINCT pruning optimization must be done before ORDER BY
optimization because ORDER BY may be removed when GROUP BY/DISTINCT
sorts as a side effect, e.g. in
SELECT DISTINCT <non-key-col>,<pk> FROM t1
ORDER BY <non-key-col> DISTINCT
must be removed before ORDER BY as if done the other way around
it will remove both.
- Make the range-et-al optimizer produce E(#table records after table
condition is applied),
- Make the join optimizer use this value,
- Add "filtered" column to EXPLAIN EXTENDED to show
fraction of records left after table condition is applied
- Adjust test results, add comments
'SELECT DISTINCT a,b FROM t1' should not use temp table if there is unique
index (or primary key) on a.
There are a number of other similar cases that can be calculated without the
use of a temp table : multi-part unique indexes, primary keys or using GROUP BY
instead of DISTINCT.
When a GROUP BY/DISTINCT clause contains all key parts of a unique
index, then it is guaranteed that the fields of the clause will be
unique, therefore we can optimize away GROUP BY/DISTINCT altogether.
This optimization has two effects:
* there is no need to create a temporary table to compute the
GROUP/DISTINCT operation (or the temporary table will be smaller if only GROUP
is removed and DISTINCT stays or if DISTINCT is removed and GROUP BY stays)
* this causes the statement in effect to become updatable in Connector/Java
because the result set columns will be direct reference to the primary key of
the table (instead to the temporary table that it currently references).
Implemented a check that will optimize away GROUP BY/DISTINCT for queries like
the above.
Currently it will work only for single non-constant table in the FROM clause.
When converting DISTINCT to GROUP BY where the columns are from the covering
index and they are quoted twice in the SELECT list the optimizer is creating
improper processing sequence. This is because of the fact that the columns
of the covering index are not recognized as such and treated as non-index
columns.
Generally speaking duplicate columns can safely be removed from the GROUP
BY/DISTINCT list because this will not add or remove new rows in the
resulting set. Duplicates can be removed even if they are not consecutive
(as is the case for ORDER BY, where the duplicate columns can be removed
only if they are consecutive).
So we can safely transform "SELECT DISTINCT a,a FROM ... ORDER BY a" to
"SELECT a,a FROM ... GROUP BY a ORDER BY a" instead of
"SELECT a,a FROM .. GROUP BY a,a ORDER BY a". We can even transform
"SELECT DISTINCT a,b,a FROM ... ORDER BY a,b" to
"SELECT a,b,a FROM ... GROUP BY a,b ORDER BY a,b".
The fix to this bug consists of checking for duplicate columns in the SELECT
list when constructing the GROUP BY list in transforming DISTINCT to GROUP
BY and skipping the ones that are already in.
Added test cases for bug #12625.
sql_select.cc:
Fixed bug #12625.
Fixed invalid removal of constant items from the DISTINCT
list in the function create_distinct_group.