SELECT ... WHERE ... IN (NULL, ...) does full table scan,
even if the same query without the NULL uses efficient range scan.
The bugfix for the bug 18360 introduced an optimization:
if
1) all right-hand arguments of the IN function are constants
2) result types of all right argument items are compatible
enough to use the same single comparison function to
compare all of them to the left argument,
then
we can convert the right-hand list of constant items to an array
of equally-typed constant values for the further
QUICK index access etc. (see Item_func_in::fix_length_and_dec()).
The Item_null constant item objects have STRING_RESULT
result types, so, as far as Item_func_in::fix_length_and_dec()
is aware of NULLs in the right list, this improvement efficiently
optimizes IN function calls with a mixed right list of NULLs and
string constants. However, the optimization doesn't affect mixed
lists of NULLs and integers, floats etc., because there is no
unique common comparator.
New optimization has been added to ignore the result type
of NULL constants in the static analysis of mixed right-hand lists.
This is safe, because at the execution phase we care about
presence of NULLs anyway.
1. The collect_cmp_types() function has been modified to optionally
ignore NULL constants in the item list.
2. NULL-skipping code of the Item_func_in::fix_length_and_dec()
function has been modified to work not only with in_string
vectors but with in_vectors of other types.
HAVING
When calculating GROUP BY the server caches some expressions. It does
that by allocating a string slot (Item_copy_string) and assigning the
value of the expression to it. This effectively means that the result
type of the expression can be changed from whatever it was to a string.
As this substitution takes place after the compile-time result type
calculation for IN but before the run-time type calculations,
it causes the type calculations in the IN function done at run time
to get unexpected results different from what was prepared at compile time.
In the CASE ... WHEN ... THEN ... statement there was a similar problem
and it was solved by artificially adding a STRING argument to the set of
types of the IN/CASE arguments at compile time, so if any of the
arguments of the CASE function changes its type to a string it will
still be covered by the information prepared at compile time.
and HAVING
When calculating GROUP BY the server caches some expressions. It does
that by allocating a string slot (Item_copy_string) and assigning the
value of the expression to it. This effectively means that the result
type of the expression can be changed from whatever it was to a string.
As this substitution takes place after the compile-time result type
calculation for IN but before the run-time type calculations,
it causes the type calculations in the IN function done at run time
to get unexpected results different from what was prepared at compile time.
In the CASE ... WHEN ... THEN ... statement there was a similar problem
and it was solved by artificially adding a STRING argument to the matrix
at compile time, so if any of the arguments of the CASE function changes
its type to a string it will still be covered by the information prepared
at compile time.
Extended the CASE fix for cover the IN case.
An alternative way of fixing this problem is by caching the result type of
the arguments at compile time and using the cached information at run time
instead of re-calculating the result types.
Preferred the CASE approach for uniformity and fix localization.
Execution of queries containing the CASE function of
aggregate function like in "SELECT ... CASE ARGV(...) WHEN ..."
crashed the server.
The CASE function caches pointers to concrete comparison
functions for an each pair of types of CASE-WHERE clause
parameters, i.e. for the "CASE INT_RESULT WHERE REAL_RESULT
THEN ... WHERE DECIMAL_RESULT ... END" function call it
caches comparisons for INT_RESULT with REAL_RESULT and
for INT_RESULT with DECIMAL_RESULT. Usually a result
type is known after a call to the fix_fields function,
however, the setup_copy_fields function call may
wrap aggregate items with Item_copy_string that has
STRING_RESULT result type, so setup_copy_fields may
change argument result types of the CASE function after
call to Item_func_case::fix_fields/fix_length_and_dec.
Then the Item_func_case::find_item function tries to
use comparison function for unexpected pair of the
STRING_RESULT and some other type - that caused
an assertion failure of server crash.
The Item_func_case::fix_length_and_dec function has
been modified to take into account possible STRING_RESULT
result type in the presence of aggregate arguments of
the CASE function.
and value-list
The server returns unexpected results if a right side of the
NOT IN clause consists of NULL value and some constants of
the same type, for example:
SELECT * FROM t WHERE NOT t.id IN (NULL, 1, 2)
may return 3, 4, 5 etc if a table contains these values.
The Item_func_in::val_int method has been modified:
unnecessary resets of an Item_func_case::has_null field
value has been moved outside of an argument comparison
loop. (Also unnecessary re-initialization of the null_value
field has been moved).
but not collation.
The problem here was that text literals in a view were always
dumped with character set introducer. That lead to loosing
collation information.
The fix is to dump character set introducer only if it was
in the original query. That is now possible because there
is no problem any more of loss of character set of string
literals in views -- after WL#4052 the view is dumped
in the original character set.
The `SELECT col FROM t WHERE col NOT IN (col, ...) GROUP BY col'
crashed in the range optimizer.
The get_func_mm_tree function has been modified to check the
Item_func_in::array field for the NULL value before using of that
value.
- BUG#11986: Stored routines and triggers can fail if the code
has a non-ascii symbol
- BUG#16291: mysqldump corrupts string-constants with non-ascii-chars
- BUG#19443: INFORMATION_SCHEMA does not support charsets properly
- BUG#21249: Character set of SP-var can be ignored
- BUG#25212: Character set of string constant is ignored (stored routines)
- BUG#25221: Character set of string constant is ignored (triggers)
There were a few general problems that caused these bugs:
1. Character set information of the original (definition) query for views,
triggers, stored routines and events was lost.
2. mysqldump output query in client character set, which can be
inappropriate to encode definition-query.
3. INFORMATION_SCHEMA used strings with mixed encodings to display object
definition;
1. No query-definition-character set.
In order to compile query into execution code, some extra data (such as
environment variables or the database character set) is used. The problem
here was that this context was not preserved. So, on the next load it can
differ from the original one, thus the result will be different.
The context contains the following data:
- client character set;
- connection collation (character set and collation);
- collation of the owner database;
The fix is to store this context and use it each time we parse (compile)
and execute the object (stored routine, trigger, ...).
2. Wrong mysqldump-output.
The original query can contain several encodings (by means of character set
introducers). The problem here was that we tried to convert original query
to the mysqldump-client character set.
Moreover, we stored queries in different character sets for different
objects (views, for one, used UTF8, triggers used original character set).
The solution is
- to store definition queries in the original character set;
- to change SHOW CREATE statement to output definition query in the
binary character set (i.e. without any conversion);
- introduce SHOW CREATE TRIGGER statement;
- to dump special statements to switch the context to the original one
before dumping and restore it afterwards.
Note, in order to preserve the database collation at the creation time,
additional ALTER DATABASE might be used (to temporary switch the database
collation back to the original value). In this case, ALTER DATABASE
privilege will be required. This is a backward-incompatible change.
3. INFORMATION_SCHEMA showed non-UTF8 strings
The fix is to generate UTF8-query during the parsing, store it in the object
and show it in the INFORMATION_SCHEMA.
Basically, the idea is to create a copy of the original query convert it to
UTF8. Character set introducers are removed and all text literals are
converted to UTF8.
This UTF8 query is intended to provide user-readable output. It must not be
used to recreate the object. Specialized SHOW CREATE statements should be
used for this.
The reason for this limitation is the following: the original query can
contain symbols from several character sets (by means of character set
introducers).
Example:
- original query:
CREATE VIEW v1 AS SELECT _cp1251 'Hello' AS c1;
- UTF8 query (for INFORMATION_SCHEMA):
CREATE VIEW v1 AS SELECT 'Hello' AS c1;
Problem: we may get unexpected results comparing [u]longlong values as doubles.
Fix: adjust the test to use integer comparators.
Note: it's not a real fix, we have to implement some new comparators
to completely solve the original problem (see my comment in the bug report).
The IN function was comparing DATE/DATETIME values either as ints or as
strings. Both methods have their disadvantages and may lead to a wrong
result.
Now IN function checks whether all of its arguments has the STRING result
types and at least one of them is a DATE/DATETIME item. If so it uses either
an object of the in_datetime class or an object of the cmp_item_datetime
class to perform its work. If the IN() function arguments are rows then
row columns are checked whether the DATE/DATETIME comparator should be used
to compare them.
The in_datetime class is used to find occurence of the item to be checked
in the vector of the constant DATE/DATETIME values. The cmp_item_datetime
class is used to compare items one by one in the DATE/DATETIME context.
Both classes obtain values from items with help of the get_datetime_value()
function and cache the left item if it is a constant one.
of its argument happened to be a decimal expression returning
the NULL value.
The crash was due to the fact the function in_decimal::set did
not take into account that val_decimal() could return 0 if
the decimal expression had been evaluated to NULL.
Several problems here :
1. The conversion to double of an hex string const item
was not taking into account the unsigned flag.
2. IN was not behaving in the same was way as comparisons
when performed over an INT/DATE/DATETIME/TIMESTAMP column
and a constant. The ordinary comparisons in that case
convert the constant to an INTEGER value and do int
comparisons. Fixed the IN to do the same.
3. IN is not taking into account the unsigned flag when
calculating <expr> IN (<int_const1>, <int_const2>, ...).
Extended the implementation of IN to store and process
the unsigned flag for its arguments.
When checking if an IN predicate can be evaluated using a key
the optimizer makes sure that all the arguments of IN are of
the same result type. To assure that it check whether
Item_func_in::array is filled in.
However Item_func_in::array is set if the types are
the same AND all the arguments are compile time constants.
Fixed by introducing Item_func_in::arg_types_compatible
flag to allow correct checking of the desired condition.
The optimizer needs to evaluate whether predicates are better
evaluated using an index. IN is one such predicate.
To qualify an IN predicate must involve a field of the index
on the left and constant arguments on the right.
However whether an expression is a constant can be determined only
by knowing the preceding tables in the join order.
Assuming that only IN predicates with expressions on the right that
are constant for the whole query qualify limits the scope of
possible optimizations of the IN predicate (more specifically it
doesn't allow the "Range checked for each record" optimization for
such an IN predicate.
Fixed by not pre-determining the optimizability of the IN predicate
in the case when all right IN operands are not SQL constant expressions
The problem was that some functions (namely IN() starting with 4.1, and
CHAR() starting with 5.0) were returning NULL in certain conditions,
while they didn't set their maybe_null flag. Because of that there could
be some problems with 'IS NULL' check, and statements that depend on the
function value domain, like CREATE TABLE t1 SELECT 1 IN (2, NULL);.
The fix is to set maybe_null correctly.
result
The IN function aggregates result types of all expressions. It uses that
type in comparison of left expression and expressions in right part.
This approach works in most cases. But let's consider the case when the
right part contains both strings and integers. In that case this approach may
cause wrong results because all strings which do not start with a digit are
evaluated as 0.
CASE uses the same approach when a CASE expression is given thus it's also
affected.
The idea behind this fix is to make IN function to compare expressions with
different result types differently. For example a string in the left
part will be compared as string with strings specified in right part and
will be converted to real for comparison to int or real items in the right
part.
A new function called collect_cmp_types() is added. It collects different
result types for comparison of first item in the provided list with each
other item in the list.
The Item_func_in class now can refer up to 5 cmp_item objects: 1 for each
result type for comparison purposes. cmp_item objects are allocated according
to found result types. The comparison of the left expression with any
right part expression is now based only on result types of these expressions.
The Item_func_case class is modified in the similar way when a CASE
expression is specified. Now it can allocate up to 5 cmp_item objects
to compare CASE expression with WHEN expressions of different types.
The comparison of the CASE expression with any WHEN expression now based only
on result types of these expressions.
- Make the range-et-al optimizer produce E(#table records after table
condition is applied),
- Make the join optimizer use this value,
- Add "filtered" column to EXPLAIN EXTENDED to show
fraction of records left after table condition is applied
- Adjust test results, add comments
The IN() function uses agg_cmp_type() to aggregate all types of its arguments
to find out some common type for comparisons. In this particular case the
char() and the int was aggregated to double because char() can contain values
like '1.5'. But all strings which do not start from a digit are converted to
0. thus 'a' and 'z' become equal.
This behaviour is reasonable when all function arguments are constants. But
when there is a field or an expression this can lead to false comparisons. In
this case it makes more sense to coerce constants to the type of the field
argument.
The agg_cmp_type() function now aggregates types of constant and non-constant
items separately. If some non-constant items will be found then their
aggregated type will be returned. Thus after the aggregation constants will be
coerced to the aggregated type.
- When manually constructing a SEL_TREE for "t.key NOT IN(...)", take into account that
get_mm_parts may return a tree with type SEL_TREE::IMPOSSIBLE
- Added missing OOM checks
- Added comments
too much memory. Instead, either create the equvalent SEL_TREE manually, or create only two ranges that
strictly include the area to scan
(Note: just to re-iterate: increasing NOT_IN_IGNORE_THRESHOLD will make optimization run slower for big
IN-lists, but the server will not run out of memory. O(N^2) memory use has been eliminated)
new file
mysql_fix_privilege_tables.sql, mysql_create_system_tables.sh:
Adding true BINARY/VARBINARY: fixing "password" type, not to be 0x00-padding.
Many files:
Adding true BINARY/VARBINARY: fixing tests not to output 0x00 bytes.
Adding true BINARY/VARBINARY: new pad_char structure member.
ctype-bin.c:
Adding true BINARY/VARBINARY: new pad_char structure member.
New strnxfrm, with two trailing length bytes.
field.cc:
Adding true BINARY/VARBINARY.