Step#1: fixing the return type of strnxfrm() from size_t to this structure:
typedef struct
{
size_t m_output_length;
size_t m_source_length_used;
uint m_warnings;
} my_strnxfrm_ret_t;
This patch also fixes:
MDEV-33050 Build-in schemas like oracle_schema are accent insensitive
MDEV-33084 LASTVAL(t1) and LASTVAL(T1) do not work well with lower-case-table-names=0
MDEV-33085 Tables T1 and t1 do not work well with ENGINE=CSV and lower-case-table-names=0
MDEV-33086 SHOW OPEN TABLES IN DB1 -- is case insensitive with lower-case-table-names=0
MDEV-33088 Cannot create triggers in the database `MYSQL`
MDEV-33103 LOCK TABLE t1 AS t2 -- alias is not case sensitive with lower-case-table-names=0
MDEV-33109 DROP DATABASE MYSQL -- does not drop SP with lower-case-table-names=0
MDEV-33110 HANDLER commands are case insensitive with lower-case-table-names=0
MDEV-33119 User is case insensitive in INFORMATION_SCHEMA.VIEWS
MDEV-33120 System log table names are case insensitive with lower-cast-table-names=0
- Removing the virtual function strnncoll() from MY_COLLATION_HANDLER
- Adding a wrapper function CHARSET_INFO::streq(), to compare
two strings for equality. For now it calls strnncoll() internally.
In the future it will turn into a virtual function.
- Adding new accent sensitive case insensitive collations:
- utf8mb4_general1400_as_ci
- utf8mb3_general1400_as_ci
They implement accent sensitive case insensitive comparison.
The weight of a character is equal to the code point of its
upper case variant. These collations use Unicode-14.0.0 casefolding data.
The result of
my_charset_utf8mb3_general1400_as_ci.strcoll()
is very close to the former
my_charset_utf8mb3_general_ci.strcasecmp()
There is only a difference in a couple dozen rare characters, because:
- the switch from "tolower" to "toupper" comparison, to make
utf8mb3_general1400_as_ci closer to utf8mb3_general_ci
- the switch from Unicode-3.0.0 to Unicode-14.0.0
This difference should be tolarable. See the list of affected
characters in the MDEV description.
Note, utf8mb4_general1400_as_ci correctly handles non-BMP characters!
Unlike utf8mb4_general_ci, it does not treat all BMP characters
as equal.
- Adding classes representing names of the file based database objects:
Lex_ident_db
Lex_ident_table
Lex_ident_trigger
Their comparison collation depends on the underlying
file system case sensitivity and on --lower-case-table-names
and can be either my_charset_bin or my_charset_utf8mb3_general1400_as_ci.
- Adding classes representing names of other database objects,
whose names have case insensitive comparison style,
using my_charset_utf8mb3_general1400_as_ci:
Lex_ident_column
Lex_ident_sys_var
Lex_ident_user_var
Lex_ident_sp_var
Lex_ident_ps
Lex_ident_i_s_table
Lex_ident_window
Lex_ident_func
Lex_ident_partition
Lex_ident_with_element
Lex_ident_rpl_filter
Lex_ident_master_info
Lex_ident_host
Lex_ident_locale
Lex_ident_plugin
Lex_ident_engine
Lex_ident_server
Lex_ident_savepoint
Lex_ident_charset
engine_option_value::Name
- All the mentioned Lex_ident_xxx classes implement a method streq():
if (ident1.streq(ident2))
do_equal();
This method works as a wrapper for CHARSET_INFO::streq().
- Changing a lot of "LEX_CSTRING name" to "Lex_ident_xxx name"
in class members and in function/method parameters.
- Replacing all calls like
system_charset_info->coll->strcasecmp(ident1, ident2)
to
ident1.streq(ident2)
- Taking advantage of the c++11 user defined literal operator
for LEX_CSTRING (see m_strings.h) and Lex_ident_xxx (see lex_ident.h)
data types. Use example:
const Lex_ident_column primary_key_name= "PRIMARY"_Lex_ident_column;
is now a shorter version of:
const Lex_ident_column primary_key_name=
Lex_ident_column({STRING_WITH_LEN("PRIMARY")});
Under terms of MDEV 27490 we'll add support for non-BMP identifiers
and upgrade casefolding information to Unicode version 14.0.0.
In Unicode-14.0.0 conversion to lower and upper cases can increase octet length
of the string, so conversion won't be possible in-place any more.
This patch removes virtual functions performing in-place casefolding:
- my_charset_handler_st::casedn_str()
- my_charset_handler_st::caseup_str()
and fixes the code to use the non-inplace functions instead:
- my_charset_handler_st::casedn()
- my_charset_handler_st::caseup()
This is a non-functional change and should not change the server behavior.
Casefolding information is now stored in items of a new data type MY_CASEFOLD_CHARACTER:
typedef struct casefold_info_char_t
{
uint32 toupper;
uint32 tolower;
} MY_CASEFOLD_CHARACTER;
Before this change, casefolding tables for Asian collations were stored in:
typedef struct unicase_info_char_st
{
uint32 toupper;
uint32 tolower;
uint32 sort;
} MY_UNICASE_CHARACTER;
The "sort" member was not used in the code handling Asian collations,
it only wasted space.
(it's only used by Unicode _general_ci and _general_mysql500_ci collations).
Unicode collations (at least UCA and _bin) should also be refactored later,
but under terms of a separate task.
String length growth during upper/lower conversion
in Unicode collations depends only on the underlying MY_UNICASE_INFO
used in the collation.
Maintaining a separate member CHARSET_INFO::caseup_multiply and
CHARSET_INFO::casedn_multiply duplicated this information
and caused bugs like this (when MY_UNICASE_INFO and case??_multiply
when out of sync because of incomplete CHARSET_INFO initialization).
Fix:
Changing CHARSET_INFO::caseup_multiply and CHARSET_INFO::casedn_multiply
from members to virtual functions.
The virtual functions in Unicode collations calculate case conversion
growth factors from the MY_UNICASE_INFO. This guarantees that the growth
factors are always in sync with the MY_UNICASE_INFO.
my_copy_fix_mb() passed MIN(src_length,dst_length) to
my_append_fix_badly_formed_tail(). It could break a multi-byte
character in the middle, which put the question mark to the
destination.
Fixing the code to pass the true src_length to
my_append_fix_badly_formed_tail().
Passing a null pointer to a nonnull argument is not only undefined
behaviour, but it also grants the compiler the permission to optimize
away further checks whether the pointer is null. GCC -O2 at least
starting with version 8 may do that, potentially causing SIGSEGV.
After the MDEV-13118 fix there's no code in the server that
wants caseup/casedn to change the argument in place for simple
charsets. Let's remove this logic and always return the result in a
new string for all charsets, both simple and complex.
1. Removing the optimization that *some* character sets used in casedn()
and caseup(), which allowed (and required) to change the case in-place,
overwriting the string passed as the "src" argument.
Now all CHARSET_INFO's work in the same way:
non of them change the source string in-place, all of them now convert
case from the source string to the destination string, leaving
the source string untouched.
2. Adding "const" qualifier to the "char *src" parameter
to caseup() and casedn().
3. Removing duplicate implementations in ctype-mb.c.
Now both caseup() and casedn() implementations for all CJK character sets
use internally the same function my_casefold_mb()
(the former my_casefold_mb_varlen()).
4. Removing the "unused" attribute from parameters of some my_case{up|dn}_xxx()
implementations, as the affected parameters are now *used* in the code.
Previously these parameters were used only in DBUG_ASSERT().
This is a pre-requisite patch for:
- MDEV-8433 Make field<'broken-string' use indexes
- MDEV-8625 Bad result set with ignorable characters when using a prefix key
- MDEV-8626 Bad result set with contractions when using a prefix key
Adding a new virtual function MY_CHARSET_HANDLER::copy_abort().
Moving character set specific code into the correspoding implementations
(for simple, multi-byte and mbmaxlen>1 character sets).
The problem was that my_hash_sort didn't properly delete end-space characters properly, so strings that should compare
identically was seen as different strings. (Space was handled correctly, but not NBSP)
This caused duplicate key errors when a heap table was converted to Aria as part of overflow in group by.
Fixed by removing all characters that compares as end space when creating a hash.
Other things:
- Fixed that --sorted_results also works for errors in mysqltest.
- Speed up hash by not comparing strings that has different hash.
- Speed up many my_hash_sort functions by using registers to calculate hash instead of pointers.
This was previously done for some functions, but not for all.
- Made a macro of the hash function, to simplify code and to be able to experiment with new hash functions.
client/mysqltest.cc:
Fixed that --sorted_results also works for error messages.
mysql-test/r/ctype_partitions.result:
New test to ensure that partitions on hash works
mysql-test/suite/multi_source/gtid.result:
Updated result
mysql-test/suite/multi_source/gtid.test:
Test that --sorted_result works for error messages
mysql-test/suite/multi_source/gtid_ignore_duplicates.result:
Updated result
mysql-test/suite/multi_source/gtid_ignore_duplicates.test:
Updated result
mysql-test/suite/multi_source/load_data.result:
Updated result
mysql-test/suite/multi_source/load_data.test:
Updated result
mysql-test/t/ctype_partitions.test:
New test to ensure that partitions on hash works
storage/heap/hp_write.c:
Speed up hash by not comparing strings that has different hash.
storage/maria/ma_check.c:
Extra debug
strings/ctype-bin.c:
Use macro for hash function
strings/ctype-latin1.c:
Use macro for hash function
Use registers to calculate hash (speedup)
strings/ctype-mb.c:
Use macro for hash function
Use registers to calculate hash (speedup)
strings/ctype-simple.c:
Use macro for hash function
Use same variable names as in other my_hash_sort functions.
Update my_hash_sort_simple() to properly remove end space (patch by Bar)
strings/ctype-uca.c:
Ignore duplicated space inside strings and end space in my_hash_sort_uca(). This fixed MDEV-6255
Use macro for hash function
Use registers to calculate hash (speedup)
strings/ctype-ucs2.c:
Use macro for hash function
Use registers to calculate hash (speedup)
strings/ctype-utf8.c:
Use macro for hash function
Use registers to calculate hash (speedup)
strings/strings_def.h:
Made a macro of the hash function, to simplify code and to be able to experiment with new hash functions.
COLLATIONS ARE USED.
ISSUE :
-------
Code points of HALF WIDTH KATAKANA in SJIS/CP932 range from
A1 to DF. In function my_wildcmp_mb_bin_impl while comparing
such single byte code points, there is a code which compares
signed character with unsigned character. Because of this,
comparisons of two same code points representing a HALF
WIDTH KATAKANA character always fails.
Solution:
---------
A code point of HALF WIDTH KATAKANA at-least need 8 bits.
Promoting the variable from uchar to int will fix the issue.
mysql-test/t/ctype_cp932.test:
Tests which have conditions
LIKE 'STRING PATTERN WITH HALF WIDTH KATAKANA'.
strings/ctype-mb.c:
A code point of HALF WIDTH KATAKANA at-least need 8 bits.
Promoting the variable from uchar to int will fix the issue.
Description: A typo in create_tailoring() causes the "contraction_flags" to be written
into cs->contractions in the wrong place. This causes two problems:
(1) Anyone relying on `contraction_flags` to decide "could this character be
part of a contraction" is 100% broken.
(2) Anyone relying on `contractions` to determine the weight of a contraction
is mostly broken
Analysis: When we are preparing the contraction in create_tailoring(), we are corrupting the
cs->contractions memory location which is supposed to store the weights(8k) + contraction information(256 bytes). We started storing the contraction information after the 4k location. This is because of logic flaw in the code.
Fix: When we create the contractions, we need to calculate the contraction with (char*) (cs->contractions + 0x40*0x40) from ((char*) cs->contractions) + 0x40*0x40. This makes the "cs->contractions" to move to 8k bytes and stores the contraction information from there. Similarly when we are calculating it for like range queries we need to calculate it from the 8k bytes onwards, this can be done by changing the logic to (const char*) (cs->contractions + 0x40*0x40). And for ucs2 charsets we need to modify the my_cs_can_be_contraction_head() and my_cs_can_be_contraction_tail() to point to 8k+ locations.