Step#1: fixing the return type of strnxfrm() from size_t to this structure:
typedef struct
{
size_t m_output_length;
size_t m_source_length_used;
uint m_warnings;
} my_strnxfrm_ret_t;
This patch also fixes:
MDEV-33050 Build-in schemas like oracle_schema are accent insensitive
MDEV-33084 LASTVAL(t1) and LASTVAL(T1) do not work well with lower-case-table-names=0
MDEV-33085 Tables T1 and t1 do not work well with ENGINE=CSV and lower-case-table-names=0
MDEV-33086 SHOW OPEN TABLES IN DB1 -- is case insensitive with lower-case-table-names=0
MDEV-33088 Cannot create triggers in the database `MYSQL`
MDEV-33103 LOCK TABLE t1 AS t2 -- alias is not case sensitive with lower-case-table-names=0
MDEV-33109 DROP DATABASE MYSQL -- does not drop SP with lower-case-table-names=0
MDEV-33110 HANDLER commands are case insensitive with lower-case-table-names=0
MDEV-33119 User is case insensitive in INFORMATION_SCHEMA.VIEWS
MDEV-33120 System log table names are case insensitive with lower-cast-table-names=0
- Removing the virtual function strnncoll() from MY_COLLATION_HANDLER
- Adding a wrapper function CHARSET_INFO::streq(), to compare
two strings for equality. For now it calls strnncoll() internally.
In the future it will turn into a virtual function.
- Adding new accent sensitive case insensitive collations:
- utf8mb4_general1400_as_ci
- utf8mb3_general1400_as_ci
They implement accent sensitive case insensitive comparison.
The weight of a character is equal to the code point of its
upper case variant. These collations use Unicode-14.0.0 casefolding data.
The result of
my_charset_utf8mb3_general1400_as_ci.strcoll()
is very close to the former
my_charset_utf8mb3_general_ci.strcasecmp()
There is only a difference in a couple dozen rare characters, because:
- the switch from "tolower" to "toupper" comparison, to make
utf8mb3_general1400_as_ci closer to utf8mb3_general_ci
- the switch from Unicode-3.0.0 to Unicode-14.0.0
This difference should be tolarable. See the list of affected
characters in the MDEV description.
Note, utf8mb4_general1400_as_ci correctly handles non-BMP characters!
Unlike utf8mb4_general_ci, it does not treat all BMP characters
as equal.
- Adding classes representing names of the file based database objects:
Lex_ident_db
Lex_ident_table
Lex_ident_trigger
Their comparison collation depends on the underlying
file system case sensitivity and on --lower-case-table-names
and can be either my_charset_bin or my_charset_utf8mb3_general1400_as_ci.
- Adding classes representing names of other database objects,
whose names have case insensitive comparison style,
using my_charset_utf8mb3_general1400_as_ci:
Lex_ident_column
Lex_ident_sys_var
Lex_ident_user_var
Lex_ident_sp_var
Lex_ident_ps
Lex_ident_i_s_table
Lex_ident_window
Lex_ident_func
Lex_ident_partition
Lex_ident_with_element
Lex_ident_rpl_filter
Lex_ident_master_info
Lex_ident_host
Lex_ident_locale
Lex_ident_plugin
Lex_ident_engine
Lex_ident_server
Lex_ident_savepoint
Lex_ident_charset
engine_option_value::Name
- All the mentioned Lex_ident_xxx classes implement a method streq():
if (ident1.streq(ident2))
do_equal();
This method works as a wrapper for CHARSET_INFO::streq().
- Changing a lot of "LEX_CSTRING name" to "Lex_ident_xxx name"
in class members and in function/method parameters.
- Replacing all calls like
system_charset_info->coll->strcasecmp(ident1, ident2)
to
ident1.streq(ident2)
- Taking advantage of the c++11 user defined literal operator
for LEX_CSTRING (see m_strings.h) and Lex_ident_xxx (see lex_ident.h)
data types. Use example:
const Lex_ident_column primary_key_name= "PRIMARY"_Lex_ident_column;
is now a shorter version of:
const Lex_ident_column primary_key_name=
Lex_ident_column({STRING_WITH_LEN("PRIMARY")});
This is a refactoring patch, it does not change the behaviour.
The MTR tests are being added only to cover the LIKE predicate better.
(these tests should have been added earlier under terms of MDEV 9711).
This patch does not need its own specific MTR tests.
Moving the duplicate code into a new shared file ctype-wildcmp.inl
and including it from multiple places, to define the following functions:
- my_wildcmp_uca_impl(), in ctype-uca.c
For utf8mb3, utf8mb4, ucs2, utf16, utf32, using cs->cset->mb_wc().
For UCA based collations.
- my_wildcmp_mb2_or_mb4_general_ci_impl(), in ctype-ucs2.c:
For ucs2, utf16, utf32, using cs->cset->mb_wc().
For general_ci-style collations:
- xxx_general_ci
- xxx_general_mysql500_ci
- xxx_general_nopad_ci
- my_wildcmp_mb2_or_mb4_bin_impl(), in ctype-ucs2.c:
For ucs2, utf16, utf32, using cs->cset->mb_wc().
For _bin collations:
- xxx_bin
- xxx_nopad_bin
- my_wildcmp_utf8mb3_general_ci_impl(), in ctype-utf8.c
Optimized for utf8mb3, using my_mb_wc_utf8mb3_quick().
For general_ci-style collations:
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb3_general_nopad_ci
- my_wildcmp_utf8mb4_general_ci_impl(), in ctype-utf8.c
Optimized for utf8mb4, using my_mb_wc_utf8mb4_quick().
For general_ci-style collations:
- utf8mb4_general_ci
- utf8mb4_general_nopad_ci
Under terms of MDEV 27490 we'll add support for non-BMP identifiers
and upgrade casefolding information to Unicode version 14.0.0.
In Unicode-14.0.0 conversion to lower and upper cases can increase octet length
of the string, so conversion won't be possible in-place any more.
This patch removes virtual functions performing in-place casefolding:
- my_charset_handler_st::casedn_str()
- my_charset_handler_st::caseup_str()
and fixes the code to use the non-inplace functions instead:
- my_charset_handler_st::casedn()
- my_charset_handler_st::caseup()
(Variant#3: Allow cross-charset comparisons, use a special
CHARSET_INFO to create lookup keys. Review input addressed.)
Equalities that compare utf8mb{3,4}_general_ci strings, like:
WHERE ... utf8mb3_key_col=utf8mb4_value (MB3-4-CMP)
can now be used to construct ref[const] access and also participate
in multiple-equalities.
This means that utf8mb3_key_col can be used for key-lookups when
compared with an utf8mb4 constant, field or expression using '=' or
'<=>' comparison operators.
This is controlled by optimizer_switch='cset_narrowing=on', which is
OFF by default.
IMPLEMENTATION
Item value comparison in (MB3-4-CMP) is done using utf8mb4_general_ci.
This is valid as any utf8mb3 value is also an utf8mb4 value.
When making index lookup value for utf8mb3_key_col, we do "Charset
Narrowing": characters that are in the Basic Multilingual Plane (=BMP) are
copied as-is, as they can be represented in utf8mb3. Characters that are
outside the BMP cannot be represented in utf8mb3 and are replaced
with U+FFFD, the "Replacement Character".
In utf8mb4_general_ci, the Replacement Character compares as equal to any
character that's not in BMP. Because of this, the constructed lookup value
will find all index records that would be considered equal by the original
condition (MB3-4-CMP).
Approved-by: Monty <monty@mariadb.org>
In the hexlo function, the element type of the array hex_lo_digit is not
explicitly declared as signed char, causing elements with a value of -1
to be converted to 255 on Arm64. The problem occurs because "char" is
unsigned by default on Arm64 compiler, but signed on x86 compiler. This
problem can be seen in https://godbolt.org/z/rT775xshj
The above issue causes "use-after-poison" exception in my_mb_wc_filename
function. The code snippet where the error occurred is shown below,
copied from below link.
5fc19e7137/strings/ctype-utf8.c (L2728)
2728 if ((byte1= hexlo(byte1)) >= 0 &&
2729 (byte2= hexlo(byte2)) >= 0)
{
2731 int byte3= hexlo(s[3]);
…
}
At line 2729, when byte2 is 0, which indicates the end of the string s.
(1) On x86, hexlo(0) return -1 and line 2731 is skipped, as expected.
(2) On Arm64, hexlo(0) return 255 and line 2731 is executed, not as
expected, accessing s[3] after the null character of string s, thus
raising the "user-after-poison" error.
The problem was discovered when executing the main.mysqlcheck test.
Signed-off-by: Xiaotong Niu <xiaotong.niu@arm.com>
(Review input addressed)
(Added handling of UPDATE/DELETE and partitioning w/o index)
If the properties of the used collation allow, do the following
equivalent rewrites:
1. UPPER(key_col)=expr -> key_col=expr
expr=UPPER(key_col) -> expr=key_col
(also rewrite both sides of the equality at the same time)
2. UPPER(key_col) IN (constant-list) -> key_col IN (constant-list)
- Mark utf8mb{3,4}_general_ci as collations that allow this.
- Add optimizer_switch='sargable_casefold=ON' to control this.
(ON by default in this patch)
- Cover the rewrite in Optimizer Trace, rewrite name is
"sargable_casefold_removal".
This is a non-functional change. It changes the way how case folding data
and weight data (for simple Unicode collations) are stored:
- Removing data types MY_UNICASE_CHARACTER, MY_UNICASE_INFO
- Using data types MY_CASEFOLD_CHARACTER, MY_CASEFOLD_INFO instead.
This patch changes simple Unicode collations in a similar way
how MDEV-30695 previously changed Asian collations.
No new MTR tests are needed. The underlying code is thoroughly
covered by a number of ctype_*_ws.test and ctype_*_casefold.test
files, which were added recently as a preparation
for this change.
Old and new Unicode data layout
-------------------------------
Case folding data is now stored in separate tables
consisting of MY_CASEFOLD_CHARACTER elements with two members:
typedef struct casefold_info_char_t
{
uint32 toupper;
uint32 tolower;
} MY_CASEFOLD_CHARACTER;
while weight data (for simple non-UCA collations xxx_general_ci
and xxx_general_mysql500_ci) is stored in separate arrays of
uint16 elements.
Before this change case folding data and simple weight data were
stored together, in tables of the following elements with three members:
typedef struct unicase_info_char_st
{
uint32 toupper;
uint32 tolower;
uint32 sort; /* weights for simple collations */
} MY_UNICASE_CHARACTER;
This data format was redundant, because weights (the "sort" member) were
needed only for these two simple Unicode collations:
- xxx_general_ci
- xxx_general_mysql500_ci
Adding case folding information for Unicode-14.0.0 using the old
format would waste memory without purpose.
Detailed changes
----------------
- Changing the underlying data types as described above
- Including unidata-dump.c into the sources.
This program was earlier used to dump UnicodeData.txt
(e.g. https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt)
into MySQL / MariaDB source files.
It was originally written in 2002, but has not been distributed yet
together with MySQL / MariaDB sources.
- Removing the old format Unicode data earlier dumped from UnicodeData.txt
(versions 3.0.0 and 5.2.0) from ctype-utf8.c.
Adding Unicode data in the new format into separate header files,
to maintain the code easier:
- ctype-unicode300-casefold.h
- ctype-unicode300-casefold-tr.h
- ctype-unicode300-general_ci.h
- ctype-unicode300-general_mysql500_ci.h
- ctype-unicode520-casefold.h
- Adding a new file ctype-unidata.c as an aggregator for
the header files listed above.
Removing similar functions from ctype-utf8.c and ctype-ucs2.c
- my_tosort_utf16()
- my_tosort_utf32()
- my_tosort_ucs2()
- my_tosort_unicode()
Adding new shared functions into ctype-unidata.h:
- my_tosort_unicode_bmp() - reused for utf8mb3, ucs2
- my_tosort_unicode() - reused for utf8mb4, utf16, utf32
For simplicity, the new version of my_tosort_unicode*()
does not include the code handling the MY_CS_LOWER_SORT flag because:
- it affects performance negatively
- we don't have any collations with this flag yet anyway
(This code was most likely earlier erroneously merged from
MySQL's utf8_tolower_ci at some point.)
1. Adding a separate MY_COLLATION_HANDLER
my_collation_ucs2_general_mysql500_ci_handler
implementing a proper order for ucs2_general_mysql500_ci
The problem happened because ucs2_general_mysql500_ci
erroneously used my_collation_ucs2_general_ci_handler.
2. Cosmetic changes: Renaming:
- plane00_mysql500 to my_unicase_mysql500_page00
- my_unicase_pages_mysql500 to my_unicase_mysql500_pages
to use the same naming style with:
- my_unicase_default_page00
- my_unicase_defaul_pages
3. Moving code fragments from
- handler::check_collation_compatibility() in handler.cc
- upgrade_collation() in table.cc
into new methods in class Charset, to reuse the code easier.
This is a non-functional change and should not change the server behavior.
Casefolding information is now stored in items of a new data type MY_CASEFOLD_CHARACTER:
typedef struct casefold_info_char_t
{
uint32 toupper;
uint32 tolower;
} MY_CASEFOLD_CHARACTER;
Before this change, casefolding tables for Asian collations were stored in:
typedef struct unicase_info_char_st
{
uint32 toupper;
uint32 tolower;
uint32 sort;
} MY_UNICASE_CHARACTER;
The "sort" member was not used in the code handling Asian collations,
it only wasted space.
(it's only used by Unicode _general_ci and _general_mysql500_ci collations).
Unicode collations (at least UCA and _bin) should also be refactored later,
but under terms of a separate task.
String length growth during upper/lower conversion
in Unicode collations depends only on the underlying MY_UNICASE_INFO
used in the collation.
Maintaining a separate member CHARSET_INFO::caseup_multiply and
CHARSET_INFO::casedn_multiply duplicated this information
and caused bugs like this (when MY_UNICASE_INFO and case??_multiply
when out of sync because of incomplete CHARSET_INFO initialization).
Fix:
Changing CHARSET_INFO::caseup_multiply and CHARSET_INFO::casedn_multiply
from members to virtual functions.
The virtual functions in Unicode collations calculate case conversion
growth factors from the MY_UNICASE_INFO. This guarantees that the growth
factors are always in sync with the MY_UNICASE_INFO.
- Added one neutral and 22 tailored (language specific) collations based on
Unicode Collation Algorithm version 14.0.0.
Collations were added for Unicode character sets
utf8mb3, utf8mb4, ucs2, utf16, utf32.
Every tailoring was added with four accent and case
sensitivity flag combinations, e.g:
* utf8mb4_uca1400_swedish_as_cs
* utf8mb4_uca1400_swedish_as_ci
* utf8mb4_uca1400_swedish_ai_cs
* utf8mb4_uca1400_swedish_ai_ci
and their _nopad_ variants:
* utf8mb4_uca1400_swedish_nopad_as_cs
* utf8mb4_uca1400_swedish_nopad_as_ci
* utf8mb4_uca1400_swedish_nopad_ai_cs
* utf8mb4_uca1400_swedish_nopad_ai_ci
- Introducing a conception of contextually typed named collations:
CREATE DATABASE db1 CHARACTER SET utf8mb4;
CREATE TABLE db1.t1 (a CHAR(10) COLLATE uca1400_as_ci);
The idea is that there is no a need to specify the character set prefix
in the new collation names. It's enough to type just the suffix
"uca1400_as_ci". The character set is taken from the context.
In the above example script the context character set is utf8mb4.
So the CREATE TABLE will make a column with the collation
utf8mb4_uca1400_as_ci.
Short collations names can be used in any parts of the SQL syntax
where the COLLATE clause is understood.
- New collations are displayed only one time
(without character set combinations) by these statements:
SELECT * FROM INFORMATION_SCHEMA.COLLATIONS;
SHOW COLLATION;
For example, all these collations:
- utf8mb3_uca1400_swedish_as_ci
- utf8mb4_uca1400_swedish_as_ci
- ucs2_uca1400_swedish_as_ci
- utf16_uca1400_swedish_as_ci
- utf32_uca1400_swedish_as_ci
have just one entry in INFORMATION_SCHEMA.COLLATIONS and SHOW COLLATION,
with COLLATION_NAME equal to "uca1400_swedish_as_ci", which is the suffix
without the character set name:
SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS
WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci';
+-----------------------+
| COLLATION_NAME |
+-----------------------+
| uca1400_swedish_as_ci |
+-----------------------+
Note, the behaviour of old collations did not change.
Non-unicode collations (e.g. latin1_swedish_ci) and
old UCA-4.0.0 collations (e.g. utf8mb4_unicode_ci)
are still displayed with the character set prefix, as before.
- The structure of the table INFORMATION_SCHEMA.COLLATIONS was changed.
The NOT NULL constraint was removed from these columns:
- CHARACTER_SET_NAME
- ID
- IS_DEFAULT
and from the corresponding columns in SHOW COLLATION.
For example:
SELECT COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT
FROM INFORMATION_SCHEMA.COLLATIONS
WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci';
+-----------------------+--------------------+------+------------+
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT |
+-----------------------+--------------------+------+------------+
| uca1400_swedish_as_ci | NULL | NULL | NULL |
+-----------------------+--------------------+------+------------+
The NULL value in these columns now means that the collation
is applicable to multiple character sets.
The behavioir of old collations did not change.
Make sure your client programs can handle NULL values in these columns.
- The structure of the table
INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY was changed.
Three new NOT NULL columns were added:
- FULL_COLLATION_NAME
- ID
- IS_DEFAULT
New collations have multiple entries in COLLATION_CHARACTER_SET_APPLICABILITY.
The column COLLATION_NAME contains the collation name without the character
set prefix. The column FULL_COLLATION_NAME contains the collation name with
the character set prefix.
Old collations have full collation name in both FULL_COLLATION_NAME and
COLLATION_NAME.
SELECT COLLATION_NAME, FULL_COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT
FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
WHERE FULL_COLLATION_NAME RLIKE '^(utf8mb4|latin1).*swedish.*ci$';
+-----------------------------+-------------------------------------+--------------------+------+------------+
| COLLATION_NAME | FULL_COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT |
+-----------------------------+-------------------------------------+--------------------+------+------------+
| latin1_swedish_ci | latin1_swedish_ci | latin1 | 8 | Yes |
| latin1_swedish_nopad_ci | latin1_swedish_nopad_ci | latin1 | 1032 | |
| utf8mb4_swedish_ci | utf8mb4_swedish_ci | utf8mb4 | 232 | |
| uca1400_swedish_ai_ci | utf8mb4_uca1400_swedish_ai_ci | utf8mb4 | 2368 | |
| uca1400_swedish_as_ci | utf8mb4_uca1400_swedish_as_ci | utf8mb4 | 2370 | |
| uca1400_swedish_nopad_ai_ci | utf8mb4_uca1400_swedish_nopad_ai_ci | utf8mb4 | 2372 | |
| uca1400_swedish_nopad_as_ci | utf8mb4_uca1400_swedish_nopad_as_ci | utf8mb4 | 2374 | |
+-----------------------------+-------------------------------------+--------------------+------+------------+
- Other INFORMATION_SCHEMA queries:
SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS;
SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES;
SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;
SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES;
SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS;
SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS;
SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES;
SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES;
SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS;
SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS;
SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS;
display full collation names, including character sets prefix,
for all collations, including new collations.
Corresponding SHOW commands also display full collation names
in collation related columns:
SHOW CREATE TABLE t1;
SHOW CREATE DATABASE db1;
SHOW TABLE STATUS;
SHOW CREATE FUNCTION f1;
SHOW CREATE PROCEDURE p1;
SHOW CREATE EVENT ev1;
SHOW CREATE TRIGGER tr1;
SHOW CREATE VIEW;
These INFORMATION_SCHEMA queries and SHOW statements may change in
the future, to display show collation names.
This change removed 68 explict strlen() calls from the code.
The following renames was done to ensure we don't use the old names
when merging code from earlier releases, as using the new variables
for print function could result in crashes:
- charset->csname renamed to charset->cs_name
- charset->name renamed to charset->coll_name
Almost everything where mechanical changes except:
- Changed to use the new Protocol::store(LEX_CSTRING..) when possible
- Changed to use field->store(LEX_CSTRING*, CHARSET_INFO*) when possible
- Changed to use String->append(LEX_CSTRING&) when possible
Other things:
- There where compiler issues with ensuring that all character set names
points to the same string: gcc doesn't allow one to use integer constants
when defining global structures (constant char * pointers works fine).
To get around this, I declared defines for each character set name
length.
This patch changes the main name of 3 byte character set from utf8 to
utf8mb3. New old_mode UTF8_IS_UTF8MB3 is added and set TRUE by default,
so that utf8 would mean utf8mb3. If not set, utf8 would mean utf8mb4.
This patch ensures that all identical character sets shares the same
cs->csname.
This allows us to replace strcmp() in my_charset_same() with comparisons
of pointers. This fixes a long standing performance issue that could cause
as strcmp() for every item sent trough the protocol class to the end user.
One consequence of this patch is that we don't allow one to add a character
definition in the Index.xml file that changes the csname of an existing
character set. This is by design as changing character set names of existing
ones is extremely dangerous, especially as some storage engines just records
character set numbers.
As we now have a hash over character set's csname, we can in the future
use that for faster access to a specific character set. This could be done
by changing the hash to non unique and use the hash to find the next
character set with same csname.
Replacing the slow loop in my_hash_sort_utf8mbX() to the fast
skip_trailing_spaces(), which consumes 8 bytes in one iteration,
and is around 8 times faster on long data.
Also, renaming:
- my_hash_sort_utf8() to my_hash_sort_utf8mb3()
- my_hash_sort_utf8_nopad() to my_hash_sort_utf8mb3_nopad()
to merge to 10.5 easier (automatically?).
The code did not take into account that:
- U+005C (backslash) can occupy more than mbminlen characters (e.g. in sjis)
- Some character sets do not have a code for U+005C (e.g. swe7)
Adding a new function my_wc_to_printable into MY_CHARSET_HANDLER to
cover all special cases easier.