This is done by mapping most of the existing MySQL unicode 0900 collations
to MariadB 1400 unicode collations. The assumption is that 1400 is a super
set of 0900 for all practical purposes.
I also added a new function 'compare_collations()' and changed most code
to use this instead of comparing character sets directly.
This enables one to seamlessly mix-and-match the corresponding 0900 and
1400 sets. Field comparision and alter table treats the character sets
as identical.
All MySQL 8.0 0900 collations are supported except:
- utf8mb4_ja_0900_as_cs
- utf8mb4_ja_0900_as_cs_ks
- utf8mb4_ru_0900_as_cs
- utf8mb4_zh_0900_as_cs
These do not have corresponding entries in the MariadB 01400 collations.
Other things:
- Added COMMENT colum to information_schema.collations. For utf8mb4_0900
colletions it contains the corresponding alias collation.
strerror_s on Linux will, for unknown error codes, display
'Unknown error <codenum>' and our tests are written with this assumption.
However, on macOS, sterror_s returns 'Unknown error: <codenum>' in the
same case, which breaks tests. Make my_strerror consistent across the
platforms by removing the ':' when present.
During a query execution some sorting and grouping operations
on strings may be involved. System variable max_sort_length defines
the maximum number of bytes to use when comparing strings during
sorting/grouping. Thus, the comparable parts of strings may be less
than their actual size, so the results of the query may be not
sorted/grouped properly.
To indicate that some comparisons were done on a truncated lengths,
a new warning has been introduced with this commit.
Step#1: fixing the return type of strnxfrm() from size_t to this structure:
typedef struct
{
size_t m_output_length;
size_t m_source_length_used;
uint m_warnings;
} my_strnxfrm_ret_t;
The code in my_strtoll10_mb2 and my_strtoll10_utf32
could hit undefinite behavior by negation of LONGLONG_MIN.
Fixing to avoid this.
Also, fixing my_strtoll10() in the same style.
The previous reduction produced a redundant warning on
CAST(_latin1'-9223372036854775808' AS SIGNED)
The code in my_strntoull_8bit() and my_strntoull_mb2_or_mb4()
could hit undefinite behavior by negating of LONGLONG_MIN.
Fixing the code to avoid this.
This patch fixes two problems:
- The code inside my_strtod_int() in strings/dtoa.c could test the byte
behind the end of the string when processing the mantissa.
Rewriting the code to avoid this.
- The code in test_if_number() in sql/sql_analyse.cc called my_atof()
which is unsafe and makes the called my_strtod_int() look behind
the end of the string if the input string is not 0-terminated.
Fixing test_if_number() to use my_strtod() instead, passing the correct
end pointer.
nullptr+0 is an UB (undefined behavior).
- Fixing my_string_metadata_get_mb() to handle {nullptr,0} without UB.
- Fixing THD::copy_with_error() to disallow {nullptr,0} by DBUG_ASSERT().
- Fixing parse_client_handshake_packet() to call THD::copy_with_error()
with an empty string {"",0} instead of NULL string {nullptr,0}.
- Fixing the code in get_interval_value() to use Longlong_hybrid_null.
This allows to handle correctly:
- Signed and unsigned arguments
(the old code assumed the argument to be signed)
- Avoid undefined negation behavior the corner case with LONGLONG_MIN
This fixes the UBSAN warning:
negation of -9223372036854775808 cannot be represented
in type 'long long int';
- Fixing the code in get_interval_value() to avoid overflow in
the INTERVAL_QUARTER and INTERVAL_WEEK branches.
This fixes the UBSAN warning:
signed integer overflow: -9223372036854775808 * 7 cannot be represented
in type 'long long int'
- Fixing the INTERVAL_WEEK branch in date_add_interval() to handle
huge numbers correctly. Before the change, huge positive numeber
were treated as their negative complements.
Note, some other branches still can be affected by this problem
and should also be fixed eventually.
Fixing the condition to raise an overflow in the ulonglong
representation of the number is greater or equal to 0x8000000000000000ULL.
Before this change the condition did not catch -9223372036854775808
(the smallest possible signed negative longlong number).
The problem was introduced by MDEV-30879.
The function my_strntoll_8bit() was correctly changed by MDEV-30879.
The function my_strntoll_mb2_or_mb4() was not.
Applying the missing change to my_strntoll_mb2_or_mb4().
This patch also fixes:
MDEV-33050 Build-in schemas like oracle_schema are accent insensitive
MDEV-33084 LASTVAL(t1) and LASTVAL(T1) do not work well with lower-case-table-names=0
MDEV-33085 Tables T1 and t1 do not work well with ENGINE=CSV and lower-case-table-names=0
MDEV-33086 SHOW OPEN TABLES IN DB1 -- is case insensitive with lower-case-table-names=0
MDEV-33088 Cannot create triggers in the database `MYSQL`
MDEV-33103 LOCK TABLE t1 AS t2 -- alias is not case sensitive with lower-case-table-names=0
MDEV-33109 DROP DATABASE MYSQL -- does not drop SP with lower-case-table-names=0
MDEV-33110 HANDLER commands are case insensitive with lower-case-table-names=0
MDEV-33119 User is case insensitive in INFORMATION_SCHEMA.VIEWS
MDEV-33120 System log table names are case insensitive with lower-cast-table-names=0
- Removing the virtual function strnncoll() from MY_COLLATION_HANDLER
- Adding a wrapper function CHARSET_INFO::streq(), to compare
two strings for equality. For now it calls strnncoll() internally.
In the future it will turn into a virtual function.
- Adding new accent sensitive case insensitive collations:
- utf8mb4_general1400_as_ci
- utf8mb3_general1400_as_ci
They implement accent sensitive case insensitive comparison.
The weight of a character is equal to the code point of its
upper case variant. These collations use Unicode-14.0.0 casefolding data.
The result of
my_charset_utf8mb3_general1400_as_ci.strcoll()
is very close to the former
my_charset_utf8mb3_general_ci.strcasecmp()
There is only a difference in a couple dozen rare characters, because:
- the switch from "tolower" to "toupper" comparison, to make
utf8mb3_general1400_as_ci closer to utf8mb3_general_ci
- the switch from Unicode-3.0.0 to Unicode-14.0.0
This difference should be tolarable. See the list of affected
characters in the MDEV description.
Note, utf8mb4_general1400_as_ci correctly handles non-BMP characters!
Unlike utf8mb4_general_ci, it does not treat all BMP characters
as equal.
- Adding classes representing names of the file based database objects:
Lex_ident_db
Lex_ident_table
Lex_ident_trigger
Their comparison collation depends on the underlying
file system case sensitivity and on --lower-case-table-names
and can be either my_charset_bin or my_charset_utf8mb3_general1400_as_ci.
- Adding classes representing names of other database objects,
whose names have case insensitive comparison style,
using my_charset_utf8mb3_general1400_as_ci:
Lex_ident_column
Lex_ident_sys_var
Lex_ident_user_var
Lex_ident_sp_var
Lex_ident_ps
Lex_ident_i_s_table
Lex_ident_window
Lex_ident_func
Lex_ident_partition
Lex_ident_with_element
Lex_ident_rpl_filter
Lex_ident_master_info
Lex_ident_host
Lex_ident_locale
Lex_ident_plugin
Lex_ident_engine
Lex_ident_server
Lex_ident_savepoint
Lex_ident_charset
engine_option_value::Name
- All the mentioned Lex_ident_xxx classes implement a method streq():
if (ident1.streq(ident2))
do_equal();
This method works as a wrapper for CHARSET_INFO::streq().
- Changing a lot of "LEX_CSTRING name" to "Lex_ident_xxx name"
in class members and in function/method parameters.
- Replacing all calls like
system_charset_info->coll->strcasecmp(ident1, ident2)
to
ident1.streq(ident2)
- Taking advantage of the c++11 user defined literal operator
for LEX_CSTRING (see m_strings.h) and Lex_ident_xxx (see lex_ident.h)
data types. Use example:
const Lex_ident_column primary_key_name= "PRIMARY"_Lex_ident_column;
is now a shorter version of:
const Lex_ident_column primary_key_name=
Lex_ident_column({STRING_WITH_LEN("PRIMARY")});
This is a refactoring patch, it does not change the behaviour.
The MTR tests are being added only to cover the LIKE predicate better.
(these tests should have been added earlier under terms of MDEV 9711).
This patch does not need its own specific MTR tests.
Moving the duplicate code into a new shared file ctype-wildcmp.inl
and including it from multiple places, to define the following functions:
- my_wildcmp_uca_impl(), in ctype-uca.c
For utf8mb3, utf8mb4, ucs2, utf16, utf32, using cs->cset->mb_wc().
For UCA based collations.
- my_wildcmp_mb2_or_mb4_general_ci_impl(), in ctype-ucs2.c:
For ucs2, utf16, utf32, using cs->cset->mb_wc().
For general_ci-style collations:
- xxx_general_ci
- xxx_general_mysql500_ci
- xxx_general_nopad_ci
- my_wildcmp_mb2_or_mb4_bin_impl(), in ctype-ucs2.c:
For ucs2, utf16, utf32, using cs->cset->mb_wc().
For _bin collations:
- xxx_bin
- xxx_nopad_bin
- my_wildcmp_utf8mb3_general_ci_impl(), in ctype-utf8.c
Optimized for utf8mb3, using my_mb_wc_utf8mb3_quick().
For general_ci-style collations:
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb3_general_nopad_ci
- my_wildcmp_utf8mb4_general_ci_impl(), in ctype-utf8.c
Optimized for utf8mb4, using my_mb_wc_utf8mb4_quick().
For general_ci-style collations:
- utf8mb4_general_ci
- utf8mb4_general_nopad_ci
Under terms of MDEV 27490 we'll add support for non-BMP identifiers
and upgrade casefolding information to Unicode version 14.0.0.
In Unicode-14.0.0 conversion to lower and upper cases can increase octet length
of the string, so conversion won't be possible in-place any more.
This patch removes virtual functions performing in-place casefolding:
- my_charset_handler_st::casedn_str()
- my_charset_handler_st::caseup_str()
and fixes the code to use the non-inplace functions instead:
- my_charset_handler_st::casedn()
- my_charset_handler_st::caseup()