mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-16 03:52:35 +01:00

Author	SHA1	Message	Date
Alexander Barkov	6075f12c65	MDEV-31071 Refactor case folding data types in Unicode collations This is a non-functional change. It changes the way how case folding data and weight data (for simple Unicode collations) are stored: - Removing data types MY_UNICASE_CHARACTER, MY_UNICASE_INFO - Using data types MY_CASEFOLD_CHARACTER, MY_CASEFOLD_INFO instead. This patch changes simple Unicode collations in a similar way how MDEV-30695 previously changed Asian collations. No new MTR tests are needed. The underlying code is thoroughly covered by a number of ctype__ws.test and ctype__casefold.test files, which were added recently as a preparation for this change. Old and new Unicode data layout ------------------------------- Case folding data is now stored in separate tables consisting of MY_CASEFOLD_CHARACTER elements with two members: typedef struct casefold_info_char_t { uint32 toupper; uint32 tolower; } MY_CASEFOLD_CHARACTER; while weight data (for simple non-UCA collations xxx_general_ci and xxx_general_mysql500_ci) is stored in separate arrays of uint16 elements. Before this change case folding data and simple weight data were stored together, in tables of the following elements with three members: typedef struct unicase_info_char_st { uint32 toupper; uint32 tolower; uint32 sort; /* weights for simple collations */ } MY_UNICASE_CHARACTER; This data format was redundant, because weights (the "sort" member) were needed only for these two simple Unicode collations: - xxx_general_ci - xxx_general_mysql500_ci Adding case folding information for Unicode-14.0.0 using the old format would waste memory without purpose. Detailed changes ---------------- - Changing the underlying data types as described above - Including unidata-dump.c into the sources. This program was earlier used to dump UnicodeData.txt (e.g. https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt) into MySQL / MariaDB source files. It was originally written in 2002, but has not been distributed yet together with MySQL / MariaDB sources. - Removing the old format Unicode data earlier dumped from UnicodeData.txt (versions 3.0.0 and 5.2.0) from ctype-utf8.c. Adding Unicode data in the new format into separate header files, to maintain the code easier: - ctype-unicode300-casefold.h - ctype-unicode300-casefold-tr.h - ctype-unicode300-general_ci.h - ctype-unicode300-general_mysql500_ci.h - ctype-unicode520-casefold.h - Adding a new file ctype-unidata.c as an aggregator for the header files listed above.	2023-04-18 11:29:25 +04:00
Alexander Barkov	2ad287caad	MDEV-31069 Reuse duplicate char-to-weight conversion code in ctype-utf8.c and ctype-ucs2.c Removing similar functions from ctype-utf8.c and ctype-ucs2.c - my_tosort_utf16() - my_tosort_utf32() - my_tosort_ucs2() - my_tosort_unicode() Adding new shared functions into ctype-unidata.h: - my_tosort_unicode_bmp() - reused for utf8mb3, ucs2 - my_tosort_unicode() - reused for utf8mb4, utf16, utf32 For simplicity, the new version of my_tosort_unicode*() does not include the code handling the MY_CS_LOWER_SORT flag because: - it affects performance negatively - we don't have any collations with this flag yet anyway (This code was most likely earlier erroneously merged from MySQL's utf8_tolower_ci at some point.)	2023-04-18 10:24:05 +04:00
Alexander Barkov	30b4bb4204	MDEV-31068 Reuse duplicate case conversion code in ctype-utf8.c and ctype-ucs2.c	2023-04-18 06:44:03 +04:00
Marko Mäkelä	a009280e60	Merge 10.9 into 10.10	2023-04-14 12:24:14 +03:00
Marko Mäkelä	5bada1246d	Merge 10.5 into 10.6	2023-04-11 16:15:19 +03:00
Oleksandr Byelkin	ac5a534a4c	Merge remote-tracking branch '10.4' into 10.5	2023-03-31 21:32:41 +02:00
Alexander Barkov	965bdf3e66	MDEV-30746 Regression in ucs2_general_mysql500_ci 1. Adding a separate MY_COLLATION_HANDLER my_collation_ucs2_general_mysql500_ci_handler implementing a proper order for ucs2_general_mysql500_ci The problem happened because ucs2_general_mysql500_ci erroneously used my_collation_ucs2_general_ci_handler. 2. Cosmetic changes: Renaming: - plane00_mysql500 to my_unicase_mysql500_page00 - my_unicase_pages_mysql500 to my_unicase_mysql500_pages to use the same naming style with: - my_unicase_default_page00 - my_unicase_defaul_pages 3. Moving code fragments from - handler::check_collation_compatibility() in handler.cc - upgrade_collation() in table.cc into new methods in class Charset, to reuse the code easier.	2023-03-01 15:38:02 +04:00
Alexander Barkov	33f8f92b74	MDEV-30695 Refactor case folding data types in Asian collations This is a non-functional change and should not change the server behavior. Casefolding information is now stored in items of a new data type MY_CASEFOLD_CHARACTER: typedef struct casefold_info_char_t { uint32 toupper; uint32 tolower; } MY_CASEFOLD_CHARACTER; Before this change, casefolding tables for Asian collations were stored in: typedef struct unicase_info_char_st { uint32 toupper; uint32 tolower; uint32 sort; } MY_UNICASE_CHARACTER; The "sort" member was not used in the code handling Asian collations, it only wasted space. (it's only used by Unicode _general_ci and _general_mysql500_ci collations). Unicode collations (at least UCA and _bin) should also be refactored later, but under terms of a separate task.	2023-02-21 14:10:25 +04:00
Alexander Barkov	7f6b648d7d	MDEV-30661 UPPER() returns an empty string for U+0251 in uca1400 collations for utf8 String length growth during upper/lower conversion in Unicode collations depends only on the underlying MY_UNICASE_INFO used in the collation. Maintaining a separate member CHARSET_INFO::caseup_multiply and CHARSET_INFO::casedn_multiply duplicated this information and caused bugs like this (when MY_UNICASE_INFO and case??_multiply when out of sync because of incomplete CHARSET_INFO initialization). Fix: Changing CHARSET_INFO::caseup_multiply and CHARSET_INFO::casedn_multiply from members to virtual functions. The virtual functions in Unicode collations calculate case conversion growth factors from the MY_UNICASE_INFO. This guarantees that the growth factors are always in sync with the MY_UNICASE_INFO.	2023-02-17 17:33:27 +04:00
Alexander Barkov	133446828c	MDEV-27009 Add UCA-14.0.0 collations - Added one neutral and 22 tailored (language specific) collations based on Unicode Collation Algorithm version 14.0.0. Collations were added for Unicode character sets utf8mb3, utf8mb4, ucs2, utf16, utf32. Every tailoring was added with four accent and case sensitivity flag combinations, e.g: * utf8mb4_uca1400_swedish_as_cs * utf8mb4_uca1400_swedish_as_ci * utf8mb4_uca1400_swedish_ai_cs * utf8mb4_uca1400_swedish_ai_ci and their _nopad_ variants: * utf8mb4_uca1400_swedish_nopad_as_cs * utf8mb4_uca1400_swedish_nopad_as_ci * utf8mb4_uca1400_swedish_nopad_ai_cs * utf8mb4_uca1400_swedish_nopad_ai_ci - Introducing a conception of contextually typed named collations: CREATE DATABASE db1 CHARACTER SET utf8mb4; CREATE TABLE db1.t1 (a CHAR(10) COLLATE uca1400_as_ci); The idea is that there is no a need to specify the character set prefix in the new collation names. It's enough to type just the suffix "uca1400_as_ci". The character set is taken from the context. In the above example script the context character set is utf8mb4. So the CREATE TABLE will make a column with the collation utf8mb4_uca1400_as_ci. Short collations names can be used in any parts of the SQL syntax where the COLLATE clause is understood. - New collations are displayed only one time (without character set combinations) by these statements: SELECT * FROM INFORMATION_SCHEMA.COLLATIONS; SHOW COLLATION; For example, all these collations: - utf8mb3_uca1400_swedish_as_ci - utf8mb4_uca1400_swedish_as_ci - ucs2_uca1400_swedish_as_ci - utf16_uca1400_swedish_as_ci - utf32_uca1400_swedish_as_ci have just one entry in INFORMATION_SCHEMA.COLLATIONS and SHOW COLLATION, with COLLATION_NAME equal to "uca1400_swedish_as_ci", which is the suffix without the character set name: SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci'; +-----------------------+ \| COLLATION_NAME \| +-----------------------+ \| uca1400_swedish_as_ci \| +-----------------------+ Note, the behaviour of old collations did not change. Non-unicode collations (e.g. latin1_swedish_ci) and old UCA-4.0.0 collations (e.g. utf8mb4_unicode_ci) are still displayed with the character set prefix, as before. - The structure of the table INFORMATION_SCHEMA.COLLATIONS was changed. The NOT NULL constraint was removed from these columns: - CHARACTER_SET_NAME - ID - IS_DEFAULT and from the corresponding columns in SHOW COLLATION. For example: SELECT COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci'; +-----------------------+--------------------+------+------------+ \| COLLATION_NAME \| CHARACTER_SET_NAME \| ID \| IS_DEFAULT \| +-----------------------+--------------------+------+------------+ \| uca1400_swedish_as_ci \| NULL \| NULL \| NULL \| +-----------------------+--------------------+------+------------+ The NULL value in these columns now means that the collation is applicable to multiple character sets. The behavioir of old collations did not change. Make sure your client programs can handle NULL values in these columns. - The structure of the table INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY was changed. Three new NOT NULL columns were added: - FULL_COLLATION_NAME - ID - IS_DEFAULT New collations have multiple entries in COLLATION_CHARACTER_SET_APPLICABILITY. The column COLLATION_NAME contains the collation name without the character set prefix. The column FULL_COLLATION_NAME contains the collation name with the character set prefix. Old collations have full collation name in both FULL_COLLATION_NAME and COLLATION_NAME. SELECT COLLATION_NAME, FULL_COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY WHERE FULL_COLLATION_NAME RLIKE '^(utf8mb4\|latin1).swedish.ci$'; +-----------------------------+-------------------------------------+--------------------+------+------------+ \| COLLATION_NAME \| FULL_COLLATION_NAME \| CHARACTER_SET_NAME \| ID \| IS_DEFAULT \| +-----------------------------+-------------------------------------+--------------------+------+------------+ \| latin1_swedish_ci \| latin1_swedish_ci \| latin1 \| 8 \| Yes \| \| latin1_swedish_nopad_ci \| latin1_swedish_nopad_ci \| latin1 \| 1032 \| \| \| utf8mb4_swedish_ci \| utf8mb4_swedish_ci \| utf8mb4 \| 232 \| \| \| uca1400_swedish_ai_ci \| utf8mb4_uca1400_swedish_ai_ci \| utf8mb4 \| 2368 \| \| \| uca1400_swedish_as_ci \| utf8mb4_uca1400_swedish_as_ci \| utf8mb4 \| 2370 \| \| \| uca1400_swedish_nopad_ai_ci \| utf8mb4_uca1400_swedish_nopad_ai_ci \| utf8mb4 \| 2372 \| \| \| uca1400_swedish_nopad_as_ci \| utf8mb4_uca1400_swedish_nopad_as_ci \| utf8mb4 \| 2374 \| \| +-----------------------------+-------------------------------------+--------------------+------+------------+ - Other INFORMATION_SCHEMA queries: SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS; SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES; SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA; SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS; SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS; display full collation names, including character sets prefix, for all collations, including new collations. Corresponding SHOW commands also display full collation names in collation related columns: SHOW CREATE TABLE t1; SHOW CREATE DATABASE db1; SHOW TABLE STATUS; SHOW CREATE FUNCTION f1; SHOW CREATE PROCEDURE p1; SHOW CREATE EVENT ev1; SHOW CREATE TRIGGER tr1; SHOW CREATE VIEW; These INFORMATION_SCHEMA queries and SHOW statements may change in the future, to display show collation names.	2022-08-10 15:04:24 +02:00
Oleksandr Byelkin	f5c5f8e41e	Merge branch '10.5' into 10.6	2022-02-03 17:01:31 +01:00
Oleksandr Byelkin	cf63eecef4	Merge branch '10.4' into 10.5	2022-02-01 20:33:04 +01:00
Oleksandr Byelkin	a576a1cea5	Merge branch '10.3' into 10.4	2022-01-30 09:46:52 +01:00
Oleksandr Byelkin	41a163ac5c	Merge branch '10.2' into 10.3	2022-01-29 15:41:05 +01:00
Alexander Barkov	b915f79e4e	MDEV-25904 New collation functions to compare InnoDB style trimmed NO PAD strings	2022-01-21 12:16:07 +04:00
Vladislav Vaintroub	47e18af906	MDEV-27494 Rename .ic files to .inl	2022-01-17 16:41:51 +01:00
Alexander Barkov	0d68b0a2d6	MDEV-26669 Add MY_COLLATION_HANDLER functions min_str() and max_str()	2021-09-27 17:10:22 +04:00
Monty	a206658b98	Change CHARSET_INFO character set and collaction names to LEX_CSTRING This change removed 68 explict strlen() calls from the code. The following renames was done to ensure we don't use the old names when merging code from earlier releases, as using the new variables for print function could result in crashes: - charset->csname renamed to charset->cs_name - charset->name renamed to charset->coll_name Almost everything where mechanical changes except: - Changed to use the new Protocol::store(LEX_CSTRING..) when possible - Changed to use field->store(LEX_CSTRING, CHARSET_INFO) when possible - Changed to use String->append(LEX_CSTRING&) when possible Other things: - There where compiler issues with ensuring that all character set names points to the same string: gcc doesn't allow one to use integer constants when defining global structures (constant char * pointers works fine). To get around this, I declared defines for each character set name length.	2021-05-19 22:54:07 +02:00
Sergei Golubchik	25d9d2e37f	Merge branch 'bb-10.4-release' into bb-10.5-release	2021-02-15 16:43:15 +01:00
Sergei Golubchik	00a313ecf3	Merge branch 'bb-10.3-release' into bb-10.4-release Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution" was null-merged. 10.4 version of the fix is coming up separately	2021-02-12 17:44:22 +01:00
Sergei Golubchik	60ea09eae6	Merge branch '10.2' into 10.3	2021-02-01 13:49:33 +01:00
Daniel Black	f2fea295b4	ucs2: cppcheck - add va_end	2021-01-21 16:46:59 +11:00
Marko Mäkelä	cf87f3e08c	Merge 10.4 into 10.5	2020-08-14 11:33:35 +03:00
Marko Mäkelä	2f7b37b021	Merge 10.3 into 10.4, except MDEV-22543 Also, fix GCC -Og -Wmaybe-uninitialized in run_backup_stage()	2020-08-13 18:48:41 +03:00
Marko Mäkelä	4bd56a697f	Merge 10.2 into 10.3	2020-08-13 18:18:25 +03:00
Marko Mäkelä	31aef3ae99	Fix GCC 10.2.0 -Og -Wmaybe-uninitialized For some reason, GCC emits more -Wmaybe-uninitialized warnings when using the flag -Og than when using -O2. Many of the warnings look genuine.	2020-08-11 15:58:16 +03:00
Monty	dbcd3384e0	MDEV-7947 strcmp() takes 0.37% in OLTP RO This patch ensures that all identical character sets shares the same cs->csname. This allows us to replace strcmp() in my_charset_same() with comparisons of pointers. This fixes a long standing performance issue that could cause as strcmp() for every item sent trough the protocol class to the end user. One consequence of this patch is that we don't allow one to add a character definition in the Index.xml file that changes the csname of an existing character set. This is by design as changing character set names of existing ones is extremely dangerous, especially as some storage engines just records character set numbers. As we now have a hash over character set's csname, we can in the future use that for faster access to a specific character set. This could be done by changing the hash to non unique and use the hash to find the next character set with same csname.	2020-07-23 10:54:33 +03:00
Alexander Barkov	cfe5ee90c8	MDEV-22043 Special character leads to assertion in my_wc_to_printable_generic on 10.5.2 (debug) The code did not take into account that: - U+005C (backslash) can occupy more than mbminlen characters (e.g. in sjis) - Some character sets do not have a code for U+005C (e.g. swe7) Adding a new function my_wc_to_printable into MY_CHARSET_HANDLER to cover all special cases easier.	2020-05-09 16:01:30 +04:00
Alexander Barkov	f1e13fdc8d	MDEV-21581 Helper functions and methods for CHARSET_INFO	2020-01-28 12:29:23 +04:00
Oleksandr Byelkin	c07325f932	Merge branch '10.3' into 10.4	2019-05-19 20:55:37 +02:00
Marko Mäkelä	be85d3e61b	Merge 10.2 into 10.3	2019-05-14 17:18:46 +03:00
Marko Mäkelä	26a14ee130	Merge 10.1 into 10.2	2019-05-13 17:54:04 +03:00
Vicențiu Ciorbaru	cb248f8806	Merge branch '5.5' into 10.1	2019-05-11 22:19:05 +03:00
Vicențiu Ciorbaru	5543b75550	Update FSF Address * Update wrong zip-code	2019-05-11 21:29:06 +03:00
Alexander Barkov	a8efe7ab1f	MDEV-17502 MDEV-17474 Change Unicode xxx_general_ci and xxx_bin collation implementation to "inline" style	2018-10-19 14:35:01 +04:00
Alexander Barkov	6eae037c4c	MDEV-17474 Change Unicode collation implementation from "handler" to "inline" style	2018-10-17 06:44:40 +04:00
Marko Mäkelä	05459706f2	Merge 10.2 into 10.3	2018-08-03 15:57:23 +03:00
Marko Mäkelä	ef3070e997	Merge 10.1 into 10.2	2018-08-02 08:19:57 +03:00
Oleksandr Byelkin	cb5952b506	Merge branch '10.0' into bb-10.1-merge-sanja	2018-07-25 22:24:40 +02:00
Alexander Barkov	e2ac4098ed	Simplify caseup() and casedn() in charsets After the MDEV-13118 fix there's no code in the server that wants caseup/casedn to change the argument in place for simple charsets. Let's remove this logic and always return the result in a new string for all charsets, both simple and complex. 1. Removing the optimization that some character sets used in casedn() and caseup(), which allowed (and required) to change the case in-place, overwriting the string passed as the "src" argument. Now all CHARSET_INFO's work in the same way: non of them change the source string in-place, all of them now convert case from the source string to the destination string, leaving the source string untouched. 2. Adding "const" qualifier to the "char src" parameter to caseup() and casedn(). 3. Removing duplicate implementations in ctype-mb.c. Now both caseup() and casedn() implementations for all CJK character sets use internally the same function my_casefold_mb() (the former my_casefold_mb_varlen()). 4. Removing the "unused" attribute from parameters of some my_case{up\|dn}_xxx() implementations, as the affected parameters are now used* in the code. Previously these parameters were used only in DBUG_ASSERT().	2018-07-19 13:02:14 +04:00
luz.paz	3dd01669b4	Misc. typos Found via `codespell -i 3 -w --skip="./debian/po" -I ../mariadb-server-word-whitelist.txt ./cmake/ ./debian/ ./Docs/ ./include/ ./man/ ./plugin/ ./strings/`	2018-04-05 15:26:57 +04:00
Alexander Barkov	c7a2f23a7b	Merge remote-tracking branch 'origin/bb-10.2-ext' into 10.3	2018-01-29 12:44:20 +04:00
Vladislav Vaintroub	9891ee5a2a	Fix and reenable Windows compiler warning C4800 (size_t conversion).	2018-01-26 10:37:46 +00:00
Marko Mäkelä	2c1067166d	Merge bb-10.2-ext into 10.3	2017-10-04 08:24:06 +03:00
Vladislav Vaintroub	7354dc6773	MDEV-13384 - misc Windows warnings fixed	2017-09-28 17:20:46 +00:00
Monty	536215e32f	Added DBUG_ASSERT_AS_PRINTF compile flag If compiling a non DBUG binary with -DDBUG_ASSERT_AS_PRINTF asserts will be changed to printf + stack trace (of stack trace are enabled). - Changed #ifndef DBUG_OFF to #ifdef DBUG_ASSERT_EXISTS for those DBUG_OFF that was just used to enable assert - Assert checking that could greatly impact performance where changed to DBUG_ASSERT_SLOW which is not affected by DBUG_ASSERT_AS_PRINTF - Added one extra option to my_print_stacktrace() to get more silent in case of stack trace printing as part of assert.	2017-08-24 01:05:50 +02:00
Sergei Golubchik	4a5d25c338	Merge branch '10.1' into 10.2	2016-12-29 13:23:18 +01:00
kevg	780db8e252	fix build and some warnings	2016-11-24 17:36:02 +03:00
Alexander Barkov	1d8eafbeaf	Removing the unused function my_bincmp() from strings/ctype-ucs2.c	2016-11-24 15:55:55 +04:00
Alexey Botchkov	27025221fe	MDEV-9143 JSON_xxx functions. strings/json_lib.c added as a JSON library. SQL frunction added with sql/item_jsonfunc.h/cc	2016-10-19 14:10:03 +04:00

1 2 3 4 5

233 commits