mariadb

mirror of https://github.com/MariaDB/server.git synced 2025-01-16 20:12:31 +01:00

Author	SHA1	Message	Date
Alexander Barkov	c21745dbe4	MDEV-30577 Case folding for uca1400 collations is not up to date Adding casefolding for Unicode-14.0.0 into uca1400 collations.	2023-04-18 11:31:05 +04:00
Alexander Barkov	6075f12c65	MDEV-31071 Refactor case folding data types in Unicode collations This is a non-functional change. It changes the way how case folding data and weight data (for simple Unicode collations) are stored: - Removing data types MY_UNICASE_CHARACTER, MY_UNICASE_INFO - Using data types MY_CASEFOLD_CHARACTER, MY_CASEFOLD_INFO instead. This patch changes simple Unicode collations in a similar way how MDEV-30695 previously changed Asian collations. No new MTR tests are needed. The underlying code is thoroughly covered by a number of ctype__ws.test and ctype__casefold.test files, which were added recently as a preparation for this change. Old and new Unicode data layout ------------------------------- Case folding data is now stored in separate tables consisting of MY_CASEFOLD_CHARACTER elements with two members: typedef struct casefold_info_char_t { uint32 toupper; uint32 tolower; } MY_CASEFOLD_CHARACTER; while weight data (for simple non-UCA collations xxx_general_ci and xxx_general_mysql500_ci) is stored in separate arrays of uint16 elements. Before this change case folding data and simple weight data were stored together, in tables of the following elements with three members: typedef struct unicase_info_char_st { uint32 toupper; uint32 tolower; uint32 sort; /* weights for simple collations */ } MY_UNICASE_CHARACTER; This data format was redundant, because weights (the "sort" member) were needed only for these two simple Unicode collations: - xxx_general_ci - xxx_general_mysql500_ci Adding case folding information for Unicode-14.0.0 using the old format would waste memory without purpose. Detailed changes ---------------- - Changing the underlying data types as described above - Including unidata-dump.c into the sources. This program was earlier used to dump UnicodeData.txt (e.g. https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt) into MySQL / MariaDB source files. It was originally written in 2002, but has not been distributed yet together with MySQL / MariaDB sources. - Removing the old format Unicode data earlier dumped from UnicodeData.txt (versions 3.0.0 and 5.2.0) from ctype-utf8.c. Adding Unicode data in the new format into separate header files, to maintain the code easier: - ctype-unicode300-casefold.h - ctype-unicode300-casefold-tr.h - ctype-unicode300-general_ci.h - ctype-unicode300-general_mysql500_ci.h - ctype-unicode520-casefold.h - Adding a new file ctype-unidata.c as an aggregator for the header files listed above.	2023-04-18 11:29:25 +04:00
Alexander Barkov	2ad287caad	MDEV-31069 Reuse duplicate char-to-weight conversion code in ctype-utf8.c and ctype-ucs2.c Removing similar functions from ctype-utf8.c and ctype-ucs2.c - my_tosort_utf16() - my_tosort_utf32() - my_tosort_ucs2() - my_tosort_unicode() Adding new shared functions into ctype-unidata.h: - my_tosort_unicode_bmp() - reused for utf8mb3, ucs2 - my_tosort_unicode() - reused for utf8mb4, utf16, utf32 For simplicity, the new version of my_tosort_unicode*() does not include the code handling the MY_CS_LOWER_SORT flag because: - it affects performance negatively - we don't have any collations with this flag yet anyway (This code was most likely earlier erroneously merged from MySQL's utf8_tolower_ci at some point.)	2023-04-18 10:24:05 +04:00
Alexander Barkov	30b4bb4204	MDEV-31068 Reuse duplicate case conversion code in ctype-utf8.c and ctype-ucs2.c	2023-04-18 06:44:03 +04:00
Alexander Barkov	965bdf3e66	MDEV-30746 Regression in ucs2_general_mysql500_ci 1. Adding a separate MY_COLLATION_HANDLER my_collation_ucs2_general_mysql500_ci_handler implementing a proper order for ucs2_general_mysql500_ci The problem happened because ucs2_general_mysql500_ci erroneously used my_collation_ucs2_general_ci_handler. 2. Cosmetic changes: Renaming: - plane00_mysql500 to my_unicase_mysql500_page00 - my_unicase_pages_mysql500 to my_unicase_mysql500_pages to use the same naming style with: - my_unicase_default_page00 - my_unicase_defaul_pages 3. Moving code fragments from - handler::check_collation_compatibility() in handler.cc - upgrade_collation() in table.cc into new methods in class Charset, to reuse the code easier.	2023-03-01 15:38:02 +04:00
Alexander Barkov	a8efe7ab1f	MDEV-17502 MDEV-17474 Change Unicode xxx_general_ci and xxx_bin collation implementation to "inline" style	2018-10-19 14:35:01 +04:00

6 commits