Commit graph

8 commits

Author SHA1 Message Date
Alexander Barkov
6075f12c65 MDEV-31071 Refactor case folding data types in Unicode collations
This is a non-functional change. It changes the way how case folding data
and weight data (for simple Unicode collations) are stored:

- Removing data types MY_UNICASE_CHARACTER, MY_UNICASE_INFO
- Using data types MY_CASEFOLD_CHARACTER, MY_CASEFOLD_INFO instead.

This patch changes simple Unicode collations in a similar way
how MDEV-30695 previously changed Asian collations.

No new MTR tests are needed. The underlying code is thoroughly
covered by a number of ctype_*_ws.test and ctype_*_casefold.test
files, which were added recently as a preparation
for this change.

Old and new Unicode data layout
-------------------------------

Case folding data is now stored in separate tables
consisting of MY_CASEFOLD_CHARACTER elements with two members:

    typedef struct casefold_info_char_t
    {
      uint32 toupper;
      uint32 tolower;
    } MY_CASEFOLD_CHARACTER;

while weight data (for simple non-UCA collations xxx_general_ci
and xxx_general_mysql500_ci) is stored in separate arrays of
uint16 elements.

Before this change case folding data and simple weight data were
stored together, in tables of the following elements with three members:

    typedef struct unicase_info_char_st
    {
      uint32 toupper;
      uint32 tolower;
      uint32 sort;          /* weights for simple collations */
    } MY_UNICASE_CHARACTER;

This data format was redundant, because weights (the "sort" member) were
needed only for these two simple Unicode collations:
- xxx_general_ci
- xxx_general_mysql500_ci

Adding case folding information for Unicode-14.0.0 using the old
format would waste memory without purpose.

Detailed changes
----------------
- Changing the underlying data types as described above

- Including unidata-dump.c into the sources.
  This program was earlier used to dump UnicodeData.txt
  (e.g. https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt)
  into MySQL / MariaDB source files.
  It was originally written in 2002, but has not been distributed yet
  together with MySQL / MariaDB sources.

- Removing the old format Unicode data earlier dumped from UnicodeData.txt
  (versions 3.0.0 and 5.2.0) from ctype-utf8.c.
  Adding Unicode data in the new format into separate header files,
  to maintain the code easier:

    - ctype-unicode300-casefold.h
    - ctype-unicode300-casefold-tr.h
    - ctype-unicode300-general_ci.h
    - ctype-unicode300-general_mysql500_ci.h
    - ctype-unicode520-casefold.h

- Adding a new file ctype-unidata.c as an aggregator for
  the header files listed above.
2023-04-18 11:29:25 +04:00
Marko Mäkelä
1d1e0ab2cc Merge 10.6 into 10.8 2023-04-12 15:50:08 +03:00
Alexander Barkov
62e137d4d7 Merge remote-tracking branch 'origin/10.4' into 10.5 2023-04-05 16:16:19 +04:00
Alexander Barkov
8020b1bd73 MDEV-30034 UNIQUE USING HASH accepts duplicate entries for tricky collations
- Adding a new argument "flag" to MY_COLLATION_HANDLER::strnncollsp_nchars()
  and a flag MY_STRNNCOLLSP_NCHARS_EMULATE_TRIMMED_TRAILING_SPACES.
  The flag defines if strnncollsp_nchars() should emulate trailing spaces
  which were possibly trimmed earlier (e.g. in InnoDB CHAR compression).
  This is important for NOPAD collations.

  For example, with this input:
   - str1= 'a '    (Latin letter a followed by one space)
   - str2= 'a  '   (Latin letter a followed by two spaces)
   - nchars= 3
  if the flag is given, strnncollsp_nchars() will virtually restore
  one trailing space to str1 up to nchars (3) characters and compare two
  strings as equal:
  - str1= 'a  '  (one extra trailing space emulated)
  - str2= 'a  '  (as is)

  If the flag is not given, strnncollsp_nchars() does not add trailing
  virtual spaces, so in case of a NOPAD collation, str1 will be compared
  as less than str2 because it is shorter.

- Field_string::cmp_prefix() now passes the new flag.
  Field_varstring::cmp_prefix() and Field_blob::cmp_prefix() do
  not pass the new flag.

- The branch in cmp_whole_field() in storage/innobase/rem/rem0cmp.cc
  (which handles the CHAR data type) now also passed the new flag.

- Fixing UCA collations to respect the new flag.
  Other collations are possibly also affected, however
  I had no success in making an SQL script demonstrating the problem.
  Other collations will be extended to respect this flags in a separate
  patch later.

- Changing the meaning of the last parameter of Field::cmp_prefix()
  from "number of bytes" (internal length)
  to "number of characters" (user visible length).

  The code calling cmp_prefix() from handler.cc was wrong.
  After this change, the call in handler.cc became correct.

  The code calling cmp_prefix() from key_rec_cmp() in key.cc
  was adjusted according to this change.

- Old strnncollsp_nchar() related tests in unittest/strings/strings-t.c
  now pass the new flag.
  A few new tests also were added, without the flag.
2023-04-04 12:30:50 +04:00
Oleksandr Byelkin
9ed8deb656 Merge branch '10.6' into 10.7 2022-02-04 14:11:46 +01:00
Oleksandr Byelkin
cf63eecef4 Merge branch '10.4' into 10.5 2022-02-01 20:33:04 +01:00
Oleksandr Byelkin
a576a1cea5 Merge branch '10.3' into 10.4 2022-01-30 09:46:52 +01:00
Vladislav Vaintroub
47e18af906 MDEV-27494 Rename .ic files to .inl 2022-01-17 16:41:51 +01:00
Renamed from strings/strcoll.ic (Browse further)