Commit graph

162 commits

Author SHA1 Message Date
Alexander Barkov
133446828c MDEV-27009 Add UCA-14.0.0 collations
- Added one neutral and 22 tailored (language specific) collations based on
  Unicode Collation Algorithm version 14.0.0.

  Collations were added for Unicode character sets
  utf8mb3, utf8mb4, ucs2, utf16, utf32.

  Every tailoring was added with four accent and case
  sensitivity flag combinations, e.g:

  * utf8mb4_uca1400_swedish_as_cs
  * utf8mb4_uca1400_swedish_as_ci
  * utf8mb4_uca1400_swedish_ai_cs
  * utf8mb4_uca1400_swedish_ai_ci

  and their _nopad_ variants:

  * utf8mb4_uca1400_swedish_nopad_as_cs
  * utf8mb4_uca1400_swedish_nopad_as_ci
  * utf8mb4_uca1400_swedish_nopad_ai_cs
  * utf8mb4_uca1400_swedish_nopad_ai_ci

- Introducing a conception of contextually typed named collations:

  CREATE DATABASE db1 CHARACTER SET utf8mb4;
  CREATE TABLE db1.t1 (a CHAR(10) COLLATE uca1400_as_ci);

  The idea is that there is no a need to specify the character set prefix
  in the new collation names. It's enough to type just the suffix
  "uca1400_as_ci". The character set is taken from the context.

  In the above example script the context character set is utf8mb4.
  So the CREATE TABLE will make a column with the collation
  utf8mb4_uca1400_as_ci.

  Short collations names can be used in any parts of the SQL syntax
  where the COLLATE clause is understood.

- New collations are displayed only one time
  (without character set combinations) by these statements:

     SELECT * FROM INFORMATION_SCHEMA.COLLATIONS;
     SHOW COLLATION;

  For example, all these collations:
  - utf8mb3_uca1400_swedish_as_ci
  - utf8mb4_uca1400_swedish_as_ci
  - ucs2_uca1400_swedish_as_ci
  - utf16_uca1400_swedish_as_ci
  - utf32_uca1400_swedish_as_ci
  have just one entry in INFORMATION_SCHEMA.COLLATIONS and SHOW COLLATION,
  with COLLATION_NAME equal to "uca1400_swedish_as_ci", which is the suffix
  without the character set name:

SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS
WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci';

+-----------------------+
| COLLATION_NAME        |
+-----------------------+
| uca1400_swedish_as_ci |
+-----------------------+

  Note, the behaviour of old collations did not change.
  Non-unicode collations (e.g. latin1_swedish_ci) and
  old UCA-4.0.0 collations (e.g. utf8mb4_unicode_ci)
  are still displayed with the character set prefix, as before.

- The structure of the table INFORMATION_SCHEMA.COLLATIONS was changed.

  The NOT NULL constraint was removed from these columns:
  - CHARACTER_SET_NAME
  - ID
  - IS_DEFAULT
  and from the corresponding columns in SHOW COLLATION.

  For example:

SELECT COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT
FROM INFORMATION_SCHEMA.COLLATIONS
WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci';
+-----------------------+--------------------+------+------------+
| COLLATION_NAME        | CHARACTER_SET_NAME | ID   | IS_DEFAULT |
+-----------------------+--------------------+------+------------+
| uca1400_swedish_as_ci | NULL               | NULL | NULL       |
+-----------------------+--------------------+------+------------+

  The NULL value in these columns now means that the collation
  is applicable to multiple character sets.
  The behavioir of old collations did not change.
  Make sure your client programs can handle NULL values in these columns.

- The structure of the table
  INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY was changed.

  Three new NOT NULL columns were added:
  - FULL_COLLATION_NAME
  - ID
  - IS_DEFAULT

  New collations have multiple entries in COLLATION_CHARACTER_SET_APPLICABILITY.
  The column COLLATION_NAME contains the collation name without the character
  set prefix. The column FULL_COLLATION_NAME contains the collation name with
  the character set prefix.

  Old collations have full collation name in both FULL_COLLATION_NAME and
  COLLATION_NAME.

SELECT COLLATION_NAME, FULL_COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT
FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
WHERE FULL_COLLATION_NAME RLIKE '^(utf8mb4|latin1).*swedish.*ci$';
+-----------------------------+-------------------------------------+--------------------+------+------------+
| COLLATION_NAME              | FULL_COLLATION_NAME                 | CHARACTER_SET_NAME | ID   | IS_DEFAULT |
+-----------------------------+-------------------------------------+--------------------+------+------------+
| latin1_swedish_ci           | latin1_swedish_ci                   | latin1             |    8 | Yes        |
| latin1_swedish_nopad_ci     | latin1_swedish_nopad_ci             | latin1             | 1032 |            |
| utf8mb4_swedish_ci          | utf8mb4_swedish_ci                  | utf8mb4            |  232 |            |
| uca1400_swedish_ai_ci       | utf8mb4_uca1400_swedish_ai_ci       | utf8mb4            | 2368 |            |
| uca1400_swedish_as_ci       | utf8mb4_uca1400_swedish_as_ci       | utf8mb4            | 2370 |            |
| uca1400_swedish_nopad_ai_ci | utf8mb4_uca1400_swedish_nopad_ai_ci | utf8mb4            | 2372 |            |
| uca1400_swedish_nopad_as_ci | utf8mb4_uca1400_swedish_nopad_as_ci | utf8mb4            | 2374 |            |
+-----------------------------+-------------------------------------+--------------------+------+------------+

- Other INFORMATION_SCHEMA queries:

  SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
  SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS;
  SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES;
  SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;
  SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS;
  SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES;
  SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS;
  SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS;

  display full collation names, including character sets prefix,
  for all collations, including new collations.

  Corresponding SHOW commands also display full collation names
  in collation related columns:

  SHOW CREATE TABLE t1;
  SHOW CREATE DATABASE db1;
  SHOW TABLE STATUS;
  SHOW CREATE FUNCTION f1;
  SHOW CREATE PROCEDURE p1;
  SHOW CREATE EVENT ev1;
  SHOW CREATE TRIGGER tr1;
  SHOW CREATE VIEW;

  These INFORMATION_SCHEMA queries and SHOW statements may change in
  the future, to display show collation names.
2022-08-10 15:04:24 +02:00
Oleksandr Byelkin
9ed8deb656 Merge branch '10.6' into 10.7 2022-02-04 14:11:46 +01:00
Oleksandr Byelkin
f5c5f8e41e Merge branch '10.5' into 10.6 2022-02-03 17:01:31 +01:00
Oleksandr Byelkin
cf63eecef4 Merge branch '10.4' into 10.5 2022-02-01 20:33:04 +01:00
Oleksandr Byelkin
a576a1cea5 Merge branch '10.3' into 10.4 2022-01-30 09:46:52 +01:00
Alexander Barkov
b915f79e4e MDEV-25904 New collation functions to compare InnoDB style trimmed NO PAD strings 2022-01-21 12:16:07 +04:00
Vladislav Vaintroub
47e18af906 MDEV-27494 Rename .ic files to .inl 2022-01-17 16:41:51 +01:00
Marko Mäkelä
b36d6f92a8 Merge 10.6 into 10.7 2021-09-30 11:01:07 +03:00
Alexander Barkov
0d68b0a2d6 MDEV-26669 Add MY_COLLATION_HANDLER functions min_str() and max_str() 2021-09-27 17:10:22 +04:00
Alexander Barkov
0629711db4 MDEV-26572 Improve simple multibyte collation performance on the ASCII range 2021-09-13 08:03:25 +04:00
Monty
a206658b98 Change CHARSET_INFO character set and collaction names to LEX_CSTRING
This change removed 68 explict strlen() calls from the code.

The following renames was done to ensure we don't use the old names
when merging code from earlier releases, as using the new variables
for print function could result in crashes:
- charset->csname renamed to charset->cs_name
- charset->name renamed to charset->coll_name

Almost everything where mechanical changes except:
- Changed to use the new Protocol::store(LEX_CSTRING..) when possible
- Changed to use field->store(LEX_CSTRING*, CHARSET_INFO*) when possible
- Changed to use String->append(LEX_CSTRING&) when possible

Other things:
- There where compiler issues with ensuring that all character set names
  points to the same string: gcc doesn't allow one to use integer constants
  when defining global structures (constant char * pointers works fine).
  To get around this, I declared defines for each character set name
  length.
2021-05-19 22:54:07 +02:00
Monty
dbcd3384e0 MDEV-7947 strcmp() takes 0.37% in OLTP RO
This patch ensures that all identical character sets shares the same
cs->csname.
This allows us to replace strcmp() in my_charset_same() with comparisons
of pointers. This fixes a long standing performance issue that could cause
as strcmp() for every item sent trough the protocol class to the end user.

One consequence of this patch is that we don't allow one to add a character
definition in the Index.xml file that changes the csname of an existing
character set. This is by design as changing character set names of existing
ones is extremely dangerous, especially as some storage engines just records
character set numbers.

As we now have a hash over character set's csname, we can in the future
use that for faster access to a specific character set. This could be done
by changing the hash to non unique and use the hash to find the next
character set with same csname.
2020-07-23 10:54:33 +03:00
Alexander Barkov
cfe5ee90c8 MDEV-22043 Special character leads to assertion in my_wc_to_printable_generic on 10.5.2 (debug)
The code did not take into account that:
- U+005C (backslash) can occupy more than mbminlen characters (e.g. in sjis)
- Some character sets do not have a code for U+005C (e.g. swe7)

Adding a new function my_wc_to_printable into MY_CHARSET_HANDLER to
cover all special cases easier.
2020-05-09 16:01:30 +04:00
Marko Mäkelä
26a14ee130 Merge 10.1 into 10.2 2019-05-13 17:54:04 +03:00
Vicențiu Ciorbaru
cb248f8806 Merge branch '5.5' into 10.1 2019-05-11 22:19:05 +03:00
Vicențiu Ciorbaru
5543b75550 Update FSF Address
* Update wrong zip-code
2019-05-11 21:29:06 +03:00
Alexander Barkov
5058ced5df MDEV-7769 MY_CHARSET_INFO refactoring# On branch 10.2
Part 3 (final): removing MY_CHARSET_HANDLER::well_formed_len().
2016-10-10 14:36:09 +04:00
Alexander Barkov
ee19806b8e MDEV-9711 NO PAD collations
Based on the patch from Daniil Medvedev (a Google Summer of Code task)
2016-09-06 12:50:02 +04:00
Alexander Barkov
e7ff281d2e MDEV-6353 my_ismbchar() and my_mbcharlen() refactoring 2016-05-17 15:27:10 +04:00
Alexander Barkov
e09299511e MDEV-9665 Remove cs->cset->ismbchar()
Using a more powerfull cs->cset->charlen() instead.
2016-03-16 10:55:12 +04:00
Alexander Barkov
78b80cb6ba Adding MY_CHARSET_HANDLER::native_to_mb().
This is a pre-requisite patch for:
- MDEV-8433 Make field<'broken-string' use indexes
- MDEV-8625 Bad result set with ignorable characters when using a prefix key
- MDEV-8626 Bad result set with contractions when using a prefix key
2015-08-14 18:34:41 +04:00
Alexander Barkov
4f828a1cac MDEV-8214 Asian MB2 charsets: compare broken bytes as "greater than any non-broken character" 2015-06-26 13:40:28 +04:00
Alexander Barkov
197afb413f MDEV-6566 Different INSERT behaviour on bad bytes with and without character set conversion 2015-03-13 16:51:36 +04:00
Alexander Barkov
a7ed8523e3 Adding a shared include file ctype-mb.ic and removing a number
of very similar copies of my_well_formed_len_xxx(), implemented
for big5, cp932, euckr, eucjpms, gb2312m gbk, sjis, ujis.
2015-03-04 09:16:43 +04:00
Alexander Barkov
b1b6101af2 A preparatory patch for MDEV-6566.
Adding a new virtual function MY_CHARSET_HANDLER::copy_abort().
Moving character set specific code into the correspoding implementations
(for simple, multi-byte and mbmaxlen>1 character sets).
2015-03-02 18:24:22 +04:00
Sergei Golubchik
2ae7541bcf cleanup: s/const CHARSET_INFO/CHARSET_INFO/
as CHARSET_INFO is already const, using const on it
is redundant and results in compiler warnings (on Windows)
2014-12-04 10:41:51 +01:00
Alexander Barkov
426d246f5b MDEV-5163 Merge WEIGHT_STRING function from MySQL-5.6 2013-10-23 20:25:52 +04:00
Alexander Barkov
0b6c4bb34f MDEV-4928 Merge collation customization improvements
Merging the following MySQL-5.6 changes:
- WL#5624: Collation customization improvements
  http://dev.mysql.com/worklog/task/?id=5624

- WL#4013: Unicode german2 collation
  http://dev.mysql.com/worklog/task/?id=4013

- Bug#62429 XML: ExtractValue, UpdateXML max arg length 127 chars
  http://bugs.mysql.com/bug.php?id=62429
  (required by WL#5624)
2013-10-02 15:04:07 +04:00
Sergei Golubchik
4f435bddfd 5.3 merge 2012-01-13 15:50:02 +01:00
Michael Widenius
6920457142 Merge with MariaDB 5.1 2011-11-24 18:48:58 +02:00
Michael Widenius
a8d03ab235 Initail merge with MySQL 5.1 (XtraDB still needs to be merged)
Fixed up copyright messages.
2011-11-21 19:13:14 +02:00
Sergei Golubchik
0e007344ea mysql-5.5.18 merge 2011-11-03 19:17:05 +01:00
Sergei Golubchik
76f0b94bb0 merge with 5.3
sql/sql_insert.cc:
  CREATE ... IF NOT EXISTS may do nothing, but
  it is still not a failure. don't forget to my_ok it.
  ******
  CREATE ... IF NOT EXISTS may do nothing, but
  it is still not a failure. don't forget to my_ok it.
sql/sql_table.cc:
  small cleanup
  ******
  small cleanup
2011-10-19 21:45:18 +02:00
Sergei Golubchik
9809f05199 5.5-merge 2011-07-02 22:08:51 +02:00
Kent Boortz
68f00a5686 Updated/added copyright headers 2011-06-30 17:37:13 +02:00
Kent Boortz
02e07e3b51 Updated/added copyright headers 2011-06-30 17:46:53 +02:00
Michael Widenius
1be5462d59 Merge with MariaDB 5.1 2011-05-03 19:10:10 +03:00
Michael Widenius
e415ba0fb2 Merge with MySQL 5.1.57/58
Moved some BSD string functions from Unireg
2011-05-02 20:58:45 +03:00
Michael Widenius
3358cdd504 Merge with 5.1 to get in changes from MySQL 5.1.55 2011-02-28 19:39:30 +02:00
Michael Widenius
785695e7c3 Flush DBUG log in case of DBUG_ASSERT()
Added strings_def.h into strings library to be able to have a DBUG_ASSERT() version without _db_flush() call (as strings.a should not depend on dbug.a)
Remove include of m_string.h in all string files (as it's included by string_def.h).
Fixed include order.
Changed "m_ctype.h" -> <m_ctype.h>

 

include/my_dbug.h:
  Flush DBUG log in case of DBUG_ASSERT()
strings/bchange.c:
  Include strings_def.h
strings/bcmp.c:
  Include strings_def.h
strings/bfill.c:
  Include strings_def.h
strings/bmove.c:
  Include strings_def.h
strings/bmove512.c:
  Include strings_def.h
strings/bmove_upp.c:
  Include strings_def.h
strings/conf_to_src.c:
  Include strings_def.h
  Fixed copyright
strings/ctype-big5.c:
  Include strings_def.h
strings/ctype-bin.c:
  Include strings_def.h
strings/ctype-cp932.c:
  Include strings_def.h
strings/ctype-czech.c:
  Include strings_def.h
strings/ctype-euc_kr.c:
  Include strings_def.h
strings/ctype-eucjpms.c:
  Include strings_def.h
strings/ctype-extra.c:
  Include strings_def.h
strings/ctype-gbk.c:
  Include strings_def.h
strings/ctype-latin1.c:
  Include strings_def.h
strings/ctype-mb.c:
  Include strings_def.h
strings/ctype-simple.c:
  Include strings_def.h
strings/ctype-sjis.c:
  Include strings_def.h
strings/ctype-tis620.c:
  Include strings_def.h
strings/ctype-uca.c:
  Include strings_def.h
strings/ctype-ucs2.c:
  Include strings_def.h
strings/ctype-ujis.c:
  Include strings_def.h
strings/ctype-utf8.c:
  Include strings_def.h
strings/ctype-win1250ch.c:
  Include strings_def.h
strings/ctype.c:
  Include strings_def.h
strings/decimal.c:
  Include strings_def.h
strings/do_ctype.c:
  Include strings_def.h
strings/int2str.c:
  Include strings_def.h
strings/is_prefix.c:
  Include strings_def.h
strings/llstr.c:
  Include strings_def.h
strings/longlong2str.c:
  Include strings_def.h
strings/longlong2str_asm.c:
  Include strings_def.h
strings/my_strchr.c:
  Include strings_def.h
strings/my_strtoll10.c:
  Include strings_def.h
strings/my_vsnprintf.c:
  Include strings_def.h
strings/r_strinstr.c:
  Include strings_def.h
strings/str2int.c:
  Include strings_def.h
strings/str_alloc.c:
  Include strings_def.h
strings/str_test.c:
  Include strings_def.h
  Fixed compiler warnings
strings/strappend.c:
  Include strings_def.h
strings/strcend.c:
  Include strings_def.h
strings/strcont.c:
  Include strings_def.h
strings/strend.c:
  Include strings_def.h
strings/strfill.c:
  Include strings_def.h
strings/strinstr.c:
  Include strings_def.h
strings/strmake.c:
  Include strings_def.h
strings/strmov.c:
  Include strings_def.h
strings/strmov_overlapp.c:
  Include strings_def.h
strings/strnlen.c:
  Include strings_def.h
strings/strnmov.c:
  Include strings_def.h
strings/strstr.c:
  Include strings_def.h
strings/strto.c:
  Include strings_def.h
strings/strtod.c:
  Include strings_def.h
strings/strtol.c:
  Include strings_def.h
strings/strtoll.c:
  Include strings_def.h
strings/strtoul.c:
  Include strings_def.h
strings/strtoull.c:
  Include strings_def.h
strings/strxmov.c:
  Include strings_def.h
strings/strxnmov.c:
  Include strings_def.h
strings/uctypedump.c:
  Include strings_def.h
  Fixed compiler warnings
  Removed double include of m_ctype.h
strings/udiv.c:
  Include strings_def.h
strings/xml.c:
  Include strings_def.h
2011-01-30 12:41:44 +02:00
Alexander Barkov
435289acd4 Updating Copyright information 2011-01-19 16:17:52 +03:00
Alexander Barkov
dfb7930b33 Merging Copyright update from 5.1 2011-01-19 16:31:17 +03:00
Sergei Golubchik
65ca700def merge.
checkpoint.
does not compile.
2010-11-25 18:17:28 +01:00
Sergei Golubchik
a3d80d952d merge with 5.1 2010-09-11 20:43:48 +02:00
Michael Widenius
ad6d95d3cb Merge with MySQL 5.1.50
- Changed to still use bcmp() in certain cases becasue
  - Faster for short unaligneed strings than memcmp()
  - Bettern when using valgrind
- Changed to use my_sprintf() instead of sprintf() to get higher portability for old systems
- Changed code to use MariaDB version of select->skip_record()
- Removed -%::SCCS/s.% from Makefile.am:s to remove automake warnings
2010-08-27 17:12:44 +03:00
Alexander Barkov
fc74792a6b Merging from mysql-5.1-bugteam 2010-07-26 10:47:03 +04:00
Alexander Barkov
e57a9d6fe0 Bug#45012 my_like_range_cp932 generates invalid string
Problem: The functions my_like_range_xxx() returned
badly formed maximum strings for Asian character sets,
which made problems for storage engines.

Fix: 
- Removed a number my_like_range_xxx() implementations,
  which were in fact dumplicate code pieces.
- Using generic my_like_range_mb() instead.
- Setting max_sort_char member properly for Asian character sets
- Adding unittest/strings/strings-t.c, 
  to test that my_like_range_xxx() return well-formed 
  min and max strings.

Notes:

- No additional tests in mysql/t/ available.
  Old tests cover the affected code well enough.
2010-07-26 09:06:18 +04:00
Davi Arnaut
13f7a1d244 WL#5486: Remove code for unsupported platforms
Remove MS-DOS specific code.
2010-07-15 08:16:06 -03:00
Alexander Barkov
4e4122b999 WL#3090 Japanese Character Set adjustments
added:
  @ mysql-test/include/ctype_utf8_table.inc
    Adding a share file to populate all utf8 values [U+0000..U+FFFF]
modified:
  @ include/m_ctype.h
    Introducing MB2 and MY_PUT_MB2 macros

  @ mysql-test/r/ctype_cp932_binlog_stm.result
  @ mysql-test/r/ctype_eucjpms.result
  @ mysql-test/r/ctype_sjis.result
  @ mysql-test/r/ctype_ujis.result
  @ mysql-test/t/ctype_cp932_binlog_stm.test
  @ mysql-test/t/ctype_eucjpms.test
  @ mysql-test/t/ctype_sjis.test
  @ mysql-test/t/ctype_ujis.test
    Adding test

  @ strings/ctype-cp932.c
  @ strings/ctype-eucjpms.c
  @ strings/ctype-sjis.c
  @ strings/ctype-ujis.c
    Adding new functions using Big-Table approach.
2010-02-15 09:57:24 +04:00
Alexander Barkov
8dfc3fbbab WL#4583 Case conversion in Asian character sets
modified:
  include/m_ctype.h
  - Changing type for tolower/toupper members, to store values >= 0xFFFF.
  - Adding function prototypes

  mysql-test/r/ctype_big5.result
  mysql-test/r/ctype_cp932_binlog_stm.result
  mysql-test/r/ctype_eucjpms.result*
  mysql-test/r/ctype_euckr.result
  mysql-test/r/ctype_gb2312.result
  mysql-test/r/ctype_gbk.result
  mysql-test/r/ctype_sjis.result
  mysql-test/r/ctype_ujis.result
  mysql-test/t/ctype_big5.test
  mysql-test/t/ctype_cp932_binlog_stm.test
  mysql-test/t/ctype_eucjpms.test
  mysql-test/t/ctype_euckr.test
  mysql-test/t/ctype_gb2312.test
  mysql-test/t/ctype_gbk.test
  mysql-test/t/ctype_sjis.test
  mysql-test/t/ctype_ujis.test
  -  Adding tests

  strings/ctype-big5.c
  strings/ctype-cp932.c
  strings/ctype-euc_kr.c
  strings/ctype-eucjpms.c
  strings/ctype-gb2312.c
  strings/ctype-gbk.c
  strings/ctype-sjis.c
  - Adding upper/lower case conversion data

  strings/ctype-mb.c
  - Adding handling of upper/lower conversion for multi-byte characters.

  strings/ctype-ujis.c
  - Implementing shared upper/lower conversion
    functions  for ujis and eucjpms
  - Adding upper/lower case conversion data for ujis
2010-01-14 15:17:57 +04:00