Commit graph

1799 commits

Author SHA1 Message Date
Sergei Golubchik
b2fc885469 Merge branch '11.4' into 11.5 2024-05-26 20:13:16 +02:00
Oleksandr Byelkin
dd7d9d7fb1 Merge branch '11.4' into 11.5 2024-05-23 17:01:43 +02:00
Alexander Barkov
2dfc6c4410 MDEV-33729 UBSAN negation of -X cannot be represented in type 'long long int'; cast to an unsigned type to negate this value to itself in my_strntoll_mb2_or_mb4
The problem was introduced by MDEV-30879.
The function my_strntoll_8bit() was correctly changed by MDEV-30879.
The function my_strntoll_mb2_or_mb4() was not.

Applying the missing change to my_strntoll_mb2_or_mb4().
2024-05-23 13:17:31 +04:00
Oleksandr Byelkin
99b370e023 Merge branch '11.2' into 11.4 2024-05-21 19:38:51 +02:00
Sergei Golubchik
f9807aadef Merge branch '10.11' into 11.0 2024-05-12 12:18:28 +02:00
Brian White
fb9af3f30e fix build with WITH_EXTRA_CHARSETS=none in cmake 2024-04-24 19:19:48 +10:00
Alexander Barkov
fd247cc21f MDEV-31340 Remove MY_COLLATION_HANDLER::strcasecmp()
This patch also fixes:
  MDEV-33050 Build-in schemas like oracle_schema are accent insensitive
  MDEV-33084 LASTVAL(t1) and LASTVAL(T1) do not work well with lower-case-table-names=0
  MDEV-33085 Tables T1 and t1 do not work well with ENGINE=CSV and lower-case-table-names=0
  MDEV-33086 SHOW OPEN TABLES IN DB1 -- is case insensitive with lower-case-table-names=0
  MDEV-33088 Cannot create triggers in the database `MYSQL`
  MDEV-33103 LOCK TABLE t1 AS t2 -- alias is not case sensitive with lower-case-table-names=0
  MDEV-33109 DROP DATABASE MYSQL -- does not drop SP with lower-case-table-names=0
  MDEV-33110 HANDLER commands are case insensitive with lower-case-table-names=0
  MDEV-33119 User is case insensitive in INFORMATION_SCHEMA.VIEWS
  MDEV-33120 System log table names are case insensitive with lower-cast-table-names=0

- Removing the virtual function strnncoll() from MY_COLLATION_HANDLER

- Adding a wrapper function CHARSET_INFO::streq(), to compare
  two strings for equality. For now it calls strnncoll() internally.
  In the future it will turn into a virtual function.

- Adding new accent sensitive case insensitive collations:
    - utf8mb4_general1400_as_ci
    - utf8mb3_general1400_as_ci
  They implement accent sensitive case insensitive comparison.
  The weight of a character is equal to the code point of its
  upper case variant. These collations use Unicode-14.0.0 casefolding data.

  The result of
     my_charset_utf8mb3_general1400_as_ci.strcoll()
  is very close to the former
     my_charset_utf8mb3_general_ci.strcasecmp()

  There is only a difference in a couple dozen rare characters, because:
    - the switch from "tolower" to "toupper" comparison, to make
      utf8mb3_general1400_as_ci closer to utf8mb3_general_ci
    - the switch from Unicode-3.0.0 to Unicode-14.0.0
  This difference should be tolarable. See the list of affected
  characters in the MDEV description.

  Note, utf8mb4_general1400_as_ci correctly handles non-BMP characters!
  Unlike utf8mb4_general_ci, it does not treat all BMP characters
  as equal.

- Adding classes representing names of the file based database objects:

    Lex_ident_db
    Lex_ident_table
    Lex_ident_trigger

  Their comparison collation depends on the underlying
  file system case sensitivity and on --lower-case-table-names
  and can be either my_charset_bin or my_charset_utf8mb3_general1400_as_ci.

- Adding classes representing names of other database objects,
  whose names have case insensitive comparison style,
  using my_charset_utf8mb3_general1400_as_ci:

  Lex_ident_column
  Lex_ident_sys_var
  Lex_ident_user_var
  Lex_ident_sp_var
  Lex_ident_ps
  Lex_ident_i_s_table
  Lex_ident_window
  Lex_ident_func
  Lex_ident_partition
  Lex_ident_with_element
  Lex_ident_rpl_filter
  Lex_ident_master_info
  Lex_ident_host
  Lex_ident_locale
  Lex_ident_plugin
  Lex_ident_engine
  Lex_ident_server
  Lex_ident_savepoint
  Lex_ident_charset
  engine_option_value::Name

- All the mentioned Lex_ident_xxx classes implement a method streq():

  if (ident1.streq(ident2))
     do_equal();

  This method works as a wrapper for CHARSET_INFO::streq().

- Changing a lot of "LEX_CSTRING name" to "Lex_ident_xxx name"
  in class members and in function/method parameters.

- Replacing all calls like
    system_charset_info->coll->strcasecmp(ident1, ident2)
  to
    ident1.streq(ident2)

- Taking advantage of the c++11 user defined literal operator
  for LEX_CSTRING (see m_strings.h) and Lex_ident_xxx (see lex_ident.h)
  data types. Use example:

  const Lex_ident_column primary_key_name= "PRIMARY"_Lex_ident_column;

  is now a shorter version of:

  const Lex_ident_column primary_key_name=
    Lex_ident_column({STRING_WITH_LEN("PRIMARY")});
2024-04-18 15:22:10 +04:00
Alexander Barkov
1e889a6e6c MDEV-33621 Unify duplicate code in my_wildcmp_uca_impl() and my_wildcmp_unicode_impl()
This is a refactoring patch, it does not change the behaviour.
The MTR tests are being added only to cover the LIKE predicate better.
(these tests should have been added earlier under terms of MDEV 9711).
This patch does not need its own specific MTR tests.

Moving the duplicate code into a new shared file ctype-wildcmp.inl
and including it from multiple places, to define the following functions:

- my_wildcmp_uca_impl(), in ctype-uca.c

  For utf8mb3, utf8mb4, ucs2, utf16, utf32, using cs->cset->mb_wc().
  For UCA based collations.

- my_wildcmp_mb2_or_mb4_general_ci_impl(), in ctype-ucs2.c:

  For ucs2, utf16, utf32, using cs->cset->mb_wc().
  For general_ci-style collations:
      - xxx_general_ci
      - xxx_general_mysql500_ci
      - xxx_general_nopad_ci

- my_wildcmp_mb2_or_mb4_bin_impl(), in ctype-ucs2.c:

  For ucs2, utf16, utf32, using cs->cset->mb_wc().
  For _bin collations:
      - xxx_bin
      - xxx_nopad_bin

- my_wildcmp_utf8mb3_general_ci_impl(), in ctype-utf8.c

  Optimized for utf8mb3, using my_mb_wc_utf8mb3_quick().

  For general_ci-style collations:
      - utf8mb3_general_ci
      - utf8mb3_general_mysql500_ci
      - utf8mb3_general_nopad_ci

- my_wildcmp_utf8mb4_general_ci_impl(), in ctype-utf8.c

  Optimized for utf8mb4, using my_mb_wc_utf8mb4_quick().

  For general_ci-style collations:
      - utf8mb4_general_ci
      - utf8mb4_general_nopad_ci
2024-03-12 09:33:20 +04:00
Alexander Barkov
929c2e06aa MDEV-31531 Remove my_casedn_str() and my_caseup_str()
Under terms of MDEV 27490 we'll add support for non-BMP identifiers
and upgrade casefolding information to Unicode version 14.0.0.
In Unicode-14.0.0 conversion to lower and upper cases can increase octet length
of the string, so conversion won't be possible in-place any more.

This patch removes virtual functions performing in-place casefolding:
  - my_charset_handler_st::casedn_str()
  - my_charset_handler_st::caseup_str()
and fixes the code to use the non-inplace functions instead:
  - my_charset_handler_st::casedn()
  - my_charset_handler_st::caseup()
2024-02-28 22:20:29 +04:00
Oleksandr Byelkin
fa69b085b1 Merge branch '11.3' into 11.4 2024-02-15 13:53:21 +01:00
Oleksandr Byelkin
d21cb43db1 Merge branch '11.2' into 11.3 2024-02-04 16:42:31 +01:00
Sergei Golubchik
87e13722a9 Merge branch '10.6' into 10.11 2024-02-01 18:36:14 +01:00
Sergei Golubchik
3f6038bc51 Merge branch '10.5' into 10.6 2024-01-31 18:04:03 +01:00
Sergei Golubchik
01f6abd1d4 Merge branch '10.4' into 10.5 2024-01-31 17:32:53 +01:00
Oleksandr Byelkin
fe490f85bb Merge branch '10.11' into 11.0 2024-01-30 08:54:10 +01:00
Oleksandr Byelkin
14d930db5d Merge branch '10.6' into 10.11 2024-01-30 08:17:58 +01:00
Oleksandr Byelkin
25c0806867 Merge branch '10.5' into 10.6 2024-01-30 07:43:15 +01:00
Oleksandr Byelkin
50107c4b22 Merge branch '10.4' into 10.5 2024-01-30 07:26:17 +01:00
Andrew Hutchings
f552febe43 MDEV-30879 Add support for up to BASE 62 to CONV()
BASE 62 uses 0-9, A-Z and then a-z to give the numbers 0-61. This patch
increases the range of the string functions to cover this.

Based on ideas and tests in PR #2589, but re-written into the charset
functions.

Includes fix by Sergei, UBSAN complained:
ctype-simple.c:683:38: runtime error: negation of -9223372036854775808
cannot be represented in type 'long long int'; cast to an unsigned
type to negate this value to itself

Co-authored-by: Weijun Huang <huangweijun1001@gmail.com>
Co-authored-by: Sergei Golubchik <serg@mariadb.org>
2024-01-17 15:24:26 +00:00
Robin Newhouse
615f4a8c9e MDEV-32587 Allow json exponential notation starting with zero
Modify the NS_ZERO state in the JSON number parser to allow
exponential notation with a zero coefficient (e.g. 0E-4).

The NS_ZERO state transition on 'E' was updated to move to the
NS_EX state rather than returning a syntax error. Similar change
was made for the NS_ZE1 (negative zero) starter state.

This allows accepted number grammar to include cases like:

- 0E4
- -0E-10

which were previously disallowed. Numeric parsing remains
the same for all other states.

Test cases are added to func_json.test to validate parsing for
various exponential numbers starting with zero coefficients.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services.
2024-01-17 19:25:43 +05:30
Sergei Golubchik
7f0094aac8 Merge branch '11.2' into 11.3 2023-12-21 02:14:59 +01:00
Sergei Golubchik
8c8bce05d2 Merge branch '10.11' into 11.0 2023-12-19 15:53:18 +01:00
Sergei Golubchik
fd0b47f9d6 Merge branch '10.6' into 10.11 2023-12-18 11:19:04 +01:00
Sergei Golubchik
e95bba9c58 Merge branch '10.5' into 10.6 2023-12-17 11:20:43 +01:00
Sergei Golubchik
98a39b0c91 Merge branch '10.4' into 10.5 2023-12-02 01:02:50 +01:00
Marko Mäkelä
02701a8430 Merge 11.2 into 11.3 2023-11-28 11:19:50 +02:00
Oleksandr Byelkin
34272bd6a5 Merge branch '11.2' into 11.3 2023-11-14 18:33:03 +01:00
Oleksandr Byelkin
644a6b1035 Merge branch '11.0' into mariadb-11.0.4 2023-11-14 09:21:35 +01:00
Alexander Barkov
1710b6454b MDEV-26743 InnoDB: CHAR+nopad does not work well
The patch for "MDEV-25440: Indexed CHAR ... broken with NO_PAD collations"
fixed these scenarios from MDEV-26743:
- Basic latin letter vs equal accented letter
- Two letters vs equal (but space padded) expansion

However, this scenario was still broken:
- Basic latin letter (but followed by an ignorable character)
  vs equal accented letter

Fix:
When processing for a NOPAD collation a string with trailing ignorable
characters, like:
  '<non-ignorable><ignorable><ignorable>'

the string gets virtually converted to:
  '<non-ignorable><ignorable><ignorable><space><space><space>...'

After the fix the code works differently in these two cases:
1. <space> fits into the "nchars" limit
2. <space> does not fit into the "nchars" limit

Details:

1. If "nchars" is large enough (4+ in this example),
   return weights as follows:

  '[weight-for-non-ignorable, 1 char] [weight-for-space-character, 3 chars]'

  i.e. the weight for the virtual trailing space character now indicates
  that it corresponds to total 3 characters:
  - two ignorable characters
  - one virtual trailing space character

2. If "nchars" is small (3), then the virtual trailing space character
   does not fit into the "nchar" limit, so return 0x00 as weight, e.g.:

  '[weight-for-non-ignorable, 1 char] [0x00, 2 chars]'

Adding corresponding MTR tests and unit tests.
2023-11-10 06:17:23 +04:00
Oleksandr Byelkin
fecd78b837 Merge branch '10.10' into 10.11 2023-11-08 16:46:47 +01:00
Oleksandr Byelkin
04d9a46c41 Merge branch '10.6' into 10.10 2023-11-08 16:23:30 +01:00
Oleksandr Byelkin
b83c379420 Merge branch '10.5' into 10.6 2023-11-08 15:57:05 +01:00
Oleksandr Byelkin
6cfd2ba397 Merge branch '10.4' into 10.5 2023-11-08 12:59:00 +01:00
Rucha Deodhar
9cc179cc7e MDEV-32007: JSON_VALUE and JSON_EXTRACT doesn't handle dash (-)
as first character in key

Analysis:
While parsing the path, if '-' is encountered as a part of the key,
the state of the parser changes to error. Hence NULL is returned eventually.

Fix:
If '-' encountered as part of the key, change the state appropriately to
continue scanning the key.
2023-11-03 01:10:40 +05:30
Marko Mäkelä
3036b36f9b Merge 10.10 into 10.11 2023-10-23 18:44:12 +03:00
Marko Mäkelä
5a8fca5a4f Merge 10.6 into 10.10 2023-10-23 18:43:36 +03:00
Sergei Petrunia
4941ac9192 MDEV-32113: utf8mb3_key_col=utf8mb4_value cannot be used for ref
(Variant#3: Allow cross-charset comparisons, use a special
CHARSET_INFO to create lookup keys. Review input addressed.)

Equalities that compare utf8mb{3,4}_general_ci strings, like:

  WHERE ... utf8mb3_key_col=utf8mb4_value    (MB3-4-CMP)

can now be used to construct ref[const] access and also participate
in multiple-equalities.
This means that utf8mb3_key_col can be used for key-lookups when
compared with an utf8mb4 constant, field or expression using '=' or
'<=>' comparison operators.

This is controlled by optimizer_switch='cset_narrowing=on', which is
OFF by default.

IMPLEMENTATION
Item value comparison in (MB3-4-CMP) is done using utf8mb4_general_ci.
This is valid as any utf8mb3 value is also an utf8mb4 value.

When making index lookup value for utf8mb3_key_col, we do "Charset
Narrowing": characters that are in the Basic Multilingual Plane (=BMP) are
copied as-is, as they can be represented in utf8mb3. Characters that are
outside the BMP cannot be represented in utf8mb3 and are replaced
with U+FFFD, the "Replacement Character".

In utf8mb4_general_ci, the Replacement Character compares as equal to any
character that's not in BMP. Because of this, the constructed lookup value
will find all index records that would be considered equal by the original
condition (MB3-4-CMP).

Approved-by: Monty <monty@mariadb.org>
2023-10-19 17:24:30 +03:00
Xiaotong Niu
8f2f8f3173 MDEV-26494 Fix buffer overflow of string lib on Arm64
In the hexlo function, the element type of the array hex_lo_digit is not
explicitly declared as signed char, causing elements with a value of -1
to be converted to 255 on Arm64. The problem occurs because "char" is
unsigned by default on Arm64 compiler, but signed on x86 compiler. This
problem can be seen in https://godbolt.org/z/rT775xshj

The above issue causes "use-after-poison" exception in my_mb_wc_filename
function. The code snippet where the error occurred is shown below,
copied from below link.
5fc19e7137/strings/ctype-utf8.c (L2728)

2728    if ((byte1= hexlo(byte1)) >= 0 &&
2729     (byte2= hexlo(byte2)) >= 0)
  	{
2731    	int byte3= hexlo(s[3]);
    		…
  	}

At line 2729, when byte2 is 0, which indicates the end of the string s.
(1) On x86, hexlo(0) return -1 and line 2731 is skipped, as expected.
(2) On Arm64, hexlo(0) return 255 and line 2731 is executed, not as
expected, accessing s[3] after the null character of string s, thus
raising the "user-after-poison" error.

The problem was discovered when executing the main.mysqlcheck test.

Signed-off-by: Xiaotong Niu <xiaotong.niu@arm.com>
2023-10-18 20:23:27 +11:00
Nikita Malyavin
28b4037242 Merge branch '11.2' into 11.3 2023-09-21 14:15:04 +04:00
Sergei Petrunia
e987b9350c MDEV-31496: Make optimizer handle UCASE(varchar_col)=...
(Review input addressed)
(Added handling of UPDATE/DELETE and partitioning w/o index)

If the properties of the used collation allow, do the following
equivalent rewrites:

1. UPPER(key_col)=expr  ->  key_col=expr
   expr=UPPER(key_col)  ->  expr=key_col
   (also rewrite both sides of the equality at the same time)

2. UPPER(key_col) IN (constant-list)  -> key_col IN (constant-list)

- Mark utf8mb{3,4}_general_ci as collations that allow this.
- Add optimizer_switch='sargable_casefold=ON' to control this.
  (ON by default in this patch)
- Cover the rewrite in Optimizer Trace, rewrite name is
  "sargable_casefold_removal".
2023-09-12 17:14:43 +03:00
Oleksandr Byelkin
036df5f970 Merge branch '10.10' into 10.11 2023-08-08 14:57:31 +02:00
Oleksandr Byelkin
ced243a099 Merge branch '10.9' into 10.10 2023-08-05 20:34:09 +02:00
Oleksandr Byelkin
34a8e78581 Merge branch '10.6' into 10.9 2023-08-04 08:01:06 +02:00
Oleksandr Byelkin
6bf8483cac Merge branch '10.5' into 10.6 2023-08-01 15:08:52 +02:00
Oleksandr Byelkin
7564be1352 Merge branch '10.4' into 10.5 2023-07-26 16:02:57 +02:00
Oleksandr Byelkin
f52954ef42 Merge commit '10.4' into 10.5 2023-07-20 11:54:52 +02:00
Alexander Barkov
03c2157dd6 MDEV-28384 UBSAN: null pointer passed as argument 1, which is declared to never be null in my_strnncoll_binary on SELECT ... COUNT or GROUP_CONCAT
Also fixes:
  MDEV-30982 UBSAN: runtime error: null pointer passed as argument 2, which is declared to never be null in my_strnncoll_binary on DELETE

Calling memcmp() with a NULL pointer is undefined behaviour
according to the C standard, even if the length argument is 0.

Adding tests for length==0 before calling memcmp() into:
- my_strnncoll_binary()
- my_strnncoll_8bit_bin
2023-07-20 11:56:19 +04:00
Marko Mäkelä
c04284e747 Merge 10.10 into 10.11 2023-06-07 15:01:43 +03:00
Marko Mäkelä
82230aa423 Merge 10.9 into 10.10 2023-06-07 14:48:37 +03:00
anson1014
1db4fc543b Ensure that source files contain only valid UTF8 encodings (#2188)
Modern software (including text editors, static analysis software,
and web-based code review interfaces) often requires source code files
to be interpretable via a consistent character encoding, with UTF-8 or
ASCII (a strict subset of UTF-8) as the default. Several of the MariaDB
source files contain bytes that are not valid in either the UTF-8 or
ASCII encodings, but instead represent strings encoded in the
ISO-8859-1/Latin-1 or ISO-8859-2/Latin-2 encodings.

These inconsistent encodings may prevent software from correctly
presenting or processing such files. Converting all source files to
valid UTF8 characters will ensure correct handling.

Comments written in Czech were replaced with lightly-corrected
translations from Google Translate. Additionally, comments describing
the proper handling of special characters were changed so that the
comments are now purely UTF8.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer
Amazon Web Services, Inc.

Co-authored-by: Andrew Hutchings <andrew@linuxjedi.co.uk>
2023-05-19 13:21:34 +01:00