Ensure that source files contain only valid UTF8 encodings (#2188)

Modern software (including text editors, static analysis software,
and web-based code review interfaces) often requires source code files
to be interpretable via a consistent character encoding, with UTF-8 or
ASCII (a strict subset of UTF-8) as the default. Several of the MariaDB
source files contain bytes that are not valid in either the UTF-8 or
ASCII encodings, but instead represent strings encoded in the
ISO-8859-1/Latin-1 or ISO-8859-2/Latin-2 encodings.

These inconsistent encodings may prevent software from correctly
presenting or processing such files. Converting all source files to
valid UTF8 characters will ensure correct handling.

Comments written in Czech were replaced with lightly-corrected
translations from Google Translate. Additionally, comments describing
the proper handling of special characters were changed so that the
comments are now purely UTF8.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer
Amazon Web Services, Inc.

Co-authored-by: Andrew Hutchings <andrew@linuxjedi.co.uk>
This commit is contained in:
anson1014 2022-08-30 04:21:40 -04:00 committed by Andrew Hutchings
parent c205f6c127
commit 1db4fc543b
4 changed files with 31 additions and 57 deletions

View file

@ -92,7 +92,7 @@ extern "C" FILE *my_win_popen(const char *cmd, const char *mode)
goto error; goto error;
break; break;
default: default:
/* Unknown mode, éxpected "r", "rt", "w", "wt" */ /* Unknown mode, expected "r", "rt", "w", "wt" */
abort(); abort();
} }
if (!SetHandleInformation(parent_pipe_end, HANDLE_FLAG_INHERIT, 0)) if (!SetHandleInformation(parent_pipe_end, HANDLE_FLAG_INHERIT, 0))

View file

@ -642,7 +642,6 @@ bool DOMNODELIST::DropItem(PGLOBAL g, int n)
if (Listp == NULL || Listp->length < n) if (Listp == NULL || Listp->length < n)
return true; return true;
//Listp->item[n] = NULL; La propriété n'a pas de méthode 'set'
return false; return false;
} // end of DeleteItem } // end of DeleteItem

View file

@ -23,13 +23,13 @@
solution was needed than the one-to-one conversion table. To solution was needed than the one-to-one conversion table. To
note a few, here is an example of a Czech sorting sequence: note a few, here is an example of a Czech sorting sequence:
co < hlaska < hláska < hlava < chlapec < krtek co < hlaska < hláska < hlava < chlapec < krtek
It because some of the rules are: double char 'ch' is sorted It because some of the rules are: double char 'ch' is sorted
between 'h' and 'i'. Accented character 'á' (a with acute) is between 'h' and 'i'. Accented character 'á' (a with acute) is
sorted after 'a' and before 'b', but only if the word is sorted after 'a' and before 'b', but only if the word is
otherwise the same. However, because 's' is sorted before 'v' otherwise the same. However, because 's' is sorted before 'v'
in hlava, the accentness of 'á' is overridden. There are many in hlava, the accentness of 'á' is overridden. There are many
more rules. more rules.
This file defines functions my_strxfrm and my_strcoll for This file defines functions my_strxfrm and my_strcoll for
@ -42,8 +42,9 @@
passes, that's why we need four times more space for expanded passes, that's why we need four times more space for expanded
string. string.
This file also contains the ISO-Latin-2 definitions of The non-ASCII literal strings in this file are encoded
characters. in the iso-8859-2 / latin-2 character set
(https://en.wikipedia.org/wiki/ISO/IEC_8859-2)
Author: (c) 1997--1998 Jan Pazdziora, adelton@fi.muni.cz Author: (c) 1997--1998 Jan Pazdziora, adelton@fi.muni.cz
Jan Pazdziora has a shared copyright for this code Jan Pazdziora has a shared copyright for this code
@ -111,7 +112,7 @@ static const struct wordvalue doubles[] = {
}; };
/* /*
Unformal description of the algorithm: Informal description of the algorithm:
We walk the string left to right. We walk the string left to right.
@ -126,7 +127,7 @@ static const struct wordvalue doubles[] = {
End of pass is marked with value 1 on the output. End of pass is marked with value 1 on the output.
For each character, we read it's value from the table. For each character, we read its value from the table.
If the value is ignore (0), we go straight to the next character. If the value is ignore (0), we go straight to the next character.
@ -138,31 +139,6 @@ static const struct wordvalue doubles[] = {
exists behind it, find its value. exists behind it, find its value.
We append 0 to the end. We append 0 to the end.
---
Neformální popis algoritmu:
Procházíme řetězec zleva doprava.
Konec řetězce je předán buď jako parametr, nebo je to *p == 0.
Toto je ošetřeno makrem IS_END.
Pokud jsme došli na konec řetězce při průchodu 0, nejdeme na
začátek, ale na uloženou pozici, protože první a druhý průchod
běží současně.
Konec vstupu (průchodu) označíme na výstupu hodnotou 1.
Pro každý znak řetězce načteme hodnotu z třídící tabulky.
Jde-li o hodnotu ignorovat (0), skočíme ihned na další znak..
Jde-li o hodnotu konec slova (2) a je to průchod 0 nebo 1,
přeskočíme všechny další 0 -- 2 a prohodíme průchody.
Jde-li o kompozitní znak (255), otestujeme, zda následuje
správný do dvojice, dohledáme správnou hodnotu.
Na konci připojíme znak 0
*/ */
#define ADD_TO_RESULT(dest, len, totlen, value) \ #define ADD_TO_RESULT(dest, len, totlen, value) \
@ -335,24 +311,23 @@ my_strnxfrm_czech(CHARSET_INFO *cs __attribute__((unused)),
/* /*
Neformální popis algoritmu: Informal description of the algorithm:
procházíme řetězec zleva doprava we pass the chain from left to right
konec řetězce poznáme podle *p == 0 we know the end of the string by *p == 0
pokud jsme došli na konec řetězce při průchodu 0, nejdeme na if we reached the end of the string on transition 0, then we don't go to
začátek, ale na uloženou pozici, protože první a druhý start, but to the saved position, because the first and second
průchod běží současně the passage runs concurrently
konec vstupu (průchodu) označíme na výstupu hodnotou 1 we mark the end of the input (transition) with the value 1 on the output
načteme hodnotu z třídící tabulky then we load the value from the sorting table
jde-li o hodnotu ignorovat (0), skočíme na další průchod if the value is ignore (0), we jump to the next pass
jde-li o hodnotu konec slova (2) a je to průchod 0 nebo 1, if the value is the end of the word (2) and it is a 0 or 1 transition,
přeskočíme všechny další 0 -- 2 a prohodíme we skip all the other 0 -- 2 and switch transitions
průchody if it is a composite character (255), we test whether it follows
jde-li o kompozitní znak (255), otestujeme, zda následuje correct to the pair, we find the correct value
správný do dvojice, dohledáme správnou hodnotu
na konci připojíme znak 0 then we add the character 0 at the end
*/ */

View file

@ -499,19 +499,19 @@ struct charset_info_st my_charset_latin1_nopad=
* *
* The modern sort order is used, where: * The modern sort order is used, where:
* *
* 'ä' -> "ae" * 'ä' -> "ae"
* 'ö' -> "oe" * 'ö' -> "oe"
* 'ü' -> "ue" * 'ü' -> "ue"
* 'ß' -> "ss" * 'ß' -> "ss"
*/ */
/* /*
* This is a simple latin1 mapping table, which maps all accented * This is a simple latin1 mapping table, which maps all accented
* characters to their non-accented equivalents. Note: in this * characters to their non-accented equivalents. Note: in this
* table, 'ä' is mapped to 'A', 'ÿ' is mapped to 'Y', etc. - all * table, 'ä' is mapped to 'A', 'ÿ' is mapped to 'Y', etc. - all
* accented characters except the following are treated the same way. * accented characters except the following are treated the same way.
* Ü, ü, Ö, ö, Ä, ä * Ü, ü, Ö, ö, Ä, ä
*/ */
static const uchar sort_order_latin1_de[] = { static const uchar sort_order_latin1_de[] = {
@ -577,7 +577,7 @@ static const uchar combo2map[]={
my_strnxfrm_latin_de() on both strings and compared the result strings. my_strnxfrm_latin_de() on both strings and compared the result strings.
This means that: This means that:
Ä must also matches ÁE and , because my_strxn_frm_latin_de() will convert Ä must also matches ÁE and , because my_strxn_frm_latin_de() will convert
both to AE. both to AE.
The other option would be to not do any accent removal in The other option would be to not do any accent removal in
@ -703,7 +703,7 @@ void my_hash_sort_latin1_de(CHARSET_INFO *cs __attribute__((unused)),
/* /*
Remove end space. We have to do this to be able to compare Remove end space. We have to do this to be able to compare
'AE' and 'Ä' as identical 'AE' and 'Ä' as identical
*/ */
end= skip_trailing_space(key, len); end= skip_trailing_space(key, len);